［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值

官方链接：http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot-missing-values-py

该例程是为了说明对缺失值的随即填充训练出的estimator表现优于直接删掉有缺失字段值的estimator

例程代码及附加注释如下：

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

import numpy as np

from sklearn.datasets import load_boston

from sklearn.ensemble import RandomForestRegressor

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import Imputer

from sklearn.model_selection import cross_val_score


# 设定随机数种子

rng = np.random.RandomState(0)

# 载入数据 波士顿房价

dataset = load_boston()

X_full, y_full = dataset.data, dataset.target

n_samples = X_full.shape[0]

n_features = X_full.shape[1]

# Estimate the score on the entire dataset, with no missing values
# 随机森林－－回归 random_state-随机种子 n_estimator 森林里树的数目

estimator = RandomForestRegressor(random_state=0, n_estimators=100)
# 交叉验证分类器的准确率

score = cross_val_score(estimator, X_full, y_full).mean()

print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines

missing_rate = 0.75

n_missing_samples = int(np.floor(n_samples * missing_rate))
# hstack 把两个数组拼接起来－行数需要一致

missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,

                                      dtype=np.bool),

                             np.ones(n_missing_samples,

                                     dtype=np.bool)))


# 打乱随机数组顺序
rng.shuffle(missing_samples)

missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values

X_filtered = X_full[~missing_samples, :]

y_filtered = y_full[~missing_samples]

estimator = RandomForestRegressor(random_state=0, n_estimators=100)

score = cross_val_score(estimator, X_filtered, y_filtered).mean()

print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values

X_missing = X_full.copy()

X_missing[np.where(missing_samples)[0], missing_features] = 0

y_missing = y_full.copy()

estimator = Pipeline([("imputer", Imputer(missing_values=0,

                                          strategy="mean",

                                          axis=0)),

                      ("forest", RandomForestRegressor(random_state=0,

                                                       n_estimators=100))])

score = cross_val_score(estimator, X_missing, y_missing).mean()

print("Score after imputation of the missing values = %.2f" % score)

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－
补充：
A. numpy.where()用法：

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值的相关教程结束。

《［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值.doc》

下载本文的Word格式文档，以方便收藏与打印。

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值的相关教程结束。

相关推荐

【pandas小技巧】--缺失值的列

数据分析缺失值处理(Missing Values)——删除法、填充法、插值法

Ubuntu16.04 下 python 3.6 安装以及各版本python切换（同时解决各种依赖缺失）

#Python 缺失值的检测与处理，处理部分

2021-10-19：缺失的区间。给定一个排序的整数数组 nums ，其中元素的范围在闭区间 [lower, upper] 当中，返回不包含在数组中的缺失区间。力扣163。

了解BootLoader——基于MPC5744P Bootloader例程

#Python 缺失值的检测与处理，检测部分

（数据科学学习手札58）在R中处理有缺失值数据的高级方法

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值

［sklearn］官方例程－Imputing missing values before building an estimator 随机填充缺失值的相关教程结束。

相关推荐

【pandas小技巧】--缺失值的列

数据分析缺失值处理(Missing Values)——删除法、填充法、插值法

Ubuntu16.04 下 python 3.6 安装以及各版本python切换（同时解决各种依赖缺失）

#Python 缺失值的检测与处理，处理部分

2021-10-19：缺失的区间。给定一个排序的整数数组 nums ，其中元素的范围在 闭区间 [lower, upper] 当中，返回不包含在数组中的缺失区间。力扣163。

了解BootLoader——基于MPC5744P Bootloader例程

#Python 缺失值的检测与处理，检测部分

（数据科学学习手札58）在R中处理有缺失值数据的高级方法

2021-10-19：缺失的区间。给定一个排序的整数数组 nums ，其中元素的范围在闭区间 [lower, upper] 当中，返回不包含在数组中的缺失区间。力扣163。