104 缺失值预处理
http://scikit-learn.org/stable/auto_examples/missing_values.html#example-missing-values-py
对于缺失值的处理,一定程度上能够决定算法模型的表现,常用的缺失值的处理方法有平均值,中间值,最常用的值等等,这三种分别对应着sklearn里preprocessing。imputer的三种处理策略。文中说当变化范围大的时候用中间值似乎是一个比较稳定的选择。
按照文中出现的顺序介绍其中的几个函数
取shape的第一个数,对于多维数据也就是数据的行数
取一个或者一组浮点数的小于它的最大整数,比如np.floor(-0.2)—>-1
水平stack(h=horizontally),相当于np.concatenate(tup, axis=1)
http://docs.scipy.org/doc/numpy-1.6.0/reference/generated/numpy.hstack.html?highlight=hstack#numpy.hstack
example
a = np.array([[1],[2],[3]])
b = np.array([[2],[3],[4]])
np.hstack((a,b))
array([[1, 2], [2, 3], [3, 4]])
洗牌,注意对于多维数据,只随机打乱第一index的顺序,比如
>>> arr = np.arange(9).reshape((3, 3))
>>> np.random.shuffle(arr)
>>> arr
array([[3, 4, 5], [6, 7, 8], [0, 1, 2]])
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer
缺失值的处理,本文中用的是默认的平均值,其参数有sklearn.preprocessing.Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0,copy=True)
其中axis=0 意味着impute along columns.也就是以横向样本为单位,取一列的平均值,如果全都是空缺值就会直接删掉这个特征,如果axis=1,且一行全都是空缺值,就会exception。
直接粘贴上
import numpy as np
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.cross_validation import cross_val_score
rng = np.random.RandomState(0)
dataset = load_boston()
X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0]
n_features = X_full.shape[1]
#第一步预测整个数据集
# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_full, y_full).mean()
print("Score with the entire dataset = %.2f" % score)
# Add missing values in 75% of the lines
missing_rate = 0.75
n_missing_samples = np.floor(n_samples * missing_rate)
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
dtype=np.bool),
np.ones(n_missing_samples,
dtype=np.bool)))
rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)
#去掉空缺值时的预测结果
# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)
#imputer处理空缺值,并预测
# Estimate the score after imputation of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
strategy="mean",
axis=0)),
("forest", RandomForestRegressor(random_state=0,
n_estimators=100))])
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)
最后的得分情况是原始数据集为0.56
去掉空缺值得到的结果是0.48
用平均值补充的结果为0.55
看来这里用均值补充是一个很好的选择。