在python语言中,缺失值一般被称为nan,是”not a number”的缩写。
下面的代码可以计算出数据总共有多少个缺失值,这里数据是存储在pandas中的DateFrame中:
print(data.isnull().sum())
处理缺失值有一下几种方式:
data_without_missing_values=original_data.dropna(axis=1)
在大多数情况下,我们必须在训练集(training dataset)和测试集(test dataset)中删除同样的数据列。
cols_with_missing=[col for col in original_data.columns
if original_data[col].isnull().any()]
reduced_original_data=original_data.drop(cols_with_missing,axis=1)
reduced_test_data=test_data.drop(cols_with_missing,axis=1)
这种方法适用于数据列包含的缺失值太多的情况
这种方法比直接删除数据列好点,能训练出更好的模型。
from sklearn.preprocessing import Imputer
my_inputer=Imputer()
data_with_imputed_values=my_imputer.fit_transform(original_data)
默认的填补策略是使用均值填充
如果缺失数据包含重要特征信息的话,我们需要保存原始数据的缺失值信息,存储在boolean列中
#先拷贝一份原始数据
new_data=original_data.copy()
#创建新的列保用来保存缺失数据列的缺失情况
cols_with_missing=(col for col in new_data.columns if new_data[col].isnull().any())
for col in cols_with_missing:
new_data[col+'_was_missing']=new_data[col].isnull()
#插值
my_imputer=Imputer()
new_data=my_imputer.fit_transform(new_data)
下面是一个房价预测的例子,用来比较上面提到的三种处理缺失值的情况。
import pandas as pd
# 加载数据
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)
# 为了简单起见,我们只使用数字特征的列训练预测模型
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])
#划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors,
melb_target,
train_size=0.7,
test_size=0.3,
random_state=0)
#定义一个函数,度量模型的MAE指标
def score_dataset(X_train, X_test, y_train, y_test):
model = RandomForestRegressor()#这里选用随机森林模型
model.fit(X_train, y_train)
preds = model.predict(X_test)
return mean_absolute_error(y_test, preds)
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))
Mean Absolute Error from dropping columns with Missing Values:
347871.8471099837
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))
Mean Absolute Error from Imputation:
201753.99398441747
#拷贝原始数据
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
#取得含有缺失值的列名元组
cols_with_missing = (col for col in X_train.columns
if X_train[col].isnull().any())
#新增列保存缺失信息,形成的新增列包含诸如(True,False,True,False这样的序列)
for col in cols_with_missing:
imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)
print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))
Mean Absolute Error from Imputation while Track What Was Imputed:
200147.29626743973
在上面的例子中,方法2和3的表现相差不是很大,但在某些情况下,差距十分明显