Day 1: Data Cleaning Challenge: Handling missing values | Kaggle
Day 2: Data Cleaning Challenge: Scale and Normalize Data | Kaggle
Day 3: Data Cleaning Challenge: Parsing Dates | Kaggle
Day 4: Data Cleaning Challenge: Character Encodings | Kaggle
Day 5: Data Cleaning Challenge: Inconsistent Data Entry | Kaggle
# modules we'll use
import pandas as pd
import numpy as np
# read in all our data
nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
# set seed for reproducibility
np.random.seed(0)
# look at a few rows of the nfl_data file. I can see a handful of missing data already!
nfl_data.sample(5)
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()
# look at the # of missing points in the first ten columns
missing_values_count[0:10]
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()
# percent of data that is missing
(total_missing/total_cells) * 100
如果缺少某个值(因为没有孩子的人的年龄最大的孩子的身高),那么尝试猜测它可能是没有意义的,这些值将保持为NaN。 另一方面,如果某个值因为没有记录而丢失,那么您可以尝试根据该列和行中的其他值猜测它可能是什么。 (这称为“插补”,我们将接着学习如何做!:)
# look at the # of missing points in the first ten columns
missing_values_count[0:10]
Note:如果您正在进行非常仔细的数据分析,那么您可以逐个查看每列,找出填补这些缺失值的最佳策略。
如果某一行中出现缺失值,那我们就将该行丢弃。注意这可能会很多记录都被丢失掉。
# remove all the rows that contain a missing value
nfl_data.dropna()
如果某一列有缺失值,则去除这一列。这种方法会导致特征维数减少。
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
方法1: 将所有缺失值都填为0
# replace all NA's with 0
subset_nfl_data.fillna(0)
方法2: 将缺失值填充为同列的下一行的值。例如三行四列为缺失值, 则将四行四列的值填入进去;执行完成后若仍然有缺失值,则用0填充。
# replace all NA's the value that comes directly after it in the same column,
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)
data_without_missing_values = original_data.dropna(axis=1)
如果训练集和测试集是分开的,那么需要如下操作保证丢掉的行是一样的。
cols_with_missing = [col for col in original_data.columns
if original_data[col].isnull().any()]
redued_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)
该方法通常不是最好的方法,但是当某一列丢失数据特别多时,此种方法就比较有优势。
这种方法填充的值(默认为平均值)并不是非常精确的,但是通常情况下,使用这种方法构建的模型要好于第一种。
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)
# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()
# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns
if new_data[c].isnull().any())
for col in cols_with_missing:
new_data[col + '_was_missing'] = new_data[col].isnull()
# Imputation
my_imputer = Imputer()
new_data = my_imputer.fit_transform(new_data)
Note:该方法也并不能保证一定会起到很大作用,某些情况下作用不大。
We’ve loaded a function score_dataset(X_train, X_test, y_train, y_test) to compare the quality of diffrent approaches to missing values. This function reports the out-of-sample MAE score from a RandomForest.评价几种方法优劣方法:
import pandas as pd
# Load data
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)
# For the sake of keeping the example simple, we'll use only numeric predictors.
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])
Get Model Score from Dropping Columns with Missing Values
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))
Get Model Score from Imputation
from sklearn.preprocessing import Imputer
my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))
Get Score from Imputation with Extra Columns Showing What Was Imputed
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
cols_with_missing = (col for col in X_train.columns
if X_train[col].isnull().any())
for col in cols_with_missing:
imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)
print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))