Kaggle-数据清理5天挑战

Day 1: Data Cleaning Challenge: Handling missing values | Kaggle
Day 2: Data Cleaning Challenge: Scale and Normalize Data | Kaggle
Day 3: Data Cleaning Challenge: Parsing Dates | Kaggle
Day 4: Data Cleaning Challenge: Character Encodings | Kaggle
Day 5: Data Cleaning Challenge: Inconsistent Data Entry | Kaggle

1. 缺失值处理

观察数据

# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0) 

# look at a few rows of the nfl_data file. I can see a handful of missing data already!
nfl_data.sample(5)

检查有多少缺失值

# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

分析数据缺失的原因

Is this value missing becuase it wasn’t recorded or becuase it dosen’t exist?

如果缺少某个值(因为没有孩子的人的年龄最大的孩子的身高),那么尝试猜测它可能是没有意义的,这些值将保持为NaN。 另一方面,如果某个值因为没有记录而丢失,那么您可以尝试根据该列和行中的其他值猜测它可能是什么。 (这称为“插补”,我们将接着学习如何做!:)

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Note:如果您正在进行非常仔细的数据分析,那么您可以逐个查看每列,找出填补这些缺失值的最佳策略。

丢弃缺失值

如果某一行中出现缺失值,那我们就将该行丢弃。注意这可能会很多记录都被丢失掉。

# remove all the rows that contain a missing value
nfl_data.dropna()

如果某一列有缺失值,则去除这一列。这种方法会导致特征维数减少。

# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

自动填充缺失值

# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()

方法1: 将所有缺失值都填为0

# replace all NA's with 0
subset_nfl_data.fillna(0)

方法2: 将缺失值填充为同列的下一行的值。例如三行四列为缺失值, 则将四行四列的值填入进去;执行完成后若仍然有缺失值,则用0填充。

# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

进阶solutions

1) Drop Columns with Missing Values
data_without_missing_values = original_data.dropna(axis=1)

如果训练集和测试集是分开的,那么需要如下操作保证丢掉的行是一样的。

cols_with_missing = [col for col in original_data.columns 
                                 if original_data[col].isnull().any()]
redued_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)

该方法通常不是最好的方法,但是当某一列丢失数据特别多时,此种方法就比较有优势。

2)A Better Option: Imputation

这种方法填充的值(默认为平均值)并不是非常精确的,但是通常情况下,使用这种方法构建的模型要好于第一种。

from sklearn.preprocessing import Imputer
my_imputer = Imputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)
3)An Extension To Imputation
# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[c].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = Imputer()
new_data = my_imputer.fit_transform(new_data)

Note:该方法也并不能保证一定会起到很大作用,某些情况下作用不大。
We’ve loaded a function score_dataset(X_train, X_test, y_train, y_test) to compare the quality of diffrent approaches to missing values. This function reports the out-of-sample MAE score from a RandomForest.评价几种方法优劣方法:

import pandas as pd

# Load data
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)

# For the sake of keeping the example simple, we'll use only numeric predictors. 
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])

Get Model Score from Dropping Columns with Missing Values

cols_with_missing = [col for col in X_train.columns 
                                 if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test  = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

Get Model Score from Imputation

from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

Get Score from Imputation with Extra Columns Showing What Was Imputed

imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns 
                                 if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

你可能感兴趣的:(机器学习,数据预处理,kaggle,机器学习,数据预处理)