机器学习/深度学习实战——kaggle房价预测比赛实战(数据预处理篇)

文章目录

    • 2. 数据预处理(特征编码)
      • 2.1 特征删除
      • 2.2 修改与时间相关的特征(减小特征值的大小)
      • 2.3 填充缺失值
        • 2.3.1 填充数值型数据
        • 2.3.2 填充非数值型数据
      • 2.4 将某些数值型特征转换为非数值型特征
      • 2.5 对某些连续特征应用PowerTransformer使其更具高斯分布
      • 2.6 对非数值特征中的某些特征进行融合
      • 2.7 对非数值型特征进行编码
      • 2.8 拆分训练数据和测试数据


  • 第1个blog:数据分析
  • 第2个blog:数据预处理
  • 第3个blog:应用机器学习回归分析算法进行建模和预测
  • 第4个blog:应用pytorch设计深度学习模型

相关:
kaggle 比赛:House Prices - Advanced Regression Techniques

数据下载地址:百度网盘 提取码: w2t6


2. 数据预处理(特征编码)

2.1 特征删除

  • 1) ID
  • 2) 缺失数据过多的特征
  • 3) 主要是0或者是null或者其他单值的特征
# id不是特征,将id去除,同时将训练数据和测试数据放在仪器做预处理
# Label是上面为了区别训练数据和测试数据加入的一个特征,实际训练也不需要,需要去除
# 如Alley、fence等特征属于上面数据缺失过多或者主要是0/null的情况
drop_columns = ["Id", "Alley", "Fence", "LotFrontage", "FireplaceQu", "PoolArea", "LowQualFinSF", "3SsnPorch", "MiscVal", 'RoofMatl','Street','Condition2','Utilities','Heating','Label']
print("Number of columns before dropping : ",len(combined_df.columns))
print("Number of dropping columns : ",len(drop_columns))
combined_df.drop(columns=drop_columns, inplace=True, errors='ignore')
print("Number of columns after dropping : ",len(combined_df.columns))
Number of columns before dropping :  82
Number of dropping columns :  15
Number of columns after dropping :  67

2.2 修改与时间相关的特征(减小特征值的大小)

(用售卖时间减去build/remodadd/garageBuild的时间)

for feature in ['YearBuilt','YearRemodAdd','GarageYrBlt']:
    combined_df[feature]=combined_df['YrSold']-combined_df[feature]
combined_df[['YearBuilt','YearRemodAdd','GarageYrBlt']].head()
YearBuilt YearRemodAdd GarageYrBlt
0 5 5 5.00
1 31 31 31.00
2 7 6 7.00
3 91 36 8.00
4 8 8 8.00

2.3 填充缺失值

2.3.1 填充数值型数据

有不同的填充方法:补零、填充中位值、填充均值等等

for col in null_features_numerical:
    if col not in drop_columns:
#         combined_df[col] = combined_df[col].fillna(combined_df[col].mean()) # 填充均值
#         combined_df[col] = combined_df[col].fillna(combined_df[col].median()) # 填充中位值
        combined_df[col] = combined_df[col].fillna(0.0) # 用0简单填充

2.3.2 填充非数值型数据

两种不同的填充方法(主要是根据之前进行的数据分析进行的填充):

  • 直接填充上‘NA’,等价于加了新的一类
  • 填充最多的值的类别,将缺失类归属于主要类别之中
null_features_categorical = [col for col in combined_df.columns if combined_df[col].isnull().sum() > 0 and col in categorical_features]

# 对这些特征填充主要类别的值
cat_feature_mode = ["SaleType", "Exterior1st", "Exterior2nd", "KitchenQual", "Electrical", "Functional"]

for col in null_features_categorical:
    if col != 'MSZoning' and col not in cat_feature_mode:
        combined_df[col] = combined_df[col].fillna('NA')
    else:
        combined_df[col] = combined_df[col].fillna(combined_df[col].mode()[0])

2.4 将某些数值型特征转换为非数值型特征

MSSubClass属性虽然是数值类型,但是其类别较少,根据

# Convert "numerical" feature to categorical
convert_list = ['MSSubClass']
for col in convert_list:
    combined_df[col] = combined_df[col].astype('str')

2.5 对某些连续特征应用PowerTransformer使其更具高斯分布

  • 我们可以发现有的连续特征是分线性的,所以我们需要对这些数据进行一定转换
  • 对这些连续数据特征的偏度进行检查
# get the features except object types
numeric_features = combined_df.dtypes[combined_df.dtypes != 'object'].index

# check the skewness of all numerical features
skewed_features = combined_df[numeric_features].apply(lambda x:skew(x.dropna())).sort_values(ascending=False)

print('\n Skew in numberical features: \n')
skewness_df = pd.DataFrame({'Skew' : skewed_features})
print(skewness_df.head(10))
 Skew in numberical features: 

               Skew
LotArea       12.82
KitchenAbvGr   4.30
BsmtFinSF2     4.15
EnclosedPorch  4.00
ScreenPorch    3.95
BsmtHalfBath   3.93
MasVnrArea     2.61
OpenPorchSF    2.54
WoodDeckSF     1.84
1stFlrSF       1.47
# Apply PowerTransformer to Columns
log_list = ['BsmtUnfSF', 'LotArea', '1stFlrSF', 'GrLivArea', 'TotalBsmtSF', 'GarageArea']

for col in log_list:
    power = PowerTransformer(method='yeo-johnson', standardize=True)
    combined_df[[col]] = power.fit_transform(combined_df[[col]]) # fit with combined_data to avoid overfitting with training data?

print('Number of skewed numerical features got transform : ', len(log_list))
Number of skewed numerical features got transform :  6

2.6 对非数值特征中的某些特征进行融合

某些object类型特征中的一些属性相对主要类别占比非常小,考虑将这些次要属性融合在一起。比如在HeatingQC特征之中的FaPo特征占比非常小,所以将这两个统称为 other

# Regroup features
# 下面这些特征中的类别占比非常小,可以考虑将这些类别统称为other
regroup_dict = {
#     'LotConfig': ['FR2','FR3'],
#     'LandSlope':['Mod','Sev'],
#     'BldgType':['2FmCon','Duplex'],
#     'RoofStyle':['Mansard','Flat','Gambrel'],
#     'Electrical':['FuseF','FuseP','FuseA','Mix'],
#     'SaleCondition':['Abnorml','AdjLand','Alloca','Family'],
#     'BsmtExposure':['Min','Av'],
#     'Functional':['Min1','Maj1','Min2','Mod','Maj2','Sev'],
#     'LotShape':['IR2','IR3'],
    'HeatingQC':['Fa','Po'],
    # 'FireplaceQu':['Fa','Po'],
    'GarageQual':['Fa','Po'],
    'GarageCond':['Fa','Po'],
}
 

for col, regroup_value in regroup_dict.items():
    mask = combined_df[col].isin(regroup_value)
    combined_df[col][mask] = 'Other'

2.7 对非数值型特征进行编码

# Generate one-hot dummy columns
combined_df = pd.get_dummies(combined_df).reset_index(drop=True)

2.8 拆分训练数据和测试数据

new_train_data = combined_df.iloc[:len(train_data), :]
new_test_data = combined_df.iloc[len(train_data):, :]
X_train = new_train_data.drop('SalePrice', axis=1)
y_train = np.log1p(new_train_data['SalePrice'].values.ravel())
X_test = new_test_data.drop('SalePrice', axis=1)
# 使用sklearn中的RoubstScaler函数对异常值鲁棒性的统计信息(中位数和四分位数)进行缩放特征
pre_precessing_pipeline = make_pipeline(RobustScaler(), 
                                        # VarianceThreshold(0.001),
                                       )

X_train = pre_precessing_pipeline.fit_transform(X_train)
X_test = pre_precessing_pipeline.transform(X_test)

print(X_train.shape)
print(X_test.shape)
(1460, 270)
(1459, 270)

你可能感兴趣的:(动手学深度学习:pytorch,数据预处理,数据分析,深度学习,房价预测,sklearn)