相关:
kaggle 比赛:House Prices - Advanced Regression Techniques
数据下载地址:百度网盘 提取码: w2t6
# id不是特征,将id去除,同时将训练数据和测试数据放在仪器做预处理
# Label是上面为了区别训练数据和测试数据加入的一个特征,实际训练也不需要,需要去除
# 如Alley、fence等特征属于上面数据缺失过多或者主要是0/null的情况
drop_columns = ["Id", "Alley", "Fence", "LotFrontage", "FireplaceQu", "PoolArea", "LowQualFinSF", "3SsnPorch", "MiscVal", 'RoofMatl','Street','Condition2','Utilities','Heating','Label']
print("Number of columns before dropping : ",len(combined_df.columns))
print("Number of dropping columns : ",len(drop_columns))
combined_df.drop(columns=drop_columns, inplace=True, errors='ignore')
print("Number of columns after dropping : ",len(combined_df.columns))
Number of columns before dropping : 82
Number of dropping columns : 15
Number of columns after dropping : 67
(用售卖时间减去build/remodadd/garageBuild的时间)
for feature in ['YearBuilt','YearRemodAdd','GarageYrBlt']:
combined_df[feature]=combined_df['YrSold']-combined_df[feature]
combined_df[['YearBuilt','YearRemodAdd','GarageYrBlt']].head()
YearBuilt | YearRemodAdd | GarageYrBlt | |
---|---|---|---|
0 | 5 | 5 | 5.00 |
1 | 31 | 31 | 31.00 |
2 | 7 | 6 | 7.00 |
3 | 91 | 36 | 8.00 |
4 | 8 | 8 | 8.00 |
有不同的填充方法:补零、填充中位值、填充均值等等
for col in null_features_numerical:
if col not in drop_columns:
# combined_df[col] = combined_df[col].fillna(combined_df[col].mean()) # 填充均值
# combined_df[col] = combined_df[col].fillna(combined_df[col].median()) # 填充中位值
combined_df[col] = combined_df[col].fillna(0.0) # 用0简单填充
两种不同的填充方法(主要是根据之前进行的数据分析进行的填充):
null_features_categorical = [col for col in combined_df.columns if combined_df[col].isnull().sum() > 0 and col in categorical_features]
# 对这些特征填充主要类别的值
cat_feature_mode = ["SaleType", "Exterior1st", "Exterior2nd", "KitchenQual", "Electrical", "Functional"]
for col in null_features_categorical:
if col != 'MSZoning' and col not in cat_feature_mode:
combined_df[col] = combined_df[col].fillna('NA')
else:
combined_df[col] = combined_df[col].fillna(combined_df[col].mode()[0])
MSSubClass
属性虽然是数值类型,但是其类别较少,根据
# Convert "numerical" feature to categorical
convert_list = ['MSSubClass']
for col in convert_list:
combined_df[col] = combined_df[col].astype('str')
# get the features except object types
numeric_features = combined_df.dtypes[combined_df.dtypes != 'object'].index
# check the skewness of all numerical features
skewed_features = combined_df[numeric_features].apply(lambda x:skew(x.dropna())).sort_values(ascending=False)
print('\n Skew in numberical features: \n')
skewness_df = pd.DataFrame({'Skew' : skewed_features})
print(skewness_df.head(10))
Skew in numberical features:
Skew
LotArea 12.82
KitchenAbvGr 4.30
BsmtFinSF2 4.15
EnclosedPorch 4.00
ScreenPorch 3.95
BsmtHalfBath 3.93
MasVnrArea 2.61
OpenPorchSF 2.54
WoodDeckSF 1.84
1stFlrSF 1.47
# Apply PowerTransformer to Columns
log_list = ['BsmtUnfSF', 'LotArea', '1stFlrSF', 'GrLivArea', 'TotalBsmtSF', 'GarageArea']
for col in log_list:
power = PowerTransformer(method='yeo-johnson', standardize=True)
combined_df[[col]] = power.fit_transform(combined_df[[col]]) # fit with combined_data to avoid overfitting with training data?
print('Number of skewed numerical features got transform : ', len(log_list))
Number of skewed numerical features got transform : 6
某些object
类型特征中的一些属性相对主要类别占比非常小,考虑将这些次要属性融合在一起。比如在HeatingQC
特征之中的Fa
和Po
特征占比非常小,所以将这两个统称为 other
# Regroup features
# 下面这些特征中的类别占比非常小,可以考虑将这些类别统称为other
regroup_dict = {
# 'LotConfig': ['FR2','FR3'],
# 'LandSlope':['Mod','Sev'],
# 'BldgType':['2FmCon','Duplex'],
# 'RoofStyle':['Mansard','Flat','Gambrel'],
# 'Electrical':['FuseF','FuseP','FuseA','Mix'],
# 'SaleCondition':['Abnorml','AdjLand','Alloca','Family'],
# 'BsmtExposure':['Min','Av'],
# 'Functional':['Min1','Maj1','Min2','Mod','Maj2','Sev'],
# 'LotShape':['IR2','IR3'],
'HeatingQC':['Fa','Po'],
# 'FireplaceQu':['Fa','Po'],
'GarageQual':['Fa','Po'],
'GarageCond':['Fa','Po'],
}
for col, regroup_value in regroup_dict.items():
mask = combined_df[col].isin(regroup_value)
combined_df[col][mask] = 'Other'
# Generate one-hot dummy columns
combined_df = pd.get_dummies(combined_df).reset_index(drop=True)
new_train_data = combined_df.iloc[:len(train_data), :]
new_test_data = combined_df.iloc[len(train_data):, :]
X_train = new_train_data.drop('SalePrice', axis=1)
y_train = np.log1p(new_train_data['SalePrice'].values.ravel())
X_test = new_test_data.drop('SalePrice', axis=1)
# 使用sklearn中的RoubstScaler函数对异常值鲁棒性的统计信息(中位数和四分位数)进行缩放特征
pre_precessing_pipeline = make_pipeline(RobustScaler(),
# VarianceThreshold(0.001),
)
X_train = pre_precessing_pipeline.fit_transform(X_train)
X_test = pre_precessing_pipeline.transform(X_test)
print(X_train.shape)
print(X_test.shape)
(1460, 270)
(1459, 270)