特征处理
缺失比例:
变量 缺失数 缺失比例 含义
PoolQC 2909 100% # 泳池质量
MiscFeature 2814 96% # 特殊的设施
Alley 2721 93% # 房屋附近的小巷
Fence 2348 80% # 房屋的篱笆
FireplaceQu 1420 49% # 壁炉的质量
LotFrontage 486 17% # 房子同街道之间的距离
GarageYrBlt 159 5% # 车库
GarageFinish 159 5%
GarageQual 159 5%
GarageCond 159 5%
GarageType 157 5%
BsmtCond 82 3% # 地下室
BsmtExposure 82 3%
BsmtQual 81 3%
BsmtFinType2 80 3%
BsmtFinType1 79 3%
MasVnrType 24 1% # 外墙装饰
MasVnrArea 23 1%
MSZoning 4 0% # 其他
Utilities 2 0%
BsmtFullBath 2 0%
BsmtHalfBath 2 0%
Functional 2 0%
Exterior1st 1 0%
Exterior2nd 1 0%
BsmtFinSF1 1 0%
BsmtFinSF2 1 0%
BsmtUnfSF 1 0%
TotalBsmtSF 1 0%
Electrical 1 0%
KitchenQual 1 0%
GarageCars 1 0%
GarageArea 1 0%
SaleType 1 0%
属性值含义(不是全部)
√ LotFrontage : 房子同街道之间的距离
√ LotArea : 地块面积
√ MSSubClass : 出售中的房屋类型 (无缺失值)
√ MSZoning : 出售中的地盘类型
× Alley : 房屋附近的小巷
× Utilities : 是否有电等等
√ Exterior1st : 房屋外墙材料
√ Exterior2nd : 如果房屋外墙有多种材料,记下第二种
√ MasVnrType : 石墙贴面类型
√ MasVnrArea : 外墙装饰材料的面积,在MasVnrType缺失时为0
√ BsmtQual : 地下室的高度
√ BsmtCond : 地下室的整体评价
√ BsmtExposure:
√ BsmtFinType1:
√ BsmtFinSF1 :
√ BsmtFinType2:
√ BsmtFinSF2 :
√ BsmtUnfSF :
√ TotalBsmtSF :
√ KitchenQual : 厨房的质量 (也可以像BaseCond一样换成数值(0.14801)
√ Functional : 房屋功能
√ FirePlaces :
√ FireplaceQu : 也可以用数值代替
√ Garage : 与车库有关的量,和Bsmt相关的同样处理方式
× PoolQC : 只有三个,直接直接Drop
√ PoolArea : 虽然有的不多,但没有缺失,可以保留
√ Fence : 栅栏
√ MiscFeature : 其他的特征比如电梯之类的
√ SaleType : Type of sale,销售方式,表示折扣啊这一类
√ Electrical : 供电
√ GrLivArea : Above grade (ground) living area square feet
缺失值处理
LotFrontage : 通过计算其与LotArea的关系可以知道其有一定相关性,所以用边长补充,好像除以1.5就差不多了 0.14941/0.15100
: 也可以通过中位数填充 (0.14993)
MSZoning : 因子变量,并且只有几个缺失值,直接用最多的因子来代替
test中缺失 : 或者直观上看与MSSubClass有关。可以通过
cat_exploration(test_data[test_data['MSSubClass'] == 70], 'MSZoning')
找出相应的MSSubClass对应的最有可能的MSZoning,然后进行填补
其实只有四个缺了,可以人工填补的,发现RL的时候一般也没有Alley
Exterior1st : 使用最多的代替(只有一个缺失)
这个缺失的是MSZoing == RL,cat_exploration(test_data[test_data['MSZoning'] == 'RL'], 'Exterior1st') 发现最多是VinylSd
但找与这个更相近的会发现是Wd Sdng(0.14982)
Basement : 与地下室有关的,空缺的正好都是没有地下室的,直接填None即可。但在test中有三个特殊的580,725,1064部分有值
Fence : 直接补上None
MiscFeature : 直接补上None
SaleType : 只有testdata里面一个是空的,直接补上最多的WD
Electrical : 只有testdata里面一个是空的,直接补上最多的SBrkr
缺失量比较多的PoolQC、MiscFeature、Alley、Fence、FireplaceQu是由于房子没有泳池、特殊的设施、旁边的小巷、篱笆、壁炉等设施。 由于缺失量比较多,可以直接移除这几个变量。
总的来说,缺失值可以使用:
- 最多的值
- 中位数
- 按一定其他特征分组之后的中位数
来填补,
或者直接删除
其他处理
- 有些特征极端值出现不多,可以删除相关列:
GrLivArea : 在train里面有4个超过4000的,test里面有一个
523,691,1182,1298 1089(目测会比较贵)
- 可以对价格取log1p,再expm1
- MSSubClass,MoSold虽然是int类型,但其实是分类
- 而一些与Qual有关的项,虽然是object,反而换成数值型比较好。
- 整合新的特征:
而一些与Qual有关的项,虽然是object,反换成数值,然后通过一些计算得到整体评估值等等作为衍生特征
不同的训练方法(目前四个):
(一)Lasso
regr = Lasso(alpha=best_alpha, max_iter=50000)
regr.fit(xtrain_feature, train_y)
y_pred = regr.predict(xtrain_feature)
y_test = train_y
print("Lasso score on training set: ", rmse(y_test, y_pred))
y_pred_lasso = regr.predict(xtest_feature)
(二)Xgboost
regr = xgb.XGBRegressor(
colsample_bytree=0.2,
gamma=0.0,
learning_rate=0.05,
max_depth=6,
min_child_weight=1.5,
n_estimators=7200,
reg_alpha=0.9,
reg_lambda=0.6,
subsample=0.2,
seed=42,
silent=0)
regr.fit(xtrain_feature, train_y)
# Run prediction on training set to get a rough idea of how well it does.
y_pred = regr.predict(xtrain_feature)
y_test = train_y
print("XGBoost score on training set: ", rmse(y_test, y_pred))
y_pred_xgb = regr.predict(xtest_feature)
(三)ElasticNet
# 4* ElasticNet
elasticNet = ElasticNetCV(l1_ratio = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006,
0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6],
max_iter = 50000, cv = 10)
elasticNet.fit(xtrain_feature, train_y)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )
print("Try again for more precision with l1_ratio centered around " + str(ratio))
elasticNet = ElasticNetCV(l1_ratio = [ratio * .85, ratio * .9, ratio * .95, ratio, ratio * 1.05, ratio * 1.1, ratio * 1.15],
alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6],
max_iter = 50000, cv = 10)
elasticNet.fit(xtrain_feature, train_y)
if (elasticNet.l1_ratio_ > 1):
elasticNet.l1_ratio_ = 1
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )
print("Now try again for more precision on alpha, with l1_ratio fixed at " + str(ratio) +
" and alpha centered around " + str(alpha))
elasticNet = ElasticNetCV(l1_ratio = ratio,
alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, alpha * .9,
alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3,
alpha * 1.35, alpha * 1.4],
max_iter = 50000, cv = 10)
elasticNet.fit(xtrain_feature, train_y)
if (elasticNet.l1_ratio_ > 1):
elasticNet.l1_ratio_ = 1
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )
y_test = elasticNet.predict(xtrain_feature)
y_test_ela = elasticNet.predict(xtest_feature)
print("Lasso score on training set: ", rmse(y_test, y_pred))
(四)Ridge
ridge = RidgeCV(alphas = [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 24])
ridge.fit(xtrain_feature, train_y)
alpha = ridge.alpha_
print("Best alpha :", alpha)
ridge = RidgeCV(alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85,
alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15,
alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4],
cv = 10)
ridge.fit(xtrain_feature, train_y)
alpha = ridge.alpha_
print("Best alpha :", alpha)
y_test = train_y
y_pred = ridge.predict(xtrain_feature)
y_test_rdg = ridge.predict(xtest_feature)
print("Ridge RMSE score on training set: ", rmse(y_test, y_pred))
其中:xgboost在训练集合上能达到0.0463,ElasticNet能到0.0244,另外两个在0.123-0.125之间
Kaggle Kernels
https://www.kaggle.com/yadavsarthak/you-got-this-feature-engineering-and-lasso
https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset
https://www.kaggle.com/apapiu/regularized-linear-models
https://www.kaggle.com/xirudieyi/house-prices-advanced-regression-techniques/house-prices
https://www.kaggle.com/xchmiao/detailed-data-exploration-in-python/comments