挑战排行前3% --House Prices: Advanced Regression Technique--kaggle入门(39)

1.题目链接

House Prices: Advanced Regression Technique

2.参考资料

(1)Comprehensive data exploration with Python     

 (国内访问速度较慢,可以参考我博客的中文翻译 点我!!)

(2)Stacked Regressions : Top 4% on LeaderBoard(线上成绩0.11466)

(3)Top 2% of LeaderBoard - Advanced FE(线上成绩:0.11383)

3.线上成绩

截至2019年2月1日

排名:123名/4326  前3%

线上成绩:0.11338

4.流程及代码

(1) 读取数据

import pandas as pd
import numpy as np
train_data = pd.read_csv("./input/train.csv")
test_data = pd.read_csv("./input/test.csv")

(2) 查看训练集和测试集大小

#查看一下数据大小
train_data.shape
test_data.shape

(3) 修改测试集中的异常值,比如在测试集中,有一行数据中车库的修建年份为2207年,而根据题意,所有的数据都来自2010年前(包括2010年),那么我们在这儿推测这个车库修建时间为2007年(实际上,后面直接删除了车库建造年份‘GarageYrBlt’这个特征)。

##测试集1132行,模型报错,发现异常值,车库修建年份为2207年,进行修改.
print("修改前的值为: %.1f" % test_data['GarageYrBlt'][1132])
## 推测为2007
test_data.loc[1132,'GarageYrBlt'] = 2007.0

print("修改后的值为: %.1f" %test_data['GarageYrBlt'][1132]) 
##测试集1089行异常.房子在08年修建,09年装修,但是在07年卖出.这在实际中可能存在,
##但,与数据集中的房子是先修好再卖的不符合,推测房子是在2009年卖出的.
print(test_data['YearBuilt'][1089])
print(test_data['YearRemodAdd'][1089])
print("修改前的值为: %d" % (test_data['YrSold'][1089]))
test_data.loc[1089,'YrSold'] = 2009
print("修改后的值为: %d" % test_data['YrSold'][1089])

(4)删除离群点

我们可以通过参考文献的第一篇的可视化方法来发现离群点。传送门!

根据我的实验,如果删除这两个离群点,线上成绩能够提升0.002.

# 删除离群点
train_data = train_data.drop(train_data[(train_data['GrLivArea']>4000) & (train_data['SalePrice']<300000)].index)
#删除离群点后,使得成果直接提升0.01,非常高.思路应该是根据模型,计算出特征的重要程度,对前几个重要特征进行单个分析来确认是否出现离群点
#train_data = train_data.drop(index = [197, 802, 1181, 185, 690])

(5)将训练集和测试集联合起来进行特征处理工作

#将训练数据和测试数据联合起来
all_data = pd.concat((train_data, test_data)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
all_data.drop(['Id'], axis=1, inplace=True)
#等会对SalePrice单独处理

(6)画出各个特征的相关系数矩阵热图,分析各个特征的相关性。具体可以参考资料的第一项。传送门!

import  matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

corrmat = train_data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train_data[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

(7)删除重复表达的特征,比如‘GarageArea’特征和‘GarageCars’特征。车库面积和车库能停多少车是直接相关的。他们表达了相似的信息。同理,可以删除其他3个特征。

#1.删除GarageArea 保留GarageCars
#2.删除TotRmsAbvGrd
#3.删除'GarageYrBlt
#4.删除1stFlrSF
# 两个特征的相关性大于80,只保留其中一个
all_data.drop(['GarageArea'], axis=1, inplace=True)
all_data.drop(['TotRmsAbvGrd'], axis=1, inplace=True)
all_data.drop(['GarageYrBlt'],axis =1, inplace= True)
all_data.drop(['1stFlrSF'], axis=1, inplace=True)

(8) 删除区分度不高的特征,通过观察发现特征'Utilities'的取值中,取值为‘AllPub’的有2914行,取值为‘NoSeWa’的只有1行。这个特征区分度不够,直接删除。

#查看特征的取值情况
all_data['Utilities'].value_counts()
#删除这个特征
all_data.drop(['Utilities'], axis=1, inplace=True)

(9)查看缺失值信息,并根据缺失值比例进行排序

all_data_na = (all_data.isnull().sum(axis = 0) / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
all_data_na

(10) 填充缺失值(本部分可以参看第三个参考资料),实际上对测试值的填充,是根据已有的事实进行推断,比如根据房子的资料推断游泳池的质量。大部分情况下,数字型特征会填充中位数,类别型变量会填充众数。当然,还可以分析缺失值产生的原因,比如游泳池的面积为0,游泳池的质量为缺失。那么,只能说明没有游泳池,也不存在游泳池质量,这种情况下,游泳池质量填充为‘None’,更为合理。

features = all_data
#更换填充方式
features['Functional'] = features['Functional'].fillna('Typ')
features['Electrical'] = features['Electrical'].fillna("SBrkr")
features['KitchenQual'] = features['KitchenQual'].fillna("TA")
features['Exterior1st'] = features['Exterior1st'].fillna(features['Exterior1st'].mode()[0])
features['Exterior2nd'] = features['Exterior2nd'].fillna(features['Exterior2nd'].mode()[0])
features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0])

features.loc[2418, 'PoolQC'] = 'Fa'
features.loc[2501, 'PoolQC'] = 'Gd'
features.loc[2597, 'PoolQC'] = 'Fa'
features['PoolQC'] = features['PoolQC'].fillna("None")
features['MiscFeature'] = features['MiscFeature'].fillna("None")
#更换填充方式
features['Alley'] = features['Alley'].fillna("None")
features['Fence'] = features['Fence'].fillna("None")
features['FireplaceQu'] = features['FireplaceQu'].fillna("None")
features.loc[2124, 'GarageFinish'] = features['GarageFinish'].mode()[0]
features.loc[2574, 'GarageFinish'] = features['GarageFinish'].mode()[0]
features.loc[2574, 'GarageCars'] = features['GarageCars'].median()
features.loc[2124, 'GarageQual'] = features['GarageQual'].mode()[0]
features.loc[2574, 'GarageQual'] = features['GarageQual'].mode()[0]
features.loc[2124, 'GarageCond'] = features['GarageCond'].mode()[0]
features.loc[2574, 'GarageCond'] = features['GarageCond'].mode()[0]
features['GarageCond'] = features['GarageCond'].fillna("None")
features['GarageFinish'] = features['GarageFinish'].fillna("None")
features['GarageQual'] = features['GarageQual'].fillna("None")
features['GarageType'] = features['GarageType'].fillna("None")


features.loc[332, 'BsmtFinType2'] = 'ALQ' #since smaller than SF1
features.loc[947, 'BsmtExposure'] = 'No' 
features.loc[1485, 'BsmtExposure'] = 'No'
features.loc[2038, 'BsmtCond'] = 'TA'
features.loc[2183, 'BsmtCond'] = 'TA'
features.loc[2215, 'BsmtQual'] = 'Po' #v small basement so let's do Poor.
features.loc[2216, 'BsmtQual'] = 'Fa' #similar but a bit bigger.
features.loc[2346, 'BsmtExposure'] = 'No' #unfinished bsmt so prob not.
features.loc[2522, 'BsmtCond'] = 'Gd' #cause ALQ for bsmtfintype1

features['BsmtCond'] = features['BsmtCond'].fillna("None")
features['BsmtExposure'] = features['BsmtExposure'].fillna("None")
features['BsmtFinType1'] = features['BsmtFinType1'].fillna("None")
features['BsmtFinType2'] = features['BsmtFinType2'].fillna("None")
features['BsmtQual'] = features['BsmtQual'].fillna("None")
features['MasVnrType'] = features['MasVnrType'].fillna("None")
features['MSZoning'] = features.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))
neighborhood_group = features.groupby('Neighborhood')
lot_medians = neighborhood_group['LotFrontage'].median()
features['LotFrontage'] = features.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerics = []
for i in features.columns:
    if features[i].dtype in numeric_dtypes: 
        numerics.append(i)        
features.update(features[numerics].fillna(0))
nulls = np.sum(features.isnull())
nullcols = nulls.loc[(nulls != 0)]
dtypes = features.dtypes
dtypes2 = dtypes.loc[(nulls != 0)]
info = pd.concat([nullcols, dtypes2], axis=1).sort_values(by=0, ascending=False)
print(info)
print("There are", len(nullcols), "columns with missing values")

(11)对‘SalePrice’进行log变换,以使得SalePrice的分布接近正态分布,具体可以参看传送门!

# 使用log1p函数完成log(1+x)变换
train_y = np.log1p(train_data["SalePrice"])

(12)增加一些组合特征,通过原来的特征,构建和价格关系更大的特征

## 增加新的组合特征

#features['Total_sqr_footage'] = (features['BsmtFinSF1'] + features['BsmtFinSF2'] +
#                                 features['1stFlrSF'] + features['2ndFlrSF'])

features['Total_Bathrooms'] = (features['FullBath'] + (0.5*features['HalfBath']) + 
                               features['BsmtFullBath'] + (0.5*features['BsmtHalfBath']))

features['Total_porch_sf'] = (features['OpenPorchSF'] + features['3SsnPorch'] +
                              features['EnclosedPorch'] + features['ScreenPorch'] +
                             features['WoodDeckSF'])

features['HouseYr'] = all_data['YrSold'] - all_data['YearBuilt'] 
features['HouseYrAdd'] = all_data['YrSold'] - all_data['YearRemodAdd'] 

#simplified features
features['haspool'] = features['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
features['has2ndfloor'] = features['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
features['hasgarage'] = features['GarageCars'].apply(lambda x: 1 if x > 0 else 0)
features['hasbsmt'] = features['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
features['hasfireplace'] = features['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)
all_data = features

(13) 对类别型特征进行LabelCoder。并不是对所有的类别特征进行LabelCoder,而是他们的取值在顺序上有含义的特征,比如房子的质量等等。

# MSSubClass是房子种类
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)

# 同样对OverallCond做变换
all_data['OverallCond'] = all_data['OverallCond'].astype(str)

# 年与月份
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

from sklearn.preprocessing import LabelEncoder

cols = ( 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# 使用LabelEncoder做变换
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# 查看维度        
print('all_data的数据维度: {}'.format(all_data.shape))

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

(14)对数值型特征进行Box-Cox变换以符合使得更加趋近正态

# 对所有数值型的特征都计算skew,计算一下偏度

from scipy.stats import boxcox_normmax
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head()
#计算偏度后,不符合正态分布的数据采用box-cox变换
from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    all_data[feat] = boxcox1p(all_data[feat], lam)

(15) 对剩余的类别型变量进行one-hot编码

all_data = pd.get_dummies(all_data)
print(all_data.shape)

(16)进行模型训练前,划分经过特征工程处理后的训练集和测试集

n_train =  len(train_data)
train_x = all_data[:n_train]
test_x = all_data[n_train:]
test_x.shape[0] == test_data.shape[0]

print("train_x 的大小为",train_x.shape)
print("train_y 的大小为",train_y.shape)

(17)模型训练(导入库并定义相关函数)

from sklearn.preprocessing import RobustScaler,StandardScaler
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectFromModel,SelectKBest
import xgboost as xgb
import lightgbm as lgb
# 交叉验证函数
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42)#为了shuffle数据
    rmse= np.sqrt(-cross_val_score(model, train_x_new,train_y.values, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

#cross_val_score函数是指进行交叉验证.在所有度量指标中返回的值,大的值都好于小的值,那么返回的是均方误差的相反数

(18) 使用嵌入式方法,进行特征选择

#对数值型特征进行robust_scale
rb_scaler = RobustScaler()
train_x_rob= rb_scaler.fit_transform(train_x)
lasso = Lasso(alpha =0.0005, random_state=1)# 可在此步对模型进行参数设置,这里用默认值。 
lasso.fit(train_x_rob, train_y)	# 训练模型,传入X、y, 数据中不能包含miss_value 
model = SelectFromModel(lasso,prefit=True) 
train_x_new = model.transform(train_x)
test_x_new = model.transform(test_x)

(19)模型训练(查看单个模型在交叉验证中的成绩)

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
score = rmsle_cv(lasso)
print("\nLasso 得分: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(ENet)
print("ElasticNet 得分: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(KRR)
print("Kernel Ridge 得分: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(GBoost)
print("Gradient Boosting 得分: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(model_xgb)
print("Xgboost 得分: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(model_lgb)
print("LGBM 得分: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

(20)对单个模型进行平均集成

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # 遍历所有模型,你和数据
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]      
        for model in self.models_:
            model.fit(X, y)

        return self
    
    # 预估,并对预估结果值做average
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)   
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" 对基模型集成后的得分: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

(21) 对模型进行stacking

Meta-model Stacking:  
在这种方法中,我们在平均基础模型上添加Meta-model,并使用这些基模型的out-of-folds预测来训练我们的Meta-model。  
训练部分的步骤如下:  
1、将整个训练集分解成两个不相交的集合(这里是train和.holdout)。   
2、在第一部分(train)上训练几个基本模型。   
3、在第二个部分(holdout)上测试这些基本模型。   
4、使用(3)中的预测(称为 out-of-fold 预测)作为输入,并将正确的标签(目标变量)作为输出来训练更高层次的学习模型称为元模型。   
前三个步骤是迭代完成的。例如,如果我们采取5倍的fold,我们首先将训练数据分成5次。然后我们会做5次迭代。在每次迭代中,我们训练每个基础模型4倍,并预测剩余的fold(holdout fold)。

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # 遍历拟合原始模型
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # 得到基模型,并用基模型对out_of_fold做预估,为学习stacking的第2层做数据准备
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # 学习stacking模型
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    # 做stacking预估
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)
stacked_averaged_models = StackingAveragedModels(base_models = (ENet,GBoost,KRR),meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

(22)对测试集进行预测

#定义评价函数
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))
#下面模型融合中选择的模型和我的实验结果不符合
test_x = pd.DataFrame(test_x_new)
train_x = pd.DataFrame(train_x_new)
#StackedRegressor:
stacked_averaged_models.fit(train_x.values, train_y.values)
stacked_train_pred = stacked_averaged_models.predict(train_x.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(test_x.values))
print(rmsle(train_y.values, stacked_train_pred))
#XGBoost:
model_xgb.fit(train_x.values, train_y.values)
xgb_train_pred = model_xgb.predict(train_x.values)
xgb_pred = np.expm1(model_xgb.predict(test_x.values))
print(rmsle(train_y.values, xgb_train_pred))
#LightGBM:
model_lgb.fit(train_x.values, train_y.values)
lgb_train_pred = model_lgb.predict(train_x.values)
lgb_pred = np.expm1(model_lgb.predict(test_x.values))
print(rmsle(train_y.values, lgb_train_pred))
ensemble = stacked_pred*0.6 + xgb_pred*0.15  + lgb_pred*0.25
sub = pd.DataFrame()
sub['Id'] = test_data['Id']
sub['SalePrice'] = ensemble
sub.to_csv('submission_0131_06.csv',index=False)

4.总结及未来工作

(1)特征工程相当重要,适合的特征工作对成绩提升非常大。对模型参数的调整也能一定程度上的提升成绩,但是幅度较小,并且有时候会对训练集进行过拟合,导致交叉验证成绩好,线上成绩差的情况。

(2)特征工程中,是否删除一个特征和一个样本,实际上,考虑这个样本或者特征给模型带来的信息多于噪声,还是噪声多于信息,一般只能实验,尝试得到。

(3)未来应该构建更对更强的组合特征,在进行特征选择时,也单纯的采用了嵌入式选择,并没有对特征对模型的影响进行进一步分析。

 

你可能感兴趣的:(机器学习,机器学习及应用,机器学习,数据挖掘,kaggle)