读完了下面这篇文章
https://blog.csdn.net/u012063773/article/details/79349256
文章写得非常非常好,不过我并没有完全看懂。。。。。。。。。
附上所有代码,以及,这些代码全部是抄这篇文章的,我自己加入了一些注释。
因为数据分析、数据处理、数据挖掘都写到了一起,所以会有一丢丢长。按理说,数据分析这一块应该用Jupyter Notebook来做,这样比较方便看结果和交流。
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 1 09:05:20 2019
@author: xuefei
"""
#导入所需的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#交叉验证
from sklearn.cross_validation import cross_val_score
#代价计算
from sklearn.metrics import make_scorer, mean_squared_error
#模型列举
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
#模型选择
from sklearn.model_selection import cross_val_score
#用于获取对象的哪些维的数据,参数为一些序号
from operator import itemgetter
#迭代器
import itertools
#这个包是对matplotlib的补充
import seaborn as sns
#忽略警告
import warnings
warnings.filterwarnings('ignore')
#数据分析库
from scipy import stats
#设置小数点维数为3
pd.set_option('display.float_format',lambda x:'{:.3f}'.format(x))
##########################################################数据探索###########################################################
##############################################数据特征分析#########################################
##############################1、总体数据情况
train = pd.read_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\train.csv',engine='python')
test = pd.read_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\test.csv',engine='python')
#下面这种连接方式其实就是去除了id和最终预测量两个指标,alldata是一个2919*79维的向量
alldata = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],test.loc[:,'MSSubClass':'SaleCondition']),ignore_index = True)
#调用函数获得对数据的整体描述
explore = train.describe(include = 'all').T
#算出每一列中为nan的行
explore['null'] = len(train) - explore['count']
#每列数据类型的描述
explore.insert(0,'dtype',train.dtypes)
#输出描述结果
explore.T.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\explore_train.csv')
#同样的事情对所有数据再做一遍
#调用函数获得对数据的整体描述
explore = alldata.describe(include = 'all').T
#算出每一列中为nan的行
explore['null'] = len(alldata) - explore['count']
#每列数据类型的描述
explore.insert(0,'dtype',alldata.dtypes)
#输出描述结果
explore.T.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\explore_alldata.csv')
##################################2.前面我们获得了数据的整体情况,下面做一个相关性分析
#计算列的相关性,只关注数值列
corrmat = train.corr()
plt.subplots(figsize = (12,9))
#绘制热力图
sns.heatmap(corrmat,vmax = 0.9,square = True)
#sns.heatmap(corrmat,cbar = True,annot = True,square = True,fmt = '.2f',annot_kws = {'size':10})
#查看影响最终价格的10个变量
k = 10
plt.figure(figsize = (12,9))
#以某列为参考,列出最大的k行及其索引
cols = corrmat.nlargest(k,'SalePrice').index
#对于数组,分析它的列相关性
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale = 1.25)
hm = sns.heatmap(cm,cbar = True,annot = True,square = True,fmt = '.2f',annot_kws = {'size':10},yticklabels = cols.values,xticklabels = cols.values)
plt.show()
Corr = train.corr()
features = Corr.loc[(Corr['SalePrice'] > 0.5) | (Corr['SalePrice'] < -0.5)]
#绘制散点图矩阵,注意多变量作图数据中不能有空格,否则出错
sns.set()
cols = list(features.index)
#对于一个数据集,两两画散点图
sns.pairplot(train[cols],size = 2.5)
plt.show()
#################################3.前面我们画了很多酷炫的图,但是我总觉得不应该把非数值变量直接排除在外了,下面我们来看一下分布分析
#查看是否符合正态分布
#直方图和正态概率图
#用正态分布拟合
sns.distplot(train['SalePrice'],fit=stats.norm)
#计算概率图的分位数(???)
fig = plt.figure()
res = stats.probplot(train['SalePrice'],plot = plt)
print("Skewness: %f" %train['SalePrice'].skew()) #偏度
print("Kurtosis: %f" %train['SalePrice'].kurt()) #峰度
#可以看出,图象不是正态分布
#探究特征GrLivArea
#研究标签和GrLivArea的关系
#讲道理我觉得这个data1纯粹是多余
data1 = pd.concat([train['SalePrice'],train['GrLivArea']],axis = 1)
data1.plot.scatter(x = 'GrLivArea' , y = 'SalePrice',ylim = (0,800000))
#直方图和正态概率图,查看是否正态分布
fig = plt.figure()
sns.distplot(train['GrLivArea'],fit = stats.norm)
fig = plt.figure()
res = stats.probplot(train['GrLivArea'],plot = plt)
## 由散点图可知,图像的右下角存在两个异常值,建议去除;图像非正态分布
#探究特征TotalBsmtSF
#研究标签和TotalBsmtSF的关系
#讲道理我觉得这个data2纯粹是多余
data2 = pd.concat([train['SalePrice'],train['TotalBsmtSF']],axis = 1)
data2.plot.scatter(x = 'TotalBsmtSF' , y = 'SalePrice',ylim = (0,800000))
#直方图和正态概率图,查看是否正态分布
fig = plt.figure()
sns.distplot(train['TotalBsmtSF'],fit = stats.norm)
fig = plt.figure()
res = stats.probplot(train['TotalBsmtSF'],plot = plt)
## 由散点图可知,图像的右下角存在一个异常值,建议去除;图像非正态分布
#探究特征OverallQual
#研究标签和OverallQual的关系
#讲道理我觉得这个data3纯粹是多余
data3 = pd.concat([train['SalePrice'],train['OverallQual']],axis = 1)
data3.plot.scatter(x = 'OverallQual' , y = 'SalePrice',ylim = (0,800000))
#直方图和正态概率图,查看是否正态分布
fig = plt.figure()
sns.distplot(train['OverallQual'],fit = stats.norm)
fig = plt.figure()
res = stats.probplot(train['OverallQual'],plot = plt)
#由图可知,这个变量不是连续的,所以换一个方法来看
f , ax = plt.subplots(figsize = (8,6))
fig = sns.boxplot(x = 'OverallQual' , y = 'SalePrice', data = data3)
fig.axis(ymin = 0,ymax = 800000)
#探究特征MoSold
#直接查看不同月份房子的销售量
fig = plt.figure()
train.groupby('MoSold').count()['Id'].plot(kind = 'bar')
fig = plt.figure()
sns.countplot(x = 'MoSold',data = train)
#我不是很懂为什么要探究这四个特征,可能是以这四个特征为例?还有非数值的特征打算怎么办?先往下看吧
##############################################数据特征分析#########################################
#################################1.缺失值
#处理缺失值是一个非常繁琐而且闹心的活
#首先,回忆在泰坦尼克之灾的案例中,我们选择用“性别”来填充“年龄”,结果并不理想,因为性别和年龄之间并没有必然的关系
#所以,首先要观观察特征,彼此相关的特征(比如拥有相同的前缀、描述同一个东西)分为一组,当且仅当我们发现它们互相有明显的关联是,我们用分组众数或中位数来填充
#拥有相同前缀的这些特征,可能都有缺失,可能有的有缺失有的没有,需要我们从缺失特征出发,去找那些和它相关的缺失或不缺失特征
#这也是为什么下面有的地方用all_data_na,有的地方用alldata,其实我觉得都用alldata就完全ok
#最后那些找不到一致特征的,我们就考虑用众数填充
#查看缺失值情况,返回一个dataframe描述了全部数据中,有缺失的那些列的缺失数目、缺失率、存在数目、训练数据的存在数目、测试数据的存在数目、列的类型
def missing_values(alldata,train,test):
all_data_na = pd.DataFrame(pd.isnull(alldata).sum(),columns = {'missingNum'})
all_data_na['missingRatio'] = all_data_na.missingNum / len(alldata)
all_data_na['existNum'] = pd.notnull(alldata).sum()
all_data_na['train_notna'] = pd.notnull(train).sum()
all_data_na['test_notna'] = pd.notnull(test).sum()
all_data_na['dtype'] = alldata.dtypes
all_data_na = all_data_na.loc[all_data_na['missingNum'] > 0].sort_values(by = ['missingNum'],ascending = False)
return all_data_na
all_data_na = missing_values(alldata,train,test)
#可以看到缺失值还不少,下面我们逐次处理他们
# 注意现在都还在查看数据的阶段,还没有开始处理,个人觉得一种较好的习惯是运行完之后立刻把统计结果注释在旁边
#1.对于Pool相关空值
#PoolQC和PoolArea都是用来描述游泳池的,也许可以放在一起看一看
#PoolQC大多数是null,PoolArea大多数是零
#查看各个PoolQC的分布情况
print(alldata['PoolQC'].value_counts())
#PoolArea的均值
poolqc = alldata.groupby('PoolQC').mean()['PoolArea']
print('不同poolqc的PoolArea的均值\n',poolqc)
#查看有PoolArea但是没有PoolQC的数据
poolqcna = alldata[(alldata['PoolQC'].isnull())& (alldata['PoolArea']!=0)][['PoolQC','PoolArea']]
print('查看有PoolArea数据但是没有poolQC的数据\n',poolqcna)
# 查看无PoolArea数据但是有poolQC的数据
poolareana = alldata[(alldata['PoolQC'].notnull()) & (alldata['PoolArea']==0)][['PoolQC','PoolArea']]
print('查看无PoolArea数据但是有poolQC的数据\n',poolareana)
# 由结果可知,PoolQC有三种取值,存在3个值QC为空,Area不为空,以PoolQC进行分组,计算PoolArea的均值,最小距离
#2.对于Garage相关空值
#我们还是把前缀作为关联属性之间的依据
#找出所有Garage前缀的属性
a = pd.Series(alldata.columns)
GarageList = a[a.str.contains('Garage')].values
print(GarageList)
print(all_data_na.ix[GarageList,:])
#可以看到,总共7个,全部都有缺失
#先看两个数值型的。注:# 'GarageYrBlt'到后来与年份一起处理,也有空值
print(len(alldata[(alldata['GarageArea']==0) & (alldata['GarageCars']==0)]))# 157
print(len(alldata[(alldata['GarageArea']!=0) & (alldata['GarageCars'].isnull() == True)])) # 0
#这两个数值的缺失在一个地方
print(len(alldata[(alldata['GarageArea'].isnull()) & (alldata['GarageCars'].isnull())])) #1
#3.对于Bmst相关空值
# 找出所有Bsmt前缀的属性
a = pd.Series(alldata.columns)
BsmtList = a[a.str.contains('Bsmt')].values
print(BsmtList)
#这里有一个隐藏的bug就是如果Bsmt开头的属性里有未缺失的数据就会报错,不过反正是调试代码又不是搭建模型没关系了
allBsmtNa = all_data_na.ix[BsmtList,:]
print(allBsmtNa)
#然后我们观察这些数据,可以发现,BsmtQual,BsmtCond,BsmtExposure这三个都是字符串类型,并且评估标准相同
#BsmtFinType1和BsmtFinType2是字符串类型,并且评估标准相同
#剩下的三个都是数值类型
condition = (alldata['BsmtExposure'].isnull()) & (alldata['BsmtCond'].notnull() )& (alldata['BsmtQual'].notnull())
alldata[condition][BsmtList]
# 通过研究发现,BsmtExposure为空时,有三行数据其他值不为空,取众数填充
#讲道理上面这句话我没看懂,一会儿回来再看
condition1 = (alldata['BsmtCond'].isnull()) & (alldata['BsmtExposure'].notnull())& (alldata['BsmtQual'].notnull())
print(len(alldata[alldata['BsmtCond']==alldata['BsmtQual']]))
#print(len(alldata[alldata['BsmtExposure']==alldata['BsmtCond']]))
alldata[condition1][BsmtList]
# 通过研究发现,BsmtCond为空时,有三行数据其他值不为空# 有1265个值的BsmtQual == BsmtCond,所以对应填充
condition2 = (alldata['BsmtQual'].isnull()) & (alldata['BsmtExposure'].notnull())& (alldata['BsmtCond'].notnull() )
alldata[condition2][BsmtList]
# 通过研究发现,BsmtQual为空时,有两行数据其他值不为空,填充方法与condition1类似
# 其他剩下的字段考虑填众数
print(alldata['BsmtFinSF1'].value_counts().head(5))# 空值填0
print(alldata['BsmtFinSF2'].value_counts().head(5))# 空值填0
print(alldata['BsmtFullBath'].value_counts().head(5))# 空值填0
print(alldata['BsmtHalfBath'].value_counts().head(5))# 空值填0
print(alldata['BsmtFinType1'].value_counts().head(5)) # 空值填Unf
print(alldata['BsmtFinType2'].value_counts().head(5)) # 空值填Unf
#4.对于Mas相关空值
# 找出所有Mas前缀的属性
a = pd.Series(alldata.columns)
MasList = a[a.str.contains('Mas')].values
print(MasList)
#发现有两个属性
print(alldata[['MasVnrType', 'MasVnrArea']].isnull().sum()) #24,23
print(len(alldata[(alldata['MasVnrType'].isnull())& (alldata['MasVnrArea'].isnull())])) # 23
print(len(alldata[(alldata['MasVnrType'].isnull())& (alldata['MasVnrArea'].notnull())])) #1
print(len(alldata[(alldata['MasVnrType'].notnull())& (alldata['MasVnrArea'].isnull())])) #0
print(alldata['MasVnrType'].value_counts())
MasVnrM = alldata.groupby('MasVnrType')['MasVnrArea'].median()
print(MasVnrM)
mtypena = alldata[(alldata['MasVnrType'].isnull())& (alldata['MasVnrArea'].notnull())][['MasVnrType','MasVnrArea']]
print(mtypena)
# 由此可知,由一条数据的 MasVnrType 为空而Area不为空,所以,填充方式按照类似poolQC和poolArea的方式,分组填充
# 其他数据中MasVnrType空值填fillna("None"),MasVnrArea空值填fillna(0)
#5.对于MS相关空值
# 找出所有MS前缀的属性
a = pd.Series(alldata.columns)
MSList = a[a.str.contains('MS')].values
print(MSList) #['MSSubClass' 'MSZoning']
print(alldata[MSList].isnull().sum())
print(alldata[alldata['MSSubClass'].isnull() | alldata['MSZoning'].isnull()][['MSSubClass','MSZoning']])
pd.crosstab(alldata.MSSubClass, alldata.MSZoning)
#通过观察30/70 'RM'和20'RL'的组合较多。对应填充
#6.对于LotFrontage
#考虑到LotFrontage 与街道连接的线性脚与Neighborhood 房屋附近位置 存在一定的关系
print(alldata[["LotFrontage", "Neighborhood"]].isnull().sum())
print(alldata["LotFrontage"].value_counts().head(5))
# 考虑通过一定的方式来填充
# 例如:
#解释一下下面这句的意思:alldata.groupby("Neighborhood")["LotFrontage"]表示用Neighborhood对alldata进行分组,
#只取其中LotFrontage的信息。transform与groupby配合使用,类似SQL中的分组聚合语句,表示对每一个group分别用中位数填充
alldata["LotFrontage"] = alldata.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
#7、其他
others = ['Functional','Utilities','SaleType','Electrical', "FireplaceQu",'Alley',"Fence", "MiscFeature",\
'KitchenQual',"LotFrontage",'Exterior1st','Exterior2nd']
print(alldata[others].isnull().sum())
print(alldata['Functional'].value_counts().head(5)) # 填众数
print(alldata['Utilities'].value_counts().head(5)) # 填众数
print(alldata['SaleType'].value_counts().head(5)) # 填众数
print(alldata['Electrical'].value_counts().head(5)) # 填众数
print(alldata["Fence"].value_counts()) # 填众数
print(alldata["MiscFeature"].value_counts().head(5)) # 填众数
print(alldata['KitchenQual'].value_counts().head(5)) # 填众数
print(alldata['Exterior1st'].value_counts().head(5)) # 填众数
print(alldata['Exterior2nd'].value_counts().head(5)) # 填众数
print(alldata['FireplaceQu'].value_counts().head(5)) # 填'none'
print(alldata['Alley'].value_counts().head(5)) # 填'none'
#################################2.异常值
#根据上面的分析可以知道,需要去除一些异常值
#################################3.重复值
alldata[alldata.duplicated()]
#只有2行,关系不大,对后面也没什么影响,不管了
##########################################################数据预处理###########################################################
##############################################数据清洗#########################################
#################################1.缺失值处理
##之前我们花了非常大的力气来认识数据,现在我们根据之前的分析结果来逐个处理缺失值
##这里对缺失值的处理除了告诉我们要分组之外,其实很有代表性
##第一种情况,用数值填充标称值:根据已有的标称值分组后的平均数或中位数,选择使数值误差最小的标称值。当数值有很多0时注意特殊情况特殊处理
##1.处理pool相关空值.PoolArea没有空值,PoolQC有一堆nan
#当我们用PoolArea来填充PoolQC时,发现PoolQC为nan且PoolArea不为零的值有三行,这三行我们按照平均值填充,剩下的填充none
#注意不是指空,而是把None作为一个标签
poolqcna = alldata[(alldata['PoolQC'].isnull())& (alldata['PoolArea']!=0)][['PoolQC','PoolArea']]
areamean = alldata.groupby('PoolQC')['PoolArea'].mean()
for i in poolqcna.index:
v = alldata.loc[i,['PoolArea']].values
print(type(np.abs(v-areamean)))
alldata.loc[i,['PoolQC']] = np.abs(v-areamean).astype('float64').argmin()
alldata['PoolQC'] = alldata["PoolQC"].fillna("None")
#alldata['PoolArea'] = alldata["PoolArea"].fillna(0)
##第二种情况,缺失值多而且找不到联系信息,直接填充为0或者None标签
##2.处理Garage相关空值
#非数值型全填None
alldata[['GarageCond','GarageFinish','GarageQual','GarageType']] = alldata[['GarageCond','GarageFinish','GarageQual','GarageType']].fillna('None')
#数值型全填0
alldata[['GarageCars','GarageArea']] = alldata[['GarageCars','GarageArea']].fillna(0)
#这个是怎么混进来的??
#这个返回的好像是众数吧
alldata['Electrical'] = alldata['Electrical'].fillna( alldata['Electrical'].mode()[0])
#注意此处'GarageYrBlt'尚未填充
##第三种情况,用标称值填充标称值:当它们的标签取值范围相同时且有强关联时直接等,否则用众数
##3.处理Bsmt相关空值
a = pd.Series(alldata.columns)
BsmtList = a[a.str.contains('Bsmt')].values
#Bmst里也有数值型和非数值型
#首先看非数值型的头三个抱团的
#从前面的分析里我们已经知道,BsmtQual和BsmtCond关系非常大,可以直接用来填充,而BsmtExposure则关系不那么大
#所以,先用BsmtExposure的众数填充自己
condition = (alldata['BsmtExposure'].isnull()) & (alldata['BsmtCond'].notnull()) # 3个
alldata.ix[(condition),'BsmtExposure'] = alldata['BsmtExposure'].mode()[0]
#然后填充那些可以互相相等来填充的,用一个填充另一个,当然要前者不为空后者为空
condition1 = (alldata['BsmtCond'].isnull()) & (alldata['BsmtExposure'].notnull()) # 3个
alldata.ix[(condition1),'BsmtCond'] = alldata.ix[(condition1),'BsmtQual']
condition2 = (alldata['BsmtQual'].isnull()) & (alldata['BsmtExposure'].notnull()) # 2个
alldata.ix[(condition2),'BsmtQual'] = alldata.ix[(condition2),'BsmtCond']
#然后看非数值型下面两个抱团的,这个也是一样的
condition3 = (alldata['BsmtFinType1'].notnull()) & (alldata['BsmtFinType2'].isnull())
alldata.ix[condition3,'BsmtFinType2'] = 'Unf'
#剩下的填充0,其实我不太懂为什么不填充众数【强行解释:任意情况下填充众数时,都要注意如果缺失值过多,不要用众数,用0或None】
allBsmtNa = all_data_na.ix[BsmtList,:]
allBsmtNa_obj = allBsmtNa[allBsmtNa['dtype']=='object'].index
allBsmtNa_flo = allBsmtNa[allBsmtNa['dtype']!='object'].index
alldata[allBsmtNa_obj] =alldata[allBsmtNa_obj].fillna('None')
alldata[allBsmtNa_flo] = alldata[allBsmtNa_flo].fillna(0)
##这还是第一种情况
##4.处理Mas相关空值
#之前说过了,和pool的处理方式一毛一样,但是换成了中位数,我并不知道为什么
MasVnrM = alldata.groupby('MasVnrType')['MasVnrArea'].median()
mtypena = alldata[(alldata['MasVnrType'].isnull())& (alldata['MasVnrArea'].notnull())][['MasVnrType','MasVnrArea']]
for i in mtypena.index:
v = alldata.loc[i,['MasVnrArea']].values
alldata.loc[i,['MasVnrType']] = np.abs(v-MasVnrM).astype('float64').argmin()
alldata['MasVnrType'] = alldata["MasVnrType"].fillna("None")
alldata['MasVnrArea'] = alldata["MasVnrArea"].fillna(0)
##第四种情况,用标称值填充数值,使用标称值分组后的众数或中位数
##5.处理MS相关空值
#根据之前的分析,用关联属性的众数填充
#注意这里和上面的区别,上面使用数值填充标称值,这里使用标称值填充数值
alldata['MSZoning'] = alldata.groupby('MSSubClass')['MSZoning'].transform(lambda x:x.fillna(x.mode()[0]))
##第五种情况,用数值填充数值,多项式拟合
##6.LotFrontage
#使用多项式拟合填充……上面说好的用neighborhood填充呢?
#这里把LotFrontage作为LotArea的函数,我猜不用另外两个Lot是因为人家是非数值
#先用非空的这些值拟合出多项式
x = alldata.loc[alldata["LotFrontage"].notnull(), "LotArea"]
y = alldata.loc[alldata["LotFrontage"].notnull(), "LotFrontage"]
#x和y实际上是Series,下面继续划分定义域,确定如何拟合这个函数
t = (x <= 25000) & (y <= 150)
p = np.polyfit(x[t], y[t], 1)
#然后用拟合的多项式计算
alldata.loc[alldata['LotFrontage'].isnull(), 'LotFrontage'] = np.polyval(p, alldata.loc[alldata['LotFrontage'].isnull(), 'LotArea'])
##同第二种情况
##7.其他的就都填充众数
alldata['KitchenQual'] = alldata['KitchenQual'].fillna(alldata['KitchenQual'].mode()[0]) # 用众数填充
alldata['Exterior1st'] = alldata['Exterior1st'].fillna(alldata['Exterior1st'].mode()[0])
alldata['Exterior2nd'] = alldata['Exterior2nd'].fillna(alldata['Exterior2nd'].mode()[0])
alldata["Functional"] = alldata["Functional"].fillna(alldata['Functional'].mode()[0])
alldata["SaleType"] = alldata["SaleType"].fillna(alldata['SaleType'].mode()[0])
alldata["Utilities"] = alldata["Utilities"].fillna(alldata['Utilities'].mode()[0])
#缺失值特别多的几个,众数已经没有参考价值,直接填充None
alldata[["Fence", "MiscFeature","FireplaceQu","Alley"]] = alldata[["Fence", "MiscFeature","FireplaceQu","Alley"]].fillna('None')
##第六种情况,时间序列
##8.最后,我们还剩一个GarageYrBlt没填充
year_map = pd.concat(pd.Series('YearGroup' + str(i+1), index=range(1871+i*20,1891+i*20)) for i in range(0, 7))
alldata.GarageYrBlt = alldata.GarageYrBlt.map(year_map)
alldata['GarageYrBlt']= alldata['GarageYrBlt'].fillna('None')
#梳理一下思路:有关联时,用关联;没有关联或者用不了关联时,如果缺失的不多,用众数;否则,用0或'None'
#确认缺失值全部填充完毕
print(pd.isnull(alldata).sum().sum())
#################################2.异常值处理
#现在我们来管一管之前说过的异常值
#探究特征GrLivArea
#研究标签和GrLivArea的关系
#讲道理我觉得这个data1纯粹是多余
data1 = pd.concat([train['SalePrice'],train['GrLivArea']],axis = 1)
data1.plot.scatter(x = 'GrLivArea' , y = 'SalePrice',ylim = (0,800000))
## 由散点图可知,图像的右下角存在两个异常值,建议去除
outliers_id = train[(train.GrLivArea > 4000) & (train.SalePrice < 200000)].index
print(outliers_id)
alldata = alldata.drop(outliers_id)
Y = train.SalePrice.drop(outliers_id)
plt.figure(figsize=(8,6))
plt.scatter(train.TotalBsmtSF,train.SalePrice)
plt.show()
#可以验证 之前那个异常点和这个是一样的,所以去除一次就可以了
train_now = pd.concat([alldata.iloc[:1458,:],Y], axis=1)
test_now = alldata.iloc[1458:,:]
#数据清洗完成,保存数据,为接下来的操作准备
train_now.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\train_afterclean.csv')
test_now.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\test_afterclean.csv')
##############################################数据变换#########################################
#前面的都可以先归零了,我们现在读取清洗后的数据
train = pd.read_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\train_afterclean.csv',engine='python')
test = pd.read_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\test_afterclean.csv',engine='python')
alldata = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'], test.loc[:,'MSSubClass':'SaleCondition']), ignore_index=True)
alldata.shape
#################################1.处理标称属性(指非数值型属性,不一定不是int型的,比如月份)
#根据属性性质的不同,对标称属性数字化
#对于序列性属性
# 处理序列型标称数据【不同字母代表不同的评级】
ordinalList = ['ExterQual', 'ExterCond', 'GarageQual', 'GarageCond','PoolQC',\
'FireplaceQu', 'KitchenQual', 'HeatingQC', 'BsmtQual','BsmtCond']
ordinalmap = {'Ex': 5,'Gd': 4,'TA': 3,'Fa': 2,'Po': 1,'None': 0}
for c in ordinalList:
alldata[c] = alldata[c].map(ordinalmap)
alldata['BsmtExposure'] = alldata['BsmtExposure'].map({'None':0, 'No':1, 'Mn':2, 'Av':3, 'Gd':4})
alldata['BsmtFinType1'] = alldata['BsmtFinType1'].map({'None':0, 'Unf':1, 'LwQ':2,'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6})
alldata['BsmtFinType2'] = alldata['BsmtFinType2'].map({'None':0, 'Unf':1, 'LwQ':2,'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6})
alldata['Functional'] = alldata['Functional'].map({'Maj2':1, 'Sev':2, 'Min2':3, 'Min1':4, 'Maj1':5, 'Mod':6, 'Typ':7})
alldata['GarageFinish'] = alldata['GarageFinish'].map({'None':0, 'Unf':1, 'RFn':2, 'Fin':3})
alldata['Fence'] = alldata['Fence'].map({'MnWw':0, 'GdWo':1, 'MnPrv':2, 'GdPrv':3, 'None':4})
#对于其他属性:标称值只是一个符号,并没有明确的递进关系
#真的只是符号而且属性值域还贼多,比如街道:按有或无划分
MasVnrType_Any = alldata.MasVnrType.map({'BrkCmn': 1,'BrkFace': 1,'CBlock': 1,'Stone': 1,'None': 0})
MasVnrType_Any.name = 'MasVnrType_Any' #修改该series的列名
SaleCondition_PriceDown = alldata.SaleCondition.map({'Abnorml': 1,'Alloca': 1,'AdjLand': 1,'Family': 1,'Normal': 0,'Partial': 0})
SaleCondition_PriceDown.name = 'SaleCondition_PriceDown' #修改该series的列名
#是符号,符号有含义,可以人为地理解符号的含义,并给他们赋予二值
alldata = alldata.replace({'CentralAir': {'Y': 1,'N': 0}})
alldata = alldata.replace({'PavedDrive': {'Y': 1,'P': 0,'N': 0}})
#这个我没弄明白为什么
newer_dwelling = alldata['MSSubClass'].map({20: 1,30: 0,40: 0,45: 0,50: 0,60: 1,70: 0,75: 0,80: 0,85: 0,90: 0,120: 1,150: 0,160: 0,180: 0,190: 0})
newer_dwelling.name= 'newer_dwelling' #修改该series的列名
#修改这一列的数据类型,标记其为标称属性
alldata['MSSubClass'] = alldata['MSSubClass'].apply(str)
#邻里也分为好邻里和坏邻里
Neighborhood_Good = pd.DataFrame(np.zeros((alldata.shape[0],1)), columns=['Neighborhood_Good'])
Neighborhood_Good[alldata.Neighborhood=='NridgHt'] = 1
Neighborhood_Good[alldata.Neighborhood=='Crawfor'] = 1
Neighborhood_Good[alldata.Neighborhood=='StoneBr'] = 1
Neighborhood_Good[alldata.Neighborhood=='Somerst'] = 1
Neighborhood_Good[alldata.Neighborhood=='NoRidge'] = 1
# Neighborhood_Good = (alldata['Neighborhood'].isin(['StoneBr','NoRidge','NridgHt','Timber','Somerst']))*1 #(效果没有上面好)
Neighborhood_Good.name='Neighborhood_Good'# 将该变量加入
#关于季节,根据我们之前划分过的旺季和淡季二值化
season = (alldata['MoSold'].isin([5,6,7]))*1
season.name='season'
alldata['MoSold'] = alldata['MoSold'].apply(str)
#对于“质量——Qual”“条件——Cond”属性,用一个阈值(这里选择5)从中间分开,然后构造新属性
# 处理OverallQual:将该属性分成两个子属性,以5为分界线,大于5及小于5的再分别以序列
overall_poor_qu = alldata.OverallQual.copy()# Series类型
overall_poor_qu = 5 - overall_poor_qu
overall_poor_qu[overall_poor_qu<0] = 0
overall_poor_qu.name = 'overall_poor_qu'
overall_good_qu = alldata.OverallQual.copy()
overall_good_qu = overall_good_qu - 5
overall_good_qu[overall_good_qu<0] = 0
overall_good_qu.name = 'overall_good_qu'
# 处理OverallCond :将该属性分成两个子属性,以5为分界线,大于5及小于5的再分别以序列
overall_poor_cond = alldata.OverallCond.copy()# Series类型
overall_poor_cond = 5 - overall_poor_cond
overall_poor_cond[overall_poor_cond<0] = 0
overall_poor_cond.name = 'overall_poor_cond'
overall_good_cond = alldata.OverallCond.copy()
overall_good_cond = overall_good_cond - 5
overall_good_cond[overall_good_cond<0] = 0
overall_good_cond.name = 'overall_good_cond'
# 处理ExterQual:将该属性分成两个子属性,以3为分界线,大于3及小于3的再分别以序列
exter_poor_qu = alldata.ExterQual.copy()
exter_poor_qu[exter_poor_qu<3] = 1
exter_poor_qu[exter_poor_qu>=3] = 0
exter_poor_qu.name = 'exter_poor_qu'
exter_good_qu = alldata.ExterQual.copy()
exter_good_qu[exter_good_qu<=3] = 0
exter_good_qu[exter_good_qu>3] = 1
exter_good_qu.name = 'exter_good_qu'
# 处理ExterCond:将该属性分成两个子属性,以3为分界线,大于3及小于3的再分别以序列
exter_poor_cond = alldata.ExterCond.copy()
exter_poor_cond[exter_poor_cond<3] = 1
exter_poor_cond[exter_poor_cond>=3] = 0
exter_poor_cond.name = 'exter_poor_cond'
exter_good_cond = alldata.ExterCond.copy()
exter_good_cond[exter_good_cond<=3] = 0
exter_good_cond[exter_good_cond>3] = 1
exter_good_cond.name = 'exter_good_cond'
# 处理BsmtCond:将该属性分成两个子属性,以3为分界线,大于3及小于3的再分别以序列
bsmt_poor_cond = alldata.BsmtCond.copy()
bsmt_poor_cond[bsmt_poor_cond<3] = 1
bsmt_poor_cond[bsmt_poor_cond>=3] = 0
bsmt_poor_cond.name = 'bsmt_poor_cond'
bsmt_good_cond = alldata.BsmtCond.copy()
bsmt_good_cond[bsmt_good_cond<=3] = 0
bsmt_good_cond[bsmt_good_cond>3] = 1
bsmt_good_cond.name = 'bsmt_good_cond'
# 处理GarageQual:将该属性分成两个子属性,以3为分界线,大于3及小于3的再分别以序列
garage_poor_qu = alldata.GarageQual.copy()
garage_poor_qu[garage_poor_qu<3] = 1
garage_poor_qu[garage_poor_qu>=3] = 0
garage_poor_qu.name = 'garage_poor_qu'
garage_good_qu = alldata.GarageQual.copy()
garage_good_qu[garage_good_qu<=3] = 0
garage_good_qu[garage_good_qu>3] = 1
garage_good_qu.name = 'garage_good_qu'
# 处理GarageCond:将该属性分成两个子属性,以3为分界线,大于3及小于3的再分别以序列
garage_poor_cond = alldata.GarageCond.copy()
garage_poor_cond[garage_poor_cond<3] = 1
garage_poor_cond[garage_poor_cond>=3] = 0
garage_poor_cond.name = 'garage_poor_cond'
garage_good_cond = alldata.GarageCond.copy()
garage_good_cond[garage_good_cond<=3] = 0
garage_good_cond[garage_good_cond>3] = 1
garage_good_cond.name = 'garage_good_cond'
# 处理KitchenQual:将该属性分成两个子属性,以3为分界线,大于3及小于3的再分别以序列
kitchen_poor_qu = alldata.KitchenQual.copy()
kitchen_poor_qu[kitchen_poor_qu<3] = 1
kitchen_poor_qu[kitchen_poor_qu>=3] = 0
kitchen_poor_qu.name = 'kitchen_poor_qu'
kitchen_good_qu = alldata.KitchenQual.copy()
kitchen_good_qu[kitchen_good_qu<=3] = 0
kitchen_good_qu[kitchen_good_qu>3] = 1
kitchen_good_qu.name = 'kitchen_good_qu'
#合并构造的这一堆属性(还不是最终合并)
qu_list = pd.concat((overall_poor_qu, overall_good_qu, overall_poor_cond, overall_good_cond, exter_poor_qu,
exter_good_qu, exter_poor_cond, exter_good_cond, bsmt_poor_cond, bsmt_good_cond, garage_poor_qu,
garage_good_qu, garage_poor_cond, garage_good_cond, kitchen_poor_qu, kitchen_good_qu), axis=1)
#对事件属性的处理:
#这里没有简单的归一化年份或者怎么样,而是新建了若干特征
#房子是否有被重新修建过
Xremoded = (alldata['YearBuilt']!=alldata['YearRemodAdd'])*1
#房子是否是最近被修建的
Xrecentremoded = (alldata['YearRemodAdd']>=alldata['YrSold'])*1
#房子是否是最近建造的
XnewHouse = (alldata['YearBuilt']>=alldata['YrSold'])*1
#房子的年龄
XHouseAge = 2010 - alldata['YearBuilt']
#房子被卖了多久
XTimeSinceSold = 2010 - alldata['YrSold']
#房子被卖和上次被修建之间差了多久
XYearSinceRemodel = alldata['YrSold'] - alldata['YearRemodAdd']
Xremoded.name='Xremoded'
Xrecentremoded.name='Xrecentremoded'
XnewHouse.name='XnewHouse'
XTimeSinceSold.name='XTimeSinceSold'
XYearSinceRemodel.name='XYearSinceRemodel'
XHouseAge.name='XHouseAge'
year_list = pd.concat((Xremoded,Xrecentremoded,XnewHouse,XHouseAge,XTimeSinceSold,XYearSinceRemodel),axis=1)
#不知道为什么要构造新属性
#这个属性似乎是抽取部分变量对房屋价格分桶,然后用SVM来预测所有数据应该在的“桶”,将桶编号作为新的一维。
#没有明白的问题是,pc代表什么含义,又是为什么要选取这几个变量
from sklearn.svm import SVC
svm = SVC(C=100, gamma=0.0001, kernel='rbf')
#pc用来对房屋价格分桶
pc = pd.Series(np.zeros(train.shape[0]))
pc[:] = 'pc1'
pc[train.SalePrice >= 150000] = 'pc2'
pc[train.SalePrice >= 220000] = 'pc3'
#不知道为什么要用这几个变量
columns_for_pc = ['Exterior1st', 'Exterior2nd', 'RoofMatl', 'Condition1', 'Condition2', 'BldgType']
#温馨提示:不要试图双击打开这个变量,会直接闪退然后本次编辑的东西统统删除,点过保存没用。还好我备份做的好。。。。
X_t = pd.get_dummies(train.loc[:, columns_for_pc], sparse=True)
svm.fit(X_t, pc)# 训练
#归一化
p = train.SalePrice/100000
price_category = pd.DataFrame(np.zeros((alldata.shape[0],1)), columns=['pc'])
X_t = pd.get_dummies(alldata.loc[:, columns_for_pc], sparse=True)
pc_pred = svm.predict(X_t) # 预测
price_category[pc_pred=='pc2'] = 1
price_category[pc_pred=='pc3'] = 2
price_category.name='price_category'
#连续数据离散化
year_map = pd.concat(pd.Series('YearGroup' + str(i+1), index=range(1871+i*20,1891+i*20)) for i in range(0, 7))
# 将年份对应映射
# alldata.GarageYrBlt = alldata.GarageYrBlt.map(year_map) # 在数据填充时已经完成该转换了(因为必须先转化后再填充,否则会出错(可以想想到底为什么呢?我觉得是因为GarageYrBlt有缺失吧))
alldata.YearBuilt = alldata.YearBuilt.map(year_map)
alldata.YearRemodAdd = alldata.YearRemodAdd.map(year_map)
#################################2.处理数值属性(这里的数值确确实实地代表着一个量的大小)
#首先选取出这些属性
numeric_feats = alldata.dtypes[alldata.dtypes != 'object'].index
#取这些列中每列的四分之三位数
t = alldata[numeric_feats].quantile(.75)
use_75_scater = t[t!=0].index
#四分之三位数不为零的,用四分之三位数缩放(为什么要用四分之三位数啊???)
alldata[use_75_scater] = alldata[use_75_scater]/alldata[use_75_scater].quantile(.75)
#标准化数据:标签取对数,其他的化成正态分布
from scipy.special import boxcox1p
#然后我也不懂为什么要对这些变量做,不对其他变量做
t = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal']
# alldata.loc[:, t] = np.log1p(alldata.loc[:, t])
train["SalePrice"] = np.log1p(train["SalePrice"])
lam = 0.15
for feat in t:
alldata[feat] = boxcox1p(alldata[feat], lam)
#将标称属性二值化
X = pd.get_dummies(alldata)
#其实这个时候X里面没有nan
X = X.fillna(X.mean())
#为什么要去掉人家???
X = X.drop('Condition2_PosN', axis=1)
X = X.drop('MSZoning_C (all)', axis=1)
X = X.drop('MSSubClass_160', axis=1)
#好了,现在连接前面搞的那一大堆属性
X= pd.concat((X, newer_dwelling, season, year_list ,qu_list,MasVnrType_Any, \
price_category,SaleCondition_PriceDown,Neighborhood_Good), axis=1)
#然后继续创建新属性,这个看起来像是把不同的属性做了组合
from itertools import product, chain
# chain(iter1, iter2, ..., iterN):
# 给出一组迭代器(iter1, iter2, ..., iterN),此函数创建一个新迭代器来将所有的迭代器链接起来,
# 返回的迭代器从iter1开始生成项,知道iter1被用完,然后从iter2生成项,这一过程会持续到iterN中所有的项都被用完。
def poly(X):
areas = ['LotArea', 'TotalBsmtSF', 'GrLivArea', 'GarageArea', 'BsmtUnfSF'] # 5个
#qu_list.axes[1]取列名,等效于qu_list.columns
t = chain(qu_list.axes[1].get_values(),year_list.axes[1].get_values(),ordinalList,
['MasVnrType_Any']) #,'Neighborhood_Good','SaleCondition_PriceDown'
for a, t in product(areas, t):
x = X.loc[:, [a, t]].prod(1) # 返回各维数组的乘积
x.name = a + '_' + t
print(a + '_' + t)
yield x
XP = pd.concat(poly(X), axis=1) # (2917, 165)
X = pd.concat((X, XP), axis=1) # (2917, 460)
#回归训练集和测试集的分类,注意预处理通常都是这样把训练集的特征和测试集一起处理
X_train = X[:train.shape[0]]
X_test = X[train.shape[0]:]
print(X_train.shape)#(1458, 460)
#然后给出训练集的标签
Y= train.SalePrice
train_now = pd.concat([X_train,Y], axis=1)
test_now = X_test
#保存结果,数据预处理至此全部完成
train_now.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\train_afterchange.csv')
test_now.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\test_afterchange.csv')
##########################################################模型构建###########################################################
#xgboost:boosting算法的一种
import xgboost as xgb
train = pd.read_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\train_afterchange.csv',engine='python')
test = pd.read_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\test_afterchange.csv',engine='python')
alldata = pd.concat((train.iloc[:,1:-1], test.iloc[:,1:]), ignore_index=True)
alldata.shape
#把特征和标签分开
X_train = train.iloc[:,1:-1]
y = train.iloc[:,-1]
X_test = test.iloc[:,1:]
#定义验证函数
# 定义验证函数
def rmse_cv(model):
#sklearn的交叉验证包,之前用过一次
rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
return(rmse)
#下面开始调轮子
#LASSO MODEL
clf1 = LassoCV(alphas = [1, 0.1, 0.001, 0.0005,0.0003,0.0002, 5e-4])
clf1.fit(X_train, y)
lasso_preds = np.expm1(clf1.predict(X_test)) # exp(x) - 1 <---->log1p(x)==log(1+x)
score1 = rmse_cv(clf1)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score1.mean(), score1.std()))
y_test = np.expm1(clf1.predict(X_test))
result = pd.DataFrame({"id":(test.index+1461).astype(np.int32),'SalePrice':y_test.astype(np.int64)},columns=['id', 'SalePrice'])
result.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\LassoCV_predictions.csv', index=False)
#ELASTIC NET
clf2 = ElasticNet(alpha=0.0005, l1_ratio=0.9)
clf2.fit(X_train, y)
elas_preds = np.expm1(clf2.predict(X_test))
score2 = rmse_cv(clf2)
print("\nElasticNet score: {:.4f} ({:.4f})\n".format(score2.mean(), score2.std()))
y_test = np.expm1(clf1.predict(X_test))
result = pd.DataFrame({"id":(test.index+1461).astype(np.int32),'SalePrice':y_test.astype(np.int64)},columns=['id', 'SalePrice'])
result.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\ElasticNet_predictions.csv', index=False)
#XGBOOST
clf3=xgb.XGBRegressor(colsample_bytree=0.4,
gamma=0.045,
learning_rate=0.07,
max_depth=20,
min_child_weight=1.5,
n_estimators=300,
reg_alpha=0.65,
reg_lambda=0.45,
subsample=0.95)
clf3.fit(X_train, y.values)
xgb_preds = np.expm1(clf3.predict(X_test))
score3 = rmse_cv(clf3)
print("\nxgb score: {:.4f} ({:.4f})\n".format(score3.mean(), score3.std()))
y_test = np.expm1(clf1.predict(X_test))
result = pd.DataFrame({"id":(test.index+1461).astype(np.int32),'SalePrice':y_test.astype(np.int64)},columns=['id', 'SalePrice'])
result.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\XGBOOST_predictions.csv', index=False)
#模型融合
final_result = 0.8*lasso_preds + 0.1*xgb_preds+0.1*elas_preds #0.11327
solution = pd.DataFrame({"id":test.index+1461, "SalePrice":final_result}, columns=['id', 'SalePrice'])
solution.to_csv(r'D:\学习\研一\数据挖掘\kaggle入门\练习赛\2017房价预测\result011621.csv', index = False) #
然后写一下看完之后的感悟。
光是数据分析和预处理部分就看了一整天(这里所说的“一整天”包括玩手机的时间),深深感慨大佬在数据预处理上果然是机智过人,跟这里缜密的预处理相比,我之前做的那个数值属性全部归一化非数值属性全部二值化简直宛如没做,分数连baseline都达不到果然是有原因的。
不过,根据之前做泰坦尼克之灾时的经验,数据预处理这件事很容易吃力不讨好,人为地去分析数据时,总觉得自己是有道理的,但是提取/融合出来的特征,却经常没有性能提升,甚至有下降。原文作者也说到,他的预处理部分参考了很多人的kernel,博采百家之长,才获得了比较好的效果。(前4%!!!!4%啊!!!!!如果是正式比赛的话都能写到简历上了!!!!!)
所以,将数据挖掘简单地等同于特征工程或者调包调参都是不严谨的。特征工程需要我们有丰富的经验、敏锐的观察力、对pandas等分析包的熟练使用和忍着不砸电脑的耐心,调包调参更需要熟悉各种常见的机器学习算法和工具。
随着年龄的增长其实觉得很多话看着像鸡汤,其实真的有沉甸甸的含金量。自己闭门造车10年也有可能造不出个自行车,多看看别人怎么做吧。
先记个清单
(→这篇虽然短但是里面有一个调参指南!个人觉得那两张图就是沉甸甸的金子啊!! https://blog.csdn.net/gh13uy2ql0N5/article/details/78293745
https://blog.csdn.net/jdbc/article/details/72468001
https://blog.csdn.net/u010094934/article/details/77689151
数据科学博大精深,要学的东西贼多,加油,笔芯。