本次分享的项目来自 Kaggle 的经典赛题:房价预测。分为数据分析和数据挖掘两部分介绍。本篇为数据分析篇。
赛题解读
比赛概述
影响房价的因素有很多,在本题的数据集中有 79 个变量几乎描述了爱荷华州艾姆斯 (Ames, Iowa) 住宅的方方面面,要求预测最终的房价。
技术栈
特征工程 (Creative feature engineering)
回归模型 (Advanced regression techniques like random forest and
gradient boosting)
最终目标
预测出每间房屋的价格,对于测试集中的每一个Id,给出变量SalePrice相应的值。
提交格式
Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.
数据分析
数据描述
首先我们导入数据并查看:
train_df = pd.read_csv('./input/train.csv', index_col=0)
test_df = pd.read_csv('./input/test.csv', index_col=0)
train_df.head()
我们可以看到有 80 列,也就是有 79 个特征。
接下来将训练集和测试集合并在一起,这么做是为了进行数据预处理的时候更加方便,让测试集和训练集的特征变换为相同的格式,等预处理进行完之后,再把他们分隔开。
我们知道SalePrice作为我们的训练目标,只出现在训练集中,不出现在测试集,因此我们需要把这一列拿出来再进行合并。在拿出这一列前,我们先来观察它,看看它长什么样子,也就是查看它的分布。
prices = DataFrame({'price': train_df['SalePrice'], 'log(price+1)': np.log1p(train_df['SalePrice'])})
prices.hist()
因为label本身并不平滑,为了我们分类器的学习更加准确,我们需要首先把label给平滑化(正态化)。我在这里使用的是log1p, 也就是 log(x+1)。要注意的是我们这一步把数据平滑化了,在最后算结果的时候,还要把预测到的平滑数据给变回去,那么log1p()的反函数就是expm1(),后面用到时再具体细说。
然后我们把这一列拿出来:
y_train = np.log1p(train_df.pop('SalePrice'))
y_train.head()
有
Id
1 12.247699
2 12.109016
3 12.317171
4 11.849405
5 12.429220
Name: SalePrice, dtype: float64
这时,y_train就是SalePrice那一列。
然后我们把两个数据集合并起来:
df = pd.concat((train_df, test_df), axis=0)
查看shape:
df.shape
(2919, 79)
df就是我们合并之后的DataFrame。
数据预处理
根据 kaggle 给出的说明,有以下特征及其说明:
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale
接下来我们对特征进行分析。上述列出了一个目标变量SalePrice和 79 个特征,数量较多,这一步的特征分析是为了之后的特征工程做准备。
我们来查看哪些特征存在缺失值:
print(pd.isnull(df).sum())
这样并不方便观察,我们先查看缺失值最多的 10 个特征:
df.isnull().sum().sort_values(ascending=False).head(10)
为了更清楚的表示,我们用缺失率来考察缺失情况:
df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'缺失率': df_na})
missing_data.head(10)
对其进行可视化:
f, ax = plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=df_na.index, y=df_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
我们可以看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特征存在大量缺失,LotFrontage 有 16.7% 的缺失率,GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近,这些特征有的是 category 数据,有的是 numerical 数据,对它们的缺失值如何处理,将在关于特征工程的部分给出。
最后,我们对每个特征进行相关性分析,查看热力图:
corrmat = train_df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corrmat, vmax=0.9, square=True)
我们看到有些特征相关性大,容易造成过拟合现象,因此需要进行剔除。在下一篇的数据挖掘篇我们来对这些特征进行处理并训练模型。
不足之处,欢迎指正。