日暮途远，人间何世

将军一去，大树飘零

概述

之前学习了加州房价预测模型，便摩拳擦掌，从kaggle上找到一份帝都房价数据，练练手。

实验流程

实验数据

从 Kaggle 中选择了帝都北京住房价格的数据集，该数据集摘录了2011～2017年链家网上的北京房价数据。

image

下载并预览数据

下载并解压数据

image

预览数据

image

每一行代表一间房，每个房子有26个相关属性，其中以下几个需要备注：
DOM: 市场活跃天数
followers: 关注人数
totalPrice: 房屋总价格
price: 每平米价格
floor: 楼层数，中文数据，处理时需要注意
buildingType: 房屋类型，包含塔楼、平房、复式和样板房
renovationCondition: 装修情况，包括其他、毛坯、简装和精装
buildingStructure: 建筑结构，包含未知、混合、砖木、砖混、钢和钢混结构
ladderRatio: 人均楼梯数
fiveYearsProperty: 产权
district：区域，离散型

读取并初步分析数据

读取数据

image

读取数据报错，怀疑是编码问题，检查文件编码
```
file new.csv  
new.csv: ISO-8859 text, with CRLF line terminators
```
文件编码是ISO-8859格式，因而将其另存为UTF-8格式,之后读取数据成功
查看数据结构和描述

image

可见与加州不同，这里存在大量非数值型数据。一共有318851个实例，其中DOM、bulidingType、elevator、fiveYearsProperty、subway、communityAverage存在缺失。其中DOM缺失过多，可以考虑删除此属性。其中url、id、Cid是不对房价构成影响的因素，可以直接不予考虑。我的目标预测结果是房屋总价格，因此每平米均价可以删去。

查看数据基本情况

image

查看数据频数直方分布情况

image

发现这组数据存在大量离散情况，连续型属性为：DOM、Lat、Lng、communityAverage、followers、square。

import pandas as pd
import matplotlib.pyplot as plt

def load_housing_data(file_path):
    return pd.read_csv(file_path, sep=',', low_memory=False)

def check_attributes(housing):
    attributes = list(housing)
    for attr in attributes:
        print(housing[attr].value_counts())

if __name__ == '__main__':
    housing = load_housing_data('new.csv')
    housing = housing.drop(['url','id','price'], axis=1)
    check_attributes(housing)
    housing.describe()
    housing.hist(bins=50, figsize=(20,15))
    plt.savefig('housing_distribution.png')

创建测试集

选取数据集的20%作为测试集，由于存在district属性，刚好可以以其作为分层抽样的依据,划分好测试集之后，检查测试集分布是否与原始数据一致

image

#split the train and test set
spliter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in spliter.split(housing, housing['district']):
    train_set = housing.loc[train_index]
    test_set = housing.loc[test_index]
test_set.hist(bins=50, figsize=(20,15))
plt.savefig('test.png')

数据探索和可视化

首先将测试集放在一边，对训练集进行数据探索。

将地理数据可视化

image

image

改变alpha参数，观察实例分布密度

image

不得不说，帝都房价就是厉害，每个地区房屋成交量都很巨大。

将区域、房价信息可视化

image

图中每个圆的半径代表价格，颜色代表各区域，基本了解数据中房源的区域集中情况。

image

发现帝都房价个地区基本持平，都集中在2500w之下，也鲜有出奇高的房子

#explore the data
housing = train_set.copy()
housing.plot(kind='scatter', x='Lat', y='Lng')
plt.savefig('gregrophy.png')

housing.plot(kind='scatter', x='Lat', y='Lng', alpha=0.1)
plt.savefig('gregrophy_more.png')

fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \
    s=housing['totalPrice']/100, label='Price', \
    c=housing['district'], cmap=plt.get_cmap('jet'))
plt.colorbar(fig)
plt.legend()
plt.savefig('gregrophy_district_value.png')

fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \
    c=housing['totalPrice'], cmap=plt.get_cmap('jet'))
plt.colorbar(fig)
plt.savefig('gregrophy_price_value.png')

绘制价格随时间变化图

image

帝都房价10年开始狂飙突进，18年倒有下降趋势

image

自02年～18年帝都房价统计如图，离群点不算太多，盒子被压缩的比较小，说明每个月房内的房子出售价格维持在差异很小的范围内(500w左右）

price_by_trade_time = pd.DataFrame()
price_by_trade_time['totalPrice'] = housing['totalPrice']
price_by_trade_time.index = housing['tradeTime'].astype('datetime64[ns]')
price_by_trade_month = price_by_trade_time.resample('M').mean().to_period('M').fillna(0)
price_by_trade_month.plot(kind='line')

price_stat_trade_month_index = [x.strftime('%Y-%m') for x in set(price_by_trade_time.to_period('M').index)]
price_stat_trade_month_index.sort()
price_stat_trade_month = []
for month in price_stat_trade_month_index:
    price_stat_trade_month.append(price_by_trade_time[month]['totalPrice'].values)
price_stat_trade_month = pd.DataFrame(price_stat_trade_month)
price_stat_trade_month.index = price_stat_trade_month_index
price_stat_trade_month = price_stat_trade_month.T
price_stat_trade_month.boxplot(figsize=(15,10))
plt.xticks(rotation=90,fontsize=7)
plt.savefig('price_stat_trade_time.png')

探索房子建筑年限与房价的关系
查看房子建筑年限数据概况

未知      15475
0          14
1          12

发现存在噪声，选择删除，之后绘制均价-房龄折线图

image

百年老房，就是不同凡响！

image

发现百年老房只是个例，房龄集中在0～65年附近，放大图像进行细微观察

image

大部分房产还是500w附近的，但是半世纪的老房子居然卖得和新房一样，实在难以理解，但是不像流言中北京房价都是千万级的，留在北京有希望了！！！

#price and constraction correlations
price_by_cons_time = pd.DataFrame()
price_by_cons_time['totalPrice'] = housing['totalPrice']
price_by_cons_time['constructionTime'] = housing['constructionTime']
price_by_cons_time = price_by_cons_time[
    (price_by_cons_time.constructionTime != '0')
    & (price_by_cons_time.constructionTime != '1')
    & (price_by_cons_time.constructionTime != '未知')
]
price_by_cons_time['constructionTime'] = price_by_cons_time['constructionTime'].astype('int64')
price_by_cons_time['constructionTime'] = 2018 - price_by_cons_time['constructionTime']
price_by_cons_time_index = list(set(price_by_cons_time['constructionTime']))
price_by_cons_time_index.sort()
price_by_cons_time.index = price_by_cons_time['constructionTime']
price_by_cons_time = price_by_cons_time.drop('constructionTime', axis=1)
price_by_cons_time_line = []
price_by_cons_time_stat = []
for years in price_by_cons_time_index:
    price_by_cons_time_line.append(price_by_cons_time.loc[years]['totalPrice'].mean())
    try:
        price_by_cons_time_stat.append(price_by_cons_time.loc[years]['totalPrice'].values)
    except Exception:
        price_by_cons_time_stat.append(np.array([price_by_cons_time.loc[years]['totalPrice']]))
plt.plot(list(price_by_cons_time_index), price_by_cons_time_line)
plt.savefig('price_cons_line.png')
price_by_cons_time_stat = pd.DataFrame(price_by_cons_time_stat)
price_by_cons_time_stat.index = price_by_cons_time_index
price_by_cons_time_stat = price_by_cons_time_stat.T
price_by_cons_time_stat.boxplot(figsize=(20,15))
plt.ylim(0,2500)
plt.savefig('price_stat_cons_time.png')

探索房价与面积关系

image

可见1000平以上的豪宅价格飙升，600～900平又是一个上升区间，0～400平应该属于刚需部分，400～600平价格基本稳定，但有可能是样本数量问题，因此我决定再看看整体情况

image

发现面积很集中，缩小区间再观察一下

image

北京楼市交易成功的房产大多是100平及以下的房子
看一下面积与价格的情况

image

发现基本是面积越大，价格越高
放大坐标进行观察

image

#square and price
price_by_square = pd.DataFrame()
price_by_square['totalPrice'] = housing['totalPrice']
price_by_square['square'] = housing['square']
price_by_square['square'] = np.ceil(price_by_square['square'])
price_by_square['square'] = price_by_square['square'] - (price_by_square['square'] % 10)
price_by_square_index = list(set(price_by_square['square']))
price_by_square_index.sort()
price_by_square.index = price_by_square['square']
price_by_square_line = []
price_by_square_stat = []
for squares in price_by_square_index:
    #price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean())
    try:
        price_by_square_stat.append(price_by_square.loc[squares]['totalPrice'].values)
    except Exception:
        price_by_square_stat.append(np.array([price_by_square.loc[squares]['totalPrice']]))
plt.plot(price_by_square_index, price_by_square_line)
plt.savefig('price_square_mean.png')
price_by_square['square'].hist(bins=50, figsize=(20,15))
plt.savefig('price_square.png')
price_by_square_stat = pd.DataFrame(price_by_square_stat).T
price_by_square_index = [int(x) for x in price_by_square_index]
price_by_square_stat.columns = price_by_square_index
price_by_square_stat.boxplot(figsize=(20,15))
plt.xticks(rotation=90)
plt.ylim(0,5000)
plt.savefig('price_stat_square_time.png')

探索时间、面积与房价的关系

image

市面上交易的北京房产大多集中在0～2500w左右，0～500平之间
放大坐标

image

再度放大坐标

image

发现17年价格一骑绝尘，11年则似乎是北京最佳购房时机

#price and time,square correlations
price = pd.DataFrame()
price['totalPrice'] = housing['totalPrice']
price['square'] = housing['square']
price.index = housing['tradeTime'].astype('datetime64[ns]')
price['square'] = np.ceil(price['square'])
price['square'] = price['square'] - (price['square'] % 10)
price = price.to_period('Y')
price_time_index = [x.strftime('%Y') for x in set(price.index)]
price_time_index.sort()
colormap = mpl.cm.Dark2.colors
m_styles = ['','.','o','^','*']
for year, (maker, color) in zip(price_time_index, itertools.product(m_styles, colormap)):
    y, x = get_mean(price.loc[year])
    plt.plot(x, y, color=color, marker=maker, label=year)
plt.xticks(rotation=90)
plt.xlim(0,750)
plt.ylim(0,5000)
plt.legend(price_time_index)
plt.savefig('price_by_time_square.png')


def get_mean(price_by_square):
try:
    price_by_square_index = list(set(price_by_square['square']))
    price_by_square_index.sort()
    price_by_square_line = []
    price_by_square.index = price_by_square['square']
    for squares in price_by_square_index:
        price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean())
    price_by_square_index = [int(x) for x in price_by_square_index]
except Exception:
    price_by_square_line = [price_by_square.loc['totalPrice']]
    price_by_square_index = [int(price_by_square['square'])]
return price_by_square_line, price_by_square_index

检查是否存在脏数据

image

livingRoom存在#NAME？考虑删除

image

drawingRoom存在中文、数值混杂，混杂的中文也不多，考虑删除

image

bathRoom存在明显错误，考虑删除错误记录
floor属性很混乱，需要特别处理

image

buildingType也存在错误
经检查，需要处理的属性如下：
constructionTime
buildingType
floor
bathRoom
drawingRoom
livingRoom
连续型属性是：
communityAverage
ladderRatio
constructionTime
square
followers
Lat
Lng
离散型属性是：
district
subway
fiveYearsProperty
elevator
buildingStructure
renovationCondition
buildingType
floor
bathRoom
kitchen
drawingRoom
livingRoom
斜体离散型是0，1二元值，不需要独热编码，tradeTime并非房产属性，删除

数据准备

清洗数据
- 数据存在太多脏记录，从头开始清理
- 移除不需要的属性
- 将constructionTime转换为连续性属性房龄（用2018作为基准）
- 清除buildingType中的脏记录
- 清除livingRoom、drawingRoom、bathRoom中的脏记录，并将其转化为数值型
- floor属性太过复杂，我决定删除

class DataNumCleaner(BaseEstimator, TransformerMixin):
def init(self, clean=True):
self.clean = clean
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if self.clean:
X = X[(X.constructionTime != '0') & (X.constructionTime != '1') & (X.constructionTime != '未知')]
X['constructionTime'] = 2018 - X['constructionTime'].astype('int64')
X = X[(X.buildingType == 1) | (X.buildingType == 2) | (X.buildingType == 3) | (X.buildingType == 4)]
X = X[X.livingRoom != '#NAME?']
X = X[(X.drawingRoom == '0') | (X.drawingRoom == '1') | (X.drawingRoom == '2') | (X.drawingRoom == '3') | (X.drawingRoom == '4') | (X.drawingRoom == '5')]
X = X[(X.bathRoom == '0') | (X.bathRoom == '1') | (X.bathRoom == '2') | (X.bathRoom == '3') | (X.bathRoom == '4') | (X.bathRoom == '5') | (X.bathRoom == '6') | (X.bathRoom == '7')]
X.bathRoom = X.bathRoom.astype('float64')
X.drawingRoom = X.drawingRoom.astype('float64')
X.livingRoom = X.livingRoom.astype('float64')
return X
else:
return X
```

清洗结果还比较理想

image
用众数填补缺失值
将buildingType、renovationCondition、buildingStructure、district转换为独热编码

建立数据清洗流程

num_pipeline = Pipeline([
    ('cleaner', DataNumCleaner()),
    ('selector', DataFrameSelector(num_attributes)),
    ('imputer', Imputer(strategy='most_frequent')),
    ('std_scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('cleaner', DataNumCleaner()),
    ('selector', DataFrameSelector(cat_attributes)),
    ('encoder', OneHotEncoder())
])

label_pipeline = Pipeline([
    ('cleaner', DataNumCleaner()),
    ('selector', DataFrameSelector(['totalPrice']))
])

full_pipeline = FeatureUnion([
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])

模型训练

线性回归模型

image

效果理想
决策树

image

效果也算理想，但是训练时间过久，考虑减少一些无关特征。
查看特征之间相关性

image

image

发现与价格相关性最高的还是面积、社区均价，但是我们是要预测一套房子的价格，因此选取的特征最好是房子本身的属性，我考虑删除followers、communityAverage

image

减少特征之后的线性回归模型性能仍可以接受
线性SVR

image

调参
由于我的计算机算力实在不济，所以只能先使用线性模型进行练手了

image

得到线性svr的最佳参数
查看每次的RMSE

image

结果可以接受

#improve liner_svr model
param_grid = [
    {'C': [0.5, 1, 2], 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive']}
]
grid_search = GridSearchCV(lin_svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared,housing_label)
grid_search.best_params_
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(np.sqrt(-mean_score), params)

#final model
final_model = grid_search.best_estimator_

模型验证

利用测试集进行验证

image

效果与训练集差不多，可以接受

从测试集中随机取100个记录进行预测，查看效果

image

可见预测结果几乎吻合，因此模型可以使用

test_index = [randint(0,len(y_test)) for i in range(100)]
y_label = [y_test[index] for index in test_index]
y_predict = [final_model.predict(X_test_prepared[index]) for index in test_index]
x = [i+1 for i in range(100)]
plt.plot(x, y_label, c='red', label='label')
plt.plot(x, y_predict, c='blue', label='predict')
plt.legend()
plt.savefig('result.png')

导出模型

image

joblib.dump(final_model,'BeijingHousingPricePredicter.pkl')

总结

北京房价真的高！
北京市场上成功买卖的房产基本都在500w附近，100平米左右，房龄在0～40年之间。面积更大的房产有价无市
北京最佳购房时机在2011年附近
2017年附近竟然交易了一套17500w的天价房产，不知买卖双方是何等神仙
数据清洗很重要，可以自己写转换器，列入PipeLine
有些特征可以凭人为经验删去，但是特征工程 很重要！！
机器学习需要算力较好的计算机ORZ
完整代码

北京房价预测——Kaggle数据

概述

实验流程

实验数据

下载并预览数据

读取并初步分析数据

创建测试集

数据探索和可视化

数据准备

模型训练

模型验证

总结

你可能感兴趣的:(北京房价预测——Kaggle数据)