北京房价预测——Kaggle数据

日暮途远,人间何世

将军一去,大树飘零

概述

之前学习了加州房价预测模型,便摩拳擦掌,从kaggle上找到一份帝都房价数据,练练手。


实验流程

实验数据

Kaggle 中选择了帝都北京住房价格的数据集,该数据集摘录了2011~2017年链家网上的北京房价数据。

image

下载并预览数据

下载并解压数据

image

预览数据
image

每一行代表一间房,每个房子有26个相关属性,其中以下几个需要备注:
DOM: 市场活跃天数
followers: 关注人数
totalPrice: 房屋总价格
price: 每平米价格
floor: 楼层数,中文数据,处理时需要注意
buildingType: 房屋类型,包含塔楼、平房、复式和样板房
renovationCondition: 装修情况,包括其他、毛坯、简装和精装
buildingStructure: 建筑结构,包含未知、混合、砖木、砖混、钢和钢混结构
ladderRatio: 人均楼梯数
fiveYearsProperty: 产权
district:区域,离散型

读取并初步分析数据

  1. 读取数据

    image

    读取数据报错,怀疑是编码问题,检查文件编码

    file new.csv  
    new.csv: ISO-8859 text, with CRLF line terminators
    

    文件编码是ISO-8859格式,因而将其另存为UTF-8格式,之后读取数据成功

  2. 查看数据结构和描述

    image

    可见与加州不同,这里存在大量非数值型数据。一共有318851个实例,其中DOM、bulidingType、elevator、fiveYearsProperty、subway、communityAverage存在缺失。其中DOM缺失过多,可以考虑删除此属性。其中url、id、Cid是不对房价构成影响的因素,可以直接不予考虑。我的目标预测结果是房屋总价格,因此每平米均价可以删去。

  3. 查看数据基本情况

    image

    查看数据频数直方分布情况
    image

    发现这组数据存在大量离散情况,连续型属性为:DOM、Lat、Lng、communityAverage、followers、square。

    import pandas as pd
    import matplotlib.pyplot as plt
    
    def load_housing_data(file_path):
        return pd.read_csv(file_path, sep=',', low_memory=False)
    
    def check_attributes(housing):
        attributes = list(housing)
        for attr in attributes:
            print(housing[attr].value_counts())
    
    if __name__ == '__main__':
        housing = load_housing_data('new.csv')
        housing = housing.drop(['url','id','price'], axis=1)
        check_attributes(housing)
        housing.describe()
        housing.hist(bins=50, figsize=(20,15))
        plt.savefig('housing_distribution.png')
    

创建测试集

选取数据集的20%作为测试集,由于存在district属性,刚好可以以其作为分层抽样的依据,划分好测试集之后,检查测试集分布是否与原始数据一致

image

#split the train and test set
spliter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in spliter.split(housing, housing['district']):
    train_set = housing.loc[train_index]
    test_set = housing.loc[test_index]
test_set.hist(bins=50, figsize=(20,15))
plt.savefig('test.png')

数据探索和可视化

首先将测试集放在一边,对训练集进行数据探索。

  1. 将地理数据可视化

    image

    image

    改变alpha参数,观察实例分布密度
    image

    不得不说,帝都房价就是厉害,每个地区房屋成交量都很巨大。

  2. 将区域、房价信息可视化

    image

    图中每个圆的半径代表价格,颜色代表各区域,基本了解数据中房源的区域集中情况。
    image

    发现帝都房价个地区基本持平,都集中在2500w之下,也鲜有出奇高的房子

    #explore the data
    housing = train_set.copy()
    housing.plot(kind='scatter', x='Lat', y='Lng')
    plt.savefig('gregrophy.png')
    
    housing.plot(kind='scatter', x='Lat', y='Lng', alpha=0.1)
    plt.savefig('gregrophy_more.png')
    
    fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \
        s=housing['totalPrice']/100, label='Price', \
        c=housing['district'], cmap=plt.get_cmap('jet'))
    plt.colorbar(fig)
    plt.legend()
    plt.savefig('gregrophy_district_value.png')
    
    fig = plt.scatter(x=housing['Lat'], y=housing['Lng'], alpha=0.4, \
        c=housing['totalPrice'], cmap=plt.get_cmap('jet'))
    plt.colorbar(fig)
    plt.savefig('gregrophy_price_value.png')
    
  3. 绘制价格随时间变化图

    image

    帝都房价10年开始狂飙突进,18年倒有下降趋势
    image

    自02年~18年帝都房价统计如图,离群点不算太多,盒子被压缩的比较小,说明每个月房内的房子出售价格维持在差异很小的范围内(500w左右)

    price_by_trade_time = pd.DataFrame()
    price_by_trade_time['totalPrice'] = housing['totalPrice']
    price_by_trade_time.index = housing['tradeTime'].astype('datetime64[ns]')
    price_by_trade_month = price_by_trade_time.resample('M').mean().to_period('M').fillna(0)
    price_by_trade_month.plot(kind='line')
    
    price_stat_trade_month_index = [x.strftime('%Y-%m') for x in set(price_by_trade_time.to_period('M').index)]
    price_stat_trade_month_index.sort()
    price_stat_trade_month = []
    for month in price_stat_trade_month_index:
        price_stat_trade_month.append(price_by_trade_time[month]['totalPrice'].values)
    price_stat_trade_month = pd.DataFrame(price_stat_trade_month)
    price_stat_trade_month.index = price_stat_trade_month_index
    price_stat_trade_month = price_stat_trade_month.T
    price_stat_trade_month.boxplot(figsize=(15,10))
    plt.xticks(rotation=90,fontsize=7)
    plt.savefig('price_stat_trade_time.png')
    
  4. 探索房子建筑年限与房价的关系
    查看房子建筑年限数据概况

    未知      15475
    0          14
    1          12
    

    发现存在噪声,选择删除,之后绘制均价-房龄折线图

    image

    百年老房,就是不同凡响!
    image

    发现百年老房只是个例,房龄集中在0~65年附近,放大图像进行细微观察
    image

    大部分房产还是500w附近的,但是半世纪的老房子居然卖得和新房一样,实在难以理解,但是不像流言中北京房价都是千万级的,留在北京有希望了!!!

    #price and constraction correlations
    price_by_cons_time = pd.DataFrame()
    price_by_cons_time['totalPrice'] = housing['totalPrice']
    price_by_cons_time['constructionTime'] = housing['constructionTime']
    price_by_cons_time = price_by_cons_time[
        (price_by_cons_time.constructionTime != '0')
        & (price_by_cons_time.constructionTime != '1')
        & (price_by_cons_time.constructionTime != '未知')
    ]
    price_by_cons_time['constructionTime'] = price_by_cons_time['constructionTime'].astype('int64')
    price_by_cons_time['constructionTime'] = 2018 - price_by_cons_time['constructionTime']
    price_by_cons_time_index = list(set(price_by_cons_time['constructionTime']))
    price_by_cons_time_index.sort()
    price_by_cons_time.index = price_by_cons_time['constructionTime']
    price_by_cons_time = price_by_cons_time.drop('constructionTime', axis=1)
    price_by_cons_time_line = []
    price_by_cons_time_stat = []
    for years in price_by_cons_time_index:
        price_by_cons_time_line.append(price_by_cons_time.loc[years]['totalPrice'].mean())
        try:
            price_by_cons_time_stat.append(price_by_cons_time.loc[years]['totalPrice'].values)
        except Exception:
            price_by_cons_time_stat.append(np.array([price_by_cons_time.loc[years]['totalPrice']]))
    plt.plot(list(price_by_cons_time_index), price_by_cons_time_line)
    plt.savefig('price_cons_line.png')
    price_by_cons_time_stat = pd.DataFrame(price_by_cons_time_stat)
    price_by_cons_time_stat.index = price_by_cons_time_index
    price_by_cons_time_stat = price_by_cons_time_stat.T
    price_by_cons_time_stat.boxplot(figsize=(20,15))
    plt.ylim(0,2500)
    plt.savefig('price_stat_cons_time.png')
    
  5. 探索房价与面积关系

    image

    可见1000平以上的豪宅价格飙升,600~900平又是一个上升区间,0~400平应该属于刚需部分,400~600平价格基本稳定,但有可能是样本数量问题,因此我决定再看看整体情况
    image

    发现面积很集中,缩小区间再观察一下
    image

    北京楼市交易成功的房产大多是100平及以下的房子
    看一下面积与价格的情况
    image

    发现基本是面积越大,价格越高
    放大坐标进行观察
    image

    #square and price
    price_by_square = pd.DataFrame()
    price_by_square['totalPrice'] = housing['totalPrice']
    price_by_square['square'] = housing['square']
    price_by_square['square'] = np.ceil(price_by_square['square'])
    price_by_square['square'] = price_by_square['square'] - (price_by_square['square'] % 10)
    price_by_square_index = list(set(price_by_square['square']))
    price_by_square_index.sort()
    price_by_square.index = price_by_square['square']
    price_by_square_line = []
    price_by_square_stat = []
    for squares in price_by_square_index:
        #price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean())
        try:
            price_by_square_stat.append(price_by_square.loc[squares]['totalPrice'].values)
        except Exception:
            price_by_square_stat.append(np.array([price_by_square.loc[squares]['totalPrice']]))
    plt.plot(price_by_square_index, price_by_square_line)
    plt.savefig('price_square_mean.png')
    price_by_square['square'].hist(bins=50, figsize=(20,15))
    plt.savefig('price_square.png')
    price_by_square_stat = pd.DataFrame(price_by_square_stat).T
    price_by_square_index = [int(x) for x in price_by_square_index]
    price_by_square_stat.columns = price_by_square_index
    price_by_square_stat.boxplot(figsize=(20,15))
    plt.xticks(rotation=90)
    plt.ylim(0,5000)
    plt.savefig('price_stat_square_time.png')
    
  6. 探索时间、面积与房价的关系

    image

    市面上交易的北京房产大多集中在0~2500w左右,0~500平之间
    放大坐标
    image

    再度放大坐标
    image

    发现17年价格一骑绝尘,11年则似乎是北京最佳购房时机

    #price and time,square correlations
    price = pd.DataFrame()
    price['totalPrice'] = housing['totalPrice']
    price['square'] = housing['square']
    price.index = housing['tradeTime'].astype('datetime64[ns]')
    price['square'] = np.ceil(price['square'])
    price['square'] = price['square'] - (price['square'] % 10)
    price = price.to_period('Y')
    price_time_index = [x.strftime('%Y') for x in set(price.index)]
    price_time_index.sort()
    colormap = mpl.cm.Dark2.colors
    m_styles = ['','.','o','^','*']
    for year, (maker, color) in zip(price_time_index, itertools.product(m_styles, colormap)):
        y, x = get_mean(price.loc[year])
        plt.plot(x, y, color=color, marker=maker, label=year)
    plt.xticks(rotation=90)
    plt.xlim(0,750)
    plt.ylim(0,5000)
    plt.legend(price_time_index)
    plt.savefig('price_by_time_square.png')
    
    
    def get_mean(price_by_square):
    try:
        price_by_square_index = list(set(price_by_square['square']))
        price_by_square_index.sort()
        price_by_square_line = []
        price_by_square.index = price_by_square['square']
        for squares in price_by_square_index:
            price_by_square_line.append(price_by_square.loc[squares]['totalPrice'].mean())
        price_by_square_index = [int(x) for x in price_by_square_index]
    except Exception:
        price_by_square_line = [price_by_square.loc['totalPrice']]
        price_by_square_index = [int(price_by_square['square'])]
    return price_by_square_line, price_by_square_index
    
    
  1. 检查是否存在脏数据
    image

    livingRoom存在#NAME?考虑删除
    image

    drawingRoom存在中文、数值混杂,混杂的中文也不多,考虑删除
    image

    bathRoom存在明显错误,考虑删除错误记录
    floor属性很混乱,需要特别处理
    image

    buildingType也存在错误
    经检查,需要处理的属性如下:
    constructionTime
    buildingType
    floor
    bathRoom
    drawingRoom
    livingRoom
    连续型属性是:
    communityAverage
    ladderRatio
    constructionTime
    square
    followers
    Lat
    Lng
    离散型属性是:
    district
    subway
    fiveYearsProperty
    elevator
    buildingStructure
    renovationCondition
    buildingType
    floor
    bathRoom
    kitchen
    drawingRoom
    livingRoom
    斜体离散型是0,1二元值,不需要独热编码,tradeTime并非房产属性,删除

数据准备

  1. 清洗数据

    • 数据存在太多脏记录,从头开始清理
    • 移除不需要的属性
    • 将constructionTime转换为连续性属性房龄(用2018作为基准)
    • 清除buildingType中的脏记录
    • 清除livingRoom、drawingRoom、bathRoom中的脏记录,并将其转化为数值型
    • floor属性太过复杂,我决定删除

class DataNumCleaner(BaseEstimator, TransformerMixin):
def init(self, clean=True):
self.clean = clean
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if self.clean:
X = X[(X.constructionTime != '0') & (X.constructionTime != '1') & (X.constructionTime != '未知')]
X['constructionTime'] = 2018 - X['constructionTime'].astype('int64')
X = X[(X.buildingType == 1) | (X.buildingType == 2) | (X.buildingType == 3) | (X.buildingType == 4)]
X = X[X.livingRoom != '#NAME?']
X = X[(X.drawingRoom == '0') | (X.drawingRoom == '1') | (X.drawingRoom == '2') | (X.drawingRoom == '3') | (X.drawingRoom == '4') | (X.drawingRoom == '5')]
X = X[(X.bathRoom == '0') | (X.bathRoom == '1') | (X.bathRoom == '2') | (X.bathRoom == '3') | (X.bathRoom == '4') | (X.bathRoom == '5') | (X.bathRoom == '6') | (X.bathRoom == '7')]
X.bathRoom = X.bathRoom.astype('float64')
X.drawingRoom = X.drawingRoom.astype('float64')
X.livingRoom = X.livingRoom.astype('float64')
return X
else:
return X
```

  1. 清洗结果还比较理想


    image
  2. 用众数填补缺失值

  3. 将buildingType、renovationCondition、buildingStructure、district转换为独热编码

  4. 建立数据清洗流程

    num_pipeline = Pipeline([
        ('cleaner', DataNumCleaner()),
        ('selector', DataFrameSelector(num_attributes)),
        ('imputer', Imputer(strategy='most_frequent')),
        ('std_scaler', StandardScaler())
    ])
    
    cat_pipeline = Pipeline([
        ('cleaner', DataNumCleaner()),
        ('selector', DataFrameSelector(cat_attributes)),
        ('encoder', OneHotEncoder())
    ])
    
    label_pipeline = Pipeline([
        ('cleaner', DataNumCleaner()),
        ('selector', DataFrameSelector(['totalPrice']))
    ])
    
    full_pipeline = FeatureUnion([
        ('num_pipeline', num_pipeline),
        ('cat_pipeline', cat_pipeline)
    ])
    

模型训练

  1. 线性回归模型

    image

    效果理想

  2. 决策树

    image

    效果也算理想,但是训练时间过久,考虑减少一些无关特征。
    查看特征之间相关性
    image

    image

    发现与价格相关性最高的还是面积、社区均价,但是我们是要预测一套房子的价格,因此选取的特征最好是房子本身的属性,我考虑删除followers、communityAverage
    image

    减少特征之后的线性回归模型性能仍可以接受

  3. 线性SVR


    image
  4. 调参
    由于我的计算机算力实在不济,所以只能先使用线性模型进行练手了

    image

    得到线性svr的最佳参数
    查看每次的RMSE
    image

    结果可以接受

    #improve liner_svr model
    param_grid = [
        {'C': [0.5, 1, 2], 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive']}
    ]
    grid_search = GridSearchCV(lin_svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
    grid_search.fit(housing_prepared,housing_label)
    grid_search.best_params_
    cvres = grid_search.cv_results_
    for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
        print(np.sqrt(-mean_score), params)
    
    #final model
    final_model = grid_search.best_estimator_
    

模型验证

  1. 利用测试集进行验证

    image

    效果与训练集差不多,可以接受

  2. 从测试集中随机取100个记录进行预测,查看效果

    image

    可见预测结果几乎吻合,因此模型可以使用

    test_index = [randint(0,len(y_test)) for i in range(100)]
    y_label = [y_test[index] for index in test_index]
    y_predict = [final_model.predict(X_test_prepared[index]) for index in test_index]
    x = [i+1 for i in range(100)]
    plt.plot(x, y_label, c='red', label='label')
    plt.plot(x, y_predict, c='blue', label='predict')
    plt.legend()
    plt.savefig('result.png')
    
  3. 导出模型

    image

joblib.dump(final_model,'BeijingHousingPricePredicter.pkl')

总结

  • 北京房价真的高!
  • 北京市场上成功买卖的房产基本都在500w附近,100平米左右,房龄在0~40年之间。面积更大的房产有价无市
  • 北京最佳购房时机在2011年附近
  • 2017年附近竟然交易了一套17500w的天价房产,不知买卖双方是何等神仙
  • 数据清洗很重要,可以自己写转换器,列入PipeLine
  • 有些特征可以凭人为经验删去,但是特征工程 很重要!!
  • 机器学习需要算力较好的计算机ORZ
  • 完整代码

你可能感兴趣的:(北京房价预测——Kaggle数据)