机器学习项目 - 预测住房价格 (2)

上篇整理了「数据查看和分析」、「机器学习算法的数据准备」,这篇介绍「选择和训练模型」。

选择和训练模型

评估训练集

  • MSE: 均方误差
  • RMSE: 均方根误差=sqrt(MSE),数值越小代表越精确
    以下代码是线性回归的训练流程:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse  # out value is 68376.64295459937

拿决策树来预测

使用决策树算法,拿整个训练集直接训练,训练的基本步骤:

#
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)

用交叉验证来训练

交叉验证可以得到模型的评估值,并且还可以用标准差来衡量模型的精确度。
以下是利用决策树算法,用交叉验证的方式进行训练:

#
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring='neg_mean_squared_error',cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print('Scores:', scores)
    print('Mean:',scores.mean())
    print('Standard deviation', scores.std())

display_scores(tree_rmse_scores)
#
# Output:
#
# Scores: [69302.13996806 65808.92670683 70608.88582636 68445.56825659
# 70894.80906296 74790.86650185 71951.35239344 69535.68565541
# 77793.36936907 69709.25676712]
# Mean: 70884.08605076896
# Standard deviation 3187.7987599230505

可以多尝试几个模型,看不同模型的差别

其它算法,比如线性回归和随机森林,写法如下:

# 
# 使用线性回归算法:
from sklearn.linear_model import LinearRegression
lin_scores = LinearRegression()
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring='neg_mean_squared_error',cv=10)
lin_rmse_score = np.sqrt(-lin_scores)
display_scores(lin_rmse_score)
#
# Output:
#
# Scores: [66877.52325028 66608.120256   70575.91118868 74179.94799352
# 67683.32205678 71103.16843468 64782.65896552 67711.29940352
# 71080.40484136 67687.6384546 ]
# Mean: 68828.99948449328
# Standard deviation 2662.7615706103393
# 使用随机森林算法:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared,housing_labels)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring='neg_mean_squared_error',cv=10)
forest_rmse_score = np.sqrt(-forest_scores)
display_scores(forest_rmse_score)
#
# Output:
#
# Scores: [52325.56861425 49442.40149791 52611.75312295 55198.59370655
# 52086.45692647 54510.23701739 50958.38278796 51022.59800721
# 55156.14119727 52626.34550329]
# Mean: 52593.847838123955
# Standard deviation 1800.7936516195107

可以通过Mean(平均值)和Standard deviation(浮动值)看出来,使用随机森林算法的效果较好。

模型调整超参数

利用GridSearchCV(网格搜索,自定义参数组合)

利用GridSearchCV,告诉它所有实际的超参数是什么,它将尝试所有的组合值,找出最优解。如下:

#
from sklearn.model_selection import GridSearchCV
param_grid = [{'n_estimators':[3,10,30], 'max_features':[2,4,6,8]},
              {'bootstrap':[False], 'n_estimators':[3,10],'max_features':[2,3,4]}]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

输出最优解:

grid_search.best_params_
#
# Output:
#
#{'max_features': 8, 'n_estimators': 30}

输出最优状态的分类器:

grid_search.best_estimator_
#
# Output:
#
# RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
#           max_features=8, max_leaf_nodes=None, min_impurity_split=1e-07,
#           min_samples_leaf=1, min_samples_split=2,
#           min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=1,
#           oob_score=False, random_state=None, verbose=0, warm_start=False)

输出所有的可能值:

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(np.sqrt(-mean_score), params)
#
# Output:
#
# 63555.775802956174 {'max_features': 2, 'n_estimators': 3}
# 55897.585272134274 {'max_features': 2, 'n_estimators': 10}
# 53493.2929122738 {'max_features': 2, 'n_estimators': 30}
# 61352.76913864102 {'max_features': 4, 'n_estimators': 3}
# 53473.33096623527 {'max_features': 4, 'n_estimators': 10}
# 51531.8672602003 {'max_features': 4, 'n_estimators': 30}
# 60057.190610289785 {'max_features': 6, 'n_estimators': 3}
# 52994.16614765234 {'max_features': 6, 'n_estimators': 10}
# 50858.38894365947 {'max_features': 6, 'n_estimators': 30}
# 59461.81214391593 {'max_features': 8, 'n_estimators': 3}
# 53191.74529149878 {'max_features': 8, 'n_estimators': 10}
# 50473.68166246625 {'max_features': 8, 'n_estimators': 30}
# 62522.060376122194 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
# 54692.88651526594 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
# 60450.287130239885 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
# 53389.980711874865 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
# 60187.65885754527 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
# 52310.095002519985 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

随机搜索

组合值较少时用GridsearchCV还能对付,但组合数较多时需要用随机选择RandomizedSearchCV,用法类似。

分析最佳模型及其错误

查看每个属性对最终值的影响程度:

#
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
#
# Output:
#
# array([6.19813334e-02, 5.44544572e-02, 4.61227129e-02, 1.55942139e-02,
#        1.63925503e-02, 1.58653211e-02, 1.56173032e-02, 3.54320828e-01,
#        1.00273288e-01, 3.38883022e-02, 1.08708398e-01, 3.28848905e-02,
#        5.55195800e-03, 1.32916172e-01, 6.52490730e-05, 2.21408345e-03,
#        3.14893827e-03])
  • 把自定义组合属性、文本中的类型值也加入一起去比较,看哪些属性对最终的影响最大。如下:
#
extra_attribs = ['rooms_per_hhold','pop_per_hhold','bedrooms_per_room']
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
#
# Output:
#
# [(0.3543208280862656, 'median_income'),
#  (0.13291617239213746, 'INLAND'),
#  (0.10870839796533181, 'pop_per_hhold'),
#  (0.10027328798545998, 'income_cat'),
#  (0.06198133338434383, 'longitude'),
#  (0.054454457170120575, 'latitude'),
#  (0.04612271290632948, 'housing_median_age'),
#  (0.033888302242462524, 'rooms_per_hhold'),
#  (0.0328848905240143, 'bedrooms_per_room'),
#  (0.016392550289412937, 'total_bedrooms'),
#  (0.015865321138028063, 'population'),
#  (0.01561730321306384, 'households'),
#  (0.015594213911598253, 'total_rooms'),
#  (0.005551958000676239, '<1H OCEAN'),
#  (0.0031489382672819253, 'NEAR OCEAN'),
#  (0.0022140834504646703, 'NEAR BAY'),
#  (6.524907300855058e-05, 'ISLAND')]

可以看出来,在属性ocean_proximity的值里面,只有'INLAND'是有用的,这时候可以把其它值删掉。

最终测试集评估模型

用测试集去验证最终模型:

#
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop('median_house_value', axis=1)
y_test = strat_test_set['median_house_value'].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse 
#
# Output:
#
# 48905.973801565415

总结:

这个项目的作用是梳理了机器学习中的关键步骤,有些地方介绍的相对粗糙了一点,不过后面的项目还会涉及到,算是非常经典的入门项目。

你可能感兴趣的:(机器学习项目 - 预测住房价格 (2))