from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
现在就构建好了一个可以工作的线性回归模型。
可以使用sklearn的mean_squared_error函数来测量整个训练集上回归模型的RMSE:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse) # 68628.413493824875
它能够从数据中找到复杂的非线性关系。
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predicions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predicions)
tree_rmse = np.sqrt(tree_mse) # 0.0
我们可以看到结果是完全没有错误,说明这个模型对数据严重过拟合。
现在来改进一下评估的策略
使用sklearn的交叉验证功能。以下是执行k-折交叉验证的代码:它将训练集随机分割成10个不同的子集,每个子集称为一个折叠,然后对决策树模型进行10此训练和评估——每次挑选一个折叠进行评估, 使用另外9个折叠进行训练。产出的结果是一个包含10次评估分数的数组:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
sklearn的交叉验证功能更倾向于使用效用函数(越大越好)而不是成本函数(越小越好),所以计算分数的函数实际上是负的MSE
def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
随机森林:
Scores: [49565.79529588 47337.80993895 50185.64303548 52405.39139117
49600.49582891 53273.54270025 48704.41836964 47764.91984528
52851.82081761 50215.74587276]
Mean: 50190.55830959307
Standard deviation: 1961.179867922108
决策树:
Scores: [69626.46134399 67991.90860685 71566.04190367 70032.02237379
70596.49995302 74664.05771371 70091.6453497 71805.24386367
78157.17712767 69618.17027461]
Mean: 71414.92285106823
Standard deviation: 2804.7690022906345
线性回归:
Scores: [66877.52325028 66608.120256 70575.91118868 74179.94799352
67683.32205678 71103.16843468 64782.65896552 67711.29940352
71080.40484136 67687.6384546 ]
Mean: 68828.99948449328
Standard deviation: 2662.7615706103393
使用python 的pickel模块或是joblib库可以轻松保存sklearn模型,这样可以更有效。
import joblib
joblib.dump(my_model, "my_model.pkl")
my_model_loaded = joblib.load("my_model.pkl")
可以使用sklearn的GridSearchCV来替你进行搜索。
from sklearn.model_selection import GridSearchCV
param_grid = [
{
'n_estimators' : [3, 10, 30], 'max_features' : [2, 4, 6, 8]},
{
'bootstrap' : [False], 'n_estimators' : [3, 10], 'max_features' : [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
最佳的参数组合:
grid_search.best_params_
最好的估算器:
grid_search.best_estimator_
评估分数
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params
当搜索的组合数量较少——网格搜索是不错的;但是当搜索的范围较大时,通常会优先选择使用RandomizedSearchCV,它不会尝试所有可能的组合,而是在每次迭代中为每个超参数选择一个随机值,然后对一定数量的随机组合进行评估。
feature_importances = grid_search.best_estimator_.feature_importances_