利用sklearn来选择和训练模型,微调模型

选择和训练模型

线性回归模型

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

现在就构建好了一个可以工作的线性回归模型。
可以使用sklearn的mean_squared_error函数来测量整个训练集上回归模型的RMSE:

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)      # 68628.413493824875

决策树模型

它能够从数据中找到复杂的非线性关系。

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

housing_predicions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predicions)
tree_rmse = np.sqrt(tree_mse)      # 0.0

我们可以看到结果是完全没有错误,说明这个模型对数据严重过拟合。
现在来改进一下评估的策略

使用交叉验证来更好的进行评估

使用sklearn的交叉验证功能。以下是执行k-折交叉验证的代码:它将训练集随机分割成10个不同的子集,每个子集称为一个折叠,然后对决策树模型进行10此训练和评估——每次挑选一个折叠进行评估, 使用另外9个折叠进行训练。产出的结果是一个包含10次评估分数的数组:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

sklearn的交叉验证功能更倾向于使用效用函数(越大越好)而不是成本函数(越小越好),所以计算分数的函数实际上是负的MSE

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

随机森林

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)

随机森林:
Scores: [49565.79529588 47337.80993895 50185.64303548 52405.39139117
49600.49582891 53273.54270025 48704.41836964 47764.91984528
52851.82081761 50215.74587276]
Mean: 50190.55830959307
Standard deviation: 1961.179867922108

决策树:
Scores: [69626.46134399 67991.90860685 71566.04190367 70032.02237379
70596.49995302 74664.05771371 70091.6453497 71805.24386367
78157.17712767 69618.17027461]
Mean: 71414.92285106823
Standard deviation: 2804.7690022906345

线性回归:

Scores: [66877.52325028 66608.120256 70575.91118868 74179.94799352
67683.32205678 71103.16843468 64782.65896552 67711.29940352
71080.40484136 67687.6384546 ]
Mean: 68828.99948449328
Standard deviation: 2662.7615706103393

使用python 的pickel模块或是joblib库可以轻松保存sklearn模型,这样可以更有效。

import joblib
joblib.dump(my_model, "my_model.pkl")

my_model_loaded = joblib.load("my_model.pkl")

微调模型

网格搜索

可以使用sklearn的GridSearchCV来替你进行搜索。

from sklearn.model_selection import GridSearchCV

param_grid = [
    {
     'n_estimators' : [3, 10, 30], 'max_features' : [2, 4, 6, 8]},
    {
     'bootstrap' : [False], 'n_estimators' : [3, 10], 'max_features' : [2, 3, 4]},
]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

最佳的参数组合:
grid_search.best_params_
最好的估算器:
grid_search.best_estimator_
评估分数

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params

随机搜索

当搜索的组合数量较少——网格搜索是不错的;但是当搜索的范围较大时,通常会优先选择使用RandomizedSearchCV,它不会尝试所有可能的组合,而是在每次迭代中为每个超参数选择一个随机值,然后对一定数量的随机组合进行评估。

feature_importances = grid_search.best_estimator_.feature_importances_

你可能感兴趣的:(机器学习,决策树,python,机器学习,人工智能,sklearn)