模型调参——随机森林在乳腺癌数据集上的调参应用

一、数据集

Sklearn自带数据集——乳腺癌数据集

二、模型选择

乳腺癌数据集是二分类模型，选择随机森林模型进行调参

三、调参流程

1）简单建模，观察模型在数据集上具体的表现效果
2）调参——n_estimators
3）调参——max_depth
4）调参——min_samples_leaf
5）调参——min_samples_split
6）调参——max_features
7）调参——criterion
8）确定最佳参数组合

四、调参详解应用步骤

1）导入相关库

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2）查看数据集概况

data=load_breast_cancer() #实例化
data.info()
data.data.shape
data.target.shape
data.target

3）简单建模，观察模型在数据集上具体的表现效果

rfc=RandomForestClassifier(n_estimators=100,random_state=90)
score_pre=cross_val_score(rfc,data.data,data.target,cv=10).mean()
score_pre

score_pre 分数为 0.9666925935528475

4）调参 n_estimators

scorel=[]
for i in range(1,201,10):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)
    score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),(scorel.index(max(scorel))*10)+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,201,10),scorel)
plt.show()

运行结果：

通过数据和学习曲线可以发现，当n_estimators=41的时候，阶段性准确率最高，达到0.9684480598046841

接下来缩小范围，继续探索n_estimators在 [35,45] 的表现效果

scorel=[]
for i in range(35,45):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)
    score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),([*range(35,45)][scorel.index(max(scorel))]))
plt.figure(figsize=[20,5])
plt.plot(range(35,45,1),scorel)
plt.show()

运行结果：

调整n_estimators效果显著，模型准确率立刻上升了0.0035。接下来就进入网格搜索，我们将使用网格搜索对参数一个个进行调整。窥探如何通过复杂度-泛化误差方法调整参数进而提高模型的准确度。

5）调参max_depth

param_grid={'max_depth':[*np.arange(1,20,1)]}

rfc=RandomForestClassifier(n_estimators=39,random_state=90)
GS=GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

运行结果：

通过运行结果可以看到，网格搜索给出的最佳参数max_depth是11，此时最佳准确度为0.9718804920913884

但问题来了，相比前面第四步，此时限制max_depth减小，准确率反而降低了。随机森林树模型天生过拟合，降低模型复杂度理应可以提升准确率，但此时降低树的最大深度却使模型准确率降低了，说明模型现在位于图像左边，即泛化误差最低点的左边。。这和数据集本身有关，但也有可能是我们调整的n_estimators对于数据集来说太大，因此将模型拉到泛化误差最低点去了。

当模型位于图像左边时，我们需要的是增加模型复杂度（增加方差，减少偏差）的选项，因此max_depth应该尽量大，min_samples_leaf和min_samples_split都应该尽量小。这几乎是在说明，除了max_features，我们没有任何参数可以调整了，因为max_depth，min_samples_leaf和min_samples_split是剪枝参数，是减小复杂度的参数。在这里，我们可以预言，我们已经非常接近模型的上限，模型很可能没有办法再进步了。

6）调参max_features

grid_param={'max_features':np.arange(5,30)}

rfc=RandomForestClassifier(n_estimators=39,random_state=90)
GS=GridSearchCV(rfc,grid_param,cv=10)
GS.fit(data.data,data.target)
GS.best_params_
GS.best_score_

运行结果：

网格搜索给出的最佳参数max_features是5，此时最佳准确度为0.9718804920913884，模型的准确率还是降低了。

网格搜索返回了max_features的最小值，可见max_features升高之后，模型的准确率降低了。这说明，我们把模型往右推，模型的泛化误差增加了。前面用max_depth往左推，现在用max_features往右推，泛化误差都增加，这说明模型本身已经处于泛化误差最低点，已经达到了模型的预测上限，没有参数可以左右的部分了。剩下的那些误差，是噪声决定的，已经没有方差和偏差的舞台了。

五、调整完毕，总结模型最佳参数组合

RandomForestClassifier(n_estimators=39,random_state=90)

调参前模型准确率：0.9666925935528475（96.67%）
调参后模型准确率：0.9719568317345088（97.20%）
模型提升的准确率：0.0052642381816613（+0.53%）

·································································································································································
完整代码：

#导入相关库
from sklearn.datasets import load_breast_cancer     #导入乳腺癌数据集模块
from sklearn.ensemble import RandomForestClassifier #导入集成算法随机森林模块
from sklearn.model_selection import cross_val_score #导入交叉验证模块
from sklearn.model_selection import GridSearchCV    #导入网格搜索模块
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#数据集概况
data=load_breast_cancer()   #实例化
data.info()                 #数据集概况
data.data.shape             #特征数据集形状
data.target.shape           #标签数据集形状
data.target                 #标签数据


#简单建模，观察模型在数据集上具体的表现效果
rfc=RandomForestClassifier(n_estimators=100,random_state=90)      #实例化
score_pre=cross_val_score(rfc,data.data,data.target,cv=10).mean() #交叉验证
score_pre

#调参n_estimators
scorel=[]
for i in range(1,201,10):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)  #设置n_estimators[1,201]依次建模评分
    score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),(scorel.index(max(scorel))*10)+1)
plt.figure(figsize=[20,5])  #绘制学习曲线
plt.plot(range(1,201,10),scorel)
plt.show()

scorel=[]
for i in range(35,45):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)  #设置n_estimators[35,45]依次建模评分
    score=cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),([*range(35,45)][scorel.index(max(scorel))]))
plt.figure(figsize=[20,5])   #绘制学习曲线
plt.plot(range(35,45,1),scorel)
plt.show()

#调参max_depth 网格搜索最佳参数
param_grid={'max_depth':[*np.arange(1,20,1)]} #网格搜索设置参数及参数大小范围
rfc=RandomForestClassifier(n_estimators=39,random_state=90) #实例化
GS=GridSearchCV(rfc,param_grid,cv=10) #网格搜索
GS.fit(data.data,data.target)  #训练模型
GS.best_params_   #最佳参数
GS.best_score_    #最佳分数

#调参max_features 网格搜索最佳参数
grid_param={'max_features':np.arange(5,30)} #网格搜索设置参数及参数大小范围
rfc=RandomForestClassifier(n_estimators=39,random_state=90) #实例化
GS=GridSearchCV(rfc,grid_param,cv=10) #网格搜索
GS.fit(data.data,data.target)  #训练模型
GS.best_params_  #最佳参数
GS.best_score_   #最佳分数

模型调参——随机森林在乳腺癌数据集上的调参应用

一、数据集

二、模型选择

三、调参流程

四、调参详解应用步骤

1）导入相关库

2）查看数据集概况

3）简单建模，观察模型在数据集上具体的表现效果

4）调参 n_estimators

5）调参max_depth

6）调参max_features

五、调整完毕，总结模型最佳参数组合

你可能感兴趣的:(模型调参——随机森林在乳腺癌数据集上的调参应用)