决策树④——决策树Sklearn调参(GridSearchCV调参及过程做图)

决策树①——信息熵&信息增益&基尼系数
决策树②——决策树算法原理(ID3,C4.5,CART)
决策树③——决策树参数介绍(分类和回归)
决策树⑤——Python代码实现决策树
决策树应用实例①——泰坦尼克号分类
决策树应用实例②——用户流失预测模型
决策树应用实例③——银行借贷模型

上一篇介绍了决策树Sklean库的参数,今天用GridSearchCV来进行调参,寻找到最优的参数

一、GridSearchCV介绍

GridSearchCV(
estimator: 训练器,可以是分类或是回归,这里就用决策树分类和决策树回归

param_grid: 调整的参数,可以有两种方式:

a. 字典,键为参数名,值为可选的参数区间,调优过程会依次迭代所有的参数名下的值,得到每一个参数名下最优的值

param_grid= {'criterion':['gini'],'max_depth':[30,50,60,100],'min_samples_leaf':[2,3,5,10],'min_impurity_decrease':[0.1,0.2,0.5]}

b. 列表,每个元素都是字典,字典和上面一样(键为参数名,值为可选的参数区间),调优过程会依次迭代所有元素,找到每个元素下最佳的参数组合后,再对每个元素进行对比,得到最优的元素及元素内部的参数组合

param = [{'criterion':['gini'],'max_depth':[30,50,60,100],'min_samples_leaf':[2,3,5,10],'min_impurity_decrease':[0.1,0.2,0.5]},
         {'criterion':['gini','entropy']},
         {'max_depth': [30,60,100], 'min_impurity_decrease':[0.1,0.2,0.5]}]

cv: 用来指定交叉验证数据集的生成规则。假设=5表示每次计算都把数据集分成5份,拿其中一份作为交叉验证数据集,其他作为训练集。最终得出最优参数及最优评分保存在clf.best_params_和clf.best_score_里,此外clf.cv_results_里保存了计算过程的所有中间结果

二、决策树调参

# 导入库
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics

# 导入数据集
X = datasets.load_iris()  # 以全部字典形式返回,有data,target,target_names三个键
data = X.data
target = X.target
name = X.target_names
x,y=datasets.load_iris(return_X_y=True) # 能一次性取前2个
print(x.shape,y.shape)

# 数据分为训练集和测试集
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=100)

# 用GridSearchCV寻找最优参数(字典)
param = {'criterion':['gini'],'max_depth':[30,50,60,100],'min_samples_leaf':[2,3,5,10],'min_impurity_decrease':[0.1,0.2,0.5]}
grid = GridSearchCV(DecisionTreeClassifier(),param_grid=param,cv=6)
grid.fit(x_train,y_train)
print('最优分类器:',grid.best_params_,'最优分数:', grid.best_score_)  # 得到最优的参数和分值

在这里插入图片描述

# 用GridSearchCV寻找最优参数(列表)
param = [{'criterion':['gini'],'max_depth':[30,50,60,100],'min_samples_leaf':[2,3,5,10],'min_impurity_decrease':[0.1,0.2,0.5]},
         {'criterion':['gini','entropy']},
         {'max_depth': [30,60,100], 'min_impurity_decrease':[0.1,0.2,0.5]}]
grid = GridSearchCV(DecisionTreeClassifier(),param_grid=param,cv=6)
grid.fit(x_train,y_train)
print('最优分类器:',grid.best_params_,'最优分数:', grid.best_score_)  # 得到最优的参数和分值

在这里插入图片描述
两种方法调出来的参数有一点点区别,但是训练分数都一样

# 用最佳的参数训练
clf = DecisionTreeClassifier(max_depth=30,min_samples_leaf=3,min_impurity_decrease=0.1)
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
print('训练集分数:', clf.score(x_train,y_train),'测试集分数',clf.score(x_test,y_test))

在这里插入图片描述
测试集分数高于训练集,说明拟合得很好,有较好的泛化能力

# 画图展示训练结果
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(list(range(len(x_test))),y_test,marker='*')
ax.scatter(list(range(len(x_test))),y_pred,marker='o')  # 不管是画散点图还是折现图,都会被挡住
plt.legend()
plt.show()

决策树④——决策树Sklearn调参(GridSearchCV调参及过程做图)_第1张图片
从图上可以看出,只有一个值分类错误,说明测试集分类得已经相当好了。

三、GridSearchCV调参过程

# 通过cv_results观察过程并做图
min_samples_leaf = np.arange(1,10)
param = {'min_samples_leaf': min_samples_leaf}
clf = GridSearchCV(DecisionTreeClassifier(),param_grid=param,cv=6)
clf.fit(x_train,y_train)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.fill_between(min_samples_leaf,clf.cv_results_['mean_train_score']+clf.cv_results_['std_train_score'],
                 clf.cv_results_['mean_train_score']-clf.cv_results_['std_train_score'],color='b')
ax.fill_between(min_samples_leaf,clf.cv_results_['mean_test_score']+clf.cv_results_['std_test_score'],
                 clf.cv_results_['mean_test_score']-clf.cv_results_['std_test_score'],color='r')
ax.plot(min_samples_leaf,clf.cv_results_['mean_train_score'],'ko-')
ax.plot(min_samples_leaf,clf.cv_results_['mean_test_score'],'g*-')
plt.legend()
plt.title('GridSearchCV训练过程图')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['font.serif'] = ['SimHei']  # 设置正常显示中文
plt.show()

决策树④——决策树Sklearn调参(GridSearchCV调参及过程做图)_第2张图片
从图上可以看出,min_samples_leaf=3时,测试集的分数达到最高,训练集也还不错,且这时的标准差最小,就它了!

你可能感兴趣的:(决策树)