决策树①——信息熵&信息增益&基尼系数
决策树②——决策树算法原理(ID3,C4.5,CART)
决策树③——决策树参数介绍(分类和回归)
决策树⑤——Python代码实现决策树
决策树应用实例①——泰坦尼克号分类
决策树应用实例②——用户流失预测模型
决策树应用实例③——银行借贷模型
上一篇介绍了决策树Sklean库的参数,今天用GridSearchCV来进行调参,寻找到最优的参数
① estimator: 训练器,可以是分类或是回归,这里就用决策树分类和决策树回归
② param_grid: 调整的参数,可以有两种方式:
a. 字典,键为参数名,值为可选的参数区间,调优过程会依次迭代所有的参数名下的值,得到每一个参数名下最优的值
param_grid= {'criterion':['gini'],'max_depth':[30,50,60,100],'min_samples_leaf':[2,3,5,10],'min_impurity_decrease':[0.1,0.2,0.5]}
b. 列表,每个元素都是字典,字典和上面一样(键为参数名,值为可选的参数区间),调优过程会依次迭代所有元素,找到每个元素下最佳的参数组合后,再对每个元素进行对比,得到最优的元素及元素内部的参数组合
param = [{'criterion':['gini'],'max_depth':[30,50,60,100],'min_samples_leaf':[2,3,5,10],'min_impurity_decrease':[0.1,0.2,0.5]},
{'criterion':['gini','entropy']},
{'max_depth': [30,60,100], 'min_impurity_decrease':[0.1,0.2,0.5]}]
③cv: 用来指定交叉验证数据集的生成规则。假设=5表示每次计算都把数据集分成5份,拿其中一份作为交叉验证数据集,其他作为训练集。最终得出最优参数及最优评分保存在clf.best_params_和clf.best_score_里,此外clf.cv_results_里保存了计算过程的所有中间结果
# 导入库
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
# 导入数据集
X = datasets.load_iris() # 以全部字典形式返回,有data,target,target_names三个键
data = X.data
target = X.target
name = X.target_names
x,y=datasets.load_iris(return_X_y=True) # 能一次性取前2个
print(x.shape,y.shape)
# 数据分为训练集和测试集
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=100)
# 用GridSearchCV寻找最优参数(字典)
param = {'criterion':['gini'],'max_depth':[30,50,60,100],'min_samples_leaf':[2,3,5,10],'min_impurity_decrease':[0.1,0.2,0.5]}
grid = GridSearchCV(DecisionTreeClassifier(),param_grid=param,cv=6)
grid.fit(x_train,y_train)
print('最优分类器:',grid.best_params_,'最优分数:', grid.best_score_) # 得到最优的参数和分值
# 用GridSearchCV寻找最优参数(列表)
param = [{'criterion':['gini'],'max_depth':[30,50,60,100],'min_samples_leaf':[2,3,5,10],'min_impurity_decrease':[0.1,0.2,0.5]},
{'criterion':['gini','entropy']},
{'max_depth': [30,60,100], 'min_impurity_decrease':[0.1,0.2,0.5]}]
grid = GridSearchCV(DecisionTreeClassifier(),param_grid=param,cv=6)
grid.fit(x_train,y_train)
print('最优分类器:',grid.best_params_,'最优分数:', grid.best_score_) # 得到最优的参数和分值
# 用最佳的参数训练
clf = DecisionTreeClassifier(max_depth=30,min_samples_leaf=3,min_impurity_decrease=0.1)
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
print('训练集分数:', clf.score(x_train,y_train),'测试集分数',clf.score(x_test,y_test))
测试集分数高于训练集,说明拟合得很好,有较好的泛化能力
# 画图展示训练结果
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(list(range(len(x_test))),y_test,marker='*')
ax.scatter(list(range(len(x_test))),y_pred,marker='o') # 不管是画散点图还是折现图,都会被挡住
plt.legend()
plt.show()
从图上可以看出,只有一个值分类错误,说明测试集分类得已经相当好了。
# 通过cv_results观察过程并做图
min_samples_leaf = np.arange(1,10)
param = {'min_samples_leaf': min_samples_leaf}
clf = GridSearchCV(DecisionTreeClassifier(),param_grid=param,cv=6)
clf.fit(x_train,y_train)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.fill_between(min_samples_leaf,clf.cv_results_['mean_train_score']+clf.cv_results_['std_train_score'],
clf.cv_results_['mean_train_score']-clf.cv_results_['std_train_score'],color='b')
ax.fill_between(min_samples_leaf,clf.cv_results_['mean_test_score']+clf.cv_results_['std_test_score'],
clf.cv_results_['mean_test_score']-clf.cv_results_['std_test_score'],color='r')
ax.plot(min_samples_leaf,clf.cv_results_['mean_train_score'],'ko-')
ax.plot(min_samples_leaf,clf.cv_results_['mean_test_score'],'g*-')
plt.legend()
plt.title('GridSearchCV训练过程图')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['font.serif'] = ['SimHei'] # 设置正常显示中文
plt.show()
从图上可以看出,min_samples_leaf=3时,测试集的分数达到最高,训练集也还不错,且这时的标准差最小,就它了!