在scikit-learn中,RF的分类类是RandomForestClassifier,回归类是RandomForestRegressor。当然RF的变种Extra Trees也有, 分类类ExtraTreesClassifier,回归类ExtraTreesRegressor。由于RF和Extra Trees的区别较小,调参方法基本相同,本文只关注于RF的调参。
与GBDT的调参类似,RF需要调参的参数也包括两部分,第一部分是Bagging框架的参数,第二部分是CART决策树的参数。下面我们就对这些参数做一个介绍。
RF的Bagging框架的参数可以和GBDT对比来看,在scikit-learn 梯度提升树(GBDT)调参小结中,GBDT的框架参数比较多,重要的有最大迭代器个数,步长和子采样比例,调参起来比较费力。但是RF则比较简单,这是因为bagging框架里的各个弱学习器之间是没有依赖关系的,这减小了调参的难度。也就是达到同样的调参效果,RF调参时间要比GBDT少一些。
RF重要的Bagging框架的参数,由于RandomForestClassifier和RandomForestRegressor参数绝大部分相同,这里会将它们一起讲,不同点会指出。
从上面可以看出, RF重要的框架参数比较少,主要需要关注的是 n_estimators,即RF最大的决策树个数。
下面我们再来看RF的决策树参数,它要调参的参数基本和GBDT相同,如下:
上面决策树参数中最重要的包括最大特征数max_features, 最大深度max_depth, 内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf。
先载入库
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation, metrics
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
%matplotlib inline
读入数据,先不管任何参数,都用默认的进行拟合:
#留出法
train_x,test_x,train_y,test_y = train_test_split(sourse_x,
sourse_y,
train_size=.8,
random_state=0)
rf0 = RandomForestClassifier(oob_score=True, random_state=10)
rf0.fit(train_x,train_y)
print(rf0.oob_score_)
y_predprob = rf0.predict_proba(test_x)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(test_y, y_predprob))
输出如下,可见袋外分数已经不是很高,而AUC分数相对较高。
0.7612359550561798
AUC Score (Train): 0.855797
先对n_estimators进行网格搜索
param_test1 = {'n_estimators':list(range(10,80,10))}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100,
min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10),
param_grid = param_test1, scoring='roc_auc',cv=5)
gsearch1.fit(train_x,train_y)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
输出如下
([mean: 0.84443, std: 0.04852, params: {'n_estimators': 10},
mean: 0.84513, std: 0.04432, params: {'n_estimators': 20},
mean: 0.84728, std: 0.04408, params: {'n_estimators': 30},
mean: 0.84678, std: 0.04385, params: {'n_estimators': 40},
mean: 0.84894, std: 0.04377, params: {'n_estimators': 50},
mean: 0.84892, std: 0.04445, params: {'n_estimators': 60},
mean: 0.84892, std: 0.04419, params: {'n_estimators': 70}],
{'n_estimators': 50},
0.8489386572851073)
这样我们得到了最佳的弱学习器迭代次数,接着我们对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。
param_test2 = {'max_depth':list(range(3,14,2)), 'min_samples_split':list(range(50,201,20))}
gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 50,
min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10),
param_grid = param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(train_x,train_y)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
输出如下
([mean: 0.85061, std: 0.04207, params: {'max_depth': 3, 'min_samples_split': 50},
mean: 0.85007, std: 0.04378, params: {'max_depth': 3, 'min_samples_split': 70},
mean: 0.85069, std: 0.04298, params: {'max_depth': 3, 'min_samples_split': 90},
mean: 0.84911, std: 0.04400, params: {'max_depth': 3, 'min_samples_split': 110},
mean: 0.84819, std: 0.04332, params: {'max_depth': 3, 'min_samples_split': 130},
mean: 0.84463, std: 0.04037, params: {'max_depth': 3, 'min_samples_split': 150},
mean: 0.84255, std: 0.04089, params: {'max_depth': 3, 'min_samples_split': 170},
mean: 0.83875, std: 0.04485, params: {'max_depth': 3, 'min_samples_split': 190},
mean: 0.84958, std: 0.04185, params: {'max_depth': 5, 'min_samples_split': 50},
mean: 0.85120, std: 0.04104, params: {'max_depth': 5, 'min_samples_split': 70},
mean: 0.85097, std: 0.04297, params: {'max_depth': 5, 'min_samples_split': 90},
mean: 0.84828, std: 0.04357, params: {'max_depth': 5, 'min_samples_split': 110},
mean: 0.84692, std: 0.04469, params: {'max_depth': 5, 'min_samples_split': 130},
mean: 0.84405, std: 0.04084, params: {'max_depth': 5, 'min_samples_split': 150},
mean: 0.84208, std: 0.04236, params: {'max_depth': 5, 'min_samples_split': 170},
mean: 0.83856, std: 0.04559, params: {'max_depth': 5, 'min_samples_split': 190},
mean: 0.84924, std: 0.04304, params: {'max_depth': 7, 'min_samples_split': 50},
mean: 0.85186, std: 0.04071, params: {'max_depth': 7, 'min_samples_split': 70},
mean: 0.85149, std: 0.04402, params: {'max_depth': 7, 'min_samples_split': 90},
mean: 0.84724, std: 0.04356, params: {'max_depth': 7, 'min_samples_split': 110},
mean: 0.84688, std: 0.04493, params: {'max_depth': 7, 'min_samples_split': 130},
mean: 0.84409, std: 0.04090, params: {'max_depth': 7, 'min_samples_split': 150},
mean: 0.84208, std: 0.04236, params: {'max_depth': 7, 'min_samples_split': 170},
mean: 0.83856, std: 0.04559, params: {'max_depth': 7, 'min_samples_split': 190},
mean: 0.84887, std: 0.04300, params: {'max_depth': 9, 'min_samples_split': 50},
mean: 0.85194, std: 0.04078, params: {'max_depth': 9, 'min_samples_split': 70},
mean: 0.85158, std: 0.04410, params: {'max_depth': 9, 'min_samples_split': 90},
mean: 0.84724, std: 0.04356, params: {'max_depth': 9, 'min_samples_split': 110},
mean: 0.84688, std: 0.04493, params: {'max_depth': 9, 'min_samples_split': 130},
mean: 0.84409, std: 0.04090, params: {'max_depth': 9, 'min_samples_split': 150},
mean: 0.84208, std: 0.04236, params: {'max_depth': 9, 'min_samples_split': 170},
mean: 0.83856, std: 0.04559, params: {'max_depth': 9, 'min_samples_split': 190},
mean: 0.84887, std: 0.04300, params: {'max_depth': 11, 'min_samples_split': 50},
mean: 0.85194, std: 0.04078, params: {'max_depth': 11, 'min_samples_split': 70},
mean: 0.85158, std: 0.04410, params: {'max_depth': 11, 'min_samples_split': 90},
mean: 0.84724, std: 0.04356, params: {'max_depth': 11, 'min_samples_split': 110},
mean: 0.84688, std: 0.04493, params: {'max_depth': 11, 'min_samples_split': 130},
mean: 0.84409, std: 0.04090, params: {'max_depth': 11, 'min_samples_split': 150},
mean: 0.84208, std: 0.04236, params: {'max_depth': 11, 'min_samples_split': 170},
mean: 0.83856, std: 0.04559, params: {'max_depth': 11, 'min_samples_split': 190},
mean: 0.84887, std: 0.04300, params: {'max_depth': 13, 'min_samples_split': 50},
mean: 0.85194, std: 0.04078, params: {'max_depth': 13, 'min_samples_split': 70},
mean: 0.85158, std: 0.04410, params: {'max_depth': 13, 'min_samples_split': 90},
mean: 0.84724, std: 0.04356, params: {'max_depth': 13, 'min_samples_split': 110},
mean: 0.84688, std: 0.04493, params: {'max_depth': 13, 'min_samples_split': 130},
mean: 0.84409, std: 0.04090, params: {'max_depth': 13, 'min_samples_split': 150},
mean: 0.84208, std: 0.04236, params: {'max_depth': 13, 'min_samples_split': 170},
mean: 0.83856, std: 0.04559, params: {'max_depth': 13, 'min_samples_split': 190}],
{'max_depth': 9, 'min_samples_split': 70},
0.85194030165817)
我们看看我们现在模型的袋外分数
rf1 = RandomForestClassifier(n_estimators= 50, max_depth=9, min_samples_split=70,
min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(train_x,train_y)
print(rf1.oob_score_)
y_predprob1 = rf1.predict_proba(test_x)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(test_y, y_predprob1))
输出如下
0.8132022471910112
AUC Score (Train): 0.889526
可见此时我们的袋外分数有一定的提高。也就是时候模型的泛化能力增强了。
对于内部节点再划分所需最小样本数min_samples_split,我们暂时不能一起定下来,因为这个还和决策树其他的参数存在关联。下面我们再对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。
param_test3 = {'min_samples_split':list(range(2,10,1)), 'min_samples_leaf':list(range(2,10,1))}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 50, max_depth=9,
max_features='sqrt' ,oob_score=True, random_state=10),
param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(train_x,train_y)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
输出如下
([mean: 0.85195, std: 0.03518, params: {'min_samples_leaf': 2, 'min_samples_split': 2},
mean: 0.85195, std: 0.03518, params: {'min_samples_leaf': 2, 'min_samples_split': 3},
mean: 0.85195, std: 0.03518, params: {'min_samples_leaf': 2, 'min_samples_split': 4},
mean: 0.84882, std: 0.03837, params: {'min_samples_leaf': 2, 'min_samples_split': 5},
mean: 0.85731, std: 0.03290, params: {'min_samples_leaf': 2, 'min_samples_split': 6},
mean: 0.85402, std: 0.03649, params: {'min_samples_leaf': 2, 'min_samples_split': 7},
mean: 0.85446, std: 0.03748, params: {'min_samples_leaf': 2, 'min_samples_split': 8},
mean: 0.85543, std: 0.03394, params: {'min_samples_leaf': 2, 'min_samples_split': 9},
mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 2},
mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 3},
mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 4},
mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 5},
mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 6},
mean: 0.85374, std: 0.03446, params: {'min_samples_leaf': 3, 'min_samples_split': 7},
mean: 0.85442, std: 0.03244, params: {'min_samples_leaf': 3, 'min_samples_split': 8},
mean: 0.85389, std: 0.03766, params: {'min_samples_leaf': 3, 'min_samples_split': 9},
mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 2},
mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 3},
mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 4},
mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 5},
mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 6},
mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 7},
mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 8},
mean: 0.85722, std: 0.03514, params: {'min_samples_leaf': 4, 'min_samples_split': 9},
mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 2},
mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 3},
mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 4},
mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 5},
mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 6},
mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 7},
mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 8},
mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 9},
mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 2},
mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 3},
mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 4},
mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 5},
mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 6},
mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 7},
mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 8},
mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 9},
mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 2},
mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 3},
mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 4},
mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 5},
mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 6},
mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 7},
mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 8},
mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 9},
mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 2},
mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 3},
mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 4},
mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 5},
mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 6},
mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 7},
mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 8},
mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 9},
mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 2},
mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 3},
mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 4},
mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 5},
mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 6},
mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 7},
mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 8},
mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 9}],
{'min_samples_leaf': 6, 'min_samples_split': 2},
0.8604141554873008)
最后我们再对最大特征数max_features做调参
param_test4 = {'max_features':list(range(3,11,1))}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 50, max_depth=9, min_samples_split=2,
min_samples_leaf=6 ,oob_score=True, random_state=10),
param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(train_x,train_y)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
输出如下
([mean: 0.85914, std: 0.03765, params: {'max_features': 3},
mean: 0.85764, std: 0.03818, params: {'max_features': 4},
mean: 0.86041, std: 0.03527, params: {'max_features': 5},
mean: 0.85943, std: 0.03136, params: {'max_features': 6},
mean: 0.85747, std: 0.03662, params: {'max_features': 7},
mean: 0.85878, std: 0.03528, params: {'max_features': 8},
mean: 0.85898, std: 0.03439, params: {'max_features': 9},
mean: 0.85819, std: 0.03643, params: {'max_features': 10}],
{'max_features': 5},
0.8604141554873008)
用搜索到的最佳参数来看看最终的模型拟合
rf2 = RandomForestClassifier(n_estimators= 50, max_depth=9, min_samples_split=2,
min_samples_leaf=6,max_features=5 ,oob_score=True, random_state=10)
rf2.fit(train_x,train_y)
print(rf2.oob_score_)
y_predprob2 = rf2.predict_proba(test_x)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(test_y, y_predprob2))
输出如下
0.8174157303370787
AUC Score (Train): 0.889592
从上图中可以看出此时模型的袋外分数提高程度很小了,这可能与数据有关。
以上为RF调参的总结,如果存在问题,希望大家能指出。