集成学习之随机森林调参

一、scikit-learn随机森林类库概述

    在scikit-learn中,RF的分类类是RandomForestClassifier,回归类是RandomForestRegressor。当然RF的变种Extra Trees也有, 分类类ExtraTreesClassifier,回归类ExtraTreesRegressor。由于RF和Extra Trees的区别较小,调参方法基本相同,本文只关注于RF的调参。

    与GBDT的调参类似,RF需要调参的参数也包括两部分,第一部分是Bagging框架的参数,第二部分是CART决策树的参数。下面我们就对这些参数做一个介绍。

二、RF框架参数

    RF的Bagging框架的参数可以和GBDT对比来看,在scikit-learn 梯度提升树(GBDT)调参小结中,GBDT的框架参数比较多,重要的有最大迭代器个数,步长和子采样比例,调参起来比较费力。但是RF则比较简单,这是因为bagging框架里的各个弱学习器之间是没有依赖关系的,这减小了调参的难度。也就是达到同样的调参效果,RF调参时间要比GBDT少一些。

    RF重要的Bagging框架的参数,由于RandomForestClassifier和RandomForestRegressor参数绝大部分相同,这里会将它们一起讲,不同点会指出。

  1. n_estimators: 也就是最大的弱学习器的个数。一般来说n_estimators太小,容易欠拟合,n_estimators太大,计算量会太大,并且n_estimators到一定的数量后,再增大n_estimators获得的模型提升会很小,所以一般选择一个适中的数值。默认是100。
  2. oob_score:即是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True,因为袋外分数反应了一个模型拟合后的泛化能力。
  3. criterion: 即CART树做划分时对特征的评价标准。分类模型和回归模型的损失函数是不一样的。分类RF对应的CART分类树默认是基尼系数gini,另一个可选择的标准是信息增益。回归RF对应的CART回归树默认是均方差mse,另一个可以选择的标准是绝对值差mae。一般来说选择默认的标准就已经很好的。

    从上面可以看出, RF重要的框架参数比较少,主要需要关注的是 n_estimators,即RF最大的决策树个数。

三、RF决策树参数

    下面我们再来看RF的决策树参数,它要调参的参数基本和GBDT相同,如下:

  1. max_features: RF划分时考虑的最大特征数,可以使用很多种类型的值,默认是"auto",意味着划分时最多考虑 N \sqrt{N} N 个特征;如果是 log ⁡ 2 \log2 log2 意味着划分时最多考虑 log ⁡ 2 N \log_2N log2N个特征;如果是"sqrt"或者"auto"意味着划分时最多考虑 N \sqrt{N} N 个特征。如果是整数,代表考虑的特征绝对数。如果是浮点数,代表考虑特征百分比,即考虑(百分比xN)取整后的特征数。其中N为样本总特征数。一般我们用默认的"auto"就可以了,如果特征数非常多,我们可以灵活使用刚才描述的其他取值来控制划分时考虑的最大特征数,以控制决策树的生成时间。
  2. max_depth: 决策树最大深度,默认可以不输入,如果不输入的话,决策树在建立子树的时候不会限制子树的深度。一般来说,数据少或者特征少的时候可以不管这个值。如果模型样本量多,特征也多的情况下,推荐限制这个最大深度,具体的取值取决于数据的分布。常用的可以取值10-100之间。
  3. min_samples_split: 划分所需最小样本数,这个值限制了子树继续划分的条件,如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分。 默认是2.如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。
  4. min_samples_leaf: 叶子节点最少样本数,这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。 默认是1,可以输入最少的样本数的整数,或者最少样本数占样本总数的百分比。如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。
  5. min_weight_fraction_leaf: 叶子节点最小的样本权重和, 这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起被剪枝。 默认是0,就是不考虑权重问题。一般来说,如果我们有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。
  6. max_leaf_nodes: 最大叶子节点数,通过限制最大叶子节点数,可以防止过拟合,默认是"None”,即不限制最大的叶子节点数。如果加了限制,算法会建立在最大叶子节点数内最优的决策树。如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制,具体的值可以通过交叉验证得到。
  7. min_impurity_split: 节点划分最小不纯度,这个值限制了决策树的增长,如果某节点的不纯度(基于基尼系数,均方差)小于这个阈值,则该节点不再生成子节点。即为叶子节点 。一般不推荐改动默认值1e-7。

    上面决策树参数中最重要的包括最大特征数max_features, 最大深度max_depth, 内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf。

四、RF调参实例

    先载入库

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation, metrics
from sklearn.model_selection import train_test_split 

import matplotlib.pylab as plt
%matplotlib inline

    读入数据,先不管任何参数,都用默认的进行拟合:

#留出法
train_x,test_x,train_y,test_y = train_test_split(sourse_x,
                                                 sourse_y,
                                                 train_size=.8,
                                                 random_state=0)
rf0 = RandomForestClassifier(oob_score=True, random_state=10)
rf0.fit(train_x,train_y)
print(rf0.oob_score_)
y_predprob = rf0.predict_proba(test_x)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(test_y, y_predprob))

    输出如下,可见袋外分数已经不是很高,而AUC分数相对较高。

0.7612359550561798
AUC Score (Train): 0.855797

    先对n_estimators进行网格搜索

param_test1 = {'n_estimators':list(range(10,80,10))}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100,
                                  min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), 
                       param_grid = param_test1, scoring='roc_auc',cv=5)
gsearch1.fit(train_x,train_y)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

    输出如下

([mean: 0.84443, std: 0.04852, params: {'n_estimators': 10},
  mean: 0.84513, std: 0.04432, params: {'n_estimators': 20},
  mean: 0.84728, std: 0.04408, params: {'n_estimators': 30},
  mean: 0.84678, std: 0.04385, params: {'n_estimators': 40},
  mean: 0.84894, std: 0.04377, params: {'n_estimators': 50},
  mean: 0.84892, std: 0.04445, params: {'n_estimators': 60},
  mean: 0.84892, std: 0.04419, params: {'n_estimators': 70}],
 {'n_estimators': 50},
 0.8489386572851073)

    这样我们得到了最佳的弱学习器迭代次数,接着我们对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索。

param_test2 = {'max_depth':list(range(3,14,2)), 'min_samples_split':list(range(50,201,20))}
gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 50, 
                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10),
   param_grid = param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(train_x,train_y)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

    输出如下

([mean: 0.85061, std: 0.04207, params: {'max_depth': 3, 'min_samples_split': 50},
  mean: 0.85007, std: 0.04378, params: {'max_depth': 3, 'min_samples_split': 70},
  mean: 0.85069, std: 0.04298, params: {'max_depth': 3, 'min_samples_split': 90},
  mean: 0.84911, std: 0.04400, params: {'max_depth': 3, 'min_samples_split': 110},
  mean: 0.84819, std: 0.04332, params: {'max_depth': 3, 'min_samples_split': 130},
  mean: 0.84463, std: 0.04037, params: {'max_depth': 3, 'min_samples_split': 150},
  mean: 0.84255, std: 0.04089, params: {'max_depth': 3, 'min_samples_split': 170},
  mean: 0.83875, std: 0.04485, params: {'max_depth': 3, 'min_samples_split': 190},
  mean: 0.84958, std: 0.04185, params: {'max_depth': 5, 'min_samples_split': 50},
  mean: 0.85120, std: 0.04104, params: {'max_depth': 5, 'min_samples_split': 70},
  mean: 0.85097, std: 0.04297, params: {'max_depth': 5, 'min_samples_split': 90},
  mean: 0.84828, std: 0.04357, params: {'max_depth': 5, 'min_samples_split': 110},
  mean: 0.84692, std: 0.04469, params: {'max_depth': 5, 'min_samples_split': 130},
  mean: 0.84405, std: 0.04084, params: {'max_depth': 5, 'min_samples_split': 150},
  mean: 0.84208, std: 0.04236, params: {'max_depth': 5, 'min_samples_split': 170},
  mean: 0.83856, std: 0.04559, params: {'max_depth': 5, 'min_samples_split': 190},
  mean: 0.84924, std: 0.04304, params: {'max_depth': 7, 'min_samples_split': 50},
  mean: 0.85186, std: 0.04071, params: {'max_depth': 7, 'min_samples_split': 70},
  mean: 0.85149, std: 0.04402, params: {'max_depth': 7, 'min_samples_split': 90},
  mean: 0.84724, std: 0.04356, params: {'max_depth': 7, 'min_samples_split': 110},
  mean: 0.84688, std: 0.04493, params: {'max_depth': 7, 'min_samples_split': 130},
  mean: 0.84409, std: 0.04090, params: {'max_depth': 7, 'min_samples_split': 150},
  mean: 0.84208, std: 0.04236, params: {'max_depth': 7, 'min_samples_split': 170},
  mean: 0.83856, std: 0.04559, params: {'max_depth': 7, 'min_samples_split': 190},
  mean: 0.84887, std: 0.04300, params: {'max_depth': 9, 'min_samples_split': 50},
  mean: 0.85194, std: 0.04078, params: {'max_depth': 9, 'min_samples_split': 70},
  mean: 0.85158, std: 0.04410, params: {'max_depth': 9, 'min_samples_split': 90},
  mean: 0.84724, std: 0.04356, params: {'max_depth': 9, 'min_samples_split': 110},
  mean: 0.84688, std: 0.04493, params: {'max_depth': 9, 'min_samples_split': 130},
  mean: 0.84409, std: 0.04090, params: {'max_depth': 9, 'min_samples_split': 150},
  mean: 0.84208, std: 0.04236, params: {'max_depth': 9, 'min_samples_split': 170},
  mean: 0.83856, std: 0.04559, params: {'max_depth': 9, 'min_samples_split': 190},
  mean: 0.84887, std: 0.04300, params: {'max_depth': 11, 'min_samples_split': 50},
  mean: 0.85194, std: 0.04078, params: {'max_depth': 11, 'min_samples_split': 70},
  mean: 0.85158, std: 0.04410, params: {'max_depth': 11, 'min_samples_split': 90},
  mean: 0.84724, std: 0.04356, params: {'max_depth': 11, 'min_samples_split': 110},
  mean: 0.84688, std: 0.04493, params: {'max_depth': 11, 'min_samples_split': 130},
  mean: 0.84409, std: 0.04090, params: {'max_depth': 11, 'min_samples_split': 150},
  mean: 0.84208, std: 0.04236, params: {'max_depth': 11, 'min_samples_split': 170},
  mean: 0.83856, std: 0.04559, params: {'max_depth': 11, 'min_samples_split': 190},
  mean: 0.84887, std: 0.04300, params: {'max_depth': 13, 'min_samples_split': 50},
  mean: 0.85194, std: 0.04078, params: {'max_depth': 13, 'min_samples_split': 70},
  mean: 0.85158, std: 0.04410, params: {'max_depth': 13, 'min_samples_split': 90},
  mean: 0.84724, std: 0.04356, params: {'max_depth': 13, 'min_samples_split': 110},
  mean: 0.84688, std: 0.04493, params: {'max_depth': 13, 'min_samples_split': 130},
  mean: 0.84409, std: 0.04090, params: {'max_depth': 13, 'min_samples_split': 150},
  mean: 0.84208, std: 0.04236, params: {'max_depth': 13, 'min_samples_split': 170},
  mean: 0.83856, std: 0.04559, params: {'max_depth': 13, 'min_samples_split': 190}],
 {'max_depth': 9, 'min_samples_split': 70},
 0.85194030165817)

    我们看看我们现在模型的袋外分数

rf1 = RandomForestClassifier(n_estimators= 50, max_depth=9, min_samples_split=70,
                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(train_x,train_y)
print(rf1.oob_score_)
y_predprob1 = rf1.predict_proba(test_x)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(test_y, y_predprob1))

    输出如下

0.8132022471910112
AUC Score (Train): 0.889526

    可见此时我们的袋外分数有一定的提高。也就是时候模型的泛化能力增强了。

    对于内部节点再划分所需最小样本数min_samples_split,我们暂时不能一起定下来,因为这个还和决策树其他的参数存在关联。下面我们再对内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参。

param_test3 = {'min_samples_split':list(range(2,10,1)), 'min_samples_leaf':list(range(2,10,1))}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 50, max_depth=9,
                                  max_features='sqrt' ,oob_score=True, random_state=10),
   param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(train_x,train_y)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

    输出如下

([mean: 0.85195, std: 0.03518, params: {'min_samples_leaf': 2, 'min_samples_split': 2},
  mean: 0.85195, std: 0.03518, params: {'min_samples_leaf': 2, 'min_samples_split': 3},
  mean: 0.85195, std: 0.03518, params: {'min_samples_leaf': 2, 'min_samples_split': 4},
  mean: 0.84882, std: 0.03837, params: {'min_samples_leaf': 2, 'min_samples_split': 5},
  mean: 0.85731, std: 0.03290, params: {'min_samples_leaf': 2, 'min_samples_split': 6},
  mean: 0.85402, std: 0.03649, params: {'min_samples_leaf': 2, 'min_samples_split': 7},
  mean: 0.85446, std: 0.03748, params: {'min_samples_leaf': 2, 'min_samples_split': 8},
  mean: 0.85543, std: 0.03394, params: {'min_samples_leaf': 2, 'min_samples_split': 9},
  mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 2},
  mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 3},
  mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 4},
  mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 5},
  mean: 0.85799, std: 0.03434, params: {'min_samples_leaf': 3, 'min_samples_split': 6},
  mean: 0.85374, std: 0.03446, params: {'min_samples_leaf': 3, 'min_samples_split': 7},
  mean: 0.85442, std: 0.03244, params: {'min_samples_leaf': 3, 'min_samples_split': 8},
  mean: 0.85389, std: 0.03766, params: {'min_samples_leaf': 3, 'min_samples_split': 9},
  mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 2},
  mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 3},
  mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 4},
  mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 5},
  mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 6},
  mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 7},
  mean: 0.85469, std: 0.03536, params: {'min_samples_leaf': 4, 'min_samples_split': 8},
  mean: 0.85722, std: 0.03514, params: {'min_samples_leaf': 4, 'min_samples_split': 9},
  mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 2},
  mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 3},
  mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 4},
  mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 5},
  mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 6},
  mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 7},
  mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 8},
  mean: 0.85888, std: 0.03703, params: {'min_samples_leaf': 5, 'min_samples_split': 9},
  mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 2},
  mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 3},
  mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 4},
  mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 5},
  mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 6},
  mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 7},
  mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 8},
  mean: 0.86041, std: 0.03527, params: {'min_samples_leaf': 6, 'min_samples_split': 9},
  mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 2},
  mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 3},
  mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 4},
  mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 5},
  mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 6},
  mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 7},
  mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 8},
  mean: 0.85969, std: 0.03383, params: {'min_samples_leaf': 7, 'min_samples_split': 9},
  mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 2},
  mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 3},
  mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 4},
  mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 5},
  mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 6},
  mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 7},
  mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 8},
  mean: 0.85937, std: 0.03463, params: {'min_samples_leaf': 8, 'min_samples_split': 9},
  mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 2},
  mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 3},
  mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 4},
  mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 5},
  mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 6},
  mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 7},
  mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 8},
  mean: 0.85993, std: 0.03915, params: {'min_samples_leaf': 9, 'min_samples_split': 9}],
 {'min_samples_leaf': 6, 'min_samples_split': 2},
 0.8604141554873008)

    最后我们再对最大特征数max_features做调参

param_test4 = {'max_features':list(range(3,11,1))}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 50, max_depth=9, min_samples_split=2,
                                  min_samples_leaf=6 ,oob_score=True, random_state=10),
   param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(train_x,train_y)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

    输出如下

([mean: 0.85914, std: 0.03765, params: {'max_features': 3},
  mean: 0.85764, std: 0.03818, params: {'max_features': 4},
  mean: 0.86041, std: 0.03527, params: {'max_features': 5},
  mean: 0.85943, std: 0.03136, params: {'max_features': 6},
  mean: 0.85747, std: 0.03662, params: {'max_features': 7},
  mean: 0.85878, std: 0.03528, params: {'max_features': 8},
  mean: 0.85898, std: 0.03439, params: {'max_features': 9},
  mean: 0.85819, std: 0.03643, params: {'max_features': 10}],
 {'max_features': 5},
 0.8604141554873008)

    用搜索到的最佳参数来看看最终的模型拟合

rf2 = RandomForestClassifier(n_estimators= 50, max_depth=9, min_samples_split=2,
                                  min_samples_leaf=6,max_features=5 ,oob_score=True, random_state=10)
rf2.fit(train_x,train_y)
print(rf2.oob_score_)
y_predprob2 = rf2.predict_proba(test_x)[:,1]
print("AUC Score (Train): %f" % metrics.roc_auc_score(test_y, y_predprob2))

    输出如下

0.8174157303370787
AUC Score (Train): 0.889592

    从上图中可以看出此时模型的袋外分数提高程度很小了,这可能与数据有关。

    以上为RF调参的总结,如果存在问题,希望大家能指出。

你可能感兴趣的:(机器学习)