RF、GBDT和XGBoost入门

一、算法介绍

RF(Random Forest)、GBDT(Gradient Boosting Decision Tree)和XGBoost(eXtreme Gradient Boosting)都属于机器学习中的集成学习(ensemble learning)。

集成学习:通过构建并结合多个学习机器来完成学习任务,有时也被成为多分类器系统(mutil-classifier system)、基于委员会的学习(committee-based learning)等。(摘自周志华《机器学习》)

集成学习可以包含相同各类型的个体学习器,也可以包含不同类型的个体学习器。同质集成中的个体学习器称为“基学习器”(base learner),而异质集成中的个体学习器称为“组件学习器”(component learner)。我们这里探讨讨的三个算法都属于前者。

根据个体学习器的生成方式,同质集成学习大致可分为两类:一是个体学习器之间存在强依关系、必须串行生成的序列化方法(Boosting);二是个体学习器间不存在强依赖关系、可同时生成的并行化方法(Bagging和Random Forest)。我们这里探讨的两类集成学习中的三种常见算法,如下:

随机森林(RF)是一个包含多个决策树的分类器,并且其输出的类别是由个别数输出的类别的众数而定。(摘自维基百科)

提升方法(Boosting)通过改变训练样本的权重,学习多个分类器,并将这些分类器进行线性组合,提高分类的性能。当损失函数为指数损失(如Adaboost)或者平方误差损失(如提升树)函数时,每一步优化很简单。但对于一般损失函数而言,往往每一步优化不是那么容易。针对这一问题,Freidman提出了梯度提升(GB)算法,利用损失函数的负梯度在当前模型的值作为残差的近似值。而GB算法中最典型的基学习器就是决策树,特别是CART,GBDT就是使用CART作为基学器的梯度提升方法。而XGBoost是GBDT的更高效和更细粒度实现,在众多机器学习比赛中表现优异。

二、环境安装

(一)sklearn

Python一个非常流行机器学习库sklearn提供了RF和GBDT的分类和回归实现:

GradientBoostingClassifier  |  GradientBoostingRegressor

RandomForestClassifier  |  RandomForestRegressor

(二)XGBoost

XGBoost提供了多个接口,包括Python、R、JVM、Julia等等。更棒的是,它还提供了与sklearn兼容的函数声明和调用方法。

XGBClassifier | XGBRegressor

三、参数详解

这里以分类器为例,只列出比较常用的参数,一切尽在表中。

RF、GBDT和XGBoost参数对比
参数 RF GBDT XGBoost 备注
booster None None

Which booster to use

default=gbtree

Can be gblinear, dart

当booster=gbtree

XGBoost与RF、GBDT相同,基分类器都为CART

         verbose|silent

 

[verbose]

Controls the verbosity of the tree building process.

default=0

 

[verbose]

Enable verbose output.

default=0

If 1 then it prints progress and performance once in a while (the more trees the lower the frequency). If greater than 1 then it prints progress and performance for every tree.

[silent]

default=0

0 means printing running messages, 1 means silent mode

 
n_jobs

[n_jobs]

The number of jobs to run in parallel for both fit and predict.

default=1

If -1, then the number of jobs is set to the number of cores.

None

[nthread]

Number of parallel threads used to run XGBoost

default to maximum number of threads available if not set

(replaces nthread)

只有GBDT不能并行化
random_state

RandomState instance or None, optional 

default=None

Random number seed

(replaces seed)

default=0

 

         loss|objective

None

[loss]

loss function to be optimized.

default=‘deviance’

Can be 'exponential'

[objective]

Specify the learning task and the corresponding learning objective or a custom objective function to be used 

default='binary:logistic'

Can be 'reg:logistic', 'binary:linear etc.

RF为什么没有
criterion

The function to measure the quality of a split.

default='gini'|'friedman_mse'

  XGBoost为什么没有
learning_rate None

learning rate shrinks the contribution of each tree by learning_rate

default=0.1

Boosting learning rate

(replaces eta)

default=0.3

learning_rate和n_estimators是相互中和的一对变量
n_estimators

The number of trees in the forest.

default=10

The number of boosting stages to perform.

default=100

Number of boosted trees to fit.

default=100

min_impurity_decrease|

gamma

[min_impurity_decrease]

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

default=0

[gamma]

Minimum loss reduction required to make a further partition on a leaf node of the tree.

default=0

 
max_depth

The maximum depth of the tree.

default=None

maximum depth of the individual regression estimators. 

default=3

Maximum depth of a tree. 

default=6

 

     min_samples_leaf|

      min_child_weight

[min_samples_leaf]

The minimum number of samples required to be at a leaf node

default=1

[min_child_weight]

Minimum sum of instance weight (hessian) needed in a child. 

default=1

XGBoost只有min_child_weight这一个参数表示叶节点所需最小samples,而RF和GBDT还有min_samples_split表示分支节点所需最小samples

min_samples_split

The minimum number of samples required to split an internal node

default=2

None
max_delta_step None None

Maximum delta step we allow each leaf output to be.

default=0

 

            subsample

None

The fraction of samples to be used for fitting the individual base learners. 

default=1.0

RF采用bootstrap方式抽取训练集,故没有

        max_features|

      colsample_bytree

[max_features]

The number of features to consider when looking for the best split

default='auto'|None

Can be int, float, 'auto','sqrt','log2',None

[colsample_bytree]

Subsample ratio of columns when constructing each tree.

default=1.0

XGBoost除了可以控制构建每棵树选择特征数,还可以控制每次分裂特征选择数
colsample_bylevel None None

Subsample ratio of columns for each split, in each level

default=1.0

         class_weight

      scale_pos_weight

[class_weight]

default=None

Can be dict, list of dicts, “balanced”, 

“balanced_subsample”

None

[scale_pos_weight]

Control the balance of positive and negative weights, useful for unbalanced classes.

default=1

XGBoost的scale_pos_weight需要自己设置,一般为:负例数/正例数
reg_alpha None None

 L1 regularization term on weights

default=0

 
reg_lambda None None

L2 regularization term on weights

default=1

 
warm_start

default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble

None  
oob_score 

Whether to use out-of-bag samples to estimate the generalization accuracy.

default=False

None None  

你可能感兴趣的:(集成学习,入门,boosting,bagging,机器学习,机器学习,有监督学习,GBDT,RF,XGBoost)