XGBoost一文通

文章目录

  • 1 XGBoost概述
  • 2 XGBoost算法
  • 3 XGBoost详解
    • 3.1 目标函数
    • 3.2 前向算法
  • 4 XGBoost in Python
    • 4.1 载入数据
    • 4.2 设置参数
    • 4.3 训练和预测
  • 5 实例


1 XGBoost概述

前驱知识:决策树,集成学习
XGBoost是近年兴起的一种提升树(boosting tree)集成算法,其效率较高。
该算法思想就是不断地添加树,不断地进行特征分裂来生长一棵树,每次添加一个树,其实是学习一个新函数,去拟合上次预测的残差。当我们训练完成得到k棵树,我们要预测一个样本的分数,其实就是根据这个样本的特征,在每棵树中会落到对应的一个叶子节点,每个叶子节点就对应一个分数,最后只需要将每棵树对应的分数加起来就是该样本的预测值。

2 XGBoost算法

输入:训练数据集 T = ( x 1 , y 1 ) , . . . , ( x m , y m ) T={(x_1,y_1),...,(x_m,y_m)} T=(x1,y1),...,(xm,ym),其中 x i ∈ R n , y i ∈ R x_i\in \mathbb{R}^n,y_i\in \mathbb{R} xiRn,yiR;损失函数 L ( y i , y ^ i ) L(y_i,\hat y_i) L(yi,y^i)
输出:提升树 f ^ ( x ) \hat f(x) f^(x)
(1)初始化 f 0 ( x ) = a r g m i n c ∑ i = 1 m L ( y i , c ) f_0(x)=arg min_c\sum^m_{i=1}L(y_i,c) f0(x)=argminci=1mL(yi,c)
(2)对K棵决策树, k = 1 , 2 , . . . , K k=1,2,...,K k=1,2,...,K
\quad (a)计算前一棵树损失函数的一阶和二阶梯度: g k i = ∂ L ( y i , y ^ i k − 1 ) ∂ y ^ i k − 1 , h k i = ∂ 2 L ( y i , y ^ i k − 1 ) ∂ ( y ^ i k − 1 ) 2 , , i = 1 , 2 , . . . , m g_{ki}=\frac{\partial L(y_i,\hat y_i^{k-1})}{\partial \hat y_i^{k-1}},h_{ki}=\frac{\partial^2 L(y_i,\hat y_i^{k-1})}{\partial (\hat y_i^{k-1})^2},,i=1,2,...,m gki=y^ik1L(yi,y^ik1),hki=(y^ik1)22L(yi,y^ik1),,i=1,2,...,m
\quad (b)对决策树的叶结点 j = 1 , 2 , . . . , J j=1,2,...,J j=1,2,...,J,计算 G k j = ∑ i ∈ I j g k i , H k j = ∑ i ∈ I j h k i G_{kj}=\sum_{i\in I_j}g_{ki},H_{kj}=\sum_{i\in I_j}h_{ki} Gkj=iIjgki,Hkj=iIjhki
\quad (c)则第k棵树的目标函数最小值为: O b j k ∗ = 1 2 ∑ j = 1 J G k j 2 H k j + λ + γ J Obj^{*}_k=\frac{1}{2}\sum_{j=1}^J\frac{G_{kj}^2}{H_{kj}+\lambda}+\gamma J Objk=21j=1JHkj+λGkj2+γJ,第k棵树第j个叶结点的最优权值为: w k j ∗ = − G j H j + λ w_{kj}^* = -\frac{G_j}{H_j+\lambda} wkj=Hj+λGj
\quad (d)更新: f k ( x ) = f k − 1 ( x ) + ∑ i = 1 m ∑ j = 1 J w k j ∗ I ( i ∈ R k j ) f_k(x)=f_{k-1}(x)+\sum_{i=1}^m\sum_{j=1}^Jw^*_{kj}I(i\in R_{kj}) fk(x)=fk1(x)+i=1mj=1JwkjI(iRkj)
(3)得到回归问题的提升树(分类问题即对回归问题的提升树进行符号函数变换):
f ^ ( x ) = f K ( x ) \displaystyle \hat f(x)=f_K(x) f^(x)=fK(x)

3 XGBoost详解

3.1 目标函数

  1. O b j = ∑ i = 1 m L ( y ^ i , y i ) + ∑ k = 1 K Ω ( f k ( x ) ) Obj=\sum_{i=1}^mL(\hat y_i,y_i)+\sum_{k=1}^K\Omega(f_k(x)) Obj=i=1mL(y^i,yi)+k=1KΩ(fk(x)),其中 Ω ( f ( x ) ) \Omega(f(x)) Ω(f(x))表示决策树的复杂度,用于防止过拟合。
  2. 根据前向算法有: O b j ( k ) = ∑ i = 1 m L ( y i , y ^ i ( k − 1 ) + f k ( x i ) ) + Ω ( f k ) + 常 数 Obj^{(k)} = \sum_{i=1}^m L(y_i, \hat{y}_i^{(k-1)} + f_k(x_i)) + \Omega(f_k) + 常数 Obj(k)=i=1mL(yi,y^i(k1)+fk(xi))+Ω(fk)+
  3. 根据二阶泰勒展式有: Obj ( k ) = ∑ i = 1 m [ L ( y i , y ^ i ( k − 1 ) ) + g i f k ( x i ) + 1 2 h i f k 2 ( x i ) ] + Ω ( f k ) + C o n s t a n t \text{Obj}^{(k)} = \sum_{i=1}^m [L(y_i, \hat{y}_i^{(k-1)}) + g_i f_k(x_i) + \frac{1}{2} h_i f_k^2(x_i)] + \Omega(f_k) + Constant Obj(k)=i=1m[L(yi,y^i(k1))+gifk(xi)+21hifk2(xi)]+Ω(fk)+Constant,其中 g i = ∂ y ^ i ( t − 1 ) L ( y i , y ^ i ( t − 1 ) ) g_i=\partial_{\hat{y}_i^{(t-1)}} L(y_i, \hat{y}_i^{(t-1)}) gi=y^i(t1)L(yi,y^i(t1))是一阶偏导, h i = ∂ y ^ i ( t − 1 ) 2 L ( y i , y ^ i ( t − 1 ) ) h_i=\partial_{\hat{y}_i^{(t-1)}}^2 L(y_i, \hat{y}_i^{(t-1)}) hi=y^i(t1)2L(yi,y^i(t1))是二阶偏导。
  4. 删去常数项,有: Obj ( k ) = ∑ i = 1 m [ g i f k ( x i ) + 1 2 h i f k 2 ( x i ) ] + Ω ( f k ) \text{Obj}^{(k)} = \sum_{i=1}^m [ g_i f_k(x_i) + \frac{1}{2} h_i f_k^2(x_i)] + \Omega(f_k) Obj(k)=i=1m[gifk(xi)+21hifk2(xi)]+Ω(fk)
  5. 关于正则项 Ω ( f k ) \Omega(f_k) Ω(fk) Ω ( f k ( x ) ) = γ T + 1 2 λ ∑ j = 1 T w j 2 \Omega(f_k(x))=\gamma T+\frac{1}{2}\lambda \sum_{j=1}^{T}w_j^2 Ω(fk(x))=γT+21λj=1Twj2,T是叶子节点个数, w j w_j wj是叶子节点的权值, λ \lambda λ γ \gamma γ是超参数。
  6. 接着将4,5结合,这是算法的神奇之处,详见这里,有: Obj ( k ) ≈ ∑ j = 1 J [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T \text{Obj}^{(k)}\approx \sum^J_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T Obj(k)j=1J[(iIjgi)wj+21(iIjhi+λ)wj2]+γT
    ,其中 i ∈ I j i\in I_j iIj表示第i个样本划分到第j个叶结点
  7. 定义: G j = ∑ i ∈ I j g i G_j = \sum_{i\in I_j} g_i Gj=iIjgi H j = ∑ i ∈ I j h i H_j = \sum_{i\in I_j} h_i Hj=iIjhi

3.2 前向算法

y ^ i ( 0 ) = 0 \hat{y}_i^{(0)} = 0 y^i(0)=0
y ^ i ( 1 ) = f 1 ( x i ) = y ^ i ( 0 ) + f 1 ( x i ) \hat{y}_i^{(1)} = f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i) y^i(1)=f1(xi)=y^i(0)+f1(xi)
y ^ i ( 2 ) = f 1 ( x i ) + f 2 ( x i ) = y ^ i ( 1 ) + f 2 ( x i ) \hat{y}_i^{(2)} = f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i) y^i(2)=f1(xi)+f2(xi)=y^i(1)+f2(xi)
… \dots
y ^ i ( K ) = ∑ k = 1 K f k ( x i ) = y ^ i ( K − 1 ) + f K ( x i ) \hat{y}_i^{(K)} = \sum_{k=1}^K f_k(x_i)= \hat{y}_i^{(K-1)} + f_K(x_i) y^i(K)=k=1Kfk(xi)=y^i(K1)+fK(xi)

4 XGBoost in Python

import xgboost as xgb

4.1 载入数据

xgboost将数据存储在DMatrix对象里
支持的数据类型

  • LibSVM text format file
  • Comma-separated values (CSV) file
  • NumPy 2D array
  • SciPy 2D sparse array
  • Pandas data frame
  • XGBoost binary buffer file.

:xgb载入分类变量前要先one_hot encoding

  1. xgb.DMatrix(data, label=None, missing=None, weight=None, silent=False, feature_names=None, feature_types=None, nthread=None):载入数据到DMatrix对象。

label:指定标签值向量/矩阵。
missing:指定缺失值在矩阵中的值。
weight:指定权重变量。

dtrain = xgb.DMatrix(X_train, label = y)
dtest = xgb.DMatrix(X_test)
  1. DMatrix.save_binary('train.buffer'):存储DMatrix对象,下次使用时能加快加载速度。

4.2 设置参数

param = {'max_depth': 2,'eta': 1, 'objective': 'binary:logistic',...}
参数介绍:

'nthread':4
'eval_metric':['auc','ams@0','rmse']
'max_depth': 2
'eta': 1
'objective':'reg:linear''binary:logistic'
"booster":'gbtree'
'subsample': 0.7
'colsample_bytree': 0.8
'silent': True

4.3 训练和预测

  1. clf=xgb.train(params, dtrain, num_boost_round=10, evals=[], obj=None, feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None, learning_rates=None)

num_boost_round:boost迭代次数
evals:一对对 (DMatrix, string)组成的列表,培训期间将评估哪些指标的验证集列表。验证指标将帮助我们跟踪模型的性能。用evallist = [(dtest, 'eval'), (dtrain, 'train')]指定。
obj
feval:自定义评价函数
maximize
early_stopping_rounds:验证指标需要至少在每轮early_stopping_rounds中改进一次才能继续训练,例如early_stopping_rounds=200表示每200次迭代将会检查验证指标是否有改进,如果没有就会停止训练,如果有多个指标,则只判断最后一个指标
evals_result
verbose_eval:取值可以是bool型也可以是整数,当取值为True时,表示每次迭代都显示评价指标,当取值为整数时,表示每该取值次数轮迭代后显示评价指标
xgb_model
callbacks
learning_rates

  1. .cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)
model = xgb.cv(params, dtrain,  num_boost_round=500, early_stopping_rounds=100)
model.loc[30:,["test-rmse-mean", "train-rmse-mean"]].plot()

XGBoost一文通_第1张图片
3. bst.save_model('0001.model')

  1. ypred = clf.predict(data, output_margin=False, ntree_limit=None, validate_features=True)

ntree_limit:限制预测中的树数;如果定义了最佳树数限制,则默认为最佳树数限制,否则为0(使用所有树)

xgb.plot_tree(bst, num_trees=2)
xgb.to_graphviz(bst, num_trees=2)

5 实例

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
import xgboost
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

#X_train,y_train略

#自定义评价函数
def myFeval(preds, xgbtrain):
    label = xgbtrain.get_label()
    score = mean_squared_error(label,preds)
    return 'myFeval',score
    
xgb_params = {"booster":'gbtree','eta': 0.005, 'max_depth': 5, 'subsample': 0.7, 
              'colsample_bytree': 0.8, 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 8}
folds = KFold(n_splits=5, shuffle=True, random_state=2018)
oof_xgb = np.zeros(len(train))
predictions_xgb = np.zeros(len(test))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
    print("fold n°{}".format(fold_+1))
    trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
    val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])
    
    watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
    clf = xgb.train(dtrain=trn_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=100, params=xgb_params,feval = myFeval)
    oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
    predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
    
print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train_)))

OVER

你可能感兴趣的:(机器学习和数据挖掘)