机器学习_Xgboost手推重温

近段时间和同事交流XGB的时候,突然发现自己对它的一些细节居然已经记不清楚了。好歹几年的挖掘人,生感惭愧,于是就准备将xgboost 重新手推一遍。

一、XGB手推

1.1 xgb核心框架

0- 定义损失函数
1- 计算损失函数导数g0, h0
2- 基于Gain值,贪婪生成树(这里简单用了一层决策树)
3- 计算分支的平均预测值w
4- 计算当前迭代的预测值 y_hat = y_hat_before_cumsum + w * lr
5- 计算更新g0, h0
6- 重复1-5

1.2 损失函数与Gain

损失函数

以mse为例

xgb的损失函数由两部分组成损失函数复杂度

L o s s = 1 2 ( y − y ^ ) 2 + ( γ ∗ T + λ 2 y ^ 2 ) . . . . . . . . . . . ( 1 ) Loss=\frac{1}{2}(y-\hat{y})^2 + ( \gamma * T + \frac{\lambda}{2}\hat{y}^2)...........(1) Loss=21(yy^)2+(γT+2λy^2)...........(1)
近似求解 泰勒二次展开
g = y ^ − y ;   h = 1 g=\hat y- y; \ h=1 g=y^y; h=1
L o s s = 1 2 ( y − y ^ ) 2 + g 1 ! y ^ + h 2 ! y ^ 2 + ( γ ∗ T + λ 2 y ^ 2 ) = h + λ 2 y ^ 2 + g ∗ y ^ + γ ∗ T . . . . . . . . . . . ( 2 ) Loss=\frac{1}{2}(y-\hat y)^2 + \frac{g}{1!} \hat{y}+ \frac{h}{2!} \hat{y}^2+ ( \gamma * T + \frac{\lambda}{2}\hat{y}^2) \\= \frac{h+\lambda}{2} \hat{y}^2+g*\hat{y}+\gamma * T...........(2) Loss=21(yy^)2+1!gy^+2!hy^2+(γT+2λy^2)=2h+λy^2+gy^+γT...........(2)
【注】:

  • 1 2 ( y − y ^ ) 2 \frac{1}{2}(y-\hat y)^2 21(yy^)2 剔除

由初中知识二次方程取最小值时 x = − b 2 a x=\frac{-b}{2a} x=2ab 可以得到:
当损失最小的时候,我们需要取
y ^ = − b 2 a = − g h + λ . . . . . . . . . . . ( 3 ) \hat{y}=\frac{-b}{2a}=\frac{-g}{h+\lambda} ...........(3) y^=2ab=h+λg...........(3)

    def mse(self, y, y_hat):
        return 0.5 * np.sum((y - y_hat) * (y - y_hat)) / len(y_hat)
    
    def _mse_derivative(self, y, y_hat):
        g = y_hat - y
        h = np.ones_like(g)
        return g, h

Gain

将公式3带入公式2可以化简为
L o s s = − 0.5 ∗ g 2 h + λ + γ ∗ T Loss=-0.5*\frac{ g^2}{h+\lambda}+\gamma * T Loss=0.5h+λg2+γT
G a i n = L o s s 全 − L o s s 左 − L o s s 右 Gain=Loss_{全} -Loss_{左} - Loss_{右} Gain=LossLossLoss 取该值最大的时候作为拆分阈值

    def _gain(self, g_list, h_list):
    	"""
    	计算损失
    	"""
        T = 0
        gain_list = []
        for g, h in zip(g_list, h_list):
            T += 1
            G = np.sum(g, axis=0)
            H = np.sum(h, axis=0)
            gain = G * G / (H + self.lambda_)
            gain_list.append(gain)
        
        if T == 1:
            return -0.5 * gain + self.gamma
        return  -0.5 * np.concatenate(gain_list).sum(axis=0) + self.gamma * T


    def compute_w(self, g, h):
    	"""
    	当损失最小的时候的 y_hat取值
    	"""
        return - g / (h + self.lambda_)


    def split_node_gain(self, g, h, L, R):
    	"""
    	计算节点拆分的时候的gain
    	"""
        left = self._gain([g[L]], [h[L]])
        right = self._gain([g[R]], [h[R]])
        total  = self._gain([g], [h])
        return  total - left - right

二、简单版本python实现

详细代码查看笔者Github:Simple_xgboost.py


# 数据加载
tr_x, te_x, tr_y, te_y = get_data()
loss_f = LossFunction('mse')
xgb_f = XGBFunction(gamma=1, lambda_=0.01, min_split_loss=0, learning_rate=1)
num_boost_rounds = 5
t = 0
base_value = np.array([0.5] * len(tr_y))
g0, h0 = loss_f.backward(tr_y, base_value)

# 1- 计算g0, h0
# 2- 贪婪生成树(这里简单用了一层决策树)
# 3- 计算分支的预测值w
# 4- 计算当前迭代的预测值 y_hat = base_value + w
# 5- 计算更新g0, h0
while t < num_boost_rounds:
    # 贪婪算法分割-一层决策树
    sp = xgb_f.stupid_split(tr_x, g0, h0)
    try:
        f = sp[t+1].split_feature
        th = sp[t+1].split_th
    except KeyError:
        print(f'early stop in num_boost_rounds [{t+1}]')
        break
    L, R = xgb_f._split_data(tr_x, f, th)
    # 计算左右节点的平均预测值
    wl = xgb_f.compute_w(g0[L], h0[L]).mean()
    wr = xgb_f.compute_w(g0[R], h0[R]).mean()
    print(f'{t+1}', sp[t+1])
	
	# 更新最新预测结果
    tr_x = np.concatenate([tr_x[L], tr_x[R]])
    tr_y = np.concatenate([tr_y[L], tr_y[R]])
    y_hat = np.concatenate([base_value[L] + wl, base_value[R] + wr])
	
	# 基于最新预测结果再计算梯度
    g0, h0 = loss_f.backward(tr_y, y_hat)
    base_value = y_hat
    # 查看预测的损失情况
    print('train mse:', loss_f.mse(tr_y, y_hat))
    te_pred = xgb_f.predict(te_x)
    print('test mse:', loss_f.mse(te_y, te_pred))

    t += 1

最终输出:

xgb_f.split_dict
{
    1: SplitInfo(split_feature=2, split_th=3.0, node_gain=-13.502958086826098, split_gain=28.497167053742277, wl=-0.49504950495049505, wr=0.9777227722772277),
    2: SplitInfo(split_feature=3, split_th=1.8, node_gain=0.998578280748277, split_gain=4.902365970529983, wl=-0.19447343686028468, wr=0.4888176229221224),
    3: SplitInfo(split_feature=2, split_th=3.0, node_gain=0.9999998606294234, split_gain=0.0767959831019841, wl=0.18764647704037596, wr=-0.09375165749677827)
}

三、xgb调用及比对

xgb_params = {
    'objective' : 'reg:squarederror',
    'gamma' : 1,
    'min_split_loss': 0,
    'max_depth': 1,
    'reg_lambda': 0.01,
    'learning_rate':1
}
tr_mt = xgb.DMatrix(tr_x, label=tr_y)
te_mt = xgb.DMatrix(te_x, label=te_y)
xgb_model = xgb.train(xgb_params, tr_mt, num_boost_round=3)
te_p_xgb = xgb_model.predict(te_mt)
loss_f.mse(te_y, te_p_xgb)

xgb_tree = xgb_model.trees_to_dataframe()
print(xgb_tree)
cs.print(xgb_f.split_dict)

xgb_model.get_score(importance_type='gain')
cs.print( xgb_f.gain_importance() )

结果基本一致

   Tree  Node   ID Feature  Split  Yes   No Missing       Gain  Cover
0     0     0  0-0      f2   2.45  0-1  0-2     0-1  58.994327  120.0
1     0     1  0-1    Leaf    NaN  NaN  NaN     NaN  -0.499875   40.0
2     0     2  0-2    Leaf    NaN  NaN  NaN     NaN   0.987377   80.0
3     1     0  1-0      f3   1.75  1-1  1-2     1-1  11.572807  120.0
4     1     1  1-1    Leaf    NaN  NaN  NaN     NaN  -0.199235   85.0
5     1     2  1-2    Leaf    NaN  NaN  NaN     NaN   0.483914   35.0
6     2     0  2-0      f2   2.45  2-1  2-2     2-1   2.377620  120.0
7     2     1  2-1    Leaf    NaN  NaN  NaN     NaN   0.199060   40.0
8     2     2  2-2    Leaf    NaN  NaN  NaN     NaN  -0.099507   80.0
>>> cs.print(xgb_f.split_dict)
{
    1: SplitInfo(split_feature=2, split_th=3.0, node_gain=-13.502958086826098, split_gain=28.497167053742277, wl=-0.49504950495049505, wr=0.9777227722772277),
    2: SplitInfo(split_feature=3, split_th=1.8, node_gain=0.998578280748277, split_gain=4.902365970529983, wl=-0.19447343686028468, wr=0.4888176229221224),
    3: SplitInfo(split_feature=2, split_th=3.0, node_gain=0.9999998606294234, split_gain=0.0767959831019841, wl=0.18764647704037596, wr=-0.09375165749677827)
}
>>> loss_f.mse(te_y, te_p)
0.008353821640715195
>>> xgb_model.get_score(importance_type='gain')
{'f2': 30.685973739999998, 'f3': 11.5728073}
>>> cs.print( xgb_f.gain_importance() )
{'f2': 28.573963036844262, 'f3': 4.902365970529983}

四、XGB与LGB

两者的共同点

一般表格数据全都可以采用集成树模型。

4.2 XGB的不足

Xgb在寻找最佳分割点的时候会对特征预排序,然后逐步遍历寻找最佳分割(由于底层C++写的,所以在一般中等数据量上并不会觉得运行慢)。
这样排序寻找最佳分割有两方面的显著缺点

  • 空间消耗大,需要保存排序结果(排序索引)
  • 梯度访问是随机的,不同特征的访问顺序也不一样

4.3 LGB的优化

  • 基于直方图的决策树算法(不用对特征值全遍历)
  • 单边梯度采样:减少大量只有小梯度的数据,仅利用剩余的数据进行gain计算寻找最优切分
    • 实际上是对样本梯度绝对值排序取topN,剩余的随机抽取m个,合并后计算gain
  • 互斥特征捆绑合并:将有很多互斥的特征合并成一个特征,达到降维的效果
  • 支持类别变量输入

参考

  • XGBoost A Scalable Tree Boosting System
  • 知乎: 机器学习集成学习之XGBoost(基于python实现)
  • 知乎:左手论文 右手代码 深入理解网红算法XGBoost

你可能感兴趣的:(机器学习,python,机器学习,人工智能)