近段时间和同事交流XGB的时候,突然发现自己对它的一些细节居然已经记不清楚了。好歹几年的挖掘人,生感惭愧,于是就准备将xgboost 重新手推一遍。
0- 定义损失函数
1- 计算损失函数导数g0, h0
2- 基于Gain值,贪婪生成树(这里简单用了一层决策树)
3- 计算分支的平均预测值w
4- 计算当前迭代的预测值 y_hat = y_hat_before_cumsum + w * lr
5- 计算更新g0, h0
6- 重复1-5
以mse为例
xgb的损失函数由两部分组成损失函数与复杂度
L o s s = 1 2 ( y − y ^ ) 2 + ( γ ∗ T + λ 2 y ^ 2 ) . . . . . . . . . . . ( 1 ) Loss=\frac{1}{2}(y-\hat{y})^2 + ( \gamma * T + \frac{\lambda}{2}\hat{y}^2)...........(1) Loss=21(y−y^)2+(γ∗T+2λy^2)...........(1)
近似求解 泰勒二次展开
g = y ^ − y ; h = 1 g=\hat y- y; \ h=1 g=y^−y; h=1
L o s s = 1 2 ( y − y ^ ) 2 + g 1 ! y ^ + h 2 ! y ^ 2 + ( γ ∗ T + λ 2 y ^ 2 ) = h + λ 2 y ^ 2 + g ∗ y ^ + γ ∗ T . . . . . . . . . . . ( 2 ) Loss=\frac{1}{2}(y-\hat y)^2 + \frac{g}{1!} \hat{y}+ \frac{h}{2!} \hat{y}^2+ ( \gamma * T + \frac{\lambda}{2}\hat{y}^2) \\= \frac{h+\lambda}{2} \hat{y}^2+g*\hat{y}+\gamma * T...........(2) Loss=21(y−y^)2+1!gy^+2!hy^2+(γ∗T+2λy^2)=2h+λy^2+g∗y^+γ∗T...........(2)
【注】:
由初中知识二次方程取最小值时 x = − b 2 a x=\frac{-b}{2a} x=2a−b 可以得到:
当损失最小的时候,我们需要取
y ^ = − b 2 a = − g h + λ . . . . . . . . . . . ( 3 ) \hat{y}=\frac{-b}{2a}=\frac{-g}{h+\lambda} ...........(3) y^=2a−b=h+λ−g...........(3)
def mse(self, y, y_hat):
return 0.5 * np.sum((y - y_hat) * (y - y_hat)) / len(y_hat)
def _mse_derivative(self, y, y_hat):
g = y_hat - y
h = np.ones_like(g)
return g, h
将公式3带入公式2可以化简为
L o s s = − 0.5 ∗ g 2 h + λ + γ ∗ T Loss=-0.5*\frac{ g^2}{h+\lambda}+\gamma * T Loss=−0.5∗h+λg2+γ∗T
G a i n = L o s s 全 − L o s s 左 − L o s s 右 Gain=Loss_{全} -Loss_{左} - Loss_{右} Gain=Loss全−Loss左−Loss右 取该值最大的时候作为拆分阈值
def _gain(self, g_list, h_list):
"""
计算损失
"""
T = 0
gain_list = []
for g, h in zip(g_list, h_list):
T += 1
G = np.sum(g, axis=0)
H = np.sum(h, axis=0)
gain = G * G / (H + self.lambda_)
gain_list.append(gain)
if T == 1:
return -0.5 * gain + self.gamma
return -0.5 * np.concatenate(gain_list).sum(axis=0) + self.gamma * T
def compute_w(self, g, h):
"""
当损失最小的时候的 y_hat取值
"""
return - g / (h + self.lambda_)
def split_node_gain(self, g, h, L, R):
"""
计算节点拆分的时候的gain
"""
left = self._gain([g[L]], [h[L]])
right = self._gain([g[R]], [h[R]])
total = self._gain([g], [h])
return total - left - right
详细代码查看笔者Github:Simple_xgboost.py
# 数据加载
tr_x, te_x, tr_y, te_y = get_data()
loss_f = LossFunction('mse')
xgb_f = XGBFunction(gamma=1, lambda_=0.01, min_split_loss=0, learning_rate=1)
num_boost_rounds = 5
t = 0
base_value = np.array([0.5] * len(tr_y))
g0, h0 = loss_f.backward(tr_y, base_value)
# 1- 计算g0, h0
# 2- 贪婪生成树(这里简单用了一层决策树)
# 3- 计算分支的预测值w
# 4- 计算当前迭代的预测值 y_hat = base_value + w
# 5- 计算更新g0, h0
while t < num_boost_rounds:
# 贪婪算法分割-一层决策树
sp = xgb_f.stupid_split(tr_x, g0, h0)
try:
f = sp[t+1].split_feature
th = sp[t+1].split_th
except KeyError:
print(f'early stop in num_boost_rounds [{t+1}]')
break
L, R = xgb_f._split_data(tr_x, f, th)
# 计算左右节点的平均预测值
wl = xgb_f.compute_w(g0[L], h0[L]).mean()
wr = xgb_f.compute_w(g0[R], h0[R]).mean()
print(f'{t+1}', sp[t+1])
# 更新最新预测结果
tr_x = np.concatenate([tr_x[L], tr_x[R]])
tr_y = np.concatenate([tr_y[L], tr_y[R]])
y_hat = np.concatenate([base_value[L] + wl, base_value[R] + wr])
# 基于最新预测结果再计算梯度
g0, h0 = loss_f.backward(tr_y, y_hat)
base_value = y_hat
# 查看预测的损失情况
print('train mse:', loss_f.mse(tr_y, y_hat))
te_pred = xgb_f.predict(te_x)
print('test mse:', loss_f.mse(te_y, te_pred))
t += 1
最终输出:
xgb_f.split_dict
{
1: SplitInfo(split_feature=2, split_th=3.0, node_gain=-13.502958086826098, split_gain=28.497167053742277, wl=-0.49504950495049505, wr=0.9777227722772277),
2: SplitInfo(split_feature=3, split_th=1.8, node_gain=0.998578280748277, split_gain=4.902365970529983, wl=-0.19447343686028468, wr=0.4888176229221224),
3: SplitInfo(split_feature=2, split_th=3.0, node_gain=0.9999998606294234, split_gain=0.0767959831019841, wl=0.18764647704037596, wr=-0.09375165749677827)
}
xgb_params = {
'objective' : 'reg:squarederror',
'gamma' : 1,
'min_split_loss': 0,
'max_depth': 1,
'reg_lambda': 0.01,
'learning_rate':1
}
tr_mt = xgb.DMatrix(tr_x, label=tr_y)
te_mt = xgb.DMatrix(te_x, label=te_y)
xgb_model = xgb.train(xgb_params, tr_mt, num_boost_round=3)
te_p_xgb = xgb_model.predict(te_mt)
loss_f.mse(te_y, te_p_xgb)
xgb_tree = xgb_model.trees_to_dataframe()
print(xgb_tree)
cs.print(xgb_f.split_dict)
xgb_model.get_score(importance_type='gain')
cs.print( xgb_f.gain_importance() )
结果基本一致
Tree Node ID Feature Split Yes No Missing Gain Cover
0 0 0 0-0 f2 2.45 0-1 0-2 0-1 58.994327 120.0
1 0 1 0-1 Leaf NaN NaN NaN NaN -0.499875 40.0
2 0 2 0-2 Leaf NaN NaN NaN NaN 0.987377 80.0
3 1 0 1-0 f3 1.75 1-1 1-2 1-1 11.572807 120.0
4 1 1 1-1 Leaf NaN NaN NaN NaN -0.199235 85.0
5 1 2 1-2 Leaf NaN NaN NaN NaN 0.483914 35.0
6 2 0 2-0 f2 2.45 2-1 2-2 2-1 2.377620 120.0
7 2 1 2-1 Leaf NaN NaN NaN NaN 0.199060 40.0
8 2 2 2-2 Leaf NaN NaN NaN NaN -0.099507 80.0
>>> cs.print(xgb_f.split_dict)
{
1: SplitInfo(split_feature=2, split_th=3.0, node_gain=-13.502958086826098, split_gain=28.497167053742277, wl=-0.49504950495049505, wr=0.9777227722772277),
2: SplitInfo(split_feature=3, split_th=1.8, node_gain=0.998578280748277, split_gain=4.902365970529983, wl=-0.19447343686028468, wr=0.4888176229221224),
3: SplitInfo(split_feature=2, split_th=3.0, node_gain=0.9999998606294234, split_gain=0.0767959831019841, wl=0.18764647704037596, wr=-0.09375165749677827)
}
>>> loss_f.mse(te_y, te_p)
0.008353821640715195
>>> xgb_model.get_score(importance_type='gain')
{'f2': 30.685973739999998, 'f3': 11.5728073}
>>> cs.print( xgb_f.gain_importance() )
{'f2': 28.573963036844262, 'f3': 4.902365970529983}
一般表格数据全都可以采用集成树模型。
Xgb在寻找最佳分割点的时候会对特征预排序,然后逐步遍历寻找最佳分割(由于底层C++写的,所以在一般中等数据量上并不会觉得运行慢)。
这样排序寻找最佳分割有两方面的显著缺点