引用:Datawhale
XGBoost是陈天奇等人开发的一个开源机器学习项目,高效地实现了GBDT算法并进行了算法和工程上的许多改进,被广泛应用在Kaggle竞赛及其他许多机器学习竞赛中并取得了不错的成绩。XGBoost本质上还是一个GBDT,但是力争把速度和效率发挥到极致,所以叫X (Extreme) GBoosted,包括前面说过,两者都是boosting方法。XGBoost是一个优化的分布式梯度增强库,旨在实现高效,灵活和便携。 它在Gradient Boosting框架下实现机器学习算法。 XGBoost提供了并行树提升(也称为GBDT,GBM),可以快速准确地解决许多数据科学问题。 相同的代码在主要的分布式环境(Hadoop,SGE,MPI)上运行,并且可以解决超过数十亿个样例的问题。XGBoost利用了核外计算并且能够使数据科学家在一个主机上处理数亿的样本数据。最终,将这些技术进行结合来做一个端到端的系统以最少的集群系统来扩展到更大的数据集上。Xgboost以CART决策树为子模型,通过Gradient Tree Boosting实现多棵CART树的集成学习,得到最终模型。下面我们来看看XGBoost的最终模型构建:
引用陈天奇的论文,我们的数据为: D = { ( x i , y i ) } ( ∣ D ∣ = n , x i ∈ R m , y i ∈ R ) \mathcal{D}=\left\{\left(\mathbf{x}_{i}, y_{i}\right)\right\}\left(|\mathcal{D}|=n, \mathbf{x}_{i} \in \mathbb{R}^{m}, y_{i} \in \mathbb{R}\right) D={(xi,yi)}(∣D∣=n,xi∈Rm,yi∈R)
(1) 构造目标函数:
假设有K棵树,则第i个样本的输出为 y ^ i = ϕ ( x i ) = ∑ k = 1 K f k ( x i ) , f k ∈ F \hat{y}_{i}=\phi\left(\mathrm{x}_{i}\right)=\sum_{k=1}^{K} f_{k}\left(\mathrm{x}_{i}\right), \quad f_{k} \in \mathcal{F} y^i=ϕ(xi)=∑k=1Kfk(xi),fk∈F,其中, F = { f ( x ) = w q ( x ) } ( q : R m → T , w ∈ R T ) \mathcal{F}=\left\{f(\mathbf{x})=w_{q(\mathbf{x})}\right\}\left(q: \mathbb{R}^{m} \rightarrow T, w \in \mathbb{R}^{T}\right) F={f(x)=wq(x)}(q:Rm→T,w∈RT)
因此,目标函数的构建为:
L ( ϕ ) = ∑ i l ( y ^ i , y i ) + ∑ k Ω ( f k ) \mathcal{L}(\phi)=\sum_{i} l\left(\hat{y}_{i}, y_{i}\right)+\sum_{k} \Omega\left(f_{k}\right) L(ϕ)=i∑l(y^i,yi)+k∑Ω(fk)
其中, ∑ i l ( y ^ i , y i ) \sum_{i} l\left(\hat{y}_{i}, y_{i}\right) ∑il(y^i,yi)为loss function, ∑ k Ω ( f k ) \sum_{k} \Omega\left(f_{k}\right) ∑kΩ(fk)为正则化项。
(2) 叠加式的训练(Additive Training):
给定样本 x i x_i xi, y ^ i ( 0 ) = 0 \hat{y}_i^{(0)} = 0 y^i(0)=0(初始预测), y ^ i ( 1 ) = y ^ i ( 0 ) + f 1 ( x i ) \hat{y}_i^{(1)} = \hat{y}_i^{(0)} + f_1(x_i) y^i(1)=y^i(0)+f1(xi), y ^ i ( 2 ) = y ^ i ( 0 ) + f 1 ( x i ) + f 2 ( x i ) = y ^ i ( 1 ) + f 2 ( x i ) \hat{y}_i^{(2)} = \hat{y}_i^{(0)} + f_1(x_i) + f_2(x_i) = \hat{y}_i^{(1)} + f_2(x_i) y^i(2)=y^i(0)+f1(xi)+f2(xi)=y^i(1)+f2(xi)…以此类推,可以得到:$ \hat{y}_i^{(K)} = \hat{y}_i^{(K-1)} + f_K(x_i)$ ,其中,$ \hat{y}_i^{(K-1)} $ 为前K-1棵树的预测结果,$ f_K(x_i)$ 为第K棵树的预测结果。
因此,目标函数可以分解为:
L ( K ) = ∑ i = 1 n l ( y i , y ^ i ( K − 1 ) + f K ( x i ) ) + ∑ k Ω ( f k ) \mathcal{L}^{(K)}=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(K-1)}+f_{K}\left(\mathrm{x}_{i}\right)\right)+\sum_{k} \Omega\left(f_{k}\right) L(K)=i=1∑nl(yi,y^i(K−1)+fK(xi))+k∑Ω(fk)
由于正则化项也可以分解为前K-1棵树的复杂度加第K棵树的复杂度,因此: L ( K ) = ∑ i = 1 n l ( y i , y ^ i ( K − 1 ) + f K ( x i ) ) + ∑ k = 1 K − 1 Ω ( f k ) + Ω ( f K ) \mathcal{L}^{(K)}=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(K-1)}+f_{K}\left(\mathrm{x}_{i}\right)\right)+\sum_{k=1} ^{K-1}\Omega\left(f_{k}\right)+\Omega\left(f_{K}\right) L(K)=∑i=1nl(yi,y^i(K−1)+fK(xi))+∑k=1K−1Ω(fk)+Ω(fK),由于 ∑ k = 1 K − 1 Ω ( f k ) \sum_{k=1} ^{K-1}\Omega\left(f_{k}\right) ∑k=1K−1Ω(fk)在模型构建到第K棵树的时候已经固定,无法改变,因此是一个已知的常数,可以在最优化的时候省去,故:
L ( K ) = ∑ i = 1 n l ( y i , y ^ i ( K − 1 ) + f K ( x i ) ) + Ω ( f K ) \mathcal{L}^{(K)}=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(K-1)}+f_{K}\left(\mathrm{x}_{i}\right)\right)+\Omega\left(f_{K}\right) L(K)=i=1∑nl(yi,y^i(K−1)+fK(xi))+Ω(fK)
(3) 使用泰勒级数近似目标函数:
L ( K ) ≃ ∑ i = 1 n [ l ( y i , y ^ ( K − 1 ) ) + g i f K ( x i ) + 1 2 h i f K 2 ( x i ) ] + Ω ( f K ) \mathcal{L}^{(K)} \simeq \sum_{i=1}^{n}\left[l\left(y_{i}, \hat{y}^{(K-1)}\right)+g_{i} f_{K}\left(\mathrm{x}_{i}\right)+\frac{1}{2} h_{i} f_{K}^{2}\left(\mathrm{x}_{i}\right)\right]+\Omega\left(f_{K}\right) L(K)≃i=1∑n[l(yi,y^(K−1))+gifK(xi)+21hifK2(xi)]+Ω(fK)
其中, g i = ∂ y ^ ( t − 1 ) l ( y i , y ^ ( t − 1 ) ) g_{i}=\partial_{\hat{y}(t-1)} l\left(y_{i}, \hat{y}^{(t-1)}\right) gi=∂y^(t−1)l(yi,y^(t−1))和 h i = ∂ y ^ ( t − 1 ) 2 l ( y i , y ^ ( t − 1 ) ) h_{i}=\partial_{\hat{y}^{(t-1)}}^{2} l\left(y_{i}, \hat{y}^{(t-1)}\right) hi=∂y^(t−1)2l(yi,y^(t−1))
在这里,我们补充下泰勒级数的相关知识:
在数学中,泰勒级数(英语:Taylor series)用无限项连加式——级数来表示一个函数,这些相加的项由函数在某一点的导数求得。具体的形式如下:
f ( x ) = f ( x 0 ) 0 ! + f ′ ( x 0 ) 1 ! ( x − x 0 ) + f ′ ′ ( x 0 ) 2 ! ( x − x 0 ) 2 + … + f ( n ) ( x 0 ) n ! ( x − x 0 ) n + . . . . . . f(x)=\frac{f\left(x_{0}\right)}{0 !}+\frac{f^{\prime}\left(x_{0}\right)}{1 !}\left(x-x_{0}\right)+\frac{f^{\prime \prime}\left(x_{0}\right)}{2 !}\left(x-x_{0}\right)^{2}+\ldots+\frac{f^{(n)}\left(x_{0}\right)}{n !}\left(x-x_{0}\right)^{n}+...... f(x)=0!f(x0)+1!f′(x0)(x−x0)+2!f′′(x0)(x−x0)2+…+n!f(n)(x0)(x−x0)n+......
由于 ∑ i = 1 n l ( y i , y ^ ( K − 1 ) ) \sum_{i=1}^{n}l\left(y_{i}, \hat{y}^{(K-1)}\right) ∑i=1nl(yi,y^(K−1))在模型构建到第K棵树的时候已经固定,无法改变,因此是一个已知的常数,可以在最优化的时候省去,故:
L ~ ( K ) = ∑ i = 1 n [ g i f K ( x i ) + 1 2 h i f K 2 ( x i ) ] + Ω ( f K ) \tilde{\mathcal{L}}^{(K)}=\sum_{i=1}^{n}\left[g_{i} f_{K}\left(\mathbf{x}_{i}\right)+\frac{1}{2} h_{i} f_{K}^{2}\left(\mathbf{x}_{i}\right)\right]+\Omega\left(f_{K}\right) L~(K)=i=1∑n[gifK(xi)+21hifK2(xi)]+Ω(fK)
(4) 如何定义一棵树:
为了说明如何定义一棵树的问题,我们需要定义几个概念:第一个概念是样本所在的节点位置 q ( x ) q(x) q(x),第二个概念是有哪些样本落在节点j上 I j = { i ∣ q ( x i ) = j } I_{j}=\left\{i \mid q\left(\mathbf{x}_{i}\right)=j\right\} Ij={i∣q(xi)=j},第三个概念是每个结点的预测值 w q ( x ) w_{q(x)} wq(x),第四个概念是模型复杂度 Ω ( f K ) \Omega\left(f_{K}\right) Ω(fK),它可以由叶子节点的个数以及节点函数值来构建,则: Ω ( f K ) = γ T + 1 2 λ ∑ j = 1 T w j 2 \Omega\left(f_{K}\right) = \gamma T+\frac{1}{2} \lambda \sum_{j=1}^{T} w_{j}^{2} Ω(fK)=γT+21λ∑j=1Twj2。如下图的例子:
q ( x 1 ) = 1 , q ( x 2 ) = 3 , q ( x 3 ) = 1 , q ( x 4 ) = 2 , q ( x 5 ) = 3 q(x_1) = 1,q(x_2) = 3,q(x_3) = 1,q(x_4) = 2,q(x_5) = 3 q(x1)=1,q(x2)=3,q(x3)=1,q(x4)=2,q(x5)=3, I 1 = { 1 , 3 } , I 2 = { 4 } , I 3 = { 2 , 5 } I_1 = \{1,3\},I_2 = \{4\},I_3 = \{2,5\} I1={1,3},I2={4},I3={2,5}, w = ( 15 , 12 , 20 ) w = (15,12,20) w=(15,12,20)
因此,目标函数用以上符号替代后:
L ~ ( K ) = ∑ i = 1 n [ g i f K ( x i ) + 1 2 h i f K 2 ( x i ) ] + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T \begin{aligned} \tilde{\mathcal{L}}^{(K)} &=\sum_{i=1}^{n}\left[g_{i} f_{K}\left(\mathrm{x}_{i}\right)+\frac{1}{2} h_{i} f_{K}^{2}\left(\mathrm{x}_{i}\right)\right]+\gamma T+\frac{1}{2} \lambda \sum_{j=1}^{T} w_{j}^{2} \\ &=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T \end{aligned} L~(K)=i=1∑n[gifK(xi)+21hifK2(xi)]+γT+21λj=1∑Twj2=j=1∑T⎣⎡⎝⎛i∈Ij∑gi⎠⎞wj+21⎝⎛i∈Ij∑hi+λ⎠⎞wj2⎦⎤+γT
由于我们的目标就是最小化目标函数,现在的目标函数化简为一个关于w的二次函数: L ~ ( K ) = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T \tilde{\mathcal{L}}^{(K)}=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T L~(K)=∑j=1T[(∑i∈Ijgi)wj+21(∑i∈Ijhi+λ)wj2]+γT,根据二次函数求极值的公式: y = a x 2 b x c y=ax^2 bx c y=ax2bxc求极值,对称轴在 x = − b 2 a x=-\frac{b}{2 a} x=−2ab,极值为 y = 4 a c − b 2 4 a y=\frac{4 a c-b^{2}}{4 a} y=4a4ac−b2,因此:
w j ∗ = − ∑ i ∈ I j g i ∑ i ∈ I j h i + λ w_{j}^{*}=-\frac{\sum_{i \in I_{j}} g_{i}}{\sum_{i \in I_{j}} h_{i}+\lambda} wj∗=−∑i∈Ijhi+λ∑i∈Ijgi
以及
L ~ ( K ) ( q ) = − 1 2 ∑ j = 1 T ( ∑ i ∈ I j g i ) 2 ∑ i ∈ I j h i + λ + γ T \tilde{\mathcal{L}}^{(K)}(q)=-\frac{1}{2} \sum_{j=1}^{T} \frac{\left(\sum_{i \in I_{j}} g_{i}\right)^{2}}{\sum_{i \in I_{j}} h_{i}+\lambda}+\gamma T L~(K)(q)=−21j=1∑T∑i∈Ijhi+λ(∑i∈Ijgi)2+γT
(5) 如何寻找树的形状:
不难发现,刚刚的讨论都是基于树的形状已经确定了计算 w w w和 L L L,但是实际上我们需要像学习决策树一样找到树的形状。因此,我们借助决策树学习的方式,使用目标函数的变化来作为分裂节点的标准。我们使用一个例子来说明:
例子中有8个样本,分裂方式如下,因此:
L ~ ( o l d ) = − 1 2 [ ( g 7 + g 8 ) 2 H 7 + H 8 + λ + ( g 1 + . . . + g 6 ) 2 H 1 + . . . + H 6 + λ ] + 2 γ L ~ ( n e w ) = − 1 2 [ ( g 7 + g 8 ) 2 H 7 + H 8 + λ + ( g 1 + . . . + g 3 ) 2 H 1 + . . . + H 3 + λ + ( g 4 + . . . + g 6 ) 2 H 4 + . . . + H 6 + λ ] + 3 γ L ~ ( o l d ) − L ~ ( n e w ) = 1 2 [ ( g 1 + . . . + g 3 ) 2 H 1 + . . . + H 3 + λ + ( g 4 + . . . + g 6 ) 2 H 4 + . . . + H 6 + λ − ( g 1 + . . . + g 6 ) 2 h 1 + . . . + h 6 + λ ] − γ \tilde{\mathcal{L}}^{(old)} = -\frac{1}{2}[\frac{(g_7 + g_8)^2}{H_7+H_8 + \lambda} + \frac{(g_1 +...+ g_6)^2}{H_1+...+H_6 + \lambda}] + 2\gamma \\ \tilde{\mathcal{L}}^{(new)} = -\frac{1}{2}[\frac{(g_7 + g_8)^2}{H_7+H_8 + \lambda} + \frac{(g_1 +...+ g_3)^2}{H_1+...+H_3 + \lambda} + \frac{(g_4 +...+ g_6)^2}{H_4+...+H_6 + \lambda}] + 3\gamma\\ \tilde{\mathcal{L}}^{(old)} - \tilde{\mathcal{L}}^{(new)} = \frac{1}{2}[ \frac{(g_1 +...+ g_3)^2}{H_1+...+H_3 + \lambda} + \frac{(g_4 +...+ g_6)^2}{H_4+...+H_6 + \lambda} - \frac{(g_1+...+g_6)^2}{h_1+...+h_6+\lambda}] - \gamma L~(old)=−21[H7+H8+λ(g7+g8)2+H1+...+H6+λ(g1+...+g6)2]+2γL~(new)=−21[H7+H8+λ(g7+g8)2+H1+...+H3+λ(g1+...+g3)2+H4+...+H6+λ(g4+...+g6)2]+3γL~(old)−L~(new)=21[H1+...+H3+λ(g1+...+g3)2+H4+...+H6+λ(g4+...+g6)2−h1+...+h6+λ(g1+...+g6)2]−γ
因此,从上面的例子看出:分割节点的标准为 m a x { L ~ ( o l d ) − L ~ ( n e w ) } max\{\tilde{\mathcal{L}}^{(old)} - \tilde{\mathcal{L}}^{(new)} \} max{L~(old)−L~(new)},即:
L split = 1 2 [ ( ∑ i ∈ I L g i ) 2 ∑ i ∈ I L h i + λ + ( ∑ i ∈ I R g i ) 2 ∑ i ∈ I R h i + λ − ( ∑ i ∈ I g i ) 2 ∑ i ∈ I h i + λ ] − γ \mathcal{L}_{\text {split }}=\frac{1}{2}\left[\frac{\left(\sum_{i \in I_{L}} g_{i}\right)^{2}}{\sum_{i \in I_{L}} h_{i}+\lambda}+\frac{\left(\sum_{i \in I_{R}} g_{i}\right)^{2}}{\sum_{i \in I_{R}} h_{i}+\lambda}-\frac{\left(\sum_{i \in I} g_{i}\right)^{2}}{\sum_{i \in I} h_{i}+\lambda}\right]-\gamma Lsplit =21[∑i∈ILhi+λ(∑i∈ILgi)2+∑i∈IRhi+λ(∑i∈IRgi)2−∑i∈Ihi+λ(∑i∈Igi)2]−γ
精确贪心分裂算法:
XGBoost在生成新树的过程中,最基本的操作是节点分裂。节点分裂中最重 要的环节是找到最优特征及最优切分点, 然后将叶子节点按照最优特征和最优切 分点进行分裂。选取最优特征和最优切分点的一种思路如下:首先找到所有的候 选特征及所有的候选切分点, 一一求得其 L split \mathcal{L}_{\text {split }} Lsplit , 然后选择 L s p l i t \mathcal{L}_{\mathrm{split}} Lsplit 最大的特征及 对应切分点作为最优特征和最优切分点。我们称此种方法为精确贪心算法。该算法是一种启发式算法, 因为在节点分裂时只选择当前最优的分裂策略, 而非全局最优的分裂策略。精确贪心算法的计算过程如下所示:
基于直方图的近似算法:
精确贪心算法在选择最优特征和最优切分点时是一种十分有效的方法。它计算了所有特征、所有切分点的收益, 并从中选择了最优的, 从而保证模型能比较好地拟合了训练数据。但是当数据不能完全加载到内存时,精确贪心算法会变得 非常低效,算法在计算过程中需要不断在内存与磁盘之间进行数据交换,这是个非常耗时的过程, 并且在分布式环境中面临同样的问题。为了能够更高效地选 择最优特征及切分点, XGBoost提出一种近似算法来解决该问题。 基于直方图的近似算法的主要思想是:对某一特征寻找最优切分点时,首先对该特征的所有切分点按分位数 (如百分位) 分桶, 得到一个候选切分点集。特征的每一个切分点都可以分到对应的分桶; 然后,对每个桶计算特征统计G和H得到直方图, G为该桶内所有样本一阶特征统计g之和, H为该桶内所有样本二阶特征统计h之和; 最后,选择所有候选特征及候选切分点中对应桶的特征统计收益最大的作为最优特征及最优切分点。基于直方图的近似算法的计算过程如下所示:
以上是XGBoost的理论部分,下面我们对XGBoost系统进行详细的讲解:
官方文档
大佬的总结
数据集下载地址:mushroom
数据转换
agaricus-lepiota.names 文件内容
1. Title: Mushroom Database
2. Sources:
(a) Mushroom records drawn from The Audubon Society Field Guide to North
American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred
A. Knopf
(b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
(c) Date: 27 April 1987
3. Past Usage:
1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational
Adjustment (Technical Report 87-19). Doctoral disseration, Department
of Information and Computer Science, University of California, Irvine.
--- STAGGER: asymptoted to 95% classification accuracy after reviewing
1000 instances.
2. Iba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity
and Coverage in Incremental Concept Learning. In Proceedings of
the 5th International Conference on Machine Learning, 73-79.
Ann Arbor, Michigan: Morgan Kaufmann.
-- approximately the same results with their HILLARY algorithm
3. In the following references a set of rules (given below) were
learned for this data set which may serve as a point of
comparison for other researchers.
Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules
from training data using backpropagation networks, in: Proc. of the
The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30,
available on-line at: http://www.bioele.nuee.nagoya-u.ac.jp/wsc1/
Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of
crisp logical rules using constrained backpropagation networks -
comparison of two new approaches, in: Proc. of the European Symposium
on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997,
pp. xx-xx
Wlodzislaw Duch, Department of Computer Methods, Nicholas Copernicus
University, 87-100 Torun, Grudziadzka 5, Poland
e-mail: [email protected]
WWW http://www.phys.uni.torun.pl/kmk/
Date: Mon, 17 Feb 1997 13:47:40 +0100
From: Wlodzislaw Duch
Organization: Dept. of Computer Methods, UMK
I have attached a file containing logical rules for mushrooms.
It should be helpful for other people since only in the last year I
have seen about 10 papers analyzing this dataset and obtaining quite
complex rules. We will try to contribute other results later.
With best regards, Wlodek Duch
________________________________________________________________
Logical rules for the mushroom data sets.
Logical rules given below seem to be the simplest possible for the
mushroom dataset and therefore should be treated as benchmark results.
Disjunctive rules for poisonous mushrooms, from most general
to most specific:
P_1) odor=NOT(almond.OR.anise.OR.none)
120 poisonous cases missed, 98.52% accuracy
P_2) spore-print-color=green
48 cases missed, 99.41% accuracy
P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.
(stalk-color-above-ring=NOT.brown)
8 cases missed, 99.90% accuracy
P_4) habitat=leaves.AND.cap-color=white
100% accuracy
Rule P_4) may also be
P_4' ) population=clustered.AND.cap_color=white
These rule involve 6 attributes (out of 22). Rules for edible
mushrooms are obtained as negation of the rules given above, for
example the rule:
odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green
gives 48 errors, or 99.41% accuracy on the whole dataset.
Several slightly more complex variations on these rules exist,
involving other attributes, such as gill_size, gill_spacing,
stalk_surface_above_ring, but the rules given above are the simplest
we have found.
4. Relevant Information:
This data set includes descriptions of hypothetical samples
corresponding to 23 species of gilled mushrooms in the Agaricus and
Lepiota Family (pp. 500-525). Each species is identified as
definitely edible, definitely poisonous, or of unknown edibility and
not recommended. This latter class was combined with the poisonous
one. The Guide clearly states that there is no simple rule for
determining the edibility of a mushroom; no rule like ``leaflets
three, let it be'' for Poisonous Oak and Ivy.
5. Number of Instances: 8124
6. Number of Attributes: 22 (all nominally valued)
7. Attribute Information: (classes: edible=e, poisonous=p)
1. cap-shape: bell=b,conical=c,convex=x,flat=f,
knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,
pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,
musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,
green=r,orange=o,pink=p,purple=u,red=e,
white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e,
rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,
pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,
pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,
none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,
orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n,
scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p,
urban=u,waste=w,woods=d
8. Missing Attribute Values: 2480 of them (denoted by "?"), all for
attribute #11.
9. Class Distribution:
-- edible: 4208 (51.8%)
-- poisonous: 3916 (48.2%)
-- total: 8124 instances
#!/usr/bin/python
def loadfmap( fname ):
fmap = {}
nmap = {}
for l in open( fname ):
arr = l.split()
if arr[0].find('.') != -1:
idx = int( arr[0].strip('.') )
assert idx not in fmap
fmap[ idx ] = {}
ftype = arr[1].strip(':')
content = arr[2]
else:
content = arr[0]
for it in content.split(','):
if it.strip() == '':
continue
k , v = it.split('=')
fmap[ idx ][ v ] = len(nmap)
nmap[ len(nmap) ] = ftype+'='+k
return fmap, nmap
def write_nmap( fo, nmap ):
for i in range( len(nmap) ):
fo.write('%d\t%s\ti\n' % (i, nmap[i]) )
# start here
fmap, nmap = loadfmap( 'agaricus-lepiota.fmap' )
fo = open( 'featmap.txt', 'w' )
write_nmap( fo, nmap )
fo.close()
fo = open( 'agaricus.txt', 'w' )
for l in open( 'agaricus-lepiota.data' ):
arr = l.split(',')
if arr[0] == 'p':
fo.write('1')
else:
assert arr[0] == 'e'
fo.write('0')
for i in range( 1,len(arr) ):
fo.write( ' %d:1' % fmap[i][arr[i].strip()] )
fo.write('\n')
fo.close()
#!/usr/bin/python
import sys
import random
random.seed( 10 )
k = 1
nfold = 5
fi = open( 'agaricus.txt', 'r' )
ftr = open( 'agaricus.txt'+'.train', 'w' )
fte = open( 'agaricus.txt'+'.test', 'w' )
for l in fi:
if random.randint( 1 , nfold ) == k:
fte.write( l )
else:
ftr.write( l )
fi.close()
ftr.close()
fte.close()
# XGBoost原生工具库的上手:
import xgboost as xgb # 引入工具库
# read in data
dtrain = xgb.DMatrix('agaricus.txt.train') # XGBoost的专属数据格式,但是也可以用dataframe或者ndarray
dtest = xgb.DMatrix('agaricus.txt.test') # # XGBoost的专属数据格式,但是也可以用dataframe或者ndarray
# specify parameters via map
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' } # 设置XGB的参数,使用字典形式传入
num_round = 2 # 使用线程数
bst = xgb.train(param, dtrain, num_round) # 训练
# make prediction
preds = bst.predict(dtest) # 预测
运行结果:
[18:36:54] 6541x126 matrix with 143902 entries loaded from agaricus.txt.train
[18:36:54] 1583x126 matrix with 34826 entries loaded from agaricus.txt.test
[18:36:54] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[18:36:54] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
XGBoost的参数设置(括号内的名称为sklearn接口对应的参数名字):
推荐博客
推荐官方文档
XGBoost的参数分为三种:
通用参数:(两种类型的booster,因为tree的性能比线性回归好得多,因此我们很少用线性回归。)
任务参数(这个参数用来控制理想的优化目标和每一步结果的度量方法。)
命令行参数(这里不说了,因为很少用命令行控制台版本)
XGBoost的调参说明:
参数调优的一般步骤
安装XGBoost:
pip install xgboost
数据接口(XGBoost可处理的数据格式DMatrix)
# 1.LibSVM文本格式文件
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')
# 2.CSV文件(不能含类别文本变量,如果存在文本变量请做特征处理如one-hot)
dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
# 3.NumPy数组
data = np.random.rand(5, 10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix(data, label=label)
# 4.scipy.sparse数组
csr = scipy.sparse.csr_matrix((dat, (row, col)))
dtrain = xgb.DMatrix(csr)
# pandas数据框dataframe
data = pandas.DataFrame(np.arange(12).reshape((4,3)), columns=['a', 'b', 'c'])
label = pandas.DataFrame(np.random.randint(2, size=4))
dtrain = xgb.DMatrix(data, label=label)
笔者推荐:先保存到XGBoost二进制文件中将使加载速度更快,然后再加载进来
# 1.保存DMatrix到XGBoost二进制文件中
dtrain = xgb.DMatrix('train.svm.txt')
dtrain.save_binary('train.buffer')
# 2. 缺少的值可以用DMatrix构造函数中的默认值替换:
dtrain = xgb.DMatrix(data, label=label, missing=-999.0)
# 3.可以在需要时设置权重:
w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)
参数的设置方式:
import pandas as pd
# 加载并处理数据
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None)
df_wine.columns = ['Class label', 'Alcohol','Malic acid', 'Ash','Alcalinity of ash','Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols','Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']
df_wine = df_wine[df_wine['Class label'] != 1] # drop 1 class
y = df_wine['Class label'].values
X = df_wine[['Alcohol','OD280/OD315 of diluted wines']].values
from sklearn.model_selection import train_test_split # 切分训练集与测试集
from sklearn.preprocessing import LabelEncoder # 标签化分类变量
le = LabelEncoder()
y = le.fit_transform(y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test)
# 1.Booster 参数
params = {
'booster': 'gbtree',
'objective': 'multi:softmax', # 多分类的问题
'num_class': 10, # 类别数,与 multisoftmax 并用
'gamma': 0.1, # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。
'max_depth': 12, # 构建树的深度,越大越容易过拟合
'lambda': 2, # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
'subsample': 0.7, # 随机采样训练样本
'colsample_bytree': 0.7, # 生成树时进行的列采样
'min_child_weight': 3,
'silent': 1, # 设置成1则没有运行信息输出,最好是设置为0.
'eta': 0.007, # 如同学习率
'seed': 1000,
'nthread': 4, # cpu 线程数
'eval_metric':'auc'
}
plst = params.items()
# evallist = [(dtest, 'eval'), (dtrain, 'train')] # 指定验证集
训练:
# 2.训练
num_round = 10
bst = xgb.train( plst, dtrain, num_round)
#bst = xgb.train( plst, dtrain, num_round, evallist )
保存模型:
# 3.保存模型
bst.save_model('0001.model')
# dump model
bst.dump_model('dump.raw.txt')
# dump model with feature map
#bst.dump_model('dump.raw.txt', 'featmap.txt')
加载保存的模型:
# 4.加载保存的模型:
bst = xgb.Booster({'nthread': 4}) # init model
bst.load_model('0001.model') # load data
设置早停机制:
# 5.也可以设置早停机制(需要设置验证集)
train(..., evals=evals, early_stopping_rounds=10)
预测:
# 6.预测
ypred = bst.predict(dtest)
绘制重要性特征图:
# 1.绘制重要性
xgb.plot_importance(bst)
# 2.绘制输出树
#xgb.plot_tree(bst, num_trees=2)
# 3.使用xgboost.to_graphviz()将目标树转换为graphviz
#xgb.to_graphviz(bst, num_trees=2)
from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score # 准确率
# 加载样本数据集
iris = load_iris()
X,y = iris.data,iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565) # 数据集分割
# 算法参数
params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'num_class': 3,
'gamma': 0.1,
'max_depth': 6,
'lambda': 2,
'subsample': 0.7,
'colsample_bytree': 0.75,
'min_child_weight': 3,
'silent': 0,
'eta': 0.1,
'seed': 1,
'nthread': 4,
}
plst = params.items()
dtrain = xgb.DMatrix(X_train, y_train) # 生成数据集格式
num_rounds = 500
model = xgb.train(plst, dtrain, num_rounds) # xgboost模型训练
# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))
# 显示重要特征
plot_importance(model)
plt.show()
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
# 加载数据集
boston = load_boston()
X,y = boston.data,boston.target
# XGBoost训练过程
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
params = {
'booster': 'gbtree',
'objective': 'reg:squarederror',
'gamma': 0.1,
'max_depth': 5,
'lambda': 3,
'subsample': 0.7,
'colsample_bytree': 0.7,
'min_child_weight': 3,
'silent': 1,
'eta': 0.1,
'seed': 1000,
'nthread': 4,
}
dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 300
plst = params.items()
model = xgb.train(plst, dtrain, num_rounds)
# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
ans = model.predict(dtest)
# 显示重要特征
plot_importance(model)
plt.show()
import xgboost as xgb
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
iris = load_iris()
X,y = iris.data,iris.target
col = iris.target_names
train_x, valid_x, train_y, valid_y = train_test_split(X, y, test_size=0.3, random_state=1) # 分训练集和验证集
parameters = {
'max_depth': [5, 10, 15, 20, 25],
'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
'n_estimators': [500, 1000, 2000, 3000, 5000],
'min_child_weight': [0, 2, 5, 10, 20],
'max_delta_step': [0, 0.2, 0.6, 1, 2],
'subsample': [0.6, 0.7, 0.8, 0.85, 0.95],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],
'reg_alpha': [0, 0.25, 0.5, 0.75, 1],
'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1],
'scale_pos_weight': [0.2, 0.4, 0.6, 0.8, 1]
}
xlf = xgb.XGBClassifier(max_depth=10,
learning_rate=0.01,
n_estimators=2000,
silent=True,
objective='multi:softmax',
num_class=3 ,
nthread=-1,
gamma=0,
min_child_weight=1,
max_delta_step=0,
subsample=0.85,
colsample_bytree=0.7,
colsample_bylevel=1,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
seed=0,
missing=None)
gs = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)
gs.fit(train_x, train_y)
print("Best score: %0.3f" % gs.best_score_)
print("Best parameters set: %s" % gs.best_params_ )
Best score: 0.933
Best parameters set: {‘max_depth’: 5}
LightGBM也是像XGBoost一样,是一类集成算法,他跟XGBoost总体来说是一样的,算法本质上与Xgboost没有出入,只是在XGBoost的基础上进行了优化,因此就不对原理进行重复介绍,在这里我们来看看几种算法的差别:
LightGBM的优点:
1)更快的训练效率
2)低内存使用
3)更高的准确率
4)支持并行化学习
5)可以处理大规模数据
1.速度对比:
2.准确率对比:
3.内存使用量对比:
LightGBM参数说明:
推荐文档1:https://lightgbm.apachecn.org/#/docs/6
推荐文档2:https://lightgbm.readthedocs.io/en/latest/Parameters.html
1.核心参数:(括号内名称是别名)
2.用于控制模型学习过程的参数:
3.度量参数:
4.GPU 参数:
LightGBM与网格搜索结合调参:
参考代码:https://blog.csdn.net/u012735708/article/details/83749703
import lightgbm as lgb
from sklearn import metrics
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
canceData=load_breast_cancer()
X=canceData.data
y=canceData.target
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2)
### 数据转换
print('数据转换')
lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False)
### 设置初始参数--不含交叉验证参数
print('设置参数')
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'nthread':4,
'learning_rate':0.1
}
### 交叉验证(调参)
print('交叉验证')
max_auc = float('0')
best_params = {}
# 准确率
print("调参1:提高准确率")
for num_leaves in range(5,100,5):
for max_depth in range(3,8,1):
params['num_leaves'] = num_leaves
params['max_depth'] = max_depth
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc = mean_auc
best_params['num_leaves'] = num_leaves
best_params['max_depth'] = max_depth
if 'num_leaves' and 'max_depth' in best_params.keys():
params['num_leaves'] = best_params['num_leaves']
params['max_depth'] = best_params['max_depth']
# 过拟合
print("调参2:降低过拟合")
for max_bin in range(5,256,10):
for min_data_in_leaf in range(1,102,10):
params['max_bin'] = max_bin
params['min_data_in_leaf'] = min_data_in_leaf
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc = mean_auc
best_params['max_bin']= max_bin
best_params['min_data_in_leaf'] = min_data_in_leaf
if 'max_bin' and 'min_data_in_leaf' in best_params.keys():
params['min_data_in_leaf'] = best_params['min_data_in_leaf']
params['max_bin'] = best_params['max_bin']
print("调参3:降低过拟合")
for feature_fraction in [0.6,0.7,0.8,0.9,1.0]:
for bagging_fraction in [0.6,0.7,0.8,0.9,1.0]:
for bagging_freq in range(0,50,5):
params['feature_fraction'] = feature_fraction
params['bagging_fraction'] = bagging_fraction
params['bagging_freq'] = bagging_freq
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc=mean_auc
best_params['feature_fraction'] = feature_fraction
best_params['bagging_fraction'] = bagging_fraction
best_params['bagging_freq'] = bagging_freq
if 'feature_fraction' and 'bagging_fraction' and 'bagging_freq' in best_params.keys():
params['feature_fraction'] = best_params['feature_fraction']
params['bagging_fraction'] = best_params['bagging_fraction']
params['bagging_freq'] = best_params['bagging_freq']
print("调参4:降低过拟合")
for lambda_l1 in [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]:
for lambda_l2 in [1e-5,1e-3,1e-1,0.0,0.1,0.4,0.6,0.7,0.9,1.0]:
params['lambda_l1'] = lambda_l1
params['lambda_l2'] = lambda_l2
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc=mean_auc
best_params['lambda_l1'] = lambda_l1
best_params['lambda_l2'] = lambda_l2
if 'lambda_l1' and 'lambda_l2' in best_params.keys():
params['lambda_l1'] = best_params['lambda_l1']
params['lambda_l2'] = best_params['lambda_l2']
print("调参5:降低过拟合2")
for min_split_gain in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]:
params['min_split_gain'] = min_split_gain
cv_results = lgb.cv(
params,
lgb_train,
seed=1,
nfold=5,
metrics=['auc'],
early_stopping_rounds=10,
verbose_eval=True
)
mean_auc = pd.Series(cv_results['auc-mean']).max()
boost_rounds = pd.Series(cv_results['auc-mean']).idxmax()
if mean_auc >= max_auc:
max_auc=mean_auc
best_params['min_split_gain'] = min_split_gain
if 'min_split_gain' in best_params.keys():
params['min_split_gain'] = best_params['min_split_gain']
print(best_params)
运行结果
{‘bagging_fraction’: 0.7,
‘bagging_freq’: 30,
‘feature_fraction’: 0.8,
‘lambda_l1’: 0.1,
‘lambda_l2’: 0.0,
‘max_bin’: 255,
‘max_depth’: 4,
‘min_data_in_leaf’: 81,
‘min_split_gain’: 0.1,
‘num_leaves’: 10}
本章中,我们主要探讨了基于Boosting方式的集成方法,其中主要讲解了基于错误率驱动的Adaboost,基于残差改进的提升树,基于梯度提升的GBDT,基于泰勒二阶近似的Xgboost以及LightGBM。在实际的比赛或者工程中,基于Boosting的集成学习方式是非常有效且应用非常广泛的。更多的学习有待读者更深入阅读文献,包括原作者论文以及论文复现等。下一章我们即将探讨另一种集成学习方式:Stacking集成学习方式,这种集成方式虽然没有Boosting应用广泛,但是在比赛中或许能让你的模型更加出众。