AdaBoost算法(二)——理论推导篇

AdaBoost算法(二)——理论推导篇

集成学习系列博客:

  • 集成学习(ensemble learning)基础知识
  • 随机森林(random forest)
  • AdaBoost算法(一)——基础知识篇
  • AdaBoost算法(二)——理论推导篇

在前面AdaBoost算法(一)——基础知识篇中详细介绍了adaboost的基础知识和原理,如果你只想了解adaboost的基本原理那么只看那篇博客就足够了,或者是你看到公式就头大,也可以跳过这篇博客了,因为这篇博客讲的AdaBoost理论推导确实很枯燥,并且这个理论推导对于不搞学术研究,不做改进,只是用下adaboost的同学来说没啥意义,所以只看AdaBoost算法(一)——基础知识篇 也是完全ok的。
这篇博客主要分以下两个方面来介绍下adaboost的理论推导方面吧,这篇博客基于李航老师的《统计学习方法》写成。

  1. Adabost算法的训练误差分析
  2. 用加法模型来解释Adaboost

一、Adabost算法的训练误差分析

首先先来理论分析下adaboost算法的误差,以下公式大多都是直接摘自李航老师《统计学习方法》,我只是添加了一些细化的步骤和解释。我们在上一篇adaboost的基础知识里也说了adaboost的误差为: 1 N ∑ i = 1 N I ( G ( x i ) = ̸ y i ) \frac{1}{N}\sum_{i=1}^{N}I(G(x_i) = \not y_i) N1i=1NI(G(xi)≠yi),那么这个误差有没有上届?《统计学习方法》书中给出了并且给出了证明,adaboost算法最终分类器的训练误差界为:
(1) 1 N ∑ i = 1 N I ( G ( x i ) = ̸ y i ) ≤ 1 N ∑ i e x p ( − y i f ( x i ) ) = ∏ m Z m \frac{1}{N}\sum_{i=1}^{N}I(G(x_i) = \not y_i) \leq \frac{1}{N}\sum_{i}exp(-y_i f(x_i)) = \prod_{m}Z_m \tag{1} N1i=1NI(G(xi)≠yi)N1iexp(yif(xi))=mZm(1)

这里的 G ( x ) G(x) G(x)表示最终的分类器, G m ( x ) G_m(x) Gm(x)表示第m次迭代时的基分类器, f ( x ) f(x) f(x)表示基分类器的线性组合,即 f ( x ) = ∑ m = 1 M α m G m ( x ) f(x)=\sum_{m=1}^{M}\alpha_mG_m(x) f(x)=m=1MαmGm(x) Z m Z_m Zm为前一篇博客中提到的规范化因子, Z m = ∑ i = 1 N w m i e x p ( − α m y i G m ( x i ) ) Z_m=\sum_{i=1}^{N}w_{mi}exp(-\alpha_my_iG_m(x_i)) Zm=i=1Nwmiexp(αmyiGm(xi))。下面主要是来证明上面这个公式,证明:
G ( x i ) = ̸ y i G(x_i)= \not y_i G(xi)≠yi 时,有 y i f ( x i ) < 0 y_if(x_i) < 0 yif(xi)<0,因此 e x p ( − y i f ( x i ) ) ≥ 1 exp(-y_if(x_i)) \ge 1 exp(yif(xi))1,直接可得 1 N ∑ i = 1 N I ( G ( x i ) = ̸ y i ) ≤ 1 N ∑ i e x p ( − y i f ( x i ) ) \frac{1}{N}\sum_{i=1}^{N}I(G(x_i) = \not y_i) \leq \frac{1}{N}\sum_{i}exp(-y_i f(x_i)) N1i=1NI(G(xi)≠yi)N1iexp(yif(xi))
现在来证明 1 N ∑ i e x p ( − y i f ( x i ) ) = ∏ m Z m \frac{1}{N}\sum_{i}exp(-y_i f(x_i)) = \prod_{m}Z_m N1iexp(yif(xi))=mZm,首先根据上一篇博客AdaBoost算法(一)——基础知识篇中公式能够知道: w m i e x p ( − α m y i G m ( x i ) ) = Z m w m + 1 , i w_{mi}exp(-\alpha_my_iG_m(x_i)) = Z_mw_{m+1,i} wmiexp(αmyiGm(xi))=Zmwm+1,i
证明:

(2) 1 N ∑ i e x p ( − y i f ( x i ) ) = 1 N ∑ i e x p ( − ∑ m = 1 M α m y i G m ( x i ) ) = ∑ i w 1 i ⋅ e x p [ − α 1 y i G 1 ( x i ) − α 2 y i G 2 ( x i ) − . . . − α M y i G M ( x i ) ] = ∑ i w 1 i ∏ m = 1 M e x p ( − α m y i G m ( x i ) ) = Z 1 ∑ i w 2 i ∏ m = 2 M e x p ( − α m y i G m ( x i ) ) = Z 1 Z 2 ∑ i w 3 i ∏ m = 3 M e x p ( − α m y i G m ( x i ) ) = ⋅ ⋅ ⋅ = Z 1 Z 2 ⋅ ⋅ ⋅ Z M − 1 ∑ i w M i e x p ( − α M y i G M ( x i ) ) = ∏ m Z m \begin{aligned} &\frac{1}{N}\sum_{i}exp(-y_i f(x_i)) \\ &=\frac{1}{N}\sum_{i}exp(-\sum_{m=1}^{M}\alpha_my_i G_m(x_i)) \\ &= \sum_{i}w_{1i} \cdot exp[-\alpha_1y_iG_1(x_i) - \alpha_2y_iG_2(x_i) - ... - \alpha_My_iG_M(x_i)] \tag{2}\\ &= \sum_{i}w_{1i} \prod_{m=1}^{M}exp(-\alpha_my_i G_m(x_i))\\ &= Z_1\sum_{i}w_{2i} \prod_{m=2}^{M}exp(-\alpha_my_i G_m(x_i)) \\ &= Z_1Z_2\sum_{i}w_{3i} \prod_{m=3}^{M}exp(-\alpha_my_i G_m(x_i)) \\ &=\cdot \cdot\cdot\\ &= Z_1Z_2 \cdot \cdot\cdot Z_{M-1}\sum_{i}w_{Mi}exp(-\alpha_My_i G_M(x_i)) \\ &=\prod_{m}Z_m\\ \end{aligned} N1iexp(yif(xi))=N1iexp(m=1MαmyiGm(xi))=iw1iexp[α1yiG1(xi)α2yiG2(xi)...αMyiGM(xi)]=iw1im=1Mexp(αmyiGm(xi))=Z1iw2im=2Mexp(αmyiGm(xi))=Z1Z2iw3im=3Mexp(αmyiGm(xi))==Z1Z2ZM1iwMiexp(αMyiGM(xi))=mZm(2)

上面的定理说明了:只要在每一轮选择适当的 G m G_m Gm使得 Z m Z_m Zm最小,则可以使得adaboost的训练误差下降最快。
特别地,对于二分类而言,adaboost的训练误差是以指数速度下降的,即:
(3) ∏ m = 1 M Z m = ∏ m = 1 M [ 2 e m ( 1 − e m ) ] = ∏ m = 1 M ( 1 − 4 γ m 2 ) ≤ e x p ( − 2 ∑ m = 1 M γ m 2 ) \prod_{m=1}^{M}Z_m=\prod_{m=1}^{M}[2\sqrt{e_m(1-e_m)}] = \prod_{m=1}^{M}\sqrt{(1-4\gamma_m^2)} \leq exp(-2\sum_{m=1}^{M}\gamma_m^2) \tag{3} m=1MZm=m=1M[2em(1em) ]=m=1M(14γm2) exp(2m=1Mγm2)(3)
这里, γ m = 1 2 − e m \gamma_m=\frac{1}{2}-e_m γm=21em
下面来证明(3)式:
(4) Z m = ∑ i = 1 N w m i e x p ( − α m y i G m ( x i ) ) = ∑ y i = G m ( x i ) w m i e − α m + ∑ y i = ̸ G m ( x i ) w m i e α m = e − α m ∑ y i = G m ( x i ) w m i + e α m ∑ y i = ̸ G m ( x i ) w m i = e − α m ( 1 − e m ) + e α m e m = ( 1 − e m ) e − 1 2 log ⁡ 1 − e m e m + e m e 1 2 log ⁡ 1 − e m e m = 2 e m ( 1 − e m ) = 1 − 4 γ m 2 \begin{aligned} Z_m &= \sum_{i=1}^{N}w_{mi}exp(-\alpha_my_iG_m(x_i )) \\ &=\sum_{y_i=G_m(x_i)}w_{mi}e^{-\alpha_m} + \sum_{y_i= \not G_m(x_i)}w_{mi}e^{\alpha_m} \tag{4}\\ &=e^{-\alpha_m}\sum_{y_i=G_m(x_i)}w_{mi} + e^{\alpha_m}\sum_{y_i= \not G_m(x_i)}w_{mi}\\ &=e^{-\alpha_m}(1-e_m) + e^{\alpha_m}e_m \\ &= (1-e_m)e^{-\frac{1}{2}\log\frac{1-e_m}{e_m}} + e_me^{\frac{1}{2}\log\frac{1-e_m}{e_m}} \\ &= 2\sqrt{e_m(1-e_m)} \\ &= \sqrt{1-4\gamma_m^2} \end{aligned} Zm=i=1Nwmiexp(αmyiGm(xi))=yi=Gm(xi)wmieαm+yi≠Gm(xi)wmieαm=eαmyi=Gm(xi)wmi+eαmyi≠Gm(xi)wmi=eαm(1em)+eαmem=(1em)e21logem1em+eme21logem1em=2em(1em) =14γm2 (4)
下面来证不等式
∏ m = 1 M ( 1 − 4 γ m 2 ) ≤ e x p ( − 2 ∑ m = 1 M γ m 2 ) \prod_{m=1}^{M}\sqrt{(1-4\gamma_m^2)} \leq exp(-2\sum_{m=1}^{M}\gamma_m^2) m=1M(14γm2) exp(2m=1Mγm2)
因为
(5) e x p ( − 2 ∑ m = 1 M γ m 2 ) = e x p ( − 2 γ 1 2 ) ⋅ e x p ( − 2 γ 2 2 ) ⋅ ⋅ ⋅ e x p ( − 2 γ M 2 ) = ∏ m = 1 M e x p ( − 2 γ m 2 ) exp(-2\sum_{m=1}^{M}\gamma_m^2) = exp(-2\gamma_1^2)\cdot exp(-2\gamma_2^2)\cdot\cdot\cdot exp(-2\gamma_M^2) = \prod_{m=1}^{M}exp(-2\gamma_m^2) \tag{5} exp(2m=1Mγm2)=exp(2γ12)exp(2γ22)exp(2γM2)=m=1Mexp(2γm2)(5)
即证明:
∏ m = 1 M ( 1 − 4 γ m 2 ) ≤ ∏ m = 1 M e x p ( − 2 γ m 2 ) \prod_{m=1}^{M}\sqrt{(1-4\gamma_m^2)} \leq \prod_{m=1}^{M}exp(-2\gamma_m^2) m=1M(14γm2) m=1Mexp(2γm2)
这个用泰勒展开式在点 x = 0 x=0 x=0 处对 e − 2 r 2 e^{-2r^2} e2r2 1 − 4 r 2 \sqrt{1-4r^2} 14r2 展开,这里要用到的泰勒公式如下:
e x = 1 + x + 1 2 ! x 2 + 1 3 ! x 3 + 1 4 ! x 4 + ⋅ ⋅ ⋅ + 1 n ! x n e^x = 1+x + \frac{1}{2!}x^2 + \frac{1}{3!}x^3 + \frac{1}{4!}x^4 + \cdot \cdot \cdot + \frac{1}{n!}x^n ex=1+x+2!1x2+3!1x3+4!1x4++n!1xn
( 1 + x ) α = 1 + α x + α ( α − 1 ) 2 ! x 2 + ⋅ ⋅ ⋅ + α ( α − 1 ) ⋅ ⋅ ⋅ ( α − n + 1 ) n ! x n (1+x)^\alpha = 1 + \alpha x + \frac{\alpha (\alpha-1)}{2!}x^2 + \cdot \cdot \cdot + \frac{\alpha (\alpha-1) \cdot \cdot \cdot (\alpha- n + 1)}{n!}x^n (1+x)α=1+αx+2!α(α1)x2++n!α(α1)(αn+1)xn
因此 1 − 4 r 2 \sqrt{1-4r^2} 14r2 的泰勒展开式为:
1 + ( − 4 r 2 ) = 1 − 2 r 2 − 2 r 4 + ⋅ ⋅ ⋅ \sqrt{1+(-4r^2)} = 1 - 2r^2 - 2r^4 + \cdot \cdot \cdot 1+(4r2) =12r22r4+
e − 2 r 2 e^{-2r^2} e2r2 的泰勒展开式为:
e − 2 r 2 = 1 − 2 r 2 + 2 r 4 + ⋅ ⋅ ⋅ e^{-2r^2} = 1 - 2r^2 + 2r^4 +\cdot\cdot\cdot e2r2=12r2+2r4+
所以, 1 − 4 r 2 < e − 2 r 2 \sqrt{1-4r^2} < e^{-2r^2} 14r2 <e2r2,因此 ∏ m = 1 M ( 1 − 4 γ m 2 ) ≤ ∏ m = 1 M e x p ( − 2 γ m 2 ) \prod_{m=1}^{M}\sqrt{(1-4\gamma_m^2)} \leq \prod_{m=1}^{M}exp(-2\gamma_m^2) m=1M(14γm2) m=1Mexp(2γm2),得证。这实际上表明了adaboost的训练误差是以指数速度下降的。

二、用加法模型来解释Adaboost

这一部分主要是用加法模型和指数损失函数来解释adaboost。在统计学习方法书中写到,前向分步法学习的是加法模型,当基函数为基本分类器时,该加法模型等价于AdaBoost的最终分类器
f ( x ) = ∑ m = 1 M α m G m ( x ) f(x) = \sum_{m=1}^{M}\alpha_mG_m(x) f(x)=m=1MαmGm(x)
前向分步法从前往后逐一学习基函数,与adaBoost逐一学习基分类器的过程是一致的。并且当前向分步法的损失函数是指数损失函数( L ( y , f ( x ) ) = e x p [ − y f ( x ) ] L(y, f(x)) = exp[-yf(x)] L(y,f(x))=exp[yf(x)])时,其学习的具体操作等价于AdaBoost。下面就是证明,如何从加法模型中使用前向分步法推导出AdaBoost,对公式头大的同学可以略过此节了。。即使看的话,我依然建议去看统计学习书中的这部分,我只是把其中的步骤做了一点细化的补充,为了让公式更容易看懂而已。。
假设经过 m − 1 m-1 m1次迭代前向分步法得到 f m − 1 ( x ) f_{m-1}(x) fm1(x)
f m − 1 ( x ) = f m − 2 ( x ) + α m − 1 G m − 1 ( x ) = α 1 G 1 ( x ) + α 2 G 2 ( x ) + ⋅ ⋅ ⋅ + α m − 1 G m − 1 ( x ) f_{m-1}(x) = f_{m-2}(x) + \alpha_{m-1}G_{m-1}(x) = \alpha_{1}G_{1}(x) + \alpha_{2}G_{2}(x) + \cdot \cdot \cdot + \alpha_{m-1}G_{m-1}(x) fm1(x)=fm2(x)+αm1Gm1(x)=α1G1(x)+α2G2(x)++αm1Gm1(x)
则在第m次迭代有:
f m ( x ) = f m − 1 ( x ) + α m G m ( x ) f_{m}(x) = f_{m-1}(x) + \alpha_{m}G_{m}(x) fm(x)=fm1(x)+αmGm(x)
目标则是使前向分步法得到的 α m \alpha_m αm G m ( x ) G_m(x) Gm(x)使 f m ( x ) f_m(x) fm(x)在训练集上的指数损失函数最小,即:
(6) ( α m , G m ( x ) ) = arg ⁡ min ⁡ α , G ∑ i = 1 N e x p [ − y i ( f m − 1 ( x i ) + α G ( x i ) ) ] (\alpha_m, G_m(x)) = \arg\min_{\alpha, G}\sum_{i=1}^{N}exp[-y_i(f_{m-1}(x_i) + \alpha G(x_i)) ] \tag{6} (αm,Gm(x))=argα,Gmini=1Nexp[yi(fm1(xi)+αG(xi))](6)
若令 w ˉ m i = e x p [ − y i f m − 1 ( x i ) ] \bar{w}_{mi}=exp[-y_if_{m-1}(x_i)] wˉmi=exp[yifm1(xi)],则公式(6)为:
(7) ( α m , G m ( x ) ) = arg ⁡ min ⁡ α , G ∑ i = 1 N w ˉ m i e x p [ − y i α G ( x i ) ] (\alpha_m, G_m(x)) = \arg\min_{\alpha, G}\sum_{i=1}^{N}\bar{w}_{mi}exp[-y_i\alpha G(x_i) ] \tag{7} (αm,Gm(x))=argα,Gmini=1Nwˉmiexp[yiαG(xi)](7)
因为 w ˉ m i \bar{w}_{mi} wˉmi α \alpha α G G G无关,因此最小化时可以把它视为常数。
下面我们来看看怎么求 α m ∗ \alpha^*_m αm G m ∗ ( x ) G^*_m(x) Gm(x)使得公式(7)最小,

先来求 G m ∗ ( x ) G^*_m(x) Gm(x),对于任意的 α > 0 \alpha>0 α>0,要想使得公式(7)最小,则要 − y i G ( x i ) -y_iG(x_i) yiG(xi)最小,则要 y i G ( x i ) y_iG(x_i) yiG(xi)最大,那么就需要 G m ∗ ( x ) G^*_m(x) Gm(x)的错误率最小,因此:
(8) G m ∗ ( x ) = arg ⁡ min ⁡ G ∑ i = 1 N w ˉ m i I ( y i = ̸ G ( x i ) ) G^*_m(x) = \arg\min_G\sum_{i=1}^{N}\bar{w}_{mi}I(y_i = \not G(x_i)) \tag{8} Gm(x)=argGmini=1NwˉmiI(yi≠G(xi))(8)
此分类器 G m ∗ ( x ) G^*_m(x) Gm(x)即为AdaBoost算法的基分类器 G m ( x ) G_m(x) Gm(x),因为它是使得第m次迭代时加权训练数据分类误差最小的基分类器。

再来求 α m ∗ \alpha^*_m αm
(9) ∑ i = 1 N w ˉ m i e x p ( − y i α G ( x i ) ) = ∑ y i = G m ( x i ) w ˉ m i e − α + ∑ y i = ̸ G m ( x i ) w ˉ m i e α = ( e α − e − α ) ∑ i = 1 N w ˉ m i I ( y i = ̸ G ( x i ) ) + e − α ∑ i = 1 N w ˉ m i \begin{aligned} &\sum_{i=1}^{N}\bar{w}_{mi}exp(-y_i \alpha G(x_i)) \\ &=\sum_{y_i = G_m(x_i)} \bar{w}_{mi} e^{-\alpha} + \sum_{y_i = \not G_m(x_i)} \bar{w}_{mi} e^{\alpha}\\ &=(e^\alpha - e^{-\alpha})\sum_{i=1}^{N}\bar{w}_{mi}I(y_i = \not G(x_i)) + e^{-\alpha}\sum_{i=1}^{N}\bar{w}_{mi} \tag{9} \end{aligned} i=1Nwˉmiexp(yiαG(xi))=yi=Gm(xi)wˉmieα+yi≠Gm(xi)wˉmieα=(eαeα)i=1NwˉmiI(yi≠G(xi))+eαi=1Nwˉmi(9)
公式(9)对 α \alpha α求导,并使得导数为0,可解得:
(10) α m ∗ = 1 2 l o g 1 − e m e m \alpha^{*}_m = \frac{1}{2}log\frac{1-e_m}{e_m} \tag{10} αm=21logem1em(10)
其中,
(11) e m = ∑ i = 1 N w ˉ m i I ( y i = ̸ G m ( x i ) ) ∑ i = 1 N w ˉ m i e_m = \frac{\sum_{i=1}^{N}\bar{w}_{mi}I(y_i = \not G_m(x_i))}{\sum_{i=1}^{N}\bar{w}_{mi}} \tag{11} em=i=1Nwˉmii=1NwˉmiI(yi≠Gm(xi))(11)
至于公式(11)为何还等于 ∑ i = 1 N w m i I ( ( y i ) = ̸ G m ( x i ) ) \sum_{i=1}^{N}w_{mi}I((y_i)= \not G_m(x_i)) i=1NwmiI((yi)≠Gm(xi))我是没看懂,请路过的大佬不啬赐教~
能够看出这个 α m ∗ \alpha^*_m αm与adaboost的 α m \alpha_m αm完全一致。

再来看样本权重更新,因为 w ˉ m i = e x p [ − y i f m − 1 ( x i ) ] \bar{w}_{mi}=exp[-y_if_{m-1}(x_i)] wˉmi=exp[yifm1(xi)],所以:
w ˉ m + 1 , i = e x p [ − y i f m ( x i ) ] = e x p [ − y i ( f m − 1 ( x ) + α m G m ( x ) ) ] = e x p ( − y i f m − 1 ( x i ) ) ⋅ e x p ( − y i α m G m ( x ) ) = w ˉ m i e x p ( − y i α m G m ( x ) ) \begin{aligned} \bar{w}_{m+1,i}&=exp[-y_if_{m}(x_i)] \\ &=exp[-y_i(f_{m-1}(x) + \alpha_mG_m(x))] \\ &=exp(-y_if_{m-1}(x_i))\cdot exp(-y_i\alpha_mG_m(x)) \\ &=\bar{w}_{mi}exp(-y_i\alpha_mG_m(x)) \end{aligned} wˉm+1,i=exp[yifm(xi)]=exp[yi(fm1(x)+αmGm(x))]=exp(yifm1(xi))exp(yiαmGm(x))=wˉmiexp(yiαmGm(x))
这与Adaboost的样本权重更新是一致的。

关于AdaBoost的理论推导就介绍到这~



参考文献
[1]: 李航《统计学习方法》

你可能感兴趣的:(machine,learning&deep,learning)