转载请注明出处: SVM算法理论推导及python实现
本文面向的读者为掌握SVM基础前置知识如读过《统计学习方法》,并希望对SMO(Sequential Minimal Optimization)细节有更深入了解的人群。因为笔者想要实现一个简易的SVM,为了搞懂这部分费了不少工夫,所以写下这一篇嚼过的文章,目的是让读者跟着顺序阅读一定能弄懂。自己的行文习惯是简单的地方不能说太复杂,复杂的地方一定会说清楚
对于含soft margin的支持向量机, 其primitive problem:
(1) min w , b , ξ 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i s . t . y i ( w ⋅ x i + b ) ≥ 1 − ξ i , i = 1 , 2 , ⋯   , N       ξ i ≥ 0 , i = 1 , 2 , ⋯   , N \begin{aligned} \\ & \min_{w,b,\xi} \quad \dfrac{1}{2} \| w \|^{2} + C \sum_{i=1}^{N} \xi_{i} \tag{1} \\ & s.t. \quad y_{i} \left( w \cdot x_{i} + b \right) \geq 1 - \xi_{i}, \quad i=1,2, \cdots, N \\ & \quad \quad \; \; \, \xi_{i} \ge 0, \quad i=1,2, \cdots, N \end{aligned} w,b,ξmin21∥w∥2+Ci=1∑Nξis.t.yi(w⋅xi+b)≥1−ξi,i=1,2,⋯,Nξi≥0,i=1,2,⋯,N(1)
求解此原始问题:
引入拉格朗日乘子 α i ≥ 0 , μ i ≥ 0 , i = 1 , 2 , ⋯   , N \alpha_{i} \ge 0, \mu_{i} \ge 0, i = 1, 2, \cdots, N αi≥0,μi≥0,i=1,2,⋯,N构建Lagrange Function:
(2) L ( w , b , ξ , α , μ ) = 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i + ∑ i = 1 N α i ( − y i ( w ⋅ x i + b ) + 1 − ξ i ) − ∑ i = 1 N μ i ξ i \begin{aligned} \\ L(w,b,\xi,\alpha,\mu) &= \frac{1}{2} \| w \|^{2} + C \sum_{i=1}^{N} \xi_{i} + \sum_{i=1}^{N} \alpha_{i} (- y_{i} ( w \cdot x_{i} + b ) + 1 - \xi_{i} ) - \sum_{i=1}^{N} \mu_{i} \xi_{i} \tag{2} \end{aligned} L(w,b,ξ,α,μ)=21∥w∥2+Ci=1∑Nξi+i=1∑Nαi(−yi(w⋅xi+b)+1−ξi)−i=1∑Nμiξi(2)
其中, α = ( α 1 , α 2 , ⋯   , α N ) T \alpha = \left( \alpha_{1}, \alpha_{2}, \cdots, \alpha_{N} \right)^{T} α=(α1,α2,⋯,αN)T以及 μ = ( μ 1 , μ 2 , ⋯   , μ N ) T \mu = \left( \mu_{1}, \mu_{2}, \cdots, \mu_{N} \right)^{T} μ=(μ1,μ2,⋯,μN)T为lagrange multiplier , 它们的每个分量都是非负数
具体原理与KKT条件推导可看我上一篇博文:SVM之拉格朗日对偶问题与KKT条件推导
现在dual problem:
(3) max α , μ min w , b , ξ L ( w , b , ξ , α , μ ) \max_{\alpha,\mu} \min_{w,b,\xi} L(w, b,\xi,\alpha,\mu) \tag{3} α,μmaxw,b,ξminL(w,b,ξ,α,μ)(3)
由于把 α , μ \alpha, \mu α,μ 都看作常量, 要求最小值就直接求偏导:
(4) ∇ w L ( w , b , ξ , α , μ ) = w − ∑ i = 1 N α i y i x i = 0 \nabla_{w} L( w, b, \xi, \alpha, \mu) = w - \sum_{i=1}^{N} \alpha_{i} y_{i} x_{i} = 0 \tag{4} ∇wL(w,b,ξ,α,μ)=w−i=1∑Nαiyixi=0(4)
(5) ∇ b L ( w , b , ξ , α , μ ) = − ∑ i = 1 N α i y i = 0 \nabla_{b} L \left( w, b, \xi, \alpha, \mu \right) = -\sum_{i=1}^{N} \alpha_{i} y_{i} = 0 \tag{5} ∇bL(w,b,ξ,α,μ)=−i=1∑Nαiyi=0(5)
(6) ∇ ξ i L ( w , b , ξ , α , μ ) = C − α i − μ i = 0 \nabla_{\xi_{i}} L \left( w, b, \xi, \alpha, \mu \right) = C - \alpha_{i} - \mu_{i} = 0 \tag{6} ∇ξiL(w,b,ξ,α,μ)=C−αi−μi=0(6)
得:
(7) w = ∑ i = 1 N α i y i x i w = \sum_{i=1}^N \alpha_i y_i x_i \tag{7} w=i=1∑Nαiyixi(7)
(8) ∑ i = 1 N α i y i = 0 \sum_{i=1}^N \alpha_i y_i = 0 \tag{8} i=1∑Nαiyi=0(8)
(9) C − α i − μ i = 0 C - \alpha_i - \mu_i = 0 \tag{9} C−αi−μi=0(9)
把(7)-(9) 代入(2)可得(这里引入了kernel function):
(10) min w , b , ξ L ( w , b , ξ , α , μ ) = − 1 2 ∑ i = 1 N ∑ i = 1 N α i α j y i y j K ( x i , x j ) + ∑ i = 1 N α i \min_{w,b,\xi} L(w,b,\xi,\alpha,\mu) = -\frac{1}{2}\sum_{i=1}^N\sum_{i=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) + \sum_{i=1}^N \alpha_i \tag{10} w,b,ξminL(w,b,ξ,α,μ)=−21i=1∑Ni=1∑NαiαjyiyjK(xi,xj)+i=1∑Nαi(10)
对式(10)求解max:
(11) max α , μ − 1 2 ∑ i = 1 N ∑ i = 1 N α i α j y i y j K ( x i , x j ) + ∑ i = 1 N α i \max_{\alpha, \mu} -\frac{1}{2} \sum_{i=1}^N\sum_{i=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) + \sum_{i=1}^N \alpha_i \tag{11} α,μmax−21i=1∑Ni=1∑NαiαjyiyjK(xi,xj)+i=1∑Nαi(11)
(12) s . t . ∑ i = 1 N α i y i = 0 s.t. \quad \sum_{i=1}^N \alpha_i y_i = 0 \tag{12} s.t.i=1∑Nαiyi=0(12)
(13) s . t . C − α i − μ i = 0 s.t. \quad C-\alpha_i - \mu_i = 0 \tag{13} s.t.C−αi−μi=0(13)
(14) s . t . α i ≥ 0 s.t. \quad \alpha_i \ge 0 \tag{14} s.t.αi≥0(14)
(15) s . t . μ i ≥ 0 s.t. \quad \mu_i \ge 0 \tag{15} s.t.μi≥0(15)
式(13)-(15) 可简化为:
(16) 0 ≤ α i ≤ C 0 \le \alpha_i \le C \tag{16} 0≤αi≤C(16)
(17) min α 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) − ∑ i = 1 N α i \min_\alpha \quad \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) - \sum_{i=1}^N \alpha_i \tag{17} αmin21i=1∑Nj=1∑NαiαjyiyjK(xi,xj)−i=1∑Nαi(17)
(18) s . t . ∑ i = 1 N α i y i = 0 s.t. \quad \sum_{i=1}^N \alpha_i y_i = 0 \tag{18} s.t.i=1∑Nαiyi=0(18)
(19) s . t . 0 ≤ α i ≤ C , i = 1 , 2 , ⋯   , N s.t. \quad 0 \le \alpha_i \le C, \quad i=1,2,\cdots,N \tag{19} s.t.0≤αi≤C,i=1,2,⋯,N(19)
收敛的值需满足的KKT condition(求解SMO时有用):
(20) ∇ w L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = w ∗ − ∑ i = 1 N α i ∗ y i x i = 0 \nabla_{w} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = w^* - \sum_{i=1}^{N} \alpha_{i}^* y_{i} x_{i} = 0 \tag{20} ∇wL(w∗,b∗,ξ∗,α∗,μ∗)=w∗−i=1∑Nαi∗yixi=0(20)
(21) ∇ b L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = − ∑ i = 1 N α i ∗ y i = 0 \nabla_{b} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = -\sum_{i=1}^{N} \alpha_{i}^* y_{i} = 0 \tag{21} ∇bL(w∗,b∗,ξ∗,α∗,μ∗)=−i=1∑Nαi∗yi=0(21)
(22) ∇ ξ i L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = C − α i ∗ − μ i ∗ = 0 \nabla_{\xi_{i}} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = C - \alpha_{i}^* - \mu_{i}^* = 0 \tag{22} ∇ξiL(w∗,b∗,ξ∗,α∗,μ∗)=C−αi∗−μi∗=0(22)
(23) α i ∗ ≥ 0 \alpha_i^* \ge 0 \tag{23} αi∗≥0(23)
(24) 1 − ξ i ∗ + y i ( w ∗ ⋅ x ∗ + b ∗ ) ≤ 0 1 - \xi_i^* +y_i(w^*\cdot x^*+b^*) \le 0 \tag{24} 1−ξi∗+yi(w∗⋅x∗+b∗)≤0(24)
(25) α i ∗ ( 1 − ξ i ∗ + y i ( w ∗ ⋅ x ∗ + b ∗ ) ) = 0 \alpha_i^*(1 - \xi_i^* + y_i ( w^*\cdot x^*+b^*)) = 0 \tag{25} αi∗(1−ξi∗+yi(w∗⋅x∗+b∗))=0(25)
(26) μ i ∗ ≥ 0 \mu_i^* \ge 0 \tag{26} μi∗≥0(26)
(27) ξ i ∗ ≤ 0 \xi_i^* \le 0 \tag{27} ξi∗≤0(27)
(28) μ i ∗ ξ i ∗ = 0 \mu_i^* \xi_i^* = 0 \tag{28} μi∗ξi∗=0(28)
一共3个偏导为0 + 2*3个乘子与不等式约束间满足的条件 = 9个KKT 条件
SMO算法,首先选择 α \alpha α的两个分量来求解式(19)的子问题,思路是不断迭代求解原问题的子问题来逼近原始问题的解, 具体怎么选取这两个分量待会儿再说
(29) min α 1 , α 2 W ( α 1 , α 2 ) = 1 2 α 1 2 K 11 + 1 2 α 2 2 K 22 + α 1 α 2 y 1 y 2 K 12 + α 1 y 1 ∑ i = 3 N α i y i K 1 i + α 2 y 2 ∑ i = 3 N α i y i K 2 i − α 1 − α 2 \begin{aligned} \min_{\alpha1, \alpha2} \quad W(\alpha_1,\alpha_2) = \frac{1}{2}\alpha_1^2K_{11} + \frac{1}{2}\alpha_2^2K_{22} + \alpha_1\alpha_2y_1y_2K_{12} \\ + \alpha_1y_1\sum_{i=3}^N\alpha_iy_iK_{1i} + \alpha_2y_2\sum_{i=3}^N\alpha_iy_iK_{2i} - \alpha_1 - \alpha_2 \tag{29} \end{aligned} α1,α2minW(α1,α2)=21α12K11+21α22K22+α1α2y1y2K12+α1y1i=3∑NαiyiK1i+α2y2i=3∑NαiyiK2i−α1−α2(29)
(30) s . t . α 1 y 1 + α 2 y 2 = − ∑ i = 3 N y i α i = ς s.t. \quad \alpha_1y_1 + \alpha_2y_2 = -\sum_{i=3}^Ny_i\alpha_i = \varsigma \tag{30} s.t.α1y1+α2y2=−i=3∑Nyiαi=ς(30)
(31) s . t .       0 ≤ α i ≤ C , i = 1 , 2 s.t. \quad \quad \; \; \, 0 \le \alpha_i \le C, \quad i=1,2 \tag{31} s.t.0≤αi≤C,i=1,2(31)
有两个变量,那先通过式(30)的变形式 α 1 = ( ς − α 2 y 2 ) y 1 \alpha_1 = (\varsigma - \alpha_2y_2)y_1 α1=(ς−α2y2)y1代入式(29)换为只包含 α 2 \alpha_2 α2的式子:
(32) min α 2 W ( α 2 ) = 1 2 ( ς − α 2 y 2 ) 2 K 11 + 1 2 α 2 2 K 22 + ( ς − α 2 y 2 ) α 2 y 2 K 12 + ( ς − α 2 y 2 ) ∑ i = 3 N α i y i K 1 i + α 2 y 2 ∑ i = 3 N α i y i K 2 i − ( ς − α 2 y 2 ) y 1 − α 2 \begin{aligned} \min_{\alpha_2} W(\alpha_2) = \frac{1}{2}(\varsigma-\alpha_2y_2)^2K_{11} + \frac{1}{2}\alpha_2^2K_{22}+(\varsigma-\alpha_2y_2)\alpha_2y_2K_{12}+ \\ (\varsigma - \alpha_2y_2)\sum_{i=3}^N\alpha_iy_iK_{1i}+\alpha_2y_2\sum_{i=3}^N\alpha_iy_iK_{2i} - (\varsigma - \alpha_2y_2)y_1 - \alpha_2 \tag{32} \end{aligned} α2minW(α2)=21(ς−α2y2)2K11+21α22K22+(ς−α2y2)α2y2K12+(ς−α2y2)i=3∑NαiyiK1i+α2y2i=3∑NαiyiK2i−(ς−α2y2)y1−α2(32)
明显地,对 α 2 \alpha_2 α2求偏导并令其为0:
(33) ∂ W α 2 = − ς y 2 K 11 + K 11 α 2 + K 22 α 2 + ς y 2 K 12 − 2 K 12 α 2 − y 2 ∑ i = 3 N α i y i K 1 i + y 2 ∑ i = 3 N α i y i K 2 i + y 1 y 2 − 1 \frac{\partial W}{\alpha_2} = -\varsigma y_2 K_{11} + K_{11}\alpha_2 + K_{22}\alpha_2 + \varsigma y_2K_{12} - 2K_{12}\alpha_2 - y_2 \sum_{i=3}^N\alpha_iy_iK_{1i} +y_2\sum_{i=3}^N\alpha_iy_iK_{2i} + y_1y_2 - 1 \tag{33} α2∂W=−ςy2K11+K11α2+K22α2+ςy2K12−2K12α2−y2i=3∑NαiyiK1i+y2i=3∑NαiyiK2i+y1y2−1(33)
(34) = ( K 11 + K 22 − 2 K 12 ) α 2 + y 2 ( − ς K 11 + ς K 12 − ∑ i = 3 N α i y i K 1 i + ∑ i = 3 N α i y i K 2 i + y 1 − y 2 ) = (K_{11}+K_{22}-2K_{12})\alpha_2 + y_2(-\varsigma K_{11} + \varsigma K_{12} -\sum_{i=3}^N\alpha_iy_iK_{1i} + \sum_{i=3}^N\alpha_iy_iK_{2i} + y_1 - y_2) \tag{34} =(K11+K22−2K12)α2+y2(−ςK11+ςK12−i=3∑NαiyiK1i+i=3∑NαiyiK2i+y1−y2)(34)
这里要设置一些方便的符号:
1, 模型对 x x x的预测值
(35) g ( x ) = ∑ i = 1 N α i y i K ( x i , x ) + b g(x) = \sum_{i=1}^N\alpha_iy_iK(x_i, x) + b \tag{35} g(x)=i=1∑NαiyiK(xi,x)+b(35)
2, 预测值减去真实值
(36) E i = g ( x i ) − y i = ( ∑ j = 1 N α j y j K ( x j , x i ) + b ) − y i , i = 1 , 2 E_i = g(x_i) - y_i = (\sum_{j=1}^N\alpha_jy_jK(x_j, x_i) + b) - y_i, \quad i=1,2 \tag{36} Ei=g(xi)−yi=(j=1∑NαjyjK(xj,xi)+b)−yi,i=1,2(36)
3,式(34)中比较难处理的那一坨
(37) v i = ∑ j = 3 N α j y j K i j = g ( x i ) − ∑ j = 1 2 α j y j K i j − b , i = 1 , 2 v_i = \sum_{j=3}^N\alpha_jy_jK_{ij} = g(x_i) - \sum_{j=1}^2\alpha_jy_jK_{ij} - b, \quad i=1,2 \tag{37} vi=j=3∑NαjyjKij=g(xi)−j=1∑2αjyjKij−b,i=1,2(37)
(38) = E i + y i − ∑ j = 1 2 α j y j K i j − b , i = 1 , 2 =E_i +y_i - \sum_{j=1}^2\alpha_jy_jK_{ij} - b, \quad i=1,2 \tag{38} =Ei+yi−j=1∑2αjyjKij−b,i=1,2(38)
还要注意 α 1 y 1 + α 2 y 2 = ∑ i = 1 2 α i y i = ς \alpha_1y_1+\alpha_2y_2 = \sum_{i=1}^2\alpha_iy_i = \varsigma α1y1+α2y2=∑i=12αiyi=ς
令式(34)为0, 并代入上面的符号:
(39) ( K 11 + K 22 − 2 K 12 ) α 2 n e w , u n c = y 2 ( ς K 11 − ς K 12 + ∑ i = 3 N α i y i K 1 i − ∑ i = 3 N α i y i K 2 i − y 1 + y 2 ) (K_{11}+K_{22}-2K_{12})\alpha_2^{new,unc} = y_2(\varsigma K_{11} - \varsigma K_{12} +\sum_{i=3}^N\alpha_iy_iK_{1i} - \sum_{i=3}^N\alpha_iy_iK_{2i} - y_1 + y_2) \tag{39} (K11+K22−2K12)α2new,unc=y2(ςK11−ςK12+i=3∑NαiyiK1i−i=3∑NαiyiK2i−y1+y2)(39)
(40) = y 2 ( ∑ i = 1 2 α i y i K 11 − ∑ i = 1 2 α i y i K 12 + v 1 − v 2 − y 1 + y 2 ) = y_2(\sum_{i=1}^2\alpha_iy_iK_{11} - \sum_{i=1}^2\alpha_iy_iK_{12} + v_1 - v_2 - y_1 + y_2) \tag{40} =y2(i=1∑2αiyiK11−i=1∑2αiyiK12+v1−v2−y1+y2)(40)
(41) = y 2 ( ∑ i = 1 2 α i y i K 11 − ∑ i = 1 2 α i y i K 12 + E 1 + y 1 − ∑ i = 1 2 α i y i K 1 i − b − E 2 − y 2 + ∑ i = 1 2 α i y i K 2 i + b − y 1 + y 2 ) = y_2(\sum_{i=1}^2\alpha_iy_iK_{11} - \sum_{i=1}^2\alpha_iy_iK_{12} + E_1 + y_1 -\sum_{i=1}^2\alpha_iy_iK_{1i} - b - E_2 - y_2 +\sum_{i=1}^2\alpha_iy_iK_{2i} + b - y_1 + y_2 ) \tag{41} =y2(i=1∑2αiyiK11−i=1∑2αiyiK12+E1+y1−i=1∑2αiyiK1i−b−E2−y2+i=1∑2αiyiK2i+b−y1+y2)(41)
(42) = y 2 ( E 1 − E 2 + α 2 y 2 K 11 − 2 α 2 y 2 K 12 + α 2 y 2 K 22 ) = y_2(E_1 - E_2 + \alpha_2y_2K_{11} - 2\alpha_2y_2K_{12}+\alpha_2y_2K_{22}) \tag{42} =y2(E1−E2+α2y2K11−2α2y2K12+α2y2K22)(42)
(43) = ( K 11 − 2 K 12 + K 22 ) α 2 + y 2 ( E 1 − E 2 ) = (K_{11} - 2K_{12} + K_{22})\alpha_2 + y_2(E_1- E_2) \tag{43} =(K11−2K12+K22)α2+y2(E1−E2)(43)
由式(43)可得:
(44) α 2 n e w , u n c = α 2 o l d + y 2 ( E 1 − E 2 ) K 11 − 2 K 12 + K 22 \alpha_2^{new, unc} = \alpha_2^{old} + \frac{y_2(E_1-E_2)}{K_{11}-2K_{12}+K_{22}} \tag{44} α2new,unc=α2old+K11−2K12+K22y2(E1−E2)(44)
上式左边有unc, 意味着没有进行cut,现在我们讨论下 α 2 \alpha_2 α2的取值范围:
关于 α 1 , α 2 \alpha_1, \alpha_2 α1,α2的约束条件一共就式(30),(31)两个式子
那么我们根据 y 1 , y 2 y_1, y_2 y1,y2进行分类讨论:
根据式(30)可得:
(49) α 1 o l d y 1 + α 2 o l d y 2 = α 1 n e w y 1 + α 2 n e w y 2 \alpha_1^{old}y_1 + \alpha_2^{old}y_2 = \alpha_1^{new}y_1 + \alpha_2^{new}y_2 \tag{49} α1oldy1+α2oldy2=α1newy1+α2newy2(49)
根据上式,有:
(50) α 1 n e w = ( α 1 o l d y 1 + α 2 o l d y 2 − α 2 n e w y 2 ) y 1 \alpha_1^{new} = (\alpha_1^{old}y_1 + \alpha_2^{old}y_2-\alpha_2^{new}y_2)y_1 \tag{50} α1new=(α1oldy1+α2oldy2−α2newy2)y1(50)
(51) = α 1 o l d + ( α 2 o l d − α 2 n e w ) y 1 y 2 = \alpha_1^{old} + (\alpha_2^{old}-\alpha_2^{new})y_1y_2 \tag{51} =α1old+(α2old−α2new)y1y2(51)
原理,通过支持向量,即正好在间隔边界的点来进行计算(此时 0 < α i < C , y i g ( x i ) = 1 0 < \alpha_i < C, y_ig(x_i) = 1 0<αi<C,yig(xi)=1):
(52) b = y i − ∑ j = 1 N α j y j K i j b = y_i -\sum_{j=1}^N\alpha_jy_jK_{ij} \tag{52} b=yi−j=1∑NαjyjKij(52)
如果 α 1 \alpha_1 α1满足此条件,则
(53) b 1 n e w = y 1 − ∑ i = 3 N α i y i K 1 i − α 1 n e w y 1 K 11 − α 2 n e w y 2 K 12 b_1^{new} = y_1 - \sum_{i=3}^N\alpha_iy_iK_{1i} - \alpha_1^{new}y_1K_{11}-\alpha_2^{new}y_2K_{12} \tag{53} b1new=y1−i=3∑NαiyiK1i−α1newy1K11−α2newy2K12(53)
上面的公式有一部分可以用 E 1 E_1 E1进行替换:
(54) E 1 = g ( x 1 ) − y 1 = ∑ i = 3 N α i y i K 1 i + α 1 o l d y 1 K 11 + α 2 o l d y 2 K 21 + b o l d − y 1 E_1 = g(x_1) - y_1 = \sum_{i=3}^N\alpha_iy_iK_{1i} + \alpha_1^{old}y_1K_{11} +\alpha_2^{old}y_2K_{21} + b^{old} - y_1 \tag{54} E1=g(x1)−y1=i=3∑NαiyiK1i+α1oldy1K11+α2oldy2K21+bold−y1(54)
结合式(53)与式(54),将 E 1 E_1 E1引入式(55)可得:
(55) b 1 n e w = − E 1 + y 1 K 11 ( α 1 o l d − α 1 n e w ) + y 2 K 12 ( α 2 o l d − α 2 n e w ) + b o l d b_1^{new} = -E_1+y_1K_{11}(\alpha_1^{old}-\alpha_1^{new}) + y_2K_{12}(\alpha_2^{old}-\alpha_2^{new}) + b^{old} \tag{55} b1new=−E1+y1K11(α1old−α1new)+y2K12(α2old−α2new)+bold(55)
每次计算的时候,存下 E i E_i Ei可以极大的方便计算
同理,如果 0 < α 2 n e w < C 0 < \alpha_2^{new} <C 0<α2new<C, 则:
(56) b 2 n e w = − E 2 + y 1 K 12 ( α 1 o l d − α 1 n e w ) + y 2 K 22 ( α 2 o l d − α 2 n e w ) + b o l d b_2^{new} = -E_2 + y_1K_{12}(\alpha_1^{old}-\alpha_1^{new}) + y_2K_{22}(\alpha_2^{old}-\alpha_2^{new}) +b^{old} \tag{56} b2new=−E2+y1K12(α1old−α1new)+y2K22(α2old−α2new)+bold(56)
下面讨论 b n e w b^{new} bnew的最终取值:
若 0 < α 1 < C 0<\alpha_1<C 0<α1<C, 0 < α 2 < C 0<\alpha_2<C 0<α2<C:
b n e w = b 1 n e w = b 2 n e w b^{new} = b_1^{new} = b_2^{new} bnew=b1new=b2new (此时 x 1 x_1 x1与 x 2 x_2 x2都在间隔边界上)
若只有一个 0 < α i < C , i ∈ { 1 , 2 } 0<\alpha_i<C, \quad i \in \{1,2\} 0<αi<C,i∈{1,2}
b n e w = b i n e w b^{new} = b_i^{new} bnew=binew
若 α 1 , α 2 ∈ { 0 , C } \alpha_1,\alpha_2 \in \{0,C\} α1,α2∈{0,C}
b n e w = 1 2 ( b 1 n e w + b 2 n e w ) b^{new} = \frac{1}{2}(b_1^{new}+b_2^{new}) bnew=21(b1new+b2new), (若 α i = 0 \alpha_i=0 αi=0说明 x i x_i xi不是支持向量, y i g ( x i ) ≥ 1 y_ig(x_i) \ge 1 yig(xi)≥1, x i x_i xi在正确分类的间隔一侧, α i = C \alpha_i=C αi=C 说明 y i g ( x i ) ≤ 1 y_ig(x_i) \le 1 yig(xi)≤1, 这些都可以从式(22)-式(30)的KKT条件推出,下面还会推导)
(57) E i = ∑ s α j y j K i j − y i E_i = \sum_{s}\alpha_jy_jK_{ij} - y_i \tag{57} Ei=s∑αjyjKij−yi(57)
其中 s s s是所有>0的 α j \alpha_j αj, 即所有支持向量
因为收敛后的最优解是满足KKT条件的,所以第一次选择最不满足KKT条件的 α \alpha α:
从式(20)-(28)可得:
α i = 0 \alpha_i = 0 αi=0
(1), 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 C−αi∗−ui∗=0 可得 u i ∗ = C > 0 u_i^*=C > 0 ui∗=C>0
(2), 根据 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^* = 0 ui∗ξi∗=0 可得 ξ i ∗ = 0 \xi_i^*=0 ξi∗=0
(3), 根据 y i ( w ∗ x i + b ∗ ) ≥ 1 − ξ i ∗ y_i(w^*x_i+b^*) \ge 1 - \xi_i^* yi(w∗xi+b∗)≥1−ξi∗ 可得 y i ( w ∗ x i + b ∗ ) ≥ 1 y_i(w^*x_i +b^*) \ge 1 yi(w∗xi+b∗)≥1
(4), 综上, (58) α i = 0 ⇔ y i g ( x i ) ≥ 1 \alpha_i = 0 \Leftrightarrow y_ig(x_i) \ge 1 \tag{58} αi=0⇔yig(xi)≥1(58)
0 < α i < C 0 < \alpha_i <C 0<αi<C
(1), 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 C−αi∗−ui∗=0 可得 u i ∗ > 0 u_i^*> 0 ui∗>0
(2), 根据 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^* = 0 ui∗ξi∗=0 可得 ξ i ∗ = 0 \xi_i^*=0 ξi∗=0
(3), 根据 α i ∗ ( y i ( w ∗ x i + b ∗ ) − 1 + ξ ∗ ) = 0 \alpha_i^*(y_i(w^*x_i+b^*)-1+\xi^*) = 0 αi∗(yi(w∗xi+b∗)−1+ξ∗)=0 及上面一条可得 y i ( w ∗ x i + b ∗ ) − 1 = 0 y_i(w^*x_i+b^*) - 1=0 yi(w∗xi+b∗)−1=0
(4), 综上, (59) 0 < α i < C ⇔ y i g ( x i ) = 1 0 < \alpha_i < C \Leftrightarrow y_ig(x_i) = 1 \tag{59} 0<αi<C⇔yig(xi)=1(59)
α i = C \alpha_i = C αi=C
(1). 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 C−αi∗−ui∗=0 可得 u i ∗ = 0 u_i^*= 0 ui∗=0
(2). 根据 u i ∗ = 0 u_i^* = 0 ui∗=0及 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^*=0 ui∗ξi∗=0, ξ i ∗ ≥ 0 \xi_i^* \ge 0 ξi∗≥0可得 ξ i ∗ ≥ 0 \xi_i^* \ge 0 ξi∗≥0
(3). 根据 α i ∗ ( y i ( w ∗ x i + b ∗ ) − 1 + ξ i ∗ ) = 0 \alpha_i^*(y_i(w^*x_i+b^*)-1+\xi_i^*) = 0 αi∗(yi(w∗xi+b∗)−1+ξi∗)=0及 α i = C > 0 \alpha_i=C>0 αi=C>0可得
y i ( w ∗ x i + b ∗ ) − 1 + ξ i ∗ = 0 y_i(w^*x_i + b^*) - 1 +\xi_i^* = 0 yi(w∗xi+b∗)−1+ξi∗=0
(4). 根据推论(2)及推论(3)可得: y i ( w ∗ x i + b ∗ ) ≤ 1 y_i(w^*x_i+b^*) \le 1 yi(w∗xi+b∗)≤1
(5). 综上, (60) α i = C ⇔ y i g ( x i ) ≤ 1 \alpha_i=C \Leftrightarrow y_ig(x_i) \le 1 \tag{60} αi=C⇔yig(xi)≤1(60)
这里要说明一下,计算机里常有浮点数精度的问题,直接用"=="往往会得出错误的结果,上面的KKT条件检查全部都应该在 ϵ \epsilon ϵ 的精度下进行
选择算法:首先选取 0 < α i < C 0 < \alpha_i < C 0<αi<C的支持向量样本点,检查是否满足式(61)
如果不满足,则选择它
如果都满足,遍历整个训练集查看它们是否满足KKT条件, 如果都满足则满足停机条件
根据式(44), α 1 \alpha_1 α1与 E 1 − E 2 E_1 - E_2 E1−E2呈线性关系,所以对 α 2 \alpha_2 α2的选取策略:
遍历找到使 ∣ E 1 − E 2 ∣ |E_1 - E_2| ∣E1−E2∣最大的 E 2 E_2 E2,其对应的 α \alpha α分量即为 α 2 \alpha_2 α2
这里可以看出, E E E 在更新 α 2 \alpha_2 α2, 阈值 b b b 与寻找 α 2 \alpha_2 α2的过程中发挥了极大的作用
可是这个简单策略有时会找不到令目标函数式(17)有足够下降的点,怎么办呢?只能依次遍历在间隔边界上的点,看它们中是否有点能使目标函数有足够下降
如果还是找不到呢?那只能放弃 α i \alpha_i αi重新选择了
所以到这里我发现SMO算法有回溯的情况
1, 根据是否满足KKT条件寻找一个 α \alpha α的分量作为 α 1 \alpha_1 α1, 满足停机条件则算法结束
2, 根据极大化 α 1 \alpha_1 α1的变化寻找 α 2 \alpha_2 α2, 这里可能会有回溯情况,重回第1步
3, 根据 E 1 , E 2 E_1, E_2 E1,E2等,即式(44)得到 α 2 n e w , u n c \alpha_2^{new, unc} α2new,unc
4, 对 α 2 n e w , u n c \alpha_2^{new, unc} α2new,unc进行定义域剪切得到 α 2 n e w \alpha_2^{new} α2new
5, 紧接着根据式(51)得到 α 1 n e w \alpha_1^{new} α1new
6, 根据(55),(56) 通过 E i E_i Ei 获取 b 1 n e w , b 2 n e w b_1^{new}, b_2^{new} b1new,b2new
7, 根据 α 1 , α 2 \alpha_1, \alpha_2 α1,α2对 0 0 0与 C C C的大小关系获取 b n e w b^{new} bnew
8, 更新 E 1 , E 2 E_1, E_2 E1,E2,为下一轮计算做准备
循环以上步骤直到达到指定的轮次数或满足停机条件
大致讲解流程根据上面的算法流程来
# check if the alpha[i] satisfy the KKT condition:
def _satisfy_KKT_(self, i):
tmp = self.Y[i]*self._g_(i)
# 式(60)
if abs(self.alpha[i]) < self.epsilon: # epsilon is the precision for checking if two var equal
return tmp >= 1
# 式(62)
elif abs(self.alpha[i] - self.C) < self.epsilon:
return tmp <= 1
# 式(61)
else:
return abs(tmp - 1) < self.epsilon
imax = (0, 0)# store |E1-E2|, index of alpha2
E1 = self.E[i]
alpha_1_index.remove(i)
# 寻找使|E1 - E2|最大的alpha_2
for j in alpha_1_index:
E2 = self.E[j]
if abs(E1 - E2) > imax[0]:
imax = (abs(E1 - E2), j)
return i, imax[1]
E1, E2 = self.E[i1], self.E[i2]
# eta即式(46)的分母
eta = self._K_(i1, i1) + self._K_(i2, i2) - 2*self._K_(i1, i2) # 7.107
# 式(46)
alpha2_new_unc = self.alpha[i2] + self.Y[i2] * (E1-E2) / eta # 7.106
# 剪切
if self.Y[i1] == self.Y[i2]:
L = max(0, self.alpha[i2] + self.alpha[i1] - self.C)
H = min(self.C, self.alpha[i2] + self.alpha[i1])
else:
L = max(0, self.alpha[i2] - self.alpha[i1])
H = min(self.C, self.C + self.alpha[i2] - self.alpha[i1])
alpha2_new = H if alpha2_new_unc > H else L if alpha2_new_unc < L else alpha2_new_unc # 7.108
# 式(53)
alpha1_new = self.alpha[i1] + self.Y[i1]*self.Y[i2]*(self.alpha[i2] - alpha2_new)
# 式(57)
b1_new = -E1 - self.Y[i1]*self._K_(i1,i1)*(alpha1_new - self.alpha[i1]) \
- self.Y[i2]*self._K_(i2,i1)*(alpha2_new - self.alpha[i2]) + self.b
# 式(58)
b2_new = -E2 - self.Y[i1]*self._K_(i1,i2)*(alpha1_new - self.alpha[i1]) \
- self.Y[i2]*self._K_(i2,i2)*(alpha2_new - self.alpha[i2]) + self.b
# 式(58)-式(59)之间对b的取值讨论
if alpha1_new > 0 and alpha1_new < self.C:
self.b = b1_new
elif alpha2_new > 0 and alpha2_new < self.C:
self.b = b2_new
else:
self.b = (b1_new + b2_new) / 2
# 式(37), 对x_i的预测值
def _g_(self, i):
K = np.array([self._K_(j, i) for j in range(self.m)])
return np.dot(self.alpha * self.Y, K) + self.b
# 式(38), 对x_i的预测值与y_i的差值
def _E_(self, i):
return self._g_(i) - self.Y[i]
# 更新alpha_1, alpha_2对应的E_1, E_2
self.E[i1] = self._E_(i1)
self.E[i2] = self._E_(i2)
以上完整代码在我的github
自问才疏学浅,不当之处请大家指出
参考:
《统计学习方法》
《理解SVM的三重境界》
《机器学习》