SVM算法理论推导及python实现

转载请注明出处: SVM算法理论推导及python实现

本文面向的读者为掌握SVM基础前置知识如读过《统计学习方法》,并希望对SMO(Sequential Minimal Optimization)细节有更深入了解的人群。因为笔者想要实现一个简易的SVM,为了搞懂这部分费了不少工夫,所以写下这一篇嚼过的文章,目的是让读者跟着顺序阅读一定能弄懂。自己的行文习惯是简单的地方不能说太复杂,复杂的地方一定会说清楚

一, 推导至SMO要解决的SVM对偶形式

对于含soft margin的支持向量机, 其primitive problem:
(1) min ⁡ w , b , ξ 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i s . t . y i ( w ⋅ x i + b ) ≥ 1 − ξ i , i = 1 , 2 , ⋯   , N       ξ i ≥ 0 , i = 1 , 2 , ⋯   , N \begin{aligned} \\ & \min_{w,b,\xi} \quad \dfrac{1}{2} \| w \|^{2} + C \sum_{i=1}^{N} \xi_{i} \tag{1} \\ & s.t. \quad y_{i} \left( w \cdot x_{i} + b \right) \geq 1 - \xi_{i}, \quad i=1,2, \cdots, N \\ & \quad \quad \; \; \, \xi_{i} \ge 0, \quad i=1,2, \cdots, N \end{aligned} w,b,ξmin21w2+Ci=1Nξis.t.yi(wxi+b)1ξi,i=1,2,,Nξi0,i=1,2,,N(1)
求解此原始问题:

1, 构建Lagrange Function

引入拉格朗日乘子 α i ≥ 0 , μ i ≥ 0 , i = 1 , 2 , ⋯   , N \alpha_{i} \ge 0, \mu_{i} \ge 0, i = 1, 2, \cdots, N αi0,μi0,i=1,2,,N构建Lagrange Function:
(2) L ( w , b , ξ , α , μ ) = 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i + ∑ i = 1 N α i ( − y i ( w ⋅ x i + b ) + 1 − ξ i ) − ∑ i = 1 N μ i ξ i \begin{aligned} \\ L(w,b,\xi,\alpha,\mu) &= \frac{1}{2} \| w \|^{2} + C \sum_{i=1}^{N} \xi_{i} + \sum_{i=1}^{N} \alpha_{i} (- y_{i} ( w \cdot x_{i} + b ) + 1 - \xi_{i} ) - \sum_{i=1}^{N} \mu_{i} \xi_{i} \tag{2} \end{aligned} L(w,b,ξ,α,μ)=21w2+Ci=1Nξi+i=1Nαi(yi(wxi+b)+1ξi)i=1Nμiξi(2)
其中, α = ( α 1 , α 2 , ⋯   , α N ) T \alpha = \left( \alpha_{1}, \alpha_{2}, \cdots, \alpha_{N} \right)^{T} α=(α1,α2,,αN)T以及 μ = ( μ 1 , μ 2 , ⋯   , μ N ) T \mu = \left( \mu_{1}, \mu_{2}, \cdots, \mu_{N} \right)^{T} μ=(μ1,μ2,,μN)Tlagrange multiplier , 它们的每个分量都是非负数

2,转化为Lagrange Dual Problem

具体原理与KKT条件推导可看我上一篇博文:SVM之拉格朗日对偶问题与KKT条件推导
现在dual problem:
(3) max ⁡ α , μ min ⁡ w , b , ξ L ( w , b , ξ , α , μ ) \max_{\alpha,\mu} \min_{w,b,\xi} L(w, b,\xi,\alpha,\mu) \tag{3} α,μmaxw,b,ξminL(w,b,ξ,α,μ)(3)

3,先求内层的min

由于把 α , μ \alpha, \mu α,μ 都看作常量, 要求最小值就直接求偏导:
(4) ∇ w L ( w , b , ξ , α , μ ) = w − ∑ i = 1 N α i y i x i = 0 \nabla_{w} L( w, b, \xi, \alpha, \mu) = w - \sum_{i=1}^{N} \alpha_{i} y_{i} x_{i} = 0 \tag{4} wL(w,b,ξ,α,μ)=wi=1Nαiyixi=0(4)
(5) ∇ b L ( w , b , ξ , α , μ ) = − ∑ i = 1 N α i y i = 0 \nabla_{b} L \left( w, b, \xi, \alpha, \mu \right) = -\sum_{i=1}^{N} \alpha_{i} y_{i} = 0 \tag{5} bL(w,b,ξ,α,μ)=i=1Nαiyi=0(5)
(6) ∇ ξ i L ( w , b , ξ , α , μ ) = C − α i − μ i = 0 \nabla_{\xi_{i}} L \left( w, b, \xi, \alpha, \mu \right) = C - \alpha_{i} - \mu_{i} = 0 \tag{6} ξiL(w,b,ξ,α,μ)=Cαiμi=0(6)

得:
(7) w = ∑ i = 1 N α i y i x i w = \sum_{i=1}^N \alpha_i y_i x_i \tag{7} w=i=1Nαiyixi(7)
(8) ∑ i = 1 N α i y i = 0 \sum_{i=1}^N \alpha_i y_i = 0 \tag{8} i=1Nαiyi=0(8)
(9) C − α i − μ i = 0 C - \alpha_i - \mu_i = 0 \tag{9} Cαiμi=0(9)

把(7)-(9) 代入(2)可得(这里引入了kernel function):
(10) min ⁡ w , b , ξ L ( w , b , ξ , α , μ ) = − 1 2 ∑ i = 1 N ∑ i = 1 N α i α j y i y j K ( x i , x j ) + ∑ i = 1 N α i \min_{w,b,\xi} L(w,b,\xi,\alpha,\mu) = -\frac{1}{2}\sum_{i=1}^N\sum_{i=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) + \sum_{i=1}^N \alpha_i \tag{10} w,b,ξminL(w,b,ξ,α,μ)=21i=1Ni=1NαiαjyiyjK(xi,xj)+i=1Nαi(10)

4, 求解外层的max

对式(10)求解max:
(11) max ⁡ α , μ − 1 2 ∑ i = 1 N ∑ i = 1 N α i α j y i y j K ( x i , x j ) + ∑ i = 1 N α i \max_{\alpha, \mu} -\frac{1}{2} \sum_{i=1}^N\sum_{i=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) + \sum_{i=1}^N \alpha_i \tag{11} α,μmax21i=1Ni=1NαiαjyiyjK(xi,xj)+i=1Nαi(11)
(12) s . t . ∑ i = 1 N α i y i = 0 s.t. \quad \sum_{i=1}^N \alpha_i y_i = 0 \tag{12} s.t.i=1Nαiyi=0(12)
(13) s . t . C − α i − μ i = 0 s.t. \quad C-\alpha_i - \mu_i = 0 \tag{13} s.t.Cαiμi=0(13)
(14) s . t . α i ≥ 0 s.t. \quad \alpha_i \ge 0 \tag{14} s.t.αi0(14)
(15) s . t . μ i ≥ 0 s.t. \quad \mu_i \ge 0 \tag{15} s.t.μi0(15)
式(13)-(15) 可简化为:
(16) 0 ≤ α i ≤ C 0 \le \alpha_i \le C \tag{16} 0αiC(16)

5,最终形式:一个凸二次规划的对偶问题

(17) min ⁡ α 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y i y j K ( x i , x j ) − ∑ i = 1 N α i \min_\alpha \quad \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) - \sum_{i=1}^N \alpha_i \tag{17} αmin21i=1Nj=1NαiαjyiyjK(xi,xj)i=1Nαi(17)
(18) s . t . ∑ i = 1 N α i y i = 0 s.t. \quad \sum_{i=1}^N \alpha_i y_i = 0 \tag{18} s.t.i=1Nαiyi=0(18)
(19) s . t . 0 ≤ α i ≤ C , i = 1 , 2 , ⋯   , N s.t. \quad 0 \le \alpha_i \le C, \quad i=1,2,\cdots,N \tag{19} s.t.0αiC,i=1,2,,N(19)
收敛的值需满足的KKT condition(求解SMO时有用):
(20) ∇ w L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = w ∗ − ∑ i = 1 N α i ∗ y i x i = 0 \nabla_{w} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = w^* - \sum_{i=1}^{N} \alpha_{i}^* y_{i} x_{i} = 0 \tag{20} wL(w,b,ξ,α,μ)=wi=1Nαiyixi=0(20)
(21) ∇ b L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = − ∑ i = 1 N α i ∗ y i = 0 \nabla_{b} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = -\sum_{i=1}^{N} \alpha_{i}^* y_{i} = 0 \tag{21} bL(w,b,ξ,α,μ)=i=1Nαiyi=0(21)
(22) ∇ ξ i L ( w ∗ , b ∗ , ξ ∗ , α ∗ , μ ∗ ) = C − α i ∗ − μ i ∗ = 0 \nabla_{\xi_{i}} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = C - \alpha_{i}^* - \mu_{i}^* = 0 \tag{22} ξiL(w,b,ξ,α,μ)=Cαiμi=0(22)
(23) α i ∗ ≥ 0 \alpha_i^* \ge 0 \tag{23} αi0(23)
(24) 1 − ξ i ∗ + y i ( w ∗ ⋅ x ∗ + b ∗ ) ≤ 0 1 - \xi_i^* +y_i(w^*\cdot x^*+b^*) \le 0 \tag{24} 1ξi+yi(wx+b)0(24)
(25) α i ∗ ( 1 − ξ i ∗ + y i ( w ∗ ⋅ x ∗ + b ∗ ) ) = 0 \alpha_i^*(1 - \xi_i^* + y_i ( w^*\cdot x^*+b^*)) = 0 \tag{25} αi(1ξi+yi(wx+b))=0(25)
(26) μ i ∗ ≥ 0 \mu_i^* \ge 0 \tag{26} μi0(26)
(27) ξ i ∗ ≤ 0 \xi_i^* \le 0 \tag{27} ξi0(27)
(28) μ i ∗ ξ i ∗ = 0 \mu_i^* \xi_i^* = 0 \tag{28} μiξi=0(28)
一共3个偏导为0 + 2*3个乘子与不等式约束间满足的条件 = 9个KKT 条件

二,切入正题:SMO算法完整推导

SMO算法,首先选择 α \alpha α的两个分量来求解式(19)的子问题,思路是不断迭代求解原问题的子问题来逼近原始问题的解, 具体怎么选取这两个分量待会儿再说

1, 选取两个 α \alpha α的分量,求解式(19)子问题,目的是获取对这两个分量进行更新的方法

(29) min ⁡ α 1 , α 2 W ( α 1 , α 2 ) = 1 2 α 1 2 K 11 + 1 2 α 2 2 K 22 + α 1 α 2 y 1 y 2 K 12 + α 1 y 1 ∑ i = 3 N α i y i K 1 i + α 2 y 2 ∑ i = 3 N α i y i K 2 i − α 1 − α 2 \begin{aligned} \min_{\alpha1, \alpha2} \quad W(\alpha_1,\alpha_2) = \frac{1}{2}\alpha_1^2K_{11} + \frac{1}{2}\alpha_2^2K_{22} + \alpha_1\alpha_2y_1y_2K_{12} \\ + \alpha_1y_1\sum_{i=3}^N\alpha_iy_iK_{1i} + \alpha_2y_2\sum_{i=3}^N\alpha_iy_iK_{2i} - \alpha_1 - \alpha_2 \tag{29} \end{aligned} α1,α2minW(α1,α2)=21α12K11+21α22K22+α1α2y1y2K12+α1y1i=3NαiyiK1i+α2y2i=3NαiyiK2iα1α2(29)
(30) s . t . α 1 y 1 + α 2 y 2 = − ∑ i = 3 N y i α i = ς s.t. \quad \alpha_1y_1 + \alpha_2y_2 = -\sum_{i=3}^Ny_i\alpha_i = \varsigma \tag{30} s.t.α1y1+α2y2=i=3Nyiαi=ς(30)
(31) s . t .       0 ≤ α i ≤ C , i = 1 , 2 s.t. \quad \quad \; \; \, 0 \le \alpha_i \le C, \quad i=1,2 \tag{31} s.t.0αiC,i=1,2(31)

1.1 换元

有两个变量,那先通过式(30)的变形式 α 1 = ( ς − α 2 y 2 ) y 1 \alpha_1 = (\varsigma - \alpha_2y_2)y_1 α1=(ςα2y2)y1代入式(29)换为只包含 α 2 \alpha_2 α2的式子:
(32) min ⁡ α 2 W ( α 2 ) = 1 2 ( ς − α 2 y 2 ) 2 K 11 + 1 2 α 2 2 K 22 + ( ς − α 2 y 2 ) α 2 y 2 K 12 + ( ς − α 2 y 2 ) ∑ i = 3 N α i y i K 1 i + α 2 y 2 ∑ i = 3 N α i y i K 2 i − ( ς − α 2 y 2 ) y 1 − α 2 \begin{aligned} \min_{\alpha_2} W(\alpha_2) = \frac{1}{2}(\varsigma-\alpha_2y_2)^2K_{11} + \frac{1}{2}\alpha_2^2K_{22}+(\varsigma-\alpha_2y_2)\alpha_2y_2K_{12}+ \\ (\varsigma - \alpha_2y_2)\sum_{i=3}^N\alpha_iy_iK_{1i}+\alpha_2y_2\sum_{i=3}^N\alpha_iy_iK_{2i} - (\varsigma - \alpha_2y_2)y_1 - \alpha_2 \tag{32} \end{aligned} α2minW(α2)=21(ςα2y2)2K11+21α22K22+(ςα2y2)α2y2K12+(ςα2y2)i=3NαiyiK1i+α2y2i=3NαiyiK2i(ςα2y2)y1α2(32)

1.2 求极值点

明显地,对 α 2 \alpha_2 α2求偏导并令其为0:
(33) ∂ W α 2 = − ς y 2 K 11 + K 11 α 2 + K 22 α 2 + ς y 2 K 12 − 2 K 12 α 2 − y 2 ∑ i = 3 N α i y i K 1 i + y 2 ∑ i = 3 N α i y i K 2 i + y 1 y 2 − 1 \frac{\partial W}{\alpha_2} = -\varsigma y_2 K_{11} + K_{11}\alpha_2 + K_{22}\alpha_2 + \varsigma y_2K_{12} - 2K_{12}\alpha_2 - y_2 \sum_{i=3}^N\alpha_iy_iK_{1i} +y_2\sum_{i=3}^N\alpha_iy_iK_{2i} + y_1y_2 - 1 \tag{33} α2W=ςy2K11+K11α2+K22α2+ςy2K122K12α2y2i=3NαiyiK1i+y2i=3NαiyiK2i+y1y21(33)
(34) = ( K 11 + K 22 − 2 K 12 ) α 2 + y 2 ( − ς K 11 + ς K 12 − ∑ i = 3 N α i y i K 1 i + ∑ i = 3 N α i y i K 2 i + y 1 − y 2 ) = (K_{11}+K_{22}-2K_{12})\alpha_2 + y_2(-\varsigma K_{11} + \varsigma K_{12} -\sum_{i=3}^N\alpha_iy_iK_{1i} + \sum_{i=3}^N\alpha_iy_iK_{2i} + y_1 - y_2) \tag{34} =(K11+K222K12)α2+y2(ςK11+ςK12i=3NαiyiK1i+i=3NαiyiK2i+y1y2)(34)
这里要设置一些方便的符号:
1, 模型对 x x x的预测值
(35) g ( x ) = ∑ i = 1 N α i y i K ( x i , x ) + b g(x) = \sum_{i=1}^N\alpha_iy_iK(x_i, x) + b \tag{35} g(x)=i=1NαiyiK(xi,x)+b(35)
2, 预测值减去真实值
(36) E i = g ( x i ) − y i = ( ∑ j = 1 N α j y j K ( x j , x i ) + b ) − y i , i = 1 , 2 E_i = g(x_i) - y_i = (\sum_{j=1}^N\alpha_jy_jK(x_j, x_i) + b) - y_i, \quad i=1,2 \tag{36} Ei=g(xi)yi=(j=1NαjyjK(xj,xi)+b)yi,i=1,2(36)
3,式(34)中比较难处理的那一坨
(37) v i = ∑ j = 3 N α j y j K i j = g ( x i ) − ∑ j = 1 2 α j y j K i j − b , i = 1 , 2 v_i = \sum_{j=3}^N\alpha_jy_jK_{ij} = g(x_i) - \sum_{j=1}^2\alpha_jy_jK_{ij} - b, \quad i=1,2 \tag{37} vi=j=3NαjyjKij=g(xi)j=12αjyjKijb,i=1,2(37)
(38) = E i + y i − ∑ j = 1 2 α j y j K i j − b , i = 1 , 2 =E_i +y_i - \sum_{j=1}^2\alpha_jy_jK_{ij} - b, \quad i=1,2 \tag{38} =Ei+yij=12αjyjKijb,i=1,2(38)
还要注意 α 1 y 1 + α 2 y 2 = ∑ i = 1 2 α i y i = ς \alpha_1y_1+\alpha_2y_2 = \sum_{i=1}^2\alpha_iy_i = \varsigma α1y1+α2y2=i=12αiyi=ς
令式(34)为0, 并代入上面的符号:
(39) ( K 11 + K 22 − 2 K 12 ) α 2 n e w , u n c = y 2 ( ς K 11 − ς K 12 + ∑ i = 3 N α i y i K 1 i − ∑ i = 3 N α i y i K 2 i − y 1 + y 2 ) (K_{11}+K_{22}-2K_{12})\alpha_2^{new,unc} = y_2(\varsigma K_{11} - \varsigma K_{12} +\sum_{i=3}^N\alpha_iy_iK_{1i} - \sum_{i=3}^N\alpha_iy_iK_{2i} - y_1 + y_2) \tag{39} (K11+K222K12)α2new,unc=y2(ςK11ςK12+i=3NαiyiK1ii=3NαiyiK2iy1+y2)(39)
(40) = y 2 ( ∑ i = 1 2 α i y i K 11 − ∑ i = 1 2 α i y i K 12 + v 1 − v 2 − y 1 + y 2 ) = y_2(\sum_{i=1}^2\alpha_iy_iK_{11} - \sum_{i=1}^2\alpha_iy_iK_{12} + v_1 - v_2 - y_1 + y_2) \tag{40} =y2(i=12αiyiK11i=12αiyiK12+v1v2y1+y2)(40)
(41) = y 2 ( ∑ i = 1 2 α i y i K 11 − ∑ i = 1 2 α i y i K 12 + E 1 + y 1 − ∑ i = 1 2 α i y i K 1 i − b − E 2 − y 2 + ∑ i = 1 2 α i y i K 2 i + b − y 1 + y 2 ) = y_2(\sum_{i=1}^2\alpha_iy_iK_{11} - \sum_{i=1}^2\alpha_iy_iK_{12} + E_1 + y_1 -\sum_{i=1}^2\alpha_iy_iK_{1i} - b - E_2 - y_2 +\sum_{i=1}^2\alpha_iy_iK_{2i} + b - y_1 + y_2 ) \tag{41} =y2(i=12αiyiK11i=12αiyiK12+E1+y1i=12αiyiK1ibE2y2+i=12αiyiK2i+by1+y2)(41)
(42) = y 2 ( E 1 − E 2 + α 2 y 2 K 11 − 2 α 2 y 2 K 12 + α 2 y 2 K 22 ) = y_2(E_1 - E_2 + \alpha_2y_2K_{11} - 2\alpha_2y_2K_{12}+\alpha_2y_2K_{22}) \tag{42} =y2(E1E2+α2y2K112α2y2K12+α2y2K22)(42)
(43) = ( K 11 − 2 K 12 + K 22 ) α 2 + y 2 ( E 1 − E 2 ) = (K_{11} - 2K_{12} + K_{22})\alpha_2 + y_2(E_1- E_2) \tag{43} =(K112K12+K22)α2+y2(E1E2)(43)

1.3 获取 α 2 n e w , u n c \alpha_2^{new,unc} α2new,unc的迭代方式

由式(43)可得:
(44) α 2 n e w , u n c = α 2 o l d + y 2 ( E 1 − E 2 ) K 11 − 2 K 12 + K 22 \alpha_2^{new, unc} = \alpha_2^{old} + \frac{y_2(E_1-E_2)}{K_{11}-2K_{12}+K_{22}} \tag{44} α2new,unc=α2old+K112K12+K22y2(E1E2)(44)

1.4 根据 α 2 \alpha_2 α2的定义域裁剪得到 α 2 n e w \alpha_2^{new} α2new

上式左边有unc, 意味着没有进行cut,现在我们讨论下 α 2 \alpha_2 α2的取值范围:
关于 α 1 , α 2 \alpha_1, \alpha_2 α1,α2的约束条件一共就式(30),(31)两个式子
那么我们根据 y 1 , y 2 y_1, y_2 y1,y2进行分类讨论:

  • y 1 = y 2 y_1=y_2 y1=y2
    根据式(30),设 α 1 + α 2 = k \alpha_1 +\alpha_2 = k α1+α2=k
    根据式(31),可得:
    { 0 ≤ α 2 ≤ C 0 ≤ k − α 2 ≤ C ⇒ { 0 ≤ α 2 ≤ C k − C ≤ α 2 ≤ k ⇒ { 0 ≤ α 2 ≤ C α 1 o l d + α 2 o l d − C ≤ α 2 ≤ α 1 o l d + α 2 o l d {\left\{ \begin{aligned} 0 & \le \alpha_2 \le C \\ 0 & \le k-\alpha_2 \le C \end{aligned} \right.} \Rightarrow {\left\{ \begin{aligned} & 0 \le \alpha_2 \le C \\ & k-C \le \alpha_2 \le k \end{aligned} \right.} \Rightarrow {\left\{ \begin{aligned} & 0 \le \alpha_2 \le C \\ & \alpha_1^{old} +\alpha_2^{old}-C \le \alpha_2 \le \alpha_1^{old} +\alpha_2^{old} \end{aligned} \right.} {00α2Ckα2C{0α2CkCα2k{0α2Cα1old+α2oldCα2α1old+α2old
    α 2 \alpha_2 α2上界为 H H H,下界为 L L L, 则有:
    (45) L = max ⁡ ( 0 , α 1 o l d + α 2 o l d − C ) L = \max(0, \alpha_1^{old} +\alpha_2^{old}-C) \tag{45} L=max(0,α1old+α2oldC)(45)
    (46) H = min ⁡ ( C , α 1 o l d + α 2 o l d ) H = \min(C, \alpha_1^{old} +\alpha_2^{old}) \tag{46} H=min(C,α1old+α2old)(46)
  • y 1 ≠ y 2 y_1 \neq y_2 y1̸=y2
    根据式(30),设 α 1 − α 2 = k \alpha_1 - \alpha_2 = k α1α2=k
    根据式(31),可得:
    { 0 ≤ α 2 ≤ C 0 ≤ α 2 + k ≤ C ⇒ { 0 ≤ α 2 ≤ C − k ≤ α 2 ≤ C − k ⇒ { 0 ≤ α 2 ≤ C α 2 o l d − α 1 o l d ≤ α 2 ≤ C + α 2 o l d − α 1 o l d {\left\{ \begin{aligned} 0 & \le \alpha_2 \le C \\ 0 & \le \alpha_2+k \le C \end{aligned} \right.} \Rightarrow {\left\{ \begin{aligned} 0 & \le \alpha_2 \le C \\ -k & \le \alpha_2 \le C-k \end{aligned} \right.} \Rightarrow {\left\{ \begin{aligned} 0 & \le \alpha_2 \le C \\ \alpha_2^{old} - \alpha_1^{old} & \le \alpha_2 \le C+\alpha_2^{old} - \alpha_1^{old} \end{aligned} \right.} {00α2Cα2+kC{0kα2Cα2Ck{0α2oldα1oldα2Cα2C+α2oldα1old
    可得新的上下界:
    (47) L = max ⁡ ( 0 , α 2 o l d − α 1 o l d ) L = \max(0, \alpha_2^{old} - \alpha_1^{old}) \tag{47} L=max(0,α2oldα1old)(47)
    (48) H = min ⁡ ( C , C + α 2 o l d − α 1 o l d ) H = \min(C, C+\alpha_2^{old} - \alpha_1^{old}) \tag{48} H=min(C,C+α2oldα1old)(48)
    裁剪方法:
    α 2 n e w = { H , α 2 n e w , u n c > H α 2 n e w , u n c , L ≤ α 2 n e w , u n c ≤ C L , α 2 n e w , u n c < L \alpha_2^{new}={\left\{ \begin{aligned} &H \quad \quad, & \alpha_2^{new,unc} > H \\ &\alpha_2^{new,unc}, &L \le \alpha_2^{new,unc} \le C \\ &L \quad \quad, & \alpha_2^{new,unc} < L \end{aligned} \right.} α2new=H,α2new,unc,L,α2new,unc>HLα2new,uncCα2new,unc<L

1.5 根据约束条件得到 α 1 n e w \alpha_1^{new} α1new

根据式(30)可得:
(49) α 1 o l d y 1 + α 2 o l d y 2 = α 1 n e w y 1 + α 2 n e w y 2 \alpha_1^{old}y_1 + \alpha_2^{old}y_2 = \alpha_1^{new}y_1 + \alpha_2^{new}y_2 \tag{49} α1oldy1+α2oldy2=α1newy1+α2newy2(49)
根据上式,有:
(50) α 1 n e w = ( α 1 o l d y 1 + α 2 o l d y 2 − α 2 n e w y 2 ) y 1 \alpha_1^{new} = (\alpha_1^{old}y_1 + \alpha_2^{old}y_2-\alpha_2^{new}y_2)y_1 \tag{50} α1new=(α1oldy1+α2oldy2α2newy2)y1(50)
(51) = α 1 o l d + ( α 2 o l d − α 2 n e w ) y 1 y 2 = \alpha_1^{old} + (\alpha_2^{old}-\alpha_2^{new})y_1y_2 \tag{51} =α1old+(α2oldα2new)y1y2(51)

2, 通过对这两个 α \alpha α分量的更新,获取其它变量的更新

2.1 计算阈值b

原理,通过支持向量,即正好在间隔边界的点来进行计算(此时 0 < α i < C , y i g ( x i ) = 1 0 < \alpha_i < C, y_ig(x_i) = 1 0<αi<C,yig(xi)=1):
(52) b = y i − ∑ j = 1 N α j y j K i j b = y_i -\sum_{j=1}^N\alpha_jy_jK_{ij} \tag{52} b=yij=1NαjyjKij(52)
如果 α 1 \alpha_1 α1满足此条件,则
(53) b 1 n e w = y 1 − ∑ i = 3 N α i y i K 1 i − α 1 n e w y 1 K 11 − α 2 n e w y 2 K 12 b_1^{new} = y_1 - \sum_{i=3}^N\alpha_iy_iK_{1i} - \alpha_1^{new}y_1K_{11}-\alpha_2^{new}y_2K_{12} \tag{53} b1new=y1i=3NαiyiK1iα1newy1K11α2newy2K12(53)
上面的公式有一部分可以用 E 1 E_1 E1进行替换:
(54) E 1 = g ( x 1 ) − y 1 = ∑ i = 3 N α i y i K 1 i + α 1 o l d y 1 K 11 + α 2 o l d y 2 K 21 + b o l d − y 1 E_1 = g(x_1) - y_1 = \sum_{i=3}^N\alpha_iy_iK_{1i} + \alpha_1^{old}y_1K_{11} +\alpha_2^{old}y_2K_{21} + b^{old} - y_1 \tag{54} E1=g(x1)y1=i=3NαiyiK1i+α1oldy1K11+α2oldy2K21+boldy1(54)
结合式(53)与式(54),将 E 1 E_1 E1引入式(55)可得:
(55) b 1 n e w = − E 1 + y 1 K 11 ( α 1 o l d − α 1 n e w ) + y 2 K 12 ( α 2 o l d − α 2 n e w ) + b o l d b_1^{new} = -E_1+y_1K_{11}(\alpha_1^{old}-\alpha_1^{new}) + y_2K_{12}(\alpha_2^{old}-\alpha_2^{new}) + b^{old} \tag{55} b1new=E1+y1K11(α1oldα1new)+y2K12(α2oldα2new)+bold(55)
每次计算的时候,存下 E i E_i Ei可以极大的方便计算
同理,如果 0 < α 2 n e w < C 0 < \alpha_2^{new} <C 0<α2new<C, 则:
(56) b 2 n e w = − E 2 + y 1 K 12 ( α 1 o l d − α 1 n e w ) + y 2 K 22 ( α 2 o l d − α 2 n e w ) + b o l d b_2^{new} = -E_2 + y_1K_{12}(\alpha_1^{old}-\alpha_1^{new}) + y_2K_{22}(\alpha_2^{old}-\alpha_2^{new}) +b^{old} \tag{56} b2new=E2+y1K12(α1oldα1new)+y2K22(α2oldα2new)+bold(56)
下面讨论 b n e w b^{new} bnew的最终取值:

  • 0 < α 1 < C 0<\alpha_1<C 0<α1<C, 0 < α 2 < C 0<\alpha_2<C 0<α2<C:
    b n e w = b 1 n e w = b 2 n e w b^{new} = b_1^{new} = b_2^{new} bnew=b1new=b2new (此时 x 1 x_1 x1 x 2 x_2 x2都在间隔边界上)

  • 若只有一个 0 < α i < C , i ∈ { 1 , 2 } 0<\alpha_i<C, \quad i \in \{1,2\} 0<αi<C,i{1,2}
    b n e w = b i n e w b^{new} = b_i^{new} bnew=binew

  • α 1 , α 2 ∈ { 0 , C } \alpha_1,\alpha_2 \in \{0,C\} α1,α2{0,C}
    b n e w = 1 2 ( b 1 n e w + b 2 n e w ) b^{new} = \frac{1}{2}(b_1^{new}+b_2^{new}) bnew=21(b1new+b2new), (若 α i = 0 \alpha_i=0 αi=0说明 x i x_i xi不是支持向量, y i g ( x i ) ≥ 1 y_ig(x_i) \ge 1 yig(xi)1, x i x_i xi在正确分类的间隔一侧, α i = C \alpha_i=C αi=C 说明 y i g ( x i ) ≤ 1 y_ig(x_i) \le 1 yig(xi)1, 这些都可以从式(22)-式(30)的KKT条件推出,下面还会推导)

2.2 更新 E i E_i Ei,方便下一次的 b b b 计算

(57) E i = ∑ s α j y j K i j − y i E_i = \sum_{s}\alpha_jy_jK_{ij} - y_i \tag{57} Ei=sαjyjKijyi(57)
其中 s s s是所有>0的 α j \alpha_j αj, 即所有支持向量

3 α \alpha α选取策略

3.1 通过满足KKT条件与否选择 α 1 \alpha_1 α1

因为收敛后的最优解是满足KKT条件的,所以第一次选择最不满足KKT条件的 α \alpha α:
从式(20)-(28)可得:

  • α i = 0 \alpha_i = 0 αi=0
    (1), 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 Cαiui=0 可得 u i ∗ = C > 0 u_i^*=C > 0 ui=C>0
    (2), 根据 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^* = 0 uiξi=0 可得 ξ i ∗ = 0 \xi_i^*=0 ξi=0
    (3), 根据 y i ( w ∗ x i + b ∗ ) ≥ 1 − ξ i ∗ y_i(w^*x_i+b^*) \ge 1 - \xi_i^* yi(wxi+b)1ξi 可得 y i ( w ∗ x i + b ∗ ) ≥ 1 y_i(w^*x_i +b^*) \ge 1 yi(wxi+b)1
    (4), 综上, (58) α i = 0 ⇔ y i g ( x i ) ≥ 1 \alpha_i = 0 \Leftrightarrow y_ig(x_i) \ge 1 \tag{58} αi=0yig(xi)1(58)

  • 0 < α i < C 0 < \alpha_i <C 0<αi<C
    (1), 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 Cαiui=0 可得 u i ∗ > 0 u_i^*> 0 ui>0
    (2), 根据 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^* = 0 uiξi=0 可得 ξ i ∗ = 0 \xi_i^*=0 ξi=0
    (3), 根据 α i ∗ ( y i ( w ∗ x i + b ∗ ) − 1 + ξ ∗ ) = 0 \alpha_i^*(y_i(w^*x_i+b^*)-1+\xi^*) = 0 αi(yi(wxi+b)1+ξ)=0 及上面一条可得 y i ( w ∗ x i + b ∗ ) − 1 = 0 y_i(w^*x_i+b^*) - 1=0 yi(wxi+b)1=0
    (4), 综上, (59) 0 < α i < C ⇔ y i g ( x i ) = 1 0 < \alpha_i < C \Leftrightarrow y_ig(x_i) = 1 \tag{59} 0<αi<Cyig(xi)=1(59)

  • α i = C \alpha_i = C αi=C
    (1). 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 Cαiui=0 可得 u i ∗ = 0 u_i^*= 0 ui=0
    (2). 根据 u i ∗ = 0 u_i^* = 0 ui=0 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^*=0 uiξi=0, ξ i ∗ ≥ 0 \xi_i^* \ge 0 ξi0可得 ξ i ∗ ≥ 0 \xi_i^* \ge 0 ξi0
    (3). 根据 α i ∗ ( y i ( w ∗ x i + b ∗ ) − 1 + ξ i ∗ ) = 0 \alpha_i^*(y_i(w^*x_i+b^*)-1+\xi_i^*) = 0 αi(yi(wxi+b)1+ξi)=0 α i = C > 0 \alpha_i=C>0 αi=C>0可得
    y i ( w ∗ x i + b ∗ ) − 1 + ξ i ∗ = 0 y_i(w^*x_i + b^*) - 1 +\xi_i^* = 0 yi(wxi+b)1+ξi=0
    (4). 根据推论(2)及推论(3)可得: y i ( w ∗ x i + b ∗ ) ≤ 1 y_i(w^*x_i+b^*) \le 1 yi(wxi+b)1
    (5). 综上, (60) α i = C ⇔ y i g ( x i ) ≤ 1 \alpha_i=C \Leftrightarrow y_ig(x_i) \le 1 \tag{60} αi=Cyig(xi)1(60)

这里要说明一下,计算机里常有浮点数精度的问题,直接用"=="往往会得出错误的结果,上面的KKT条件检查全部都应该在 ϵ \epsilon ϵ 的精度下进行
选择算法:首先选取 0 < α i < C 0 < \alpha_i < C 0<αi<C的支持向量样本点,检查是否满足式(61)
如果不满足,则选择它
如果都满足,遍历整个训练集查看它们是否满足KKT条件, 如果都满足则满足停机条件

3.2 根据极大化 α 1 \alpha_1 α1的变化来寻找 α 2 \alpha_2 α2

根据式(44), α 1 \alpha_1 α1 E 1 − E 2 E_1 - E_2 E1E2呈线性关系,所以对 α 2 \alpha_2 α2的选取策略:
遍历找到使 ∣ E 1 − E 2 ∣ |E_1 - E_2| E1E2最大的 E 2 E_2 E2,其对应的 α \alpha α分量即为 α 2 \alpha_2 α2
这里可以看出, E E E 在更新 α 2 \alpha_2 α2, 阈值 b b b 与寻找 α 2 \alpha_2 α2的过程中发挥了极大的作用

可是这个简单策略有时会找不到令目标函数式(17)有足够下降的点,怎么办呢?只能依次遍历在间隔边界上的点,看它们中是否有点能使目标函数有足够下降

如果还是找不到呢?那只能放弃 α i \alpha_i αi重新选择了

所以到这里我发现SMO算法有回溯的情况

4 简单总结整个SMO算法的流程

1, 根据是否满足KKT条件寻找一个 α \alpha α的分量作为 α 1 \alpha_1 α1, 满足停机条件则算法结束
2, 根据极大化 α 1 \alpha_1 α1的变化寻找 α 2 \alpha_2 α2, 这里可能会有回溯情况,重回第1步
3, 根据 E 1 , E 2 E_1, E_2 E1,E2等,即式(44)得到 α 2 n e w , u n c \alpha_2^{new, unc} α2new,unc
4, 对 α 2 n e w , u n c \alpha_2^{new, unc} α2new,unc进行定义域剪切得到 α 2 n e w \alpha_2^{new} α2new
5, 紧接着根据式(51)得到 α 1 n e w \alpha_1^{new} α1new
6, 根据(55),(56) 通过 E i E_i Ei 获取 b 1 n e w , b 2 n e w b_1^{new}, b_2^{new} b1new,b2new
7, 根据 α 1 , α 2 \alpha_1, \alpha_2 α1,α2 0 0 0 C C C的大小关系获取 b n e w b^{new} bnew
8, 更新 E 1 , E 2 E_1, E_2 E1,E2,为下一轮计算做准备

循环以上步骤直到达到指定的轮次数或满足停机条件

三. 用python实现核心步骤并进行代码片段讲解

大致讲解流程根据上面的算法流程来

1. 检查是否满足KKT条件

# check if the alpha[i] satisfy the KKT condition:
def _satisfy_KKT_(self, i):
    tmp = self.Y[i]*self._g_(i)
    # 式(60)
    if abs(self.alpha[i]) < self.epsilon: # epsilon is the precision for checking if two var equal
        return tmp >= 1
    # 式(62)
    elif abs(self.alpha[i] - self.C) < self.epsilon:
        return tmp <= 1
    # 式(61)
    else:
        return abs(tmp - 1) < self.epsilon

2. 寻找 α 2 \alpha_2 α2

imax = (0, 0)# store |E1-E2|, index of alpha2
E1 = self.E[i]
alpha_1_index.remove(i)
# 寻找使|E1 - E2|最大的alpha_2
for j in alpha_1_index:
    E2 = self.E[j]
    if abs(E1 - E2) > imax[0]:
        imax = (abs(E1 - E2), j)

return i, imax[1]

3. 获取 α 2 n e w , u n c \alpha_2^{new,unc} α2new,unc并进行剪切获得 α 2 n e w \alpha_2^{new} α2new, 再获取 α 1 n e w \alpha_1^{new} α1new

E1, E2 = self.E[i1], self.E[i2]
# eta即式(46)的分母
eta = self._K_(i1, i1) + self._K_(i2, i2) - 2*self._K_(i1, i2) # 7.107
# 式(46)
alpha2_new_unc = self.alpha[i2] + self.Y[i2] * (E1-E2) / eta # 7.106
# 剪切
if self.Y[i1] == self.Y[i2]:
    L = max(0, self.alpha[i2] + self.alpha[i1] - self.C)
    H = min(self.C, self.alpha[i2] + self.alpha[i1])
else:
    L = max(0, self.alpha[i2] - self.alpha[i1])
    H = min(self.C, self.C + self.alpha[i2] - self.alpha[i1])

alpha2_new = H if alpha2_new_unc > H else L if alpha2_new_unc < L else alpha2_new_unc # 7.108
# 式(53)
alpha1_new = self.alpha[i1] + self.Y[i1]*self.Y[i2]*(self.alpha[i2] - alpha2_new)

4.获取新的阈值 b n e w b^{new} bnew

# 式(57)
b1_new = -E1 - self.Y[i1]*self._K_(i1,i1)*(alpha1_new - self.alpha[i1]) \
         - self.Y[i2]*self._K_(i2,i1)*(alpha2_new - self.alpha[i2]) + self.b
# 式(58)
b2_new = -E2 - self.Y[i1]*self._K_(i1,i2)*(alpha1_new - self.alpha[i1]) \
         - self.Y[i2]*self._K_(i2,i2)*(alpha2_new - self.alpha[i2]) + self.b

# 式(58)-式(59)之间对b的取值讨论
if alpha1_new > 0 and alpha1_new < self.C:
    self.b = b1_new
elif alpha2_new > 0 and alpha2_new < self.C:
    self.b = b2_new
else:
    self.b = (b1_new + b2_new) / 2

5. 更新 E 1 , E 2 E_1, E_2 E1,E2

# 式(37), 对x_i的预测值
def _g_(self, i):
    K = np.array([self._K_(j, i) for j in range(self.m)])
    return np.dot(self.alpha * self.Y, K) + self.b
# 式(38), 对x_i的预测值与y_i的差值
def _E_(self, i):
    return self._g_(i) - self.Y[i]

# 更新alpha_1, alpha_2对应的E_1, E_2
self.E[i1] = self._E_(i1)
self.E[i2] = self._E_(i2)

以上完整代码在我的github

自问才疏学浅,不当之处请大家指出

参考:


《统计学习方法》
《理解SVM的三重境界》
《机器学习》

你可能感兴趣的:(机器学习)