线性可分支持向量机(SVM)与硬间隔最大化

分类决策函数:
f ( x ) = s i g n ( w x + b ) f(x)=sign(wx+b) f(x)=sign(wx+b)
其中
s i g n ( z ) = { + 1 i f   z ≥ 0 − 1 o t h e r w i s e sign(z)=\begin{cases}+1 &if\space z\ge 0\\ -1 &otherwise \end{cases} sign(z)={+11if z0otherwise

训练数据T中的一个样本 ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i))的函数间隔:
γ ^ ( i ) = y ( i ) ( w x + b ) . \hat\gamma(i) = y^{(i)}(wx + b). γ^(i)=y(i)(wx+b).

函数间隔有个问题,比如将w和b增大2倍,间隔就会增大2倍,但这没有什么意义!!!因为超平面还是那个超平面,而我们的目标是选一个较好的超平面。

为此我们定义一个与w,b的比例尺无关的几何间隔(对w,b使用L2正则化):

γ ( i ) = y ( i ) ( w ∣ ∣ w ∣ ∣ x + b ∣ ∣ w ∣ ∣ ) . \gamma(i) = y^{(i)}(\dfrac{w}{||w||} x + \dfrac{b}{||w||}). γ(i)=y(i)(wwx+wb).

整个训练数据T的函数间隔:
γ = min ⁡ i = 1 , ⋯   , N γ ( i ) \gamma=\displaystyle\min_{i=1,\cdots,N}\gamma(i) γ=i=1,,Nminγ(i)
优化问题为:
max ⁡ w , b γ s . t . y ( i ) ( w ∣ ∣ w ∣ ∣ x ( i ) + b ∣ ∣ w ∣ ∣ ) ≥ γ \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \gamma \\ &s.t.\quad y^{(i)}(\dfrac{w}{||w||}x^{(i)}+\dfrac{b}{||w||}) \ge\gamma \end{alignedat} w,bmaxγs.t.y(i)(wwx(i)+wb)γ
因为 γ = γ ^ ∣ ∣ w ∣ ∣ \gamma=\dfrac{\hat\gamma}{||w||} γ=wγ^

则该优化问题等价于:

max ⁡ w , b γ ^ ∣ ∣ w ∣ ∣ s . t . y ( i ) ( w ∣ ∣ w ∣ ∣ x ( i ) + b ∣ ∣ w ∣ ∣ ) ≥ γ ^ ∣ ∣ w ∣ ∣ \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{\hat\gamma}{||w||} \\ &s.t.\quad y^{(i)}(\dfrac{w}{||w||}x^{(i)}+\dfrac{b}{||w||}) \ge \dfrac{\hat\gamma}{||w||} \end{alignedat} w,bmaxwγ^s.t.y(i)(wwx(i)+wb)wγ^


max ⁡ w , b γ ^ ∣ ∣ w ∣ ∣ s . t . y ( i ) ( w x ( i ) + b ) ≥ γ ^ \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{\hat\gamma}{||w||} \\ &s.t.\quad y^{(i)}(wx^{(i)}+b) \ge \hat\gamma \end{alignedat} w,bmaxwγ^s.t.y(i)(wx(i)+b)γ^

因为比例缩放不影响最优化问题的不等式约束,我们可以让 γ ^ = 1 \hat\gamma=1 γ^=1,则最优化问题变成
max ⁡ w , b 1 ∣ ∣ w ∣ ∣ s . t . y ( i ) ( w x ( i ) + b ) ≥ 1 \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{1}{||w||} \\ &s.t.\quad y^{(i)}(wx^{(i)}+b) \ge 1 \end{alignedat} w,bmaxw1s.t.y(i)(wx(i)+b)1
因为最大化 1 ∣ ∣ w ∣ ∣ \dfrac{1}{||w||} w1和最小化 1 2 ∣ ∣ w ∣ ∣ 2 \dfrac{1}{2}||w||^2 21w2是等价的,所以最优化问题变成:
min ⁡ w , b 1 2 ∣ ∣ w ∣ ∣ 2 s . t . y ( i ) ( w x ( i ) + b ) − 1 ≥ 0 \begin{alignedat}{2} &\displaystyle\min_{w,b} \quad \dfrac{1}{2}||w||^2 \\ &s.t.\quad y^{(i)}(wx^{(i)}+b)-1 \ge 0 \end{alignedat} w,bmin21w2s.t.y(i)(wx(i)+b)10

对每个不等式约束引入一个拉格朗日算子 α i ≥ 0 \alpha_i\ge 0 αi0,定义拉格朗日函数:
L ( w , b , α ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ i = 1 N α i [ y ( i ) ( w x ( i ) + b ) − 1 ] L(w,b,\alpha)=\dfrac{1}{2}||w||^2 -\displaystyle\sum_{i=1}^N\alpha_i [y^{(i)}(wx^{(i)}+b)-1] L(w,b,α)=21w2i=1Nαi[y(i)(wx(i)+b)1]
则原始问题为:
min ⁡ w , b max ⁡ α L ( w , b , α ) \displaystyle\min_{w,b}\displaystyle\max_\alpha L(w,b,\alpha) w,bminαmaxL(w,b,α)
其拉格朗日对偶问题为:
max ⁡ α min ⁡ w , b L ( w , b , α ) \displaystyle\max_\alpha \displaystyle\min_{w,b} L(w,b,\alpha) αmaxw,bminL(w,b,α)


∇ w L ( w , b , α ) = w − ∑ i = 1 N α i y ( i ) x ( i ) = 0 ∇ b L ( w , b , α ) = − ∑ i = 1 N α i y ( i ) = 0 \begin{aligned} &\nabla_wL(w,b,\alpha)=w-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)}=0 \\ &\nabla_b L(w,b,\alpha)=-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}=0 \end{aligned} wL(w,b,α)=wi=1Nαiy(i)x(i)=0bL(w,b,α)=i=1Nαiy(i)=0

得到
min ⁡ w , b L ( w , b , α ) = − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y ( i ) y ( j ) ( x ( i ) ⋅ x ( j ) ) + ∑ i = 1 N α i w = ∑ i = 1 N α i y ( i ) x ( i ) \begin{aligned} \displaystyle\min_{w,b} L(w,b,\alpha)&=-\dfrac{1}{2}\displaystyle\sum_{i=1}^N\displaystyle\sum_{j=1}^N\alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(i)}\cdot x^{(j)}) +\displaystyle\sum_{i=1}^N\alpha_i \\ w&=\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)} \end{aligned} w,bminL(w,b,α)w=21i=1Nj=1Nαiαjy(i)y(j)(x(i)x(j))+i=1Nαi=i=1Nαiy(i)x(i)

一般情况下 max ⁡ min ⁡ < min ⁡ max ⁡ \max\min<\min\max maxmin<minmax
,只有KKT条件成立的时候,原始问题的解才和对偶问题的解相等。

对于形如
min ⁡ w f ( w ) s . t . g i ( w ) ≤ 0 , i = 1 , . . . , k h i ( w ) = 0 , i = 1 , . . . , l . \begin{aligned} &\min_w \quad f(w) \\ & \begin{aligned}s.t. \quad g_i(w) &≤ 0, i = 1, . . . , k\\ h_i(w) &= 0, i = 1, . . . , l. \end{aligned} \end{aligned} wminf(w)s.t.gi(w)hi(w)0,i=1,...,k=0,i=1,...,l.
拉格朗日函数为
L ( w , α , β ) = f ( w ) + ∑ i = 1 k α i g i ( w ) + ∑ i = 1 l β i h i ( w ) . L(w, α, β) = f(w) + \displaystyle\sum_{i=1}^k α_ig_i(w) + \displaystyle\sum_{i=1}^l β_i h_i(w). L(w,α,β)=f(w)+i=1kαigi(w)+i=1lβihi(w).
的问题,其KTT条件如下:
∂ ∂ w i L ( w , α , β ) = 0 , i = 1 , ⋯   , N ∂ ∂ β i L ( w , α , β ) = 0 , i = 1 , ⋯   , l α i g i ( w ) = 0 , i = 1 , ⋯   , k g i ( w ) ≤ 0 , i = 1 , ⋯   , k α i ≥ 0 , i = 1 , ⋯   , k \begin{aligned} \dfrac{\partial}{\partial w_i}L(w, α, β) &= 0, \quad i = 1,\cdots ,N \\ \dfrac{\partial}{\partial \beta_i}L(w, α, β) &= 0, \quad i = 1, \cdots ,l \\ \alpha_i g_i(w) &= 0 ,\quad i = 1, \cdots ,k\\ g_i(w) &≤ 0, \quad i = 1, \cdots ,k \\ α_i &≥ 0, \quad i = 1, \cdots ,k \end{aligned} wiL(w,α,β)βiL(w,α,β)αigi(w)gi(w)αi=0,i=1,,N=0,i=1,,l=0,i=1,,k0,i=1,,k0,i=1,,k
注意因为 g i ( w ) = − [ y ( i ) ( w x ( i ) + b ) − 1 ] ≤ 0 g_i(w)=-[y^{(i)}(wx^{(i)}+b)-1]\le 0 gi(w)=[y(i)(wx(i)+b)1]0,
如果 α i > 0 \alpha_i>0 αi>0,则必有 g i ( w ) g_i(w) gi(w)=0,则说明样本 i i i距离分割超平面的距离为1,我们称这样的点为支持向量.

假设 α j > 0 \alpha_j>0 αj>0,则
y ( j ) ( w x ( j ) + b ) − 1 = 0 y^{(j)}(wx^{(j)}+b)-1=0 y(j)(wx(j)+b)1=0
将w的值代入,得
0 = y ( j ) ( ∑ α i y ( i ) x ( i ) x ( j ) + b ) − 1 = ( y ( j ) ) 2 ( ∑ α i y ( i ) x ( i ) x ( j ) + b ) − y ( j ) = ∑ α i y ( i ) x ( i ) x ( j ) + b − y ( j ) \begin{aligned} 0&=y^{(j)}(\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b)-1 \\ &=(y^{(j)})^2(\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b)-y^{(j)} \\ &=\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b-y^{(j)} \end{aligned} 0=y(j)(αiy(i)x(i)x(j)+b)1=(y(j))2(αiy(i)x(i)x(j)+b)y(j)=αiy(i)x(i)x(j)+by(j)

b = y ( j ) − ∑ i = 1 N α i y ( i ) x ( i ) x ( j ) b=y^{(j)}-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)}x^{(j)} b=y(j)i=1Nαiy(i)x(i)x(j)

在得到 w w w b b b之后,当我们对新的数据· x x x进行分类,即判断wx+b的符号,将w的值代入得:
w T x + b = ( ∑ i = 1 N α i y ( i ) x ( i ) ) T x + b = ∑ i = 1 N α i y ( i ) ⟨ x ( i ) , x ⟩ + b \begin{aligned} w^T x + b &= (\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)})^Tx+b \\ &=\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}\langle x^{(i)},x\rangle+b \end{aligned} wTx+b=(i=1Nαiy(i)x(i))Tx+b=i=1Nαiy(i)x(i),x+b
注意只有支持向量 i i i对应的 α i \alpha_i αi才可能大于0,而其他的 α i \alpha_i αi均为0,因为训练样本中只有很少的几个点是支持向量,因此上述计算过程中计算内积将会减少很多开销。当然这也说明了一点,最后得到的分类器其实只和支持向量有关,和其他的点无关。

参考文献:

[1] http://cs229.stanford.edu/notes/cs229-notes3.pdf
[2] 统计学习方法-李航

你可能感兴趣的:(AI,Algorithm,ML)