分类决策函数:
f ( x ) = s i g n ( w x + b ) f(x)=sign(wx+b) f(x)=sign(wx+b)
其中
s i g n ( z ) = { + 1 i f z ≥ 0 − 1 o t h e r w i s e sign(z)=\begin{cases}+1 &if\space z\ge 0\\ -1 &otherwise \end{cases} sign(z)={+1−1if z≥0otherwise
训练数据T中的一个样本 ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i))的函数间隔:
γ ^ ( i ) = y ( i ) ( w x + b ) . \hat\gamma(i) = y^{(i)}(wx + b). γ^(i)=y(i)(wx+b).
函数间隔有个问题,比如将w和b增大2倍,间隔就会增大2倍,但这没有什么意义!!!因为超平面还是那个超平面,而我们的目标是选一个较好的超平面。
为此我们定义一个与w,b的比例尺无关的几何间隔(对w,b使用L2正则化):
γ ( i ) = y ( i ) ( w ∣ ∣ w ∣ ∣ x + b ∣ ∣ w ∣ ∣ ) . \gamma(i) = y^{(i)}(\dfrac{w}{||w||} x + \dfrac{b}{||w||}). γ(i)=y(i)(∣∣w∣∣wx+∣∣w∣∣b).
整个训练数据T的函数间隔:
γ = min i = 1 , ⋯   , N γ ( i ) \gamma=\displaystyle\min_{i=1,\cdots,N}\gamma(i) γ=i=1,⋯,Nminγ(i)
优化问题为:
max w , b γ s . t . y ( i ) ( w ∣ ∣ w ∣ ∣ x ( i ) + b ∣ ∣ w ∣ ∣ ) ≥ γ \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \gamma \\ &s.t.\quad y^{(i)}(\dfrac{w}{||w||}x^{(i)}+\dfrac{b}{||w||}) \ge\gamma \end{alignedat} w,bmaxγs.t.y(i)(∣∣w∣∣wx(i)+∣∣w∣∣b)≥γ
因为 γ = γ ^ ∣ ∣ w ∣ ∣ \gamma=\dfrac{\hat\gamma}{||w||} γ=∣∣w∣∣γ^
则该优化问题等价于:
max w , b γ ^ ∣ ∣ w ∣ ∣ s . t . y ( i ) ( w ∣ ∣ w ∣ ∣ x ( i ) + b ∣ ∣ w ∣ ∣ ) ≥ γ ^ ∣ ∣ w ∣ ∣ \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{\hat\gamma}{||w||} \\ &s.t.\quad y^{(i)}(\dfrac{w}{||w||}x^{(i)}+\dfrac{b}{||w||}) \ge \dfrac{\hat\gamma}{||w||} \end{alignedat} w,bmax∣∣w∣∣γ^s.t.y(i)(∣∣w∣∣wx(i)+∣∣w∣∣b)≥∣∣w∣∣γ^
即
max w , b γ ^ ∣ ∣ w ∣ ∣ s . t . y ( i ) ( w x ( i ) + b ) ≥ γ ^ \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{\hat\gamma}{||w||} \\ &s.t.\quad y^{(i)}(wx^{(i)}+b) \ge \hat\gamma \end{alignedat} w,bmax∣∣w∣∣γ^s.t.y(i)(wx(i)+b)≥γ^
因为比例缩放不影响最优化问题的不等式约束,我们可以让 γ ^ = 1 \hat\gamma=1 γ^=1,则最优化问题变成
max w , b 1 ∣ ∣ w ∣ ∣ s . t . y ( i ) ( w x ( i ) + b ) ≥ 1 \begin{alignedat}{2} &\displaystyle\max_{w,b} \quad \dfrac{1}{||w||} \\ &s.t.\quad y^{(i)}(wx^{(i)}+b) \ge 1 \end{alignedat} w,bmax∣∣w∣∣1s.t.y(i)(wx(i)+b)≥1
因为最大化 1 ∣ ∣ w ∣ ∣ \dfrac{1}{||w||} ∣∣w∣∣1和最小化 1 2 ∣ ∣ w ∣ ∣ 2 \dfrac{1}{2}||w||^2 21∣∣w∣∣2是等价的,所以最优化问题变成:
min w , b 1 2 ∣ ∣ w ∣ ∣ 2 s . t . y ( i ) ( w x ( i ) + b ) − 1 ≥ 0 \begin{alignedat}{2} &\displaystyle\min_{w,b} \quad \dfrac{1}{2}||w||^2 \\ &s.t.\quad y^{(i)}(wx^{(i)}+b)-1 \ge 0 \end{alignedat} w,bmin21∣∣w∣∣2s.t.y(i)(wx(i)+b)−1≥0
对每个不等式约束引入一个拉格朗日算子 α i ≥ 0 \alpha_i\ge 0 αi≥0,定义拉格朗日函数:
L ( w , b , α ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ i = 1 N α i [ y ( i ) ( w x ( i ) + b ) − 1 ] L(w,b,\alpha)=\dfrac{1}{2}||w||^2 -\displaystyle\sum_{i=1}^N\alpha_i [y^{(i)}(wx^{(i)}+b)-1] L(w,b,α)=21∣∣w∣∣2−i=1∑Nαi[y(i)(wx(i)+b)−1]
则原始问题为:
min w , b max α L ( w , b , α ) \displaystyle\min_{w,b}\displaystyle\max_\alpha L(w,b,\alpha) w,bminαmaxL(w,b,α)
其拉格朗日对偶问题为:
max α min w , b L ( w , b , α ) \displaystyle\max_\alpha \displaystyle\min_{w,b} L(w,b,\alpha) αmaxw,bminL(w,b,α)
令
∇ w L ( w , b , α ) = w − ∑ i = 1 N α i y ( i ) x ( i ) = 0 ∇ b L ( w , b , α ) = − ∑ i = 1 N α i y ( i ) = 0 \begin{aligned} &\nabla_wL(w,b,\alpha)=w-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)}=0 \\ &\nabla_b L(w,b,\alpha)=-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}=0 \end{aligned} ∇wL(w,b,α)=w−i=1∑Nαiy(i)x(i)=0∇bL(w,b,α)=−i=1∑Nαiy(i)=0
得到
min w , b L ( w , b , α ) = − 1 2 ∑ i = 1 N ∑ j = 1 N α i α j y ( i ) y ( j ) ( x ( i ) ⋅ x ( j ) ) + ∑ i = 1 N α i w = ∑ i = 1 N α i y ( i ) x ( i ) \begin{aligned} \displaystyle\min_{w,b} L(w,b,\alpha)&=-\dfrac{1}{2}\displaystyle\sum_{i=1}^N\displaystyle\sum_{j=1}^N\alpha_i\alpha_jy^{(i)}y^{(j)}(x^{(i)}\cdot x^{(j)}) +\displaystyle\sum_{i=1}^N\alpha_i \\ w&=\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)} \end{aligned} w,bminL(w,b,α)w=−21i=1∑Nj=1∑Nαiαjy(i)y(j)(x(i)⋅x(j))+i=1∑Nαi=i=1∑Nαiy(i)x(i)
一般情况下 max min < min max \max\min<\min\max maxmin<minmax
,只有KKT条件成立的时候,原始问题的解才和对偶问题的解相等。
对于形如
min w f ( w ) s . t . g i ( w ) ≤ 0 , i = 1 , . . . , k h i ( w ) = 0 , i = 1 , . . . , l . \begin{aligned} &\min_w \quad f(w) \\ & \begin{aligned}s.t. \quad g_i(w) &≤ 0, i = 1, . . . , k\\ h_i(w) &= 0, i = 1, . . . , l. \end{aligned} \end{aligned} wminf(w)s.t.gi(w)hi(w)≤0,i=1,...,k=0,i=1,...,l.
拉格朗日函数为
L ( w , α , β ) = f ( w ) + ∑ i = 1 k α i g i ( w ) + ∑ i = 1 l β i h i ( w ) . L(w, α, β) = f(w) + \displaystyle\sum_{i=1}^k α_ig_i(w) + \displaystyle\sum_{i=1}^l β_i h_i(w). L(w,α,β)=f(w)+i=1∑kαigi(w)+i=1∑lβihi(w).
的问题,其KTT条件如下:
∂ ∂ w i L ( w , α , β ) = 0 , i = 1 , ⋯   , N ∂ ∂ β i L ( w , α , β ) = 0 , i = 1 , ⋯   , l α i g i ( w ) = 0 , i = 1 , ⋯   , k g i ( w ) ≤ 0 , i = 1 , ⋯   , k α i ≥ 0 , i = 1 , ⋯   , k \begin{aligned} \dfrac{\partial}{\partial w_i}L(w, α, β) &= 0, \quad i = 1,\cdots ,N \\ \dfrac{\partial}{\partial \beta_i}L(w, α, β) &= 0, \quad i = 1, \cdots ,l \\ \alpha_i g_i(w) &= 0 ,\quad i = 1, \cdots ,k\\ g_i(w) &≤ 0, \quad i = 1, \cdots ,k \\ α_i &≥ 0, \quad i = 1, \cdots ,k \end{aligned} ∂wi∂L(w,α,β)∂βi∂L(w,α,β)αigi(w)gi(w)αi=0,i=1,⋯,N=0,i=1,⋯,l=0,i=1,⋯,k≤0,i=1,⋯,k≥0,i=1,⋯,k
注意因为 g i ( w ) = − [ y ( i ) ( w x ( i ) + b ) − 1 ] ≤ 0 g_i(w)=-[y^{(i)}(wx^{(i)}+b)-1]\le 0 gi(w)=−[y(i)(wx(i)+b)−1]≤0,
如果 α i > 0 \alpha_i>0 αi>0,则必有 g i ( w ) g_i(w) gi(w)=0,则说明样本 i i i距离分割超平面的距离为1,我们称这样的点为支持向量.
假设 α j > 0 \alpha_j>0 αj>0,则
y ( j ) ( w x ( j ) + b ) − 1 = 0 y^{(j)}(wx^{(j)}+b)-1=0 y(j)(wx(j)+b)−1=0
将w的值代入,得
0 = y ( j ) ( ∑ α i y ( i ) x ( i ) x ( j ) + b ) − 1 = ( y ( j ) ) 2 ( ∑ α i y ( i ) x ( i ) x ( j ) + b ) − y ( j ) = ∑ α i y ( i ) x ( i ) x ( j ) + b − y ( j ) \begin{aligned} 0&=y^{(j)}(\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b)-1 \\ &=(y^{(j)})^2(\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b)-y^{(j)} \\ &=\sum\alpha_iy^{(i)}x^{(i)}x^{(j)}+b-y^{(j)} \end{aligned} 0=y(j)(∑αiy(i)x(i)x(j)+b)−1=(y(j))2(∑αiy(i)x(i)x(j)+b)−y(j)=∑αiy(i)x(i)x(j)+b−y(j)
则
b = y ( j ) − ∑ i = 1 N α i y ( i ) x ( i ) x ( j ) b=y^{(j)}-\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)}x^{(j)} b=y(j)−i=1∑Nαiy(i)x(i)x(j)
在得到 w w w和 b b b之后,当我们对新的数据· x x x进行分类,即判断wx+b的符号,将w的值代入得:
w T x + b = ( ∑ i = 1 N α i y ( i ) x ( i ) ) T x + b = ∑ i = 1 N α i y ( i ) ⟨ x ( i ) , x ⟩ + b \begin{aligned} w^T x + b &= (\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}x^{(i)})^Tx+b \\ &=\displaystyle\sum_{i=1}^N\alpha_iy^{(i)}\langle x^{(i)},x\rangle+b \end{aligned} wTx+b=(i=1∑Nαiy(i)x(i))Tx+b=i=1∑Nαiy(i)⟨x(i),x⟩+b
注意只有支持向量 i i i对应的 α i \alpha_i αi才可能大于0,而其他的 α i \alpha_i αi均为0,因为训练样本中只有很少的几个点是支持向量,因此上述计算过程中计算内积将会减少很多开销。当然这也说明了一点,最后得到的分类器其实只和支持向量有关,和其他的点无关。
参考文献:
[1] http://cs229.stanford.edu/notes/cs229-notes3.pdf
[2] 统计学习方法-李航