视频地址
以下是看完 视频的笔记,涉及 SVM公式的推导、求解全过程:
{ m a x m a r g i n ( w , b ) s . t . y i ( w T x i + b ) > 0 , ( i ∈ 1 , 2 , . . . , N ) \left \{ \begin{aligned} & max margin(w, b) \\ & s.t. \quad y_i(w^Tx_i+b) > 0, (i \in {1,2,...,N}) \\ \end{aligned} \right. {maxmargin(w,b)s.t.yi(wTxi+b)>0,(i∈1,2,...,N)
由于 y i y_i yi=-1或者+1,由点到直线的距离公式可推:
m a x m a r g i n ( w , b ) = m a x w , b m i n x i d i s t a n c e ( w , b , x i ) = m a x w , b m i n x i 1 ∣ ∣ w ∣ ∣ ∣ w T x i + b ∣ = m a x w , b 1 ∣ ∣ w ∣ ∣ m i n x i ∣ w T x i + b ∣ = m a x w , b 1 ∣ ∣ w ∣ ∣ m i n x i y i ( w T x i + b ) \begin{aligned} max margin(w,b) &= \underset{w,b}{max} \underset{x_i}{min} distance(w, b, x_i) \\ & =\underset{w,b}{max} \underset{x_i}{min} \frac{1}{||w||} |w^Tx_i+b|\\ & =\underset{w,b}{max} \frac{1}{||w||} \underset{x_i}{min} |w^Tx_i + b| \\ & = \underset{w,b}{max} \frac{1}{||w||} \underset{x_i}{min} y_i(w^Tx_i+b) \\ \end{aligned} maxmargin(w,b)=w,bmaxximindistance(w,b,xi)=w,bmaxximin∣∣w∣∣1∣wTxi+b∣=w,bmax∣∣w∣∣1ximin∣wTxi+b∣=w,bmax∣∣w∣∣1ximinyi(wTxi+b)
由于同时对w,b进行缩放不影响 超平面 w T x i + b w^Tx_i + b wTxi+b 的表达。所以,我们令 y i ( w T x i + b ) = 1 y_i(w^Tx_i+b)=1 yi(wTxi+b)=1,则上式可化为:
m a x m a r g i n ( w , b ) = m a x w , b 1 ∣ ∣ w ∣ ∣ = m i n w , b 1 2 ∣ ∣ w ∣ ∣ \begin{aligned} max margin(w,b) & = \underset{w,b}{max} \frac{1}{||w||} \\ & = \underset{w,b}{min} \frac{1}{2} ||w|| \end{aligned} maxmargin(w,b)=w,bmax∣∣w∣∣1=w,bmin21∣∣w∣∣
这里的 1 2 \frac{1}{2} 21是我们方便求导加上去的,不影响求最值。这里的 ∣ ∣ w ∣ ∣ ||w|| ∣∣w∣∣ 就是w的2范数,也就是 w T w w^Tw wTw.
所以,最终 SVM的公式可表示为:
{ m i n w , b 1 2 w T w s . t . m i n w , b y i ( w T x i + b ) = 1 \left \{ \begin{aligned} & \underset{w,b}{min} \frac {1}{2}w^Tw \\ & s.t. \quad \underset{w,b}{min} y_i(w^Tx_i+b)=1 \\ \end{aligned} \right. ⎩⎪⎨⎪⎧w,bmin21wTws.t.w,bminyi(wTxi+b)=1
也就是:
{ m i n w , b 1 2 w T w s . t . y i ( w T x i + b ) ≥ 1 \left \{ \begin{aligned} & \underset{w,b}{min} \frac {1}{2}w^Tw \\ & s.t. \quad y_i(w^Tx_i+b) \geq 1 \\ \end{aligned} \right. ⎩⎪⎨⎪⎧w,bmin21wTws.t.yi(wTxi+b)≥1
也就是:
{ m i n w , b 1 2 w T w s . t . 1 − y i ( w T x i + b ) ≤ 0 \left \{ \begin{aligned} & \underset{w,b}{min} \frac {1}{2}w^Tw \\ & s.t. \quad 1- y_i(w^Tx_i+b) \leq 0 \\ \end{aligned} \right. ⎩⎪⎨⎪⎧w,bmin21wTws.t.1−yi(wTxi+b)≤0
考虑上面的这个带约束的二次凸优化问题,我们可以利用拉格朗日公式化为 无约束优化问题,然后,转化为一个最小最大的原始问题,
然后,由于二次凸优化问题,对偶问题的解=原始问题的解。并且,强对偶满足KKT条件,我们就可以利用KKT条件对拉格朗日公式进行求导,进而求出最优值。
带约束的问题:
{ m i n w , b 1 2 w T w s . t . 1 − y i ( w T x i + b ) ≤ 0 \left \{ \begin{aligned} & \underset{w,b}{min} \frac {1}{2}w^Tw \\ & s.t. \quad 1- y_i(w^Tx_i+b) \leq 0 \\ \end{aligned} \right. ⎩⎪⎨⎪⎧w,bmin21wTws.t.1−yi(wTxi+b)≤0
利用拉格朗日公式化为 无约束问题:引入参数 λ i = λ 1 , λ 2 , . . . , λ N \lambda_i = {\lambda_1, \lambda_2, ... , \lambda_N} λi=λ1,λ2,...,λN
L ( w , b , λ ) = 1 2 w T w + ∑ i = 1 N λ i ( 1 − y i ( w T x i + b ) ) \begin{aligned} L(w,b, \lambda) = \frac{1}{2}w^Tw + \sum_{i=1}^N \lambda_i(1-y_i(w^Tx_i+b)) \end{aligned} L(w,b,λ)=21wTw+i=1∑Nλi(1−yi(wTxi+b))
则带约束的问题可以转化为下面无约束问题:
{ m i n w , b m a x λ i L ( w , b , λ ) s . t . λ i ≥ 0 \left \{ \begin{aligned} & \underset{w,b}{min} \underset{\lambda_i}{max} \quad L(w,b,\lambda) \\ & s.t. \quad \lambda_i \geq 0 \end{aligned} \right. ⎩⎨⎧w,bminλimaxL(w,b,λ)s.t.λi≥0
根据对偶关系,上面的无约束的最小最大原始问题,可以转化为它的对偶问题,即最大最小问题:
{ m a x λ i m i n w , b L ( w , b , λ i ) s . t . λ i ≥ 0 \left \{ \begin{aligned} & \underset{\lambda_i}{max} \underset{w,b}{min} \quad L(w,b,\lambda_i) \\ & s.t. \quad \lambda_i \geq 0 \end{aligned} \right. ⎩⎨⎧λimaxw,bminL(w,b,λi)s.t.λi≥0
(1)先求 m i n w , b L ( w , b , λ ) \underset{w,b}{min} L(w,b, \lambda) w,bminL(w,b,λ)
L ( w , b , λ i ) L(w,b, \lambda_i) L(w,b,λi)分别对w,b进行求导,可以得到:
w = ∑ i = 1 N λ i y i x i w = \sum_{i=1}^N \lambda_i y_i x_i w=i=1∑Nλiyixi
∑ i = 1 N λ i y i = 0 \sum_{i=1}^N \lambda_i y_i = 0 i=1∑Nλiyi=0
代入拉格朗日函数 L ( w , b , λ ) L(w,b, \lambda) L(w,b,λ)中可得:
m i n w , b L ( w , b , λ ) = − 1 2 ∑ i = 1 N ∑ j = 1 N λ i λ j y i y j ( x i ⋅ x j ) + ∑ i = 1 N λ i \underset{w,b}{min} L(w,b,\lambda) = -\frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j (x_i \cdot x_j) + \sum_{i=1}^N \lambda_i w,bminL(w,b,λ)=−21i=1∑Nj=1∑Nλiλjyiyj(xi⋅xj)+i=1∑Nλi
(2) 求 m a x λ i m i n w , b L ( w , b , λ ) \underset{\lambda_i}{max} \underset{w,b}{min} \quad L(w,b,\lambda) λimaxw,bminL(w,b,λ)
m a x λ i m i n w , b L ( w , b , λ ) = { m a x λ i − 1 2 ∑ i = 1 N ∑ j = 1 N λ i λ j y i y j ( x i ⋅ x j ) + ∑ i = 1 N λ i s . t . ∑ i = 1 N λ i y i = 0 λ i ≥ 0 , i = 1 , 2 , . . . , N \begin{aligned} \underset{\lambda_i}{max} \underset{w,b}{min} L(w,b,\lambda) &= \left \{ \begin{aligned} \underset{\lambda_i}{max} &\quad -\frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j (x_i \cdot x_j) + \sum_{i=1}^N \lambda_i \\ s.t. &\quad \sum_{i=1}^{N} \lambda_i y_i = 0 \\ &\quad \lambda_i \geq 0, i = 1,2,...,N \end{aligned} \right. \end{aligned} λimaxw,bminL(w,b,λ)=⎩⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎧λimaxs.t.−21i=1∑Nj=1∑Nλiλjyiyj(xi⋅xj)+i=1∑Nλii=1∑Nλiyi=0λi≥0,i=1,2,...,N
由求极大值转化为求极小值:
则最终原问题的对偶问题可表达为:
m a x λ i m i n w , b L ( w , b , λ ) = { m i n λ i 1 2 ∑ i = 1 N ∑ j = 1 N λ i λ j y i y j ( x i ⋅ x j ) − ∑ i = 1 N λ i s . t . ∑ i = 1 N λ i y i = 0 λ i ≥ 0 , i = 1 , 2 , . . . , N \begin{aligned} \underset{\lambda_i}{max} \underset{w,b}{min} L(w,b,\lambda) &= \left \{ \begin{aligned} \underset{\lambda_i}{min} &\quad \frac{1}{2}\sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j (x_i \cdot x_j) - \sum_{i=1}^N \lambda_i \\ s.t. & \quad \sum_{i=1}^{N} \lambda_i y_i = 0 \\ &\quad \lambda_i \geq 0, i = 1,2,...,N \end{aligned} \right. \end{aligned} λimaxw,bminL(w,b,λ)=⎩⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎧λimins.t.21i=1∑Nj=1∑Nλiλjyiyj(xi⋅xj)−i=1∑Nλii=1∑Nλiyi=0λi≥0,i=1,2,...,N
假设上式对偶问题对 λ \lambda λ的解为
λ ∗ = ( λ 1 ∗ , λ 2 ∗ , , . . . , λ N ∗ ) T \lambda^* = (\lambda_1^* , \lambda_2^* , ,..., \lambda_N^* )^T λ∗=(λ1∗,λ2∗,,...,λN∗)T
则可以由 λ ∗ \lambda^* λ∗ 求出 w ∗ , b ∗ w^*, b^* w∗,b∗.
注意!这块的证明以及定理可以去参考《统计学习方法》第二版105页,这里不做具体证明。
根据KKT条件可以求出:
w ∗ = ∑ i = 1 N λ i ∗ y i x i w^* = \sum_{i=1}^N \lambda_{i}^* y_i x_i w∗=i=1∑Nλi∗yixi
b ∗ = y j − ∑ i = 1 N λ i ∗ y i ( x i ⋅ x j ) b^* = y_j - \sum_{i=1}^N \lambda_i^* y_i(x_i \cdot x_j) b∗=yj−i=1∑Nλi∗yi(xi⋅xj)
由此,可写出分离超平面为:
∑ i = 1 N λ i ∗ y i ( x ⋅ x i ) + b ∗ = 0 \sum_{i=1}^N \lambda_i^* y_i(x \cdot x_i) + b^* = 0 i=1∑Nλi∗yi(x⋅xi)+b∗=0
分类决策函数可以写成:
f ( x ) = s i g n ( ∑ i = 1 N λ i ∗ y i ( x ⋅ x i ) + b ∗ ) f(x) = sign(\sum_{i=1}^N \lambda_i^* y_i(x \cdot x_i) + b^* ) f(x)=sign(i=1∑Nλi∗yi(x⋅xi)+b∗)
说明:这里的x指的是测试样本输入,则由此可以看出,分类决策函数只依赖于输入x和训练样本 x i x_i xi的內积。