对于样本 ( x , y ) (x,y) (x,y)而言,传统的回归模型是基于模型输入 f ( x ) f(x) f(x)与真实 y y y之间的差别来计算损失,当且仅当 f ( x ) f(x) f(x)与 y y y完全相同时,损失才为0。
而支持向量机是不同的,仅当 f ( x ) f(x) f(x)与 y y y之间有大于 ϵ \epsilon ϵ的偏差时,我们才计算损失,这相当于以 f ( x ) f(x) f(x)为中心构建了一个宽度为 ϵ \epsilon ϵ的间隔带,如果训练样本落入该间隔带,都被认为预测是正确的。
用公式来表示即为
min w , b 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m l ϵ ( f ( x i ) − y i ) (30) \min_{w,b}\frac{1}{2}\mid \mid w\mid\mid^2+C\sum^m_{i=1}l_{\epsilon}(f(x_i)-y_i)\tag{30} w,bmin21∣∣w∣∣2+Ci=1∑mlϵ(f(xi)−yi)(30)
其中C为正则化常数, l ϵ l_{\epsilon} lϵ是不敏感损失
l ϵ ( z ) = { 0 i f ∣ z ∣ ≤ ϵ ∣ z ∣ − ϵ o t h e r w i s e (31) l_{\epsilon}(z)=\begin{cases} 0&if \mid z\mid\leq \epsilon\\ \mid z\mid- \epsilon &otherwise \tag{31} \end{cases} lϵ(z)={0∣z∣−ϵif∣z∣≤ϵotherwise(31)
引入松弛变量 ξ i \xi_i ξi和 ξ i ^ \hat{\xi_i} ξi^。
值得注意的是,之所以引入两个松弛变量是因为间隔带两侧的松弛程度是不一样的。
令 ξ i = l ϵ ( z ) \xi_i=l_{\epsilon}(z) ξi=lϵ(z),显然 ξ i ≥ 0 \xi_i\geq0 ξi≥0
当 ∣ z ∣ ≤ ϵ \mid z\mid\leq \epsilon ∣z∣≤ϵ时, ξ i = 0 \xi_i=0 ξi=0
当 ∣ z ∣ > ϵ \mid z\mid> \epsilon ∣z∣>ϵ时, ξ i = ∣ z ∣ − ϵ > 0 \xi_i= \mid z\mid- \epsilon>0 ξi=∣z∣−ϵ>0
可以推出 ∣ z ∣ − ϵ ≤ ξ i \mid z\mid- \epsilon \leq \xi_i ∣z∣−ϵ≤ξi,即 ∣ z ∣ ≤ ξ i + ϵ \mid z\mid \leq \xi_i+\epsilon ∣z∣≤ξi+ϵ
− ξ i − ϵ ≥ z ≤ ξ i + ϵ -\xi_i-\epsilon\geq z \leq\xi_i+\epsilon −ξi−ϵ≥z≤ξi+ϵ,即 − ξ i − ϵ ≥ f ( x i ) − y i ≤ ξ i + ϵ -\xi_i-\epsilon\geq f(x_i)-y_i\leq\xi_i+\epsilon −ξi−ϵ≥f(xi)−yi≤ξi+ϵ
对式(30)进行重写
min w , b , ξ i , ξ i ^ 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ( ξ i + ξ i ^ ) s . t . f ( x i ) − y i ≤ ϵ + ξ i y i − f ( x i ) ≤ ϵ + ξ i ^ ξ i ≥ 0 , ξ i ^ ≥ 0 , i = 1 , 2 , . . . , m (32) \min_{w,b,\xi_i,\hat{\xi_i}}\frac{1}{2}\mid \mid w\mid\mid^2+C\sum^m_{i=1}(\xi_i+\hat{\xi_i})\\ s.t. f(x_i)-y_i\leq\epsilon+\xi_i\\ y_i-f(x_i)\leq\epsilon+\hat{\xi_i}\\ \xi_i\geq0,\hat{\xi_i}\geq 0,i=1,2,...,m \tag{32} w,b,ξi,ξi^min21∣∣w∣∣2+Ci=1∑m(ξi+ξi^)s.t.f(xi)−yi≤ϵ+ξiyi−f(xi)≤ϵ+ξi^ξi≥0,ξi^≥0,i=1,2,...,m(32)
类似之前的做法,我们引入拉格朗日乘子写出拉格朗日函数
L ( w , b , ξ i , ξ i ^ , α , α ^ , μ , μ ^ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ( ξ i + ξ i ^ ) + ∑ i = 1 m α i ( f ( x i ) − y i − ϵ − ξ i ) + ∑ i = 1 m α i ^ ( y i − f ( x i ) − ϵ − ξ i ^ ) − ∑ i = 1 m μ i ξ i − ∑ i = 1 m μ i ^ ξ i ^ (33) L(w,b,\xi_i,\hat{\xi_i},\alpha,\hat{\alpha},\mu,\hat{\mu})=\frac{1}{2}\mid \mid w\mid\mid^2+C\sum^m_{i=1}(\xi_i+\hat{\xi_i})\\+\sum^m_{i=1}\alpha_i(f(x_i)-y_i-\epsilon-\xi_i)+\sum^m_{i=1}\hat{\alpha_i}(y_i-f(x_i)-\epsilon-\hat{\xi_i})\\-\sum^m_{i=1}\mu_i\xi_i-\sum^m_{i=1}\hat{\mu_i}\hat{\xi_i}\tag{33} L(w,b,ξi,ξi^,α,α^,μ,μ^)=21∣∣w∣∣2+Ci=1∑m(ξi+ξi^)+i=1∑mαi(f(xi)−yi−ϵ−ξi)+i=1∑mαi^(yi−f(xi)−ϵ−ξi^)−i=1∑mμiξi−i=1∑mμi^ξi^(33)
其中 f ( x i ) = w T x i + b f(x_i)=w^Tx_i+b f(xi)=wTxi+b
分别对自变量求偏导
l ∂ w = w + ∑ i = 1 m α i x i − ∑ i = 1 m α i ^ x i l ∂ b = ∑ i = 1 m α i − ∑ i = 1 m α i ^ l ∂ ξ i = C − α i − μ i l ∂ ξ i ^ = C − α i ^ − μ i ^ \frac{l}{\partial w}=w+\sum^m_{i=1}\alpha_ix_i-\sum^m_{i=1}\hat{\alpha_i}x_i\\ \frac{l}{\partial b}=\sum^m_{i=1}\alpha_i-\sum^m_{i=1}\hat{\alpha_i}\\ \frac{l}{\partial \xi_i}=C-\alpha_i-\mu_i\\ \frac{l}{\partial \hat{\xi_i}}=C-\hat{\alpha_i}-\hat{\mu_i} ∂wl=w+i=1∑mαixi−i=1∑mαi^xi∂bl=i=1∑mαi−i=1∑mαi^∂ξil=C−αi−μi∂ξi^l=C−αi^−μi^
令他们为0,整理一下
w = ∑ i = 1 m ( α i ^ − α i ) x i ∑ i = 1 m ( α i ^ − α i ) = 0 C = α i + μ i C = α i ^ + μ i ^ (34) w=\sum^m_{i=1}(\hat{\alpha_i}-\alpha_i)x_i\\ \sum^m_{i=1}(\hat{\alpha_i}-\alpha_i)=0\\ C=\alpha_i+\mu_i\\ C=\hat{\alpha_i}+\hat{\mu_i} \tag{34} w=i=1∑m(αi^−αi)xii=1∑m(αi^−αi)=0C=αi+μiC=αi^+μi^(34)
将式(34)带入拉格朗日函数,即可得到对偶问题
max α , α i ^ ∑ i = 1 m y i ( α i ^ − α i ) − ϵ ( α i ^ + α i ) − 1 2 ∑ i = 1 m ∑ j = 1 m ( α i ^ − α i ) ( α j ^ − α j ) x i T x j s . t . ∑ i = 1 m ( α i ^ − α i ) = 0 C ≥ α i , α i ^ ≥ 0 , i = 1 , 2 , . . . , m (35) \max_{\alpha,\hat{\alpha_i}}\sum^m_{i=1}y_i(\hat{\alpha_i}-\alpha_i)-\epsilon(\hat{\alpha_i}+\alpha_i)-\frac{1}{2}\sum^m_{i=1}\sum^m_{j=1}(\hat{\alpha_i}-\alpha_i)(\hat{\alpha_j}-\alpha_j)x_i^Tx_j\\ s.t.\quad \sum^m_{i=1}(\hat{\alpha_i}-\alpha_i)=0\\ \quad\quad\quad\quad C\geq \alpha_i,\hat{\alpha_i}\geq 0,i=1,2,...,m \tag{35} α,αi^maxi=1∑myi(αi^−αi)−ϵ(αi^+αi)−21i=1∑mj=1∑m(αi^−αi)(αj^−αj)xiTxjs.t.i=1∑m(αi^−αi)=0C≥αi,αi^≥0,i=1,2,...,m(35)
还需要满足KKT条件
{ α i ( f ( x i ) − y i − ϵ − ξ i ) = 0 α i ^ ( y i − f ( x i ) − ϵ − ξ i ^ ) = 0 μ i ξ i = 0 , μ i ^ ξ i ^ = 0 ; = > ( C − α i ) ξ i = 0 , ( C − α i ^ ) ξ i ^ = 0 f ( x i ) − y i ≤ ϵ + ξ i y i − f ( x i ) ≤ ϵ + ξ i ^ ξ i ≥ 0 , ξ i ^ ≥ 0 , α i ≥ 0 , α i ^ ≥ 0 , i = 1 , 2 , . . . , m (36) \begin{cases} \alpha_i(f(x_i)-y_i-\epsilon-\xi_i)=0\\ \hat{\alpha_i}(y_i-f(x_i)-\epsilon-\hat{\xi_i})=0\\ \mu_i\xi_i=0,\hat{\mu_i}\hat{\xi_i}= 0;=>(C-\alpha_i)\xi_i=0,(C-\hat{\alpha_i})\hat{\xi_i}= 0\\ f(x_i)-y_i\leq\epsilon+\xi_i\\ y_i-f(x_i)\leq\epsilon+\hat{\xi_i}\\ \xi_i\geq0,\hat{\xi_i}\geq 0,\alpha_i\geq0,\hat{\alpha_i}\geq 0,i=1,2,...,m \tag{36} \end{cases} ⎩ ⎨ ⎧αi(f(xi)−yi−ϵ−ξi)=0αi^(yi−f(xi)−ϵ−ξi^)=0μiξi=0,μi^ξi^=0;=>(C−αi)ξi=0,(C−αi^)ξi^=0f(xi)−yi≤ϵ+ξiyi−f(xi)≤ϵ+ξi^ξi≥0,ξi^≥0,αi≥0,αi^≥0,i=1,2,...,m(36)
当且仅当 f ( x i ) − y i − ϵ − ξ i = 0 f(x_i)-y_i-\epsilon-\xi_i=0 f(xi)−yi−ϵ−ξi=0 时, α i \alpha_i αi能取非零值
当且仅当 y i − f ( x i ) − ϵ − ξ i ^ = 0 y_i-f(x_i)-\epsilon-\hat{\xi_i}=0 yi−f(xi)−ϵ−ξi^=0 时, α i ^ \hat{\alpha_i} αi^能取非零值
即样本不落入间隔带中,相应的 α i , α i ^ \alpha_i,\hat{\alpha_i} αi,αi^才能取非零值,而 f ( x i ) − y i − ϵ − ξ i = 0 f(x_i)-y_i-\epsilon-\xi_i=0 f(xi)−yi−ϵ−ξi=0 和 y i − f ( x i ) − ϵ − ξ i ^ = 0 y_i-f(x_i)-\epsilon-\hat{\xi_i}=0 yi−f(xi)−ϵ−ξi^=0 不同时成立,因此 α i , α i ^ \alpha_i,\hat{\alpha_i} αi,αi^中至少有一个为0