【系列三】线性回归

1. 最小二乘法

数据集: D = { ( x i , y i ) ∣ i = 1 , 2 , … , N , x i ∈ R p , y i ∈ R } D=\{(x_i,y_i)|i=1,2,\dotsc,N,x_i\in\Reals^p,y_i\in\Reals\} D={ (xi,yi)i=1,2,,N,xiRp,yiR}
X = ( x 1 , x 2 , … , x N ) N × p T , Y = ( y 1 y 2 ⋮ y N ) N × 1 X=(x_1,x_2,\dots,x_N)_{N\times p}^T,Y=\begin{pmatrix}y_1\\y_2\\\vdots\\y_N\end{pmatrix}_{N\times 1} X=(x1,x2,,xN)N×pT,Y=y1y2yNN×1
线性回归就不多介绍了,就是拟合曲线。假设拟合的曲线为 f ( w ) = w T x f(w)=w^Tx f(w)=wTx
最小二乘估计: L ( w ) = ∑ i = 1 N ∥ w T x i − y i ∥ 2 = ( w T x T − y T ) ( x w − y ) = w T X T X w − 2 w T X T Y + Y T Y L(w)=\sum\limits_{i=1}^{N}\|w^Tx_i-y_i\|^2=(w^Tx^T-y^T)(xw-y)\\=w^TX^TXw-2w^TX^TY+Y^TY L(w)=i=1NwTxiyi2=(wTxTyT)(xwy)=wTXTXw2wTXTY+YTY
w ^ = a r g min ⁡ w L ( w ) ∂ L ( w ) ∂ w = 2 X T X w − 2 X T Y = 0 w = ( X T X ) − 1 X T Y \hat w=arg\min_w L(w)\\\frac{\partial L(w)}{\partial w}=2X^TXw-2X^TY=0\\w=(X^TX)^{-1}X^TY w^=argwminL(w)wL(w)=2XTXw2XTY=0w=(XTX)1XTY
其几何意义就是预测点与实际点在y轴上的平方差之和(Loss)。

2. 概率角度看最小二乘法

假设数据存在的噪声服从正态分布,即 ϵ ∼ N ( 0 , σ 2 ) \epsilon\sim N(0,\sigma^2) ϵN(0,σ2),则 y = f ( w ) + ϵ = w T x + ϵ y=f(w)+\epsilon=w^Tx+\epsilon y=f(w)+ϵ=wTx+ϵ
y ∣ x ; w ∼ N ( w T x , σ 2 ) y|x;w\sim N(w^Tx,\sigma^2) yx;wN(wTx,σ2)
w M L E = w ^ = a r g max ⁡ w log ⁡ P ( Y ∣ X ; w ) = a r g max ⁡ w log ⁡ ∏ i = 1 N P ( y i ∣ x i ; w ) = … w ^ = a r g min ⁡ w 1 2 σ 2 ( y i − w T x i ) 2 w_{MLE}=\hat w=arg\max_w\log P(Y|X;w)\\=arg\max_w \log\prod_{i=1}^{N}P(y_i|x_i;w)=\dotsc\\\hat w=arg\min_w\frac{1}{2\sigma^2}(y_i-w^Tx_i)^2 wMLE=w^=argwmaxlogP(YX;w)=argwmaxlogi=1NP(yixi;w)=w^=argwmin2σ21(yiwTxi)2
这说明了:(LSE,Least Square Error)
L S E    ⟺    M L E LSE\iff MLE LSEMLE等价在极大似然估计的噪声服从高斯分布的条件下得到满足

3. 正则化

N ≫ p N\gg p Np时,容易过拟合,其解决办法:

  1. 添加数据
  2. 特征选择(PCA)
  3. 正则化(Regulazation)

正则化框架: a r g min ⁡ w [ L ( w ) + λ P ( w ) ] L 1 : L a s s o , P ( w ) = ∥ w ∥ L 2 : R i d g e , 岭 回 归 , P ( w ) = ∥ w ∥ 2 = w T w arg\min\limits_w[L(w)+\lambda P(w)]\\ L_1:Lasso,P(w)=\|w\|\\ L_2:Ridge,岭回归,P(w)=\|w\|^2=w^Tw argwmin[L(w)+λP(w)]L1:Lasso,P(w)=wL2:Ridge,,P(w)=w2=wTw
J ( w ) = ∑ i = 1 N ∥ w T x i − y i ∥ 2 + λ w T w = w T X T X w − 2 w T X T Y + Y T Y + λ w T w w ^ = a r g min ⁡ w J ( w ) ∂ J ( w ) ∂ w = 2 ( X T X + λ I ) w − 2 X T Y = 0 w ^ = ( x T x + λ I ) − 1 x T y J(w)=\sum\limits_{i=1}^{N}\|w^Tx_i-y_i\|^2+\lambda w^Tw\\ =w^TX^TXw-2w^TX^TY+Y^TY+\lambda w^Tw\\ \hat w=arg\min_wJ(w)\\ \frac{\partial J(w)}{\partial w}=2(X^TX+\lambda I)w-2X^TY=0\\ \hat w=(x^Tx+\lambda I)^{-1}x^Ty J(w)=i=1NwTxiyi2+λwTw=wTXTXw2wTXTY+YTY+λwTww^=argwminJ(w)wJ(w)=2(XTX+λI)w2XTY=0w^=(xTx+λI)1xTy

4. 贝叶斯角度看岭回归

w ∼ N ( 0 , σ 0 2 ) w\sim N(0,\sigma_0^2) wN(0,σ02)
y ∣ x ; w ∼ N ( w T x , σ 2 ) y|x;w\sim N(w^Tx,\sigma^2) yx;wN(wTx,σ2)
P ( w ∣ y ) = P ( y ∣ w ) P ( w ) P ( y ) P(w|y)=\frac{P(y|w)P(w)}{P(y)} P(wy)=P(y)P(yw)P(w)
MAP: a r g max ⁡ w P ( w ∣ y ) = a r g max ⁡ w log ⁡ [ P ( y ∣ w ) P ( w ) ] = a r g max ⁡ w { − ( y − w T x ) 2 2 σ 2 − ∥ w ∥ 2 2 σ 0 2 } = a r g min ⁡ w [ ( y − w T x ) 2 + σ 2 ∥ w ∥ 2 σ 0 2 ] arg\max_w P(w|y)\\ =arg\max_w \log[P(y|w)P(w)]\\ =arg\max_w \{-\frac{(y-w^Tx)^2}{2\sigma^2}-\frac{\|w\|^2}{2\sigma_0^2}\}\\ =arg\min_w [{(y-w^Tx)^2}+\frac{\sigma^2\|w\|^2}{\sigma_0^2}] argwmaxP(wy)=argwmaxlog[P(yw)P(w)]=argwmax{ 2σ2(ywTx)22σ02w2}=argwmin[(ywTx)2+σ02σ2w2]
所以:
R e g u l a r i z e d   L S E    ⟺    M A P Regularized\space LSE\iff MAP Regularized LSEMAP等价在极大后验估计的噪声服从高斯分布的条件下得到满足

原视频链接

【机器学习】【白板推导系列】【合集 1~23】

你可能感兴趣的:(机器学习-白板推导系列笔记,机器学习,正则化)