数据集: D = { ( x i , y i ) ∣ i = 1 , 2 , … , N , x i ∈ R p , y i ∈ R } D=\{(x_i,y_i)|i=1,2,\dotsc,N,x_i\in\Reals^p,y_i\in\Reals\} D={ (xi,yi)∣i=1,2,…,N,xi∈Rp,yi∈R}
X = ( x 1 , x 2 , … , x N ) N × p T , Y = ( y 1 y 2 ⋮ y N ) N × 1 X=(x_1,x_2,\dots,x_N)_{N\times p}^T,Y=\begin{pmatrix}y_1\\y_2\\\vdots\\y_N\end{pmatrix}_{N\times 1} X=(x1,x2,…,xN)N×pT,Y=⎝⎜⎜⎜⎛y1y2⋮yN⎠⎟⎟⎟⎞N×1
线性回归就不多介绍了,就是拟合曲线。假设拟合的曲线为 f ( w ) = w T x f(w)=w^Tx f(w)=wTx
最小二乘估计: L ( w ) = ∑ i = 1 N ∥ w T x i − y i ∥ 2 = ( w T x T − y T ) ( x w − y ) = w T X T X w − 2 w T X T Y + Y T Y L(w)=\sum\limits_{i=1}^{N}\|w^Tx_i-y_i\|^2=(w^Tx^T-y^T)(xw-y)\\=w^TX^TXw-2w^TX^TY+Y^TY L(w)=i=1∑N∥wTxi−yi∥2=(wTxT−yT)(xw−y)=wTXTXw−2wTXTY+YTY
w ^ = a r g min w L ( w ) ∂ L ( w ) ∂ w = 2 X T X w − 2 X T Y = 0 w = ( X T X ) − 1 X T Y \hat w=arg\min_w L(w)\\\frac{\partial L(w)}{\partial w}=2X^TXw-2X^TY=0\\w=(X^TX)^{-1}X^TY w^=argwminL(w)∂w∂L(w)=2XTXw−2XTY=0w=(XTX)−1XTY
其几何意义就是预测点与实际点在y轴上的平方差之和(Loss)。
假设数据存在的噪声服从正态分布,即 ϵ ∼ N ( 0 , σ 2 ) \epsilon\sim N(0,\sigma^2) ϵ∼N(0,σ2),则 y = f ( w ) + ϵ = w T x + ϵ y=f(w)+\epsilon=w^Tx+\epsilon y=f(w)+ϵ=wTx+ϵ
y ∣ x ; w ∼ N ( w T x , σ 2 ) y|x;w\sim N(w^Tx,\sigma^2) y∣x;w∼N(wTx,σ2)
w M L E = w ^ = a r g max w log P ( Y ∣ X ; w ) = a r g max w log ∏ i = 1 N P ( y i ∣ x i ; w ) = … w ^ = a r g min w 1 2 σ 2 ( y i − w T x i ) 2 w_{MLE}=\hat w=arg\max_w\log P(Y|X;w)\\=arg\max_w \log\prod_{i=1}^{N}P(y_i|x_i;w)=\dotsc\\\hat w=arg\min_w\frac{1}{2\sigma^2}(y_i-w^Tx_i)^2 wMLE=w^=argwmaxlogP(Y∣X;w)=argwmaxlogi=1∏NP(yi∣xi;w)=…w^=argwmin2σ21(yi−wTxi)2
这说明了:(LSE,Least Square Error)
L S E ⟺ M L E LSE\iff MLE LSE⟺MLE等价在极大似然估计的噪声服从高斯分布的条件下得到满足
N ≫ p N\gg p N≫p时,容易过拟合,其解决办法:
正则化框架: a r g min w [ L ( w ) + λ P ( w ) ] L 1 : L a s s o , P ( w ) = ∥ w ∥ L 2 : R i d g e , 岭 回 归 , P ( w ) = ∥ w ∥ 2 = w T w arg\min\limits_w[L(w)+\lambda P(w)]\\ L_1:Lasso,P(w)=\|w\|\\ L_2:Ridge,岭回归,P(w)=\|w\|^2=w^Tw argwmin[L(w)+λP(w)]L1:Lasso,P(w)=∥w∥L2:Ridge,岭回归,P(w)=∥w∥2=wTw
J ( w ) = ∑ i = 1 N ∥ w T x i − y i ∥ 2 + λ w T w = w T X T X w − 2 w T X T Y + Y T Y + λ w T w w ^ = a r g min w J ( w ) ∂ J ( w ) ∂ w = 2 ( X T X + λ I ) w − 2 X T Y = 0 w ^ = ( x T x + λ I ) − 1 x T y J(w)=\sum\limits_{i=1}^{N}\|w^Tx_i-y_i\|^2+\lambda w^Tw\\ =w^TX^TXw-2w^TX^TY+Y^TY+\lambda w^Tw\\ \hat w=arg\min_wJ(w)\\ \frac{\partial J(w)}{\partial w}=2(X^TX+\lambda I)w-2X^TY=0\\ \hat w=(x^Tx+\lambda I)^{-1}x^Ty J(w)=i=1∑N∥wTxi−yi∥2+λwTw=wTXTXw−2wTXTY+YTY+λwTww^=argwminJ(w)∂w∂J(w)=2(XTX+λI)w−2XTY=0w^=(xTx+λI)−1xTy
w ∼ N ( 0 , σ 0 2 ) w\sim N(0,\sigma_0^2) w∼N(0,σ02)
y ∣ x ; w ∼ N ( w T x , σ 2 ) y|x;w\sim N(w^Tx,\sigma^2) y∣x;w∼N(wTx,σ2)
P ( w ∣ y ) = P ( y ∣ w ) P ( w ) P ( y ) P(w|y)=\frac{P(y|w)P(w)}{P(y)} P(w∣y)=P(y)P(y∣w)P(w)
MAP: a r g max w P ( w ∣ y ) = a r g max w log [ P ( y ∣ w ) P ( w ) ] = a r g max w { − ( y − w T x ) 2 2 σ 2 − ∥ w ∥ 2 2 σ 0 2 } = a r g min w [ ( y − w T x ) 2 + σ 2 ∥ w ∥ 2 σ 0 2 ] arg\max_w P(w|y)\\ =arg\max_w \log[P(y|w)P(w)]\\ =arg\max_w \{-\frac{(y-w^Tx)^2}{2\sigma^2}-\frac{\|w\|^2}{2\sigma_0^2}\}\\ =arg\min_w [{(y-w^Tx)^2}+\frac{\sigma^2\|w\|^2}{\sigma_0^2}] argwmaxP(w∣y)=argwmaxlog[P(y∣w)P(w)]=argwmax{ −2σ2(y−wTx)2−2σ02∥w∥2}=argwmin[(y−wTx)2+σ02σ2∥w∥2]
所以:
R e g u l a r i z e d L S E ⟺ M A P Regularized\space LSE\iff MAP Regularized LSE⟺MAP等价在极大后验估计的噪声服从高斯分布的条件下得到满足
【机器学习】【白板推导系列】【合集 1~23】