机器学习 之 作业2

作业 1: 线性回归模型的极大似然估计

1. E { w M L E } E\{\textbf w_{MLE}\} E{wMLE}

E { w ^ } = ∫ w ^ p ( y ∣ X , w , δ 2 ) d t E\{\widehat{\textbf w}\}=\int \widehat{\textbf w}p(y|\textbf X,w,\delta^2)dt E{w }=w p(yX,w,δ2)dt

其中 w ^ = ( X T X ) − 1 X T y \widehat{\textbf w}=(\textbf X^T\textbf X)^{-1}\textbf X^T\textbf y w =(XTX)1XTy,带入
E { w ^ } = ( X T X ) − 1 X T ∫ y p ( y ∣ X , w , δ 2 ) d t = ( X T X ) − 1 X T E { y } = ( X T X ) − 1 X T X w ^ = w ^ E\{\widehat{\textbf w}\}=(\textbf X^T\textbf X)^{-1}\textbf X^T\int \textbf yp(y|\textbf X,w,\delta^2)dt\\ =(\textbf X^T\textbf X)^{-1}\textbf X^TE\{\textbf y\}\\ =(\textbf X^T\textbf X)^{-1}\textbf X^T\textbf X\widehat{\textbf w}\\ =\widehat{\textbf w} E{w }=(XTX)1XTyp(yX,w,δ2)dt=(XTX)1XTE{y}=(XTX)1XTXw =w

2. C o v { w M L E } Cov\{\textbf w_{MLE}\} Cov{wMLE}

c o v { w ^ } = E { w ^ w ^ T } − E { w ^ } E { w ^ } T = E { w ^ w ^ T } − w w T cov\{\widehat{\textbf w}\}=E\{\widehat{\textbf w}\widehat{\textbf w}^T\}-E\{\widehat{\textbf w}\}E\{\widehat{\textbf w}\}^T\\ =E\{\widehat{\textbf w}\widehat{\textbf w}^T\}-\textbf w\textbf w^T cov{w }=E{w w T}E{w }E{w }T=E{w w T}wwT

其中
E { w ^ w ^ T } = E { ( ( X T X ) − 1 X T y ) ( ( X T X ) − 1 X T y ) T } = ( X T X ) − 1 X T E { y y T } X ( X T X ) − 1 E\{\widehat{\textbf w}\widehat{\textbf w}^T\}=E\{((\textbf X^T\textbf X)^{-1}\textbf X^T\textbf y)((\textbf X^T\textbf X)^{-1}\textbf X^T\textbf y)^T\}\\ =(\textbf X^T\textbf X)^{-1}\textbf X^TE\{\textbf y\textbf y^T\}\textbf X(\textbf X^T\textbf X)^{-1} E{w w T}=E{((XTX)1XTy)((XTX)1XTy)T}=(XTX)1XTE{yyT}X(XTX)1

其中
c o v { y } = δ 2 I = E { y y T } − E { y } E { y } T E { y y T } = E { y } E { y } T + δ 2 I = Xw ( Xw ) T + δ 2 I = Xw w T X T + δ 2 I cov\{\textbf y\}=\delta^2I=E\{\textbf y\textbf y^T\}-E\{\textbf y\}E\{\textbf y\}^T\\ E\{\textbf y\textbf y^T\}=E\{\textbf y\}E\{\textbf y\}^T+\delta^2I\\ =\textbf X\textbf w(\textbf X\textbf w)^T+\delta^2I\\ =\textbf X\textbf w\textbf w^T\textbf X^T+\delta^2I cov{y}=δ2I=E{yyT}E{y}E{y}TE{yyT}=E{y}E{y}T+δ2I=Xw(Xw)T+δ2I=XwwTXT+δ2I

因此
E { w ^ w ^ T } = ( X T X ) − 1 X T Xw w T X T X ( X T X ) − 1 + δ 2 ( X T X ) − 1 X T X ( X T X ) − 1 = w w T + δ 2 ( X T X ) − 1 E\{\widehat{\textbf w}\widehat{\textbf w}^T\}=(\textbf X^T\textbf X)^{-1}\textbf X^T\textbf X\textbf w\textbf w^T\textbf X^T\textbf X(\textbf X^T\textbf X)^{-1}\\ +\delta^2(\textbf X^T\textbf X)^{-1}\textbf X^T\textbf X(\textbf X^T\textbf X)^{-1}\\ =\textbf w\textbf w^T+\delta^2(\textbf X^T\textbf X)^{-1} E{w w T}=(XTX)1XTXwwTXTX(XTX)1+δ2(XTX)1XTX(XTX)1=wwT+δ2(XTX)1

c o v { w ^ } = w w T + δ 2 ( X T X ) − 1 − w w T = δ 2 ( X T X ) − 1 cov\{\widehat{\textbf w}\}=\textbf w\textbf w^T+\delta^2(\textbf X^T\textbf X)^{-1}-\textbf w\textbf w^T\\ =\delta^2(\textbf X^T\textbf X)^{-1} cov{w }=wwT+δ2(XTX)1wwT=δ2(XTX)1

作业 2: 关于岭回归与正则化

1. 求 P ( w ∣ D ) 求P(w|D) P(wD)的均值和方差

p ( w ∣ D ) ≃ p ( D ∣ w ) p ( w ) = ( 1 2 π δ ) n e − ( y − Xw ) T ( y − Xw ) 2 δ 2 I × ( 1 2 π Σ 0 ) n e − ( w − μ 0 ) T ( w − μ 0 ) 2 Σ 0 ≃ e − 1 2 ( ( y − Xw ) T ( y − Xw ) δ 2 I + ( w − μ 0 ) T ( w − μ 0 ) Σ 0 ) ≃ e − 1 2 ( − 2 y T Xw + w T X T Xw δ 2 I + w T w − 2 μ 0 T w Σ 0 ) p(w|D)\simeq p(D|w)p(w)\\ =(\frac{1}{\sqrt{2\pi}\delta})^ne^{-\frac{(\textbf y-\textbf X\textbf w)^T(\textbf y-\textbf X\textbf w)}{2\delta^2\textbf I}}\times (\frac{1}{\sqrt{2\pi\Sigma_0}})^ne^{-\frac{(\textbf w-\mu_0)^T(\textbf w-\mu_0)}{2\Sigma_0}}\\ \simeq e^{-\frac{1}{2}(\frac{(\textbf y-\textbf X\textbf w)^T(\textbf y-\textbf X\textbf w)}{\delta^2\textbf I}+\frac{(\textbf w-\mu_0)^T(\textbf w-\mu_0)}{\Sigma_0})}\\ \simeq e^{-\frac{1}{2}(\frac{-2\textbf y^T\textbf X\textbf w+\textbf w^T\textbf X^T\textbf X\textbf w}{\delta^2\textbf I}+\frac{\textbf w^T\textbf w-2\mu_0^T\textbf w}{\Sigma_0})} p(wD)p(Dw)p(w)=(2π δ1)ne2δ2I(yXw)T(yXw)×(2πΣ0 1)ne2Σ0(wμ0)T(wμ0)e21(δ2I(yXw)T(yXw)+Σ0(wμ0)T(wμ0))e21(δ2I2yTXw+wTXTXw+Σ0wTw2μ0Tw)

后验概率应该为
p ( w ∣ D ) = N ( μ w , Σ w ) ≃ e − ( w − μ w ) T ( w − μ w ) 2 Σ w ≃ e − w T w − 2 μ w T w 2 Σ w p(\textbf w|D)=N(\mu_w,\Sigma_w)\\ \simeq e^{-\frac{(\textbf w-\mu_w)^T(\textbf w-\mu_w)}{2\Sigma_w}}\\ \simeq e^{-\frac{\textbf w^T\textbf w-2\mu_w^T\textbf w}{2\Sigma_w}} p(wD)=N(μw,Σw)e2Σw(wμw)T(wμw)e2ΣwwTw2μwTw

这里面有一个线性二次项,对应相等,方差可解
w T w Σ w = w T X T Xw δ 2 I + w T w Σ 0 = w T ( X T X δ 2 I + 1 Σ 0 ) w 1 Σ w = X T X δ 2 I + 1 Σ 0 Σ w = ( 1 δ 2 X T X + Σ 0 − 1 ) − 1 \frac{\textbf w^T\textbf w}{\Sigma_w}=\frac{\textbf w^T\textbf X^T\textbf X\textbf w}{\delta^2\textbf I}+\frac{\textbf w^T\textbf w}{\Sigma_0}\\ =\textbf w^T(\frac{\textbf X^T\textbf X}{\delta^2\textbf I}+\frac{1}{\Sigma_0})\textbf w\\ \frac{1}{\Sigma_w}=\frac{\textbf X^T\textbf X}{\delta^2\textbf I}+\frac{1}{\Sigma_0}\\ \Sigma_w=(\frac{1}{\delta^2}\textbf X^T\textbf X+\Sigma_0^{-1})^{-1} ΣwwTw=δ2IwTXTXw+Σ0wTw=wT(δ2IXTX+Σ01)wΣw1=δ2IXTX+Σ01Σw=(δ21XTX+Σ01)1

同样的,将线性一次项相等,会获得均值 μ w \mu_w μw,这样 w w w的方差,均值都获得了
− 2 μ w T w Σ w = − 2 y T Xw δ 2 + − 2 μ 0 T w Σ 0 μ w T Σ w = y T X δ 2 + μ 0 T Σ 0 μ w T = ( y T X δ 2 + μ 0 T Σ 0 ) Σ w μ w = Σ w ( 1 δ 2 X T y + Σ 0 − 1 μ 0 ) = ( 1 δ 2 X T X + Σ 0 − 1 ) − 1 ( 1 δ 2 X T y + Σ 0 − 1 μ 0 ) \frac{-2\mu_w^T\textbf w}{\Sigma_w}=\frac{-2\textbf y^T\textbf X\textbf w}{\delta^2}+\frac{-2\mu_0^T\textbf w}{\Sigma_0}\\ \frac{\mu_w^T}{\Sigma_w}=\frac{\textbf y^T\textbf X}{\delta^2}+\frac{\mu_0^T}{\Sigma_0}\\ \mu_w^T=(\frac{\textbf y^T\textbf X}{\delta^2}+\frac{\mu_0^T}{\Sigma_0})\Sigma_w\\ \mu_w=\Sigma_w(\frac{1}{\delta^2}\textbf X^T\textbf y+\Sigma_0^{-1}\mu_0)\\ =(\frac{1}{\delta^2}\textbf X^T\textbf X+\Sigma_0^{-1})^{-1}(\frac{1}{\delta^2}\textbf X^T\textbf y+\Sigma_0^{-1}\mu_0) Σw2μwTw=δ22yTXw+Σ02μ0TwΣwμwT=δ2yTX+Σ0μ0TμwT=(δ2yTX+Σ0μ0T)Σwμw=Σw(δ21XTy+Σ01μ0)=(δ21XTX+Σ01)1(δ21XTy+Σ01μ0)

2. 为什么 w M A P = ( X T + λ I ) − 1 X T y ? w_{MAP}=(X^T+\lambda I)^{-1}X^Ty? wMAP=(XT+λI)1XTy?

因为后验概率是高斯分布, w w w最有可能的地方就是在均值处,如果 μ 0 = [ 0 , 0 , . . . , 0 ] T \mu_0=[0,0,...,0]^T μ0=[0,0,...,0]T,那么就变成了 w M A P = μ w = ( 1 δ 2 X T X + Σ 0 − 1 ) − 1 1 δ 2 X T y w_{MAP}=\mu_w=(\frac{1}{\delta^2}\textbf X^T\textbf X+\Sigma_0^{-1})^{-1}\frac{1}{\delta^2}\textbf X^T\textbf y wMAP=μw=(δ21XTX+Σ01)1δ21XTy

基本相同。

3. 为什么 X T X + λ I X^TX+\lambda I XTX+λI是可逆的?

( X T X ) T = X T X (\textbf X^T\textbf X)^T=\textbf X^T\textbf X (XTX)T=XTX,因此 X T X \textbf X^T\textbf X XTX是对称阵,明显的特点是对称轴都大于等于0,加上 λ I \lambda I λI后对称轴上都大于0,说明它为正定矩阵,正定矩阵可逆。

你可能感兴趣的:(机器学习)