【白板推导系列笔记】线性回归-最小二乘法及其几何意义&最小二乘法-概率视角-高斯噪声-MLE

D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } x i ∈ R p , y i ∈ R , i = 1 , 2 , ⋯   , N X = ( x 1 x 2 ⋯ x N ) T = ( x 1 T x 2 T ⋮ x N T ) = ( x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N p ) N × p Y = ( y 1 y 2 ⋮ y N ) N × 1 \begin{gathered} D=\left\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots ,(x_{N},y_{N})\right\}\\ x_{i}\in \mathbb{R}^{p},y_{i}\in \mathbb{R},i=1,2,\cdots ,N\\ X=\begin{pmatrix} x_{1} & x_{2} & \cdots & x_{N} \end{pmatrix}^{T}=\begin{pmatrix} x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T} \end{pmatrix}=\begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{pmatrix}_{N \times p}\\ Y=\begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{pmatrix}_{N \times 1} \end{gathered} D={(x1,y1),(x2,y2),,(xN,yN)}xiRp,yiR,i=1,2,,NX=(x1x2xN)T= x1Tx2TxNT = x11x21xN1x12x22xN2x1px2pxNp N×pY= y1y2yN N×1

因此,对于最小二乘估计,有
L ( ω ) = ∑ i = 1 N ∣ ∣ ω T x i − y i ∣ ∣ 2 = ∑ i = 1 N ( ω T x i − y i ) 2 = ( ω T x 1 − y 1 ω T x 2 − y 2 ⋯ ω T x N − y N ) ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = [ ( ω T x 1 ω T x 2 ⋯ ω T x N ) − ( y 1 y 2 ⋯ y N ) ] ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = [ ω T ( x 1 x 2 ⋯ x N ) − ( y 1 y 2 ⋯ y N ) ] ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = ( ω T X T − Y T ) ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = ( ω T X T − Y T ) ( X ω − Y ) = ω T X T X ω − 2 ω T X T Y + Y T Y \begin{aligned} L(\omega)&=\sum\limits_{i=1}^{N}||\omega^{T}x_{i}-y_{i}||^{2}\\ &=\sum\limits_{i=1}^{N}(\omega^{T}x_{i}-y_{i})^{2}\\ &=\begin{pmatrix} \omega^{T}x_{1}-y_{1} & \omega^{T}x_{2}-y_{2} & \cdots & \omega^{T}x_{N}-y_{N} \end{pmatrix}\begin{pmatrix} \omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N} \end{pmatrix}\\ &=[\begin{pmatrix} \omega^{T}x_{1} & \omega^{T}x_{2} & \cdots & \omega^{T}x_{N} \end{pmatrix}-\begin{pmatrix} y_{1} & y_{2} & \cdots & y_{N} \end{pmatrix}]\begin{pmatrix} \omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N} \end{pmatrix}\\ &=[\omega^{T}\begin{pmatrix} x_{1} & x_{2} & \cdots & x_{N} \end{pmatrix}-\begin{pmatrix} y_{1} & y_{2} & \cdots & y_{N} \end{pmatrix}]\begin{pmatrix} \omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N} \end{pmatrix}\\ &=(\omega^{T}X^{T}-Y^{T})\begin{pmatrix} \omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N} \end{pmatrix}\\ &=(\omega^{T}X^{T}-Y^{T})(X \omega-Y)\\ &=\omega^{T}X^{T}X \omega-2 \omega^{T}X^{T}Y+Y^{T}Y \end{aligned} L(ω)=i=1N∣∣ωTxiyi2=i=1N(ωTxiyi)2=(ωTx1y1ωTx2y2ωTxNyN) ωTx1y1ωTx2y2ωTxNyN =[(ωTx1ωTx2ωTxN)(y1y2yN)] ωTx1y1ωTx2y2ωTxNyN =[ωT(x1x2xN)(y1y2yN)] ωTx1y1ωTx2y2ωTxNyN =(ωTXTYT) ωTx1y1ωTx2y2ωTxNyN =(ωTXTYT)(XωY)=ωTXTXω2ωTXTY+YTY
对于 ω ^ \hat{\omega} ω^,有
ω ^ = argmin  L ( ω ) ∂ L ( ω ) ∂ ω = 2 X T X ω − 2 X T Y 2 X T X ω − 2 X T Y = 0 ω = ( X T X ) − 1 X T Y \begin{aligned} \hat{\omega}&=\text{argmin }L(\omega)\\ \frac{\partial L(\omega)}{\partial \omega}&=2X^{T}X \omega-2X^{T}Y\\ 2X^{T}X \omega-2X^{T}Y&=0\\ \omega&=(X^{T}X)^{-1}X^{T}Y \end{aligned} ω^ωL(ω)2XTXω2XTYω=argmin L(ω)=2XTXω2XTY=0=(XTX)1XTY

补充:矩阵求导法则
x = ( x 1 x 2 ⋯ x n ) f ( x ) = A x ,则 ∂ f ( x ) ∂ x T = ∂ ( A x ) ∂ x T = A f ( x ) = x T A x ,则 ∂ f ( x ) ∂ x = ∂ ( x T A x ) ∂ x = A x + A T x f ( x ) = a T x ,则 ∂ a T x ∂ x = ∂ x T a ∂ x = a f ( x ) = x T A y ,则 ∂ x T A y ∂ x = A y , ∂ x T A y ∂ A = x y T \begin{aligned} x&=\begin{pmatrix}x_{1} & x_{2} & \cdots & x_{n}\end{pmatrix}\\f(x)&=Ax,则\frac{\partial f (x)}{\partial x^T} = \frac{\partial (Ax)}{\partial x^T} =A\\f(x)&=x^TAx,则\frac{\partial f (x)}{\partial x} = \frac{\partial (x^TAx)}{\partial x} =Ax+A^Tx\\f(x)&=a^{T}x,则\frac{\partial a^Tx}{\partial x} = \frac{\partial x^Ta}{\partial x} =a\\f(x)&=x^{T}Ay,则\frac{\partial x^TAy}{\partial x} = Ay,\frac{\partial x^TAy}{\partial A} = xy^T\end{aligned} xf(x)f(x)f(x)f(x)=(x1x2xn)=Ax,则xTf(x)=xT(Ax)=A=xTAx,则xf(x)=x(xTAx)=Ax+ATx=aTx,则xaTx=xxTa=a=xTAy,则xxTAy=Ay,AxTAy=xyT
作者:zealscott
链接:矩阵求导法则与性质

在几何上,最小二乘法相当于模型(这里就是直线)和试验值的距离的平方求和,假设我们的试验样本张成一个 p p p 维空间(满秩的情况): X = S p a n ( x 1 , ⋯   , x N ) X=Span(x_1,\cdots,x_N) X=Span(x1,,xN),而模型可以写成 f ( w ) = x i T β f(w)=x_{i}^{T}\beta f(w)=xiTβ,也就是 x 1 , ⋯   , x N x_1,\cdots,x_N x1,,xN 的某种组合,而最小二乘法就是说希望 Y Y Y 和这个模型距离越小越好,于是它们的差应该与这个张成的空间垂直:
X ⊥ ( Y − X β ) ⟶ X T ⋅ ( Y − X β ) = 0 p × 1 ⟶ β = ( X T X ) − 1 X T Y X\bot(Y-X\beta)\longrightarrow X^T\cdot(Y-X\beta)=0_{p\times1}\longrightarrow\beta=(X^TX)^{-1}X^TY X(Y)XT(Y)=0p×1β=(XTX)1XTY

作者:tsyw
链接:线性回归 · 语雀 (yuque.com)

这里个人理解,有几点

  1. 由于 X = ( x 1 T x 2 T ⋮ x N T ) X=\begin{pmatrix}x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T}\end{pmatrix} X= x1Tx2TxNT ,因此 x i T β x_{i}^{T}\beta xiTβ就是 X β X \beta
  2. 一般 Y Y Y是不在 p p p维空间中的
  3. X β = ( x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N p ) ( β 1 β 2 ⋮ β p ) = β 1 ( x 11 x 21 ⋮ x N 1 ) + β 2 ( x 12 x 22 ⋮ x N 2 ) + ⋯ + β p ( x 1 p x 2 p ⋮ x N p ) \begin{aligned} X \beta&=\begin{pmatrix}x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np}\end{pmatrix}\begin{pmatrix}\beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p}\end{pmatrix}\\&=\beta_{1}\begin{pmatrix}x_{11} \\ x_{21} \\ \vdots \\ x_{N1}\end{pmatrix}+\beta_{2}\begin{pmatrix}x_{12} \\ x_{22} \\ \vdots \\ x_{N2}\end{pmatrix}+\cdots +\beta_{p}\begin{pmatrix}x_{1p} \\ x_{2p} \\ \vdots \\ x_{Np}\end{pmatrix}\end{aligned} = x11x21xN1x12x22xN2x1px2pxNp β1β2βp =β1 x11x21xN1 +β2 x12x22xN2 ++βp x1px2pxNp
    这里可以看做是 β \beta β在矩阵 X X X的作用下,从原来 ( 1 0 ⋮ 0 ) , ( 0 1 ⋮ 0 ) , ⋯   , ( 0 0 ⋮ 1 ) \begin{pmatrix}1 \\ 0 \\ \vdots \\ 0\end{pmatrix},\begin{pmatrix}0 \\ 1 \\ \vdots \\ 0\end{pmatrix},\cdots ,\begin{pmatrix}0 \\ 0 \\ \vdots \\ 1\end{pmatrix} 100 , 010 ,, 001 基底映射到新的基底 ( x 11 x 21 ⋮ x N 1 ) , ( x 12 x 22 ⋮ x N 2 ) , ⋯   , ( x 1 p x 2 p ⋮ x N p ) \begin{pmatrix}x_{11} \\ x_{21} \\ \vdots \\ x_{N1}\end{pmatrix},\begin{pmatrix}x_{12} \\ x_{22} \\ \vdots \\ x_{N2}\end{pmatrix},\cdots ,\begin{pmatrix}x_{1p} \\ x_{2p} \\ \vdots \\ x_{Np}\end{pmatrix} x11x21xN1 , x12x22xN2 ,, x1px2pxNp ,因此新的向量 X β X \beta 一定是在 p p p维空间内的,又因为 Y Y Y一般不在 p p p维空间内,因此求向量 Y Y Y X β X \beta 的最短距离,应当调整 β \beta β,使得 Y − X β Y-X \beta Y所代表的的向量恰好与 p p p维空间垂直,此时即为最小。因此有 X T ⊥ ( Y − X β ) = 0 X^{T}\bot(Y -X \beta)=\boldsymbol{0} XT(Y)=0

对于一维的情况,记 y = ω T x + ϵ , ϵ ∼ N ( 0 , σ 2 ) y=\omega^{T}x+\epsilon ,\epsilon \sim N(0,\sigma^{2}) y=ωTx+ϵ,ϵN(0,σ2),那么
y ∣ x ; ω ∼ N ( ω T x , σ 2 ) y|x;\omega \sim N(\omega^{T}x, \sigma^{2}) yx;ωN(ωTx,σ2)
注意这里 x x x为已知数据集, ω \omega ω为参数,因此 y y y ϵ \epsilon ϵ同分布

P ( y ∣ x ; ω ) = 1 2 π σ exp [ ( y − ω T x ) 2 2 σ 2 ] P(y|x;\omega)=\frac{1}{\sqrt{2\pi}\sigma}\text{exp}\left[ \frac{(y-\omega^{T}x)^{2}}{2\sigma^{2}}\right] P(yx;ω)=2π σ1exp[2σ2(yωTx)2]
最大似然估计即为
L ( ω ) = log ⁡ P ( Y ∣ X ; ω ) = log ⁡ ∏ i = 1 N P ( y i ∣ x i ; ω ) = ∑ i = 1 N log ⁡ P ( y i ∣ x i ; ω ) = ∑ i = 1 N { log ⁡ 1 2 π σ + log ⁡ exp [ − ( y i − ω T x ) 2 2 σ 2 ] } ω ^ = a r g m a x ω L ( ω ) = a r g m a x ω [ − 1 2 σ 2 ( y i − ω T x i ) 2 ] = a r g m i n ω ( y i − ω T x i ) 2 \begin{aligned} L(\omega)&=\log P(Y|X;\omega)\\ &=\log \prod\limits_{i=1}^{N}P(y_{i}|x_{i};\omega)\\ &=\sum\limits_{i=1}^{N}\log P(y_{i}|x_{i};\omega)\\ &=\sum\limits_{i=1}^{N}\left\{\log \frac{1}{\sqrt{2\pi}\sigma}+\log \text{exp}\left[- \frac{(y_{i}-\omega^{T}x)^{2}}{2\sigma^{2}}\right]\right\}\\ \hat{\omega}&=\mathop{argmax }\limits_{\omega}L(\omega)\\ &=\mathop{argmax }\limits_{\omega}\left[- \frac{1}{2\sigma^{2}}(y_{i}-\omega^{T}x_{i})^{2}\right]\\ &=\mathop{argmin }\limits_{\omega}(y_{i}-\omega^{T}x_{i})^{2} \end{aligned} L(ω)ω^=logP(YX;ω)=logi=1NP(yixi;ω)=i=1NlogP(yixi;ω)=i=1N{log2π σ1+logexp[2σ2(yiωTx)2]}=ωargmaxL(ω)=ωargmax[2σ21(yiωTxi)2]=ωargmin(yiωTxi)2

到目前为止对于确定 ω \omega ω的问题来说,最大化似然函数等价于最小化由公式
E ( ω ) = 1 2 ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 E(\omega)=\frac{1}{2}\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2} E(ω)=21n=1N[y(xn,ω)tn]2
定义的平方和误差函数。因此,在高斯噪声的假设下,平方和误差函数是最大化似然函数的一个自然结果

来源:《PRML Translation》-P27
作者:马春鹏
原著:《Pattern Recognition and Machine Learning》
作者:Christopher M. Bishop

在PRML中还有对精度矩阵 β \beta β,也就是这里的 σ 2 \sigma^{2} σ2的最大似然估计。这里 y y y就是PRML中的 t t t
(不做特殊说明都用PRML中的符号)
ln ⁡ p ( T ∣ X , ω , β ) = − β 2 ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 + N 2 ln ⁡ β − N 2 ln ⁡ ( 2 π ) β ^ = a r g m a x   β { − β ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 + N ln ⁡ β } = L ( β ) ∂ L ( β ) ∂ β = ∑ n = 1 N [ y ( x n , ω MLE ) − t n ] 2 − N β MLE = 0 1 β MLE = 1 N ∑ n = 1 N [ y ( x n , ω MLE ) − t n ] 2 \begin{aligned} \ln p(T|X,\omega,\beta)&=- \frac{\beta}{2}\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}+ \frac{N}{2}\ln \beta- \frac{N}{2}\ln (2 \pi)\\ \hat{\beta}&=\mathop{argmax\space}\limits_{\beta}\left\{- \beta\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}+ N\ln \beta\right\}=L(\beta)\\ \frac{\partial L(\beta)}{\partial \beta}&=\sum\limits_{n=1}^{N}[y(x_{n},\omega_\text{MLE})-t_{n}]^{2}- \frac{N}{\beta_\text{MLE}}=0\\ \frac{1}{\beta_\text{MLE}}&=\frac{1}{N}\sum\limits_{n=1}^{N}[y(x_{n},\omega_\text{MLE})-t_{n}]^{2} \end{aligned} lnp(TX,ω,β)β^βL(β)βMLE1=2βn=1N[y(xn,ω)tn]2+2Nlnβ2Nln(2π)=βargmax {βn=1N[y(xn,ω)tn]2+Nlnβ}=L(β)=n=1N[y(xn,ωMLE)tn]2βMLEN=0=N1n=1N[y(xn,ωMLE)tn]2

CSDN话题挑战赛第2期
参赛话题:学习笔记

你可能感兴趣的:(白板推导系列笔记,线性回归,最小二乘法,机器学习,学习,算法)