D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } x i ∈ R p , y i ∈ R , i = 1 , 2 , ⋯ , N X = ( x 1 x 2 ⋯ x N ) T = ( x 1 T x 2 T ⋮ x N T ) = ( x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N p ) N × p Y = ( y 1 y 2 ⋮ y N ) N × 1 \begin{gathered} D=\left\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots ,(x_{N},y_{N})\right\}\\ x_{i}\in \mathbb{R}^{p},y_{i}\in \mathbb{R},i=1,2,\cdots ,N\\ X=\begin{pmatrix} x_{1} & x_{2} & \cdots & x_{N} \end{pmatrix}^{T}=\begin{pmatrix} x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T} \end{pmatrix}=\begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np} \end{pmatrix}_{N \times p}\\ Y=\begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{pmatrix}_{N \times 1} \end{gathered} D={(x1,y1),(x2,y2),⋯,(xN,yN)}xi∈Rp,yi∈R,i=1,2,⋯,NX=(x1x2⋯xN)T=⎝ ⎛x1Tx2T⋮xNT⎠ ⎞=⎝ ⎛x11x21⋮xN1x12x22⋮xN2⋯⋯⋯x1px2p⋮xNp⎠ ⎞N×pY=⎝ ⎛y1y2⋮yN⎠ ⎞N×1
因此,对于最小二乘估计,有
L ( ω ) = ∑ i = 1 N ∣ ∣ ω T x i − y i ∣ ∣ 2 = ∑ i = 1 N ( ω T x i − y i ) 2 = ( ω T x 1 − y 1 ω T x 2 − y 2 ⋯ ω T x N − y N ) ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = [ ( ω T x 1 ω T x 2 ⋯ ω T x N ) − ( y 1 y 2 ⋯ y N ) ] ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = [ ω T ( x 1 x 2 ⋯ x N ) − ( y 1 y 2 ⋯ y N ) ] ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = ( ω T X T − Y T ) ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = ( ω T X T − Y T ) ( X ω − Y ) = ω T X T X ω − 2 ω T X T Y + Y T Y \begin{aligned} L(\omega)&=\sum\limits_{i=1}^{N}||\omega^{T}x_{i}-y_{i}||^{2}\\ &=\sum\limits_{i=1}^{N}(\omega^{T}x_{i}-y_{i})^{2}\\ &=\begin{pmatrix} \omega^{T}x_{1}-y_{1} & \omega^{T}x_{2}-y_{2} & \cdots & \omega^{T}x_{N}-y_{N} \end{pmatrix}\begin{pmatrix} \omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N} \end{pmatrix}\\ &=[\begin{pmatrix} \omega^{T}x_{1} & \omega^{T}x_{2} & \cdots & \omega^{T}x_{N} \end{pmatrix}-\begin{pmatrix} y_{1} & y_{2} & \cdots & y_{N} \end{pmatrix}]\begin{pmatrix} \omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N} \end{pmatrix}\\ &=[\omega^{T}\begin{pmatrix} x_{1} & x_{2} & \cdots & x_{N} \end{pmatrix}-\begin{pmatrix} y_{1} & y_{2} & \cdots & y_{N} \end{pmatrix}]\begin{pmatrix} \omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N} \end{pmatrix}\\ &=(\omega^{T}X^{T}-Y^{T})\begin{pmatrix} \omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N} \end{pmatrix}\\ &=(\omega^{T}X^{T}-Y^{T})(X \omega-Y)\\ &=\omega^{T}X^{T}X \omega-2 \omega^{T}X^{T}Y+Y^{T}Y \end{aligned} L(ω)=i=1∑N∣∣ωTxi−yi∣∣2=i=1∑N(ωTxi−yi)2=(ωTx1−y1ωTx2−y2⋯ωTxN−yN)⎝ ⎛ωTx1−y1ωTx2−y2⋮ωTxN−yN⎠ ⎞=[(ωTx1ωTx2⋯ωTxN)−(y1y2⋯yN)]⎝ ⎛ωTx1−y1ωTx2−y2⋮ωTxN−yN⎠ ⎞=[ωT(x1x2⋯xN)−(y1y2⋯yN)]⎝ ⎛ωTx1−y1ωTx2−y2⋮ωTxN−yN⎠ ⎞=(ωTXT−YT)⎝ ⎛ωTx1−y1ωTx2−y2⋮ωTxN−yN⎠ ⎞=(ωTXT−YT)(Xω−Y)=ωTXTXω−2ωTXTY+YTY
对于 ω ^ \hat{\omega} ω^,有
ω ^ = argmin L ( ω ) ∂ L ( ω ) ∂ ω = 2 X T X ω − 2 X T Y 2 X T X ω − 2 X T Y = 0 ω = ( X T X ) − 1 X T Y \begin{aligned} \hat{\omega}&=\text{argmin }L(\omega)\\ \frac{\partial L(\omega)}{\partial \omega}&=2X^{T}X \omega-2X^{T}Y\\ 2X^{T}X \omega-2X^{T}Y&=0\\ \omega&=(X^{T}X)^{-1}X^{T}Y \end{aligned} ω^∂ω∂L(ω)2XTXω−2XTYω=argmin L(ω)=2XTXω−2XTY=0=(XTX)−1XTY
补充:矩阵求导法则
x = ( x 1 x 2 ⋯ x n ) f ( x ) = A x ,则 ∂ f ( x ) ∂ x T = ∂ ( A x ) ∂ x T = A f ( x ) = x T A x ,则 ∂ f ( x ) ∂ x = ∂ ( x T A x ) ∂ x = A x + A T x f ( x ) = a T x ,则 ∂ a T x ∂ x = ∂ x T a ∂ x = a f ( x ) = x T A y ,则 ∂ x T A y ∂ x = A y , ∂ x T A y ∂ A = x y T \begin{aligned} x&=\begin{pmatrix}x_{1} & x_{2} & \cdots & x_{n}\end{pmatrix}\\f(x)&=Ax,则\frac{\partial f (x)}{\partial x^T} = \frac{\partial (Ax)}{\partial x^T} =A\\f(x)&=x^TAx,则\frac{\partial f (x)}{\partial x} = \frac{\partial (x^TAx)}{\partial x} =Ax+A^Tx\\f(x)&=a^{T}x,则\frac{\partial a^Tx}{\partial x} = \frac{\partial x^Ta}{\partial x} =a\\f(x)&=x^{T}Ay,则\frac{\partial x^TAy}{\partial x} = Ay,\frac{\partial x^TAy}{\partial A} = xy^T\end{aligned} xf(x)f(x)f(x)f(x)=(x1x2⋯xn)=Ax,则∂xT∂f(x)=∂xT∂(Ax)=A=xTAx,则∂x∂f(x)=∂x∂(xTAx)=Ax+ATx=aTx,则∂x∂aTx=∂x∂xTa=a=xTAy,则∂x∂xTAy=Ay,∂A∂xTAy=xyT
作者:zealscott
链接:矩阵求导法则与性质
在几何上,最小二乘法相当于模型(这里就是直线)和试验值的距离的平方求和,假设我们的试验样本张成一个 p p p 维空间(满秩的情况): X = S p a n ( x 1 , ⋯ , x N ) X=Span(x_1,\cdots,x_N) X=Span(x1,⋯,xN),而模型可以写成 f ( w ) = x i T β f(w)=x_{i}^{T}\beta f(w)=xiTβ,也就是 x 1 , ⋯ , x N x_1,\cdots,x_N x1,⋯,xN 的某种组合,而最小二乘法就是说希望 Y Y Y 和这个模型距离越小越好,于是它们的差应该与这个张成的空间垂直:
X ⊥ ( Y − X β ) ⟶ X T ⋅ ( Y − X β ) = 0 p × 1 ⟶ β = ( X T X ) − 1 X T Y X\bot(Y-X\beta)\longrightarrow X^T\cdot(Y-X\beta)=0_{p\times1}\longrightarrow\beta=(X^TX)^{-1}X^TY X⊥(Y−Xβ)⟶XT⋅(Y−Xβ)=0p×1⟶β=(XTX)−1XTY作者:tsyw
链接:线性回归 · 语雀 (yuque.com)这里个人理解,有几点
- 由于 X = ( x 1 T x 2 T ⋮ x N T ) X=\begin{pmatrix}x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T}\end{pmatrix} X=⎝ ⎛x1Tx2T⋮xNT⎠ ⎞,因此 x i T β x_{i}^{T}\beta xiTβ就是 X β X \beta Xβ
- 一般 Y Y Y是不在 p p p维空间中的
- X β = ( x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N p ) ( β 1 β 2 ⋮ β p ) = β 1 ( x 11 x 21 ⋮ x N 1 ) + β 2 ( x 12 x 22 ⋮ x N 2 ) + ⋯ + β p ( x 1 p x 2 p ⋮ x N p ) \begin{aligned} X \beta&=\begin{pmatrix}x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np}\end{pmatrix}\begin{pmatrix}\beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p}\end{pmatrix}\\&=\beta_{1}\begin{pmatrix}x_{11} \\ x_{21} \\ \vdots \\ x_{N1}\end{pmatrix}+\beta_{2}\begin{pmatrix}x_{12} \\ x_{22} \\ \vdots \\ x_{N2}\end{pmatrix}+\cdots +\beta_{p}\begin{pmatrix}x_{1p} \\ x_{2p} \\ \vdots \\ x_{Np}\end{pmatrix}\end{aligned} Xβ=⎝ ⎛x11x21⋮xN1x12x22⋮xN2⋯⋯⋯x1px2p⋮xNp⎠ ⎞⎝ ⎛β1β2⋮βp⎠ ⎞=β1⎝ ⎛x11x21⋮xN1⎠ ⎞+β2⎝ ⎛x12x22⋮xN2⎠ ⎞+⋯+βp⎝ ⎛x1px2p⋮xNp⎠ ⎞
这里可以看做是 β \beta β在矩阵 X X X的作用下,从原来 ( 1 0 ⋮ 0 ) , ( 0 1 ⋮ 0 ) , ⋯ , ( 0 0 ⋮ 1 ) \begin{pmatrix}1 \\ 0 \\ \vdots \\ 0\end{pmatrix},\begin{pmatrix}0 \\ 1 \\ \vdots \\ 0\end{pmatrix},\cdots ,\begin{pmatrix}0 \\ 0 \\ \vdots \\ 1\end{pmatrix} ⎝ ⎛10⋮0⎠ ⎞,⎝ ⎛01⋮0⎠ ⎞,⋯,⎝ ⎛00⋮1⎠ ⎞基底映射到新的基底 ( x 11 x 21 ⋮ x N 1 ) , ( x 12 x 22 ⋮ x N 2 ) , ⋯ , ( x 1 p x 2 p ⋮ x N p ) \begin{pmatrix}x_{11} \\ x_{21} \\ \vdots \\ x_{N1}\end{pmatrix},\begin{pmatrix}x_{12} \\ x_{22} \\ \vdots \\ x_{N2}\end{pmatrix},\cdots ,\begin{pmatrix}x_{1p} \\ x_{2p} \\ \vdots \\ x_{Np}\end{pmatrix} ⎝ ⎛x11x21⋮xN1⎠ ⎞,⎝ ⎛x12x22⋮xN2⎠ ⎞,⋯,⎝ ⎛x1px2p⋮xNp⎠ ⎞,因此新的向量 X β X \beta Xβ一定是在 p p p维空间内的,又因为 Y Y Y一般不在 p p p维空间内,因此求向量 Y Y Y与 X β X \beta Xβ的最短距离,应当调整 β \beta β,使得 Y − X β Y-X \beta Y−Xβ所代表的的向量恰好与 p p p维空间垂直,此时即为最小。因此有 X T ⊥ ( Y − X β ) = 0 X^{T}\bot(Y -X \beta)=\boldsymbol{0} XT⊥(Y−Xβ)=0
对于一维的情况,记 y = ω T x + ϵ , ϵ ∼ N ( 0 , σ 2 ) y=\omega^{T}x+\epsilon ,\epsilon \sim N(0,\sigma^{2}) y=ωTx+ϵ,ϵ∼N(0,σ2),那么
y ∣ x ; ω ∼ N ( ω T x , σ 2 ) y|x;\omega \sim N(\omega^{T}x, \sigma^{2}) y∣x;ω∼N(ωTx,σ2)
注意这里 x x x为已知数据集, ω \omega ω为参数,因此 y y y与 ϵ \epsilon ϵ同分布
有
P ( y ∣ x ; ω ) = 1 2 π σ exp [ ( y − ω T x ) 2 2 σ 2 ] P(y|x;\omega)=\frac{1}{\sqrt{2\pi}\sigma}\text{exp}\left[ \frac{(y-\omega^{T}x)^{2}}{2\sigma^{2}}\right] P(y∣x;ω)=2πσ1exp[2σ2(y−ωTx)2]
最大似然估计即为
L ( ω ) = log P ( Y ∣ X ; ω ) = log ∏ i = 1 N P ( y i ∣ x i ; ω ) = ∑ i = 1 N log P ( y i ∣ x i ; ω ) = ∑ i = 1 N { log 1 2 π σ + log exp [ − ( y i − ω T x ) 2 2 σ 2 ] } ω ^ = a r g m a x ω L ( ω ) = a r g m a x ω [ − 1 2 σ 2 ( y i − ω T x i ) 2 ] = a r g m i n ω ( y i − ω T x i ) 2 \begin{aligned} L(\omega)&=\log P(Y|X;\omega)\\ &=\log \prod\limits_{i=1}^{N}P(y_{i}|x_{i};\omega)\\ &=\sum\limits_{i=1}^{N}\log P(y_{i}|x_{i};\omega)\\ &=\sum\limits_{i=1}^{N}\left\{\log \frac{1}{\sqrt{2\pi}\sigma}+\log \text{exp}\left[- \frac{(y_{i}-\omega^{T}x)^{2}}{2\sigma^{2}}\right]\right\}\\ \hat{\omega}&=\mathop{argmax }\limits_{\omega}L(\omega)\\ &=\mathop{argmax }\limits_{\omega}\left[- \frac{1}{2\sigma^{2}}(y_{i}-\omega^{T}x_{i})^{2}\right]\\ &=\mathop{argmin }\limits_{\omega}(y_{i}-\omega^{T}x_{i})^{2} \end{aligned} L(ω)ω^=logP(Y∣X;ω)=logi=1∏NP(yi∣xi;ω)=i=1∑NlogP(yi∣xi;ω)=i=1∑N{log2πσ1+logexp[−2σ2(yi−ωTx)2]}=ωargmaxL(ω)=ωargmax[−2σ21(yi−ωTxi)2]=ωargmin(yi−ωTxi)2
到目前为止对于确定 ω \omega ω的问题来说,最大化似然函数等价于最小化由公式
E ( ω ) = 1 2 ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 E(\omega)=\frac{1}{2}\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2} E(ω)=21n=1∑N[y(xn,ω)−tn]2
定义的平方和误差函数。因此,在高斯噪声的假设下,平方和误差函数是最大化似然函数的一个自然结果来源:《PRML Translation》-P27
作者:马春鹏
原著:《Pattern Recognition and Machine Learning》
作者:Christopher M. Bishop
在PRML中还有对精度矩阵 β \beta β,也就是这里的 σ 2 \sigma^{2} σ2的最大似然估计。这里 y y y就是PRML中的 t t t
(不做特殊说明都用PRML中的符号)
ln p ( T ∣ X , ω , β ) = − β 2 ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 + N 2 ln β − N 2 ln ( 2 π ) β ^ = a r g m a x β { − β ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 + N ln β } = L ( β ) ∂ L ( β ) ∂ β = ∑ n = 1 N [ y ( x n , ω MLE ) − t n ] 2 − N β MLE = 0 1 β MLE = 1 N ∑ n = 1 N [ y ( x n , ω MLE ) − t n ] 2 \begin{aligned} \ln p(T|X,\omega,\beta)&=- \frac{\beta}{2}\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}+ \frac{N}{2}\ln \beta- \frac{N}{2}\ln (2 \pi)\\ \hat{\beta}&=\mathop{argmax\space}\limits_{\beta}\left\{- \beta\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}+ N\ln \beta\right\}=L(\beta)\\ \frac{\partial L(\beta)}{\partial \beta}&=\sum\limits_{n=1}^{N}[y(x_{n},\omega_\text{MLE})-t_{n}]^{2}- \frac{N}{\beta_\text{MLE}}=0\\ \frac{1}{\beta_\text{MLE}}&=\frac{1}{N}\sum\limits_{n=1}^{N}[y(x_{n},\omega_\text{MLE})-t_{n}]^{2} \end{aligned} lnp(T∣X,ω,β)β^∂β∂L(β)βMLE1=−2βn=1∑N[y(xn,ω)−tn]2+2Nlnβ−2Nln(2π)=βargmax {−βn=1∑N[y(xn,ω)−tn]2+Nlnβ}=L(β)=n=1∑N[y(xn,ωMLE)−tn]2−βMLEN=0=N1n=1∑N[y(xn,ωMLE)−tn]2
CSDN话题挑战赛第2期
参赛话题:学习笔记