【系列二】数学基础—概率—高斯分布1

1. 一维高斯分布参数 μ , σ \mu,\sigma μ,σ的极大似然估计

datas: X = ( x 1 , x 2 , … , x n ) n × p T = ( x 11 , x 12 … , x 1 p x 21 , x 22 … , x 2 p ⋮ x n 1 , x n 2 … , x n p ) X=(x_1,x_2,\dotsc,x_n)^T_{n\times p}=\begin{pmatrix}x_{11},x_{12}\dotsc,x_{1p}\\x_{21},x_{22}\dotsc,x_{2p}\\ \vdots\\x_{n1},x_{n2}\dotsc,x_{np}\end{pmatrix} X=(x1,x2,,xn)n×pT=x11,x12,x1px21,x22,x2pxn1,xn2,xnp
x i ∽ i i d N ( μ , Σ ) x_i\overset{iid}{\backsim}N(\mu,\varSigma) xiiidN(μ,Σ)
iid(Independent and identically distributed,独立同分布,之后没特殊说明。均为独立同分布)
θ = ( μ , Σ ) , p ( x ) = 1 ( 2 π ) p 2 Σ 1 2 exp ⁡ ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) \theta=(\mu,\varSigma),p(x)=\frac{1}{(2\pi)^{\frac{p}{2}}\varSigma^{\frac{1}{2}}}\exp{(-\frac{1}{2}(x-\mu)^T\varSigma^{-1}(x-\mu))} θ=(μ,Σ),p(x)=(2π)2pΣ211exp(21(xμ)TΣ1(xμ))
令p=1(p为 x i x_i xi的维度),则 θ = ( μ , σ 2 ) , p ( x ) = 1 2 π σ exp ⁡ ( − ( x − μ ) 2 2 σ 2 ) \theta=(\mu,\sigma^2),p(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp{(-\frac{(x-\mu)^2}{2\sigma^2})} θ=(μ,σ2),p(x)=2π σ1exp(2σ2(xμ)2)即高维高斯分布退化为一维高斯分布。
极大似然估计: θ M L E = a r g max ⁡ θ P ( X ∣ θ ) log ⁡ P ( X ∣ θ ) = log ⁡ ∏ i = 1 n p ( x i ∣ θ ) = ∑ i = 1 n log ⁡ P ( x i ∣ θ ) = ∑ i = 1 n log ⁡ 1 2 π σ exp ⁡ ( − ( x i − μ ) 2 2 σ 2 ) = ∑ i = 1 n [ log ⁡ 1 2 π + log ⁡ 1 σ − ( x i − μ ) 2 2 σ 2 ] \theta_{MLE}=arg\max_{\theta}P(X|\theta)\\\log P(X|\theta)=\log\prod_{i=1}^{n}p(x_i|\theta)=\sum_{i=1}^{n}\log P(x_i|\theta)\\=\sum_{i=1}^{n}\log \frac{1}{\sqrt{2\pi}\sigma}\exp{(-\frac{(x_i-\mu)^2}{2\sigma^2})}\\=\sum_{i=1}^{n}[\log\frac{1}{\sqrt{2\pi}}+\log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2}] θMLE=argθmaxP(Xθ)logP(Xθ)=logi=1np(xiθ)=i=1nlogP(xiθ)=i=1nlog2π σ1exp(2σ2(xiμ)2)=i=1n[log2π 1+logσ12σ2(xiμ)2]
由于是极大似然估计,所以常数项可以约去,同时对 μ \mu μ的最大似然估计与 σ \sigma σ无关(即视为常数项,其他同理),化简得 μ M L E = a r g max ⁡ μ P ( X ∣ θ ) = a r g min ⁡ μ ∑ i = 1 n ( x i − μ ) 2 \mu_{MLE}=arg\max_{\mu}P(X|\theta)\\=arg\min_{\mu}\sum_{i=1}^{n}(x_i-\mu)^2 μMLE=argμmaxP(Xθ)=argμmini=1n(xiμ)2
令偏导数为0,即 ∂ ∂ μ ∑ i = 1 n ( x i − μ ) 2 = ∑ i = 1 n 2 ( x i − μ ) ( − 1 ) = 0 ∑ i = 1 n ( x i − μ ) = 0 ∑ i = 1 n x i − ∑ i = 1 n μ = 0 μ M L E = 1 n ∑ i = 1 n x i \frac{\partial}{\partial\mu}\sum_{i=1}^{n}(x_i-\mu)^2=\sum_{i=1}^{n}2(x_i-\mu)(-1)=0\\\sum_{i=1}^{n}(x_i-\mu)=0\\\sum_{i=1}^{n}x_i-\sum_{i=1}^{n}\mu=0\\\mu_{MLE}=\frac{1}{n}\sum_{i=1}^{n}x_{i} μi=1n(xiμ)2=i=1n2(xiμ)(1)=0i=1n(xiμ)=0i=1nxii=1nμ=0μMLE=n1i=1nxi
同理: σ M L E 2 = a r g max ⁡ σ P ( X ∣ θ ) = a r g max ⁡ σ ( log ⁡ 1 σ − ( x i − μ ) 2 2 σ 2 ) = a r g max ⁡ σ ℓ ( x ) \sigma^2_{MLE}=arg\max_{\sigma}P(X|\theta)\\=arg\max_{\sigma}(\log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2})=arg\max_{\sigma}\ell(x) σMLE2=argσmaxP(Xθ)=argσmax(logσ12σ2(xiμ)2)=argσmax(x)
令: ∂ ℓ ∂ σ = ∑ i = 1 n ( − 1 σ + ( x i − μ ) 2 σ − 3 = 0 ∑ i = 1 n ( x i − μ ) 2 = ∑ i = 1 n σ 2 σ M L E 2 = 1 n ∑ i = 1 n ( x i − μ ) 2 \frac{\partial\ell}{\partial\sigma}=\sum_{i=1}^{n}(-\frac{1}{\sigma}+(x_i-\mu)^2\sigma^{-3}=0\\\sum_{i=1}^{n}(x_i-\mu)^2=\sum_{i=1}^{n}\sigma^2\\\sigma^2_{MLE}=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2 σ=i=1n(σ1+(xiμ)2σ3=0i=1n(xiμ)2=i=1nσ2σMLE2=n1i=1n(xiμ)2

2. 无偏估计与有偏估计

上节解出来的 μ M L E \mu_{MLE} μMLE是无偏估计量,而 σ M L E 2 \sigma^2_{MLE} σMLE2是有偏估计量。
一个总体如果服从高斯分布,那么必客观存在参数 μ , σ \mu,\sigma μ,σ。通过对样本的观察,采样极大似然的方法,对总体的参数进行估计,如果估计量的期望等于被估计的参数,则称此估计量为无偏估计,否则为有偏估计。
E [ μ M L E ] = E [ 1 n ∑ i = 1 n x i ] = 1 n ∑ i = 1 n E [ x i ] = 1 n ∑ i = 1 n μ = μ σ M L E 2 = 1 n ∑ i = 1 n ( x i − μ M L E ) 2 = 1 n ∑ i = 1 n ( x i 2 − 2 x i μ M L E + μ M L E 2 ) = 1 n ∑ i = 1 n x i 2 − 2 μ M L E 2 + μ M L E 2 E[\mu_{MLE}]=E[\frac{1}{n}\sum_{i=1}^{n}x_i]=\frac{1}{n}\sum_{i=1}^{n}E[x_i]=\frac{1}{n}\sum_{i=1}^{n}\mu=\mu\\\sigma^2_{MLE}=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu_{MLE})^2\\=\frac{1}{n}\sum_{i=1}^{n}(x_i^2-2x_i\mu_{MLE}+\mu^2_{MLE})=\frac{1}{n}\sum_{i=1}^{n}x_i^2-2\mu_{MLE}^2+\mu_{MLE}^2 E[μMLE]=E[n1i=1nxi]=n1i=1nE[xi]=n1i=1nμ=μσMLE2=n1i=1n(xiμMLE)2=n1i=1n(xi22xiμMLE+μMLE2)=n1i=1nxi22μMLE2+μMLE2 E [ σ M L E 2 ] = E [ ( 1 n ∑ i = 1 n x i 2 − μ 2 ) − ( μ M L E 2 − μ 2 ) ] = E [ 1 n ∑ i = 1 n ( x i 2 − μ 2 ) ] − E [ μ M L E 2 − μ 2 ] = 1 n ∑ i = 1 n E [ ( x i 2 − μ 2 ) ] − E [ μ M L E 2 − μ 2 ] = ( E [ μ 2 ] − E 2 [ μ ] ) − ( E [ μ M L E 2 ] − E 2 [ μ M L E ] ) = σ 2 − v a r [ μ M L E ] = σ 2 − 1 n σ 2 = n − 1 n σ 2 E[\sigma^2_{MLE}]=E[(\frac{1}{n}\sum_{i=1}^{n}x_i^2-\mu^2)-(\mu_{MLE}^2-\mu^2)]\\=E[\frac{1}{n}\sum_{i=1}^{n}(x_i^2-\mu^2)]-E[\mu_{MLE}^2-\mu^2]\\=\frac{1}{n}\sum_{i=1}^{n}E[(x_i^2-\mu^2)]-E[\mu_{MLE}^2-\mu^2]\\=(E[\mu^2]-E^2[\mu])-(E[\mu_{MLE}^2]-E^2[\mu_{MLE}])\\=\sigma^2-var[\mu_{MLE}]\\=\sigma^2-\frac{1}{n}\sigma^2=\frac{n-1}{n}\sigma^2 E[σMLE2]=E[(n1i=1nxi2μ2)(μMLE2μ2)]=E[n1i=1n(xi2μ2)]E[μMLE2μ2]=n1i=1nE[(xi2μ2)]E[μMLE2μ2]=(E[μ2]E2[μ])(E[μMLE2]E2[μMLE])=σ2var[μMLE]=σ2n1σ2=nn1σ2其中 v a r [ μ M L E ] = v a r [ 1 n ∑ i = 1 n x i ] = 1 n 2 ∑ i = 1 n v a r [ x i ] = σ 2 n var[\mu_{MLE}]=var[\frac{1}{n}\sum_{i=1}^{n}x_i]\\=\frac{1}{n^2}\sum_{i=1}^{n}var[x_i]=\frac{\sigma^2}{n} var[μMLE]=var[n1i=1nxi]=n21i=1nvar[xi]=nσ2

原视频链接

【机器学习】【白板推导系列】【合集 1~23】

你可能感兴趣的:(机器学习-白板推导系列笔记,机器学习,统计学,概率论)