机器学习-白板推导系列笔记(二)-数学基础

此文章主要是结合哔站shuhuai008大佬的白板推导视频:数学基础_150min

全部笔记的汇总贴:机器学习-白板推导系列笔记

一、概述

假设有以下数据:

X = ( x 1 , x 1 , ⋯   , x N ) T = ( x 1 T x 2 T ⋮ x N T ) N × p X=(x_{1},x_{1},\cdots ,x_{N})^{T}=\begin{pmatrix} x_{1}^{T}\\ x_{2}^{T}\\ \vdots \\ x_{N}^{T} \end{pmatrix}_{N \times p} X=(x1,x1,,xN)T=x1Tx2TxNTN×p
其中 x i ∈ R p x_{i}\in \mathbb{R}^{p} xiRp x i ∼ i i d N ( μ , Σ ) x_{i}\overset{iid}{\sim }N(\mu ,\Sigma ) xiiidN(μ,Σ)
则参数 θ = ( μ , Σ ) \theta =(\mu ,\Sigma ) θ=(μ,Σ)

二、通过极大似然估计高斯分布的均值和方差

(一)极大似然

θ M L E = a r g m a x θ P ( X ∣ θ ) \theta_{MLE}=\underset{\theta }{argmax}P(X|\theta ) θMLE=θargmaxP(Xθ)

(二)高斯分布

一维高斯分布: p ( x ) = 1 2 π σ e x p ( − ( x − μ ) 2 2 σ 2 ) p(x)=\frac{1}{\sqrt{2\pi }\sigma }exp(-\frac{(x-\mu )^{2}}{2\sigma ^{2}}) p(x)=2π σ1exp(2σ2(xμ)2)
多维高斯分布: p ( x ) = 1 ( 2 π ) D / 2 ∣ Σ ∣ 1 / 2 e x p ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p(x)=\frac{1}{(2\pi )^{D/2}|\Sigma |^{1/2}}exp(-\frac{1}{2}(x-\mu)^{T}\Sigma ^{-1}(x-\mu)) p(x)=(2π)D/2Σ1/21exp(21(xμ)TΣ1(xμ))

(三)一维高斯分布下的估计

1.关于 θ \theta θ的似然函数

l o g P ( X ∣ θ ) = l o g ∏ i = 1 N p ( x i ∣ θ ) = ∑ i = 1 N l o g 1 2 π σ e x p ( − ( x i − μ ) 2 2 σ 2 ) = ∑ i = 1 N [ l o g 1 2 π + l o g 1 σ − ( x i − μ ) 2 2 σ 2 ] logP(X|\theta )=log\prod_{i=1}^{N}p(x_{i}|\theta )\\ =\sum_{i=1}^{N}log\frac{1}{\sqrt{2\pi }\sigma }exp(-\frac{(x_{i}-\mu )^{2}}{2\sigma ^{2}})\\ =\sum_{i=1}^{N}[log\frac{1}{\sqrt{2\pi }}+log\frac{1}{\sigma }-\frac{(x_{i}-\mu )^{2}}{2\sigma ^{2}}] logP(Xθ)=logi=1Np(xiθ)=i=1Nlog2π σ1exp(2σ2(xiμ)2)=i=1N[log2π 1+logσ12σ2(xiμ)2]

2.通过极大似然估计法求解 μ M L E \mu _{MLE} μMLE

μ M L E = a r g m a x μ l o g P ( X ∣ θ ) = a r g m a x μ ∑ i = 1 N − ( x i − μ ) 2 2 σ 2 = a r g m i n μ ∑ i = 1 N ( x i − μ ) 2 \mu _{MLE}=\underset{\mu }{argmax}logP(X|\theta)\\ =\underset{\mu }{argmax}\sum_{i=1}^{N}-\frac{(x_{i}-\mu )^{2}}{2\sigma ^{2}}\\ =\underset{\mu }{argmin}\sum_{i=1}^{N}(x_{i}-\mu )^{2} μMLE=μargmaxlogP(Xθ)=μargmaxi=1N2σ2(xiμ)2=μargmini=1N(xiμ)2
μ \mu μ求导
∂ ∑ i = 1 N ( x i − μ ) 2 ∂ μ = ∑ i = 1 N 2 ( x i − μ ) ( − 1 ) = 0 ⇔ ∑ i = 1 N ( x i − μ ) = 0 ⇔ ∑ i = 1 N x i − ∑ i = 1 N μ ⏟ N μ = 0 \frac{\partial \sum_{i=1}^{N}(x_{i}-\mu )^{2}}{\partial \mu}=\sum_{i=1}^{N}2(x_{i}-\mu )(-1)=0\\ \Leftrightarrow \sum_{i=1}^{N}(x_{i}-\mu )=0\\ \Leftrightarrow \sum_{i=1}^{N}x_{i}-\underset{N\mu }{\underbrace{\sum_{i=1}^{N}\mu }}=0 μi=1N(xiμ)2=i=1N2(xiμ)(1)=0i=1N(xiμ)=0i=1NxiNμ i=1Nμ=0
解得 μ M L E = 1 N ∑ i = 1 N x i \mu _{MLE}=\frac{1}{N}\sum_{i=1}^{N}x_{i} μMLE=N1i=1Nxi

3.证明 μ M L E \mu _{MLE} μMLE是无偏估计

E [ μ M L E ] = 1 N ∑ i = 1 N E [ x i ] = 1 N ∑ i = 1 N μ = 1 N N μ = μ E[\mu _{MLE}]=\frac{1}{N}\sum_{i=1}^{N}E[x_{i}] =\frac{1}{N}\sum_{i=1}^{N}\mu =\frac{1}{N}N\mu =\mu E[μMLE]=N1i=1NE[xi]=N1i=1Nμ=N1Nμ=μ

4.通过极大似然估计法求解 σ M L E \sigma _{MLE} σMLE

σ M L E 2 = a r g m a x σ P ( X ∣ θ ) = a r g m a x σ ∑ i = 1 N ( − l o g σ − ( x i − μ ) 2 2 σ 2 ) ⏟ L ∂ L ∂ σ = ∑ i = 1 N [ − 1 σ + ( x i − μ ) 2 σ − 3 ] ⇔ ∑ i = 1 N [ − σ 2 + ( x i − μ ) 2 ] = 0 ⇔ − ∑ i = 1 N σ 2 + ∑ i = 1 N ( x i − μ ) 2 = 0 σ M L E 2 = 1 N ∑ i = 1 N ( x i − μ ) 2 \sigma _{MLE}^{2}=\underset{\sigma }{argmax}P(X|\theta )\\ =\underset{\sigma }{argmax}\underset{L}{\underbrace{\sum_{i=1}^{N}(-log\sigma -\frac{(x_{i}-\mu )^{2}}{2\sigma ^{2}})}}\\ \frac{\partial L}{\partial \sigma}=\sum_{i=1}^{N}[-\frac{1}{\sigma }+(x_{i}-\mu )^{2}\sigma ^{-3}]\\ \Leftrightarrow \sum_{i=1}^{N}[-\sigma ^{2}+(x_{i}-\mu )^{2}]=0\\ \Leftrightarrow -\sum_{i=1}^{N}\sigma ^{2}+\sum_{i=1}^{N}(x_{i}-\mu )^{2}=0\\ \sigma _{MLE}^{2}=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu )^{2} σMLE2=σargmaxP(Xθ)=σargmaxL i=1N(logσ2σ2(xiμ)2)σL=i=1N[σ1+(xiμ)2σ3]i=1N[σ2+(xiμ)2]=0i=1Nσ2+i=1N(xiμ)2=0σMLE2=N1i=1N(xiμ)2
μ \mu μ μ M L E \mu_{MLE} μMLE时, σ M L E 2 = 1 N ∑ i = 1 N ( x i − μ M L E ) 2 \sigma _{MLE}^{2}=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu _{MLE})^{2} σMLE2=N1i=1N(xiμMLE)2

5.证明 σ M L E 2 \sigma _{MLE}^{2} σMLE2是有偏估计

要证明 σ M L E 2 \sigma _{MLE}^{2} σMLE2是有偏估计就需要判断 E [ σ M L E 2 ] = ? σ 2 E[\sigma _{MLE}^{2}]\overset{?}{=}\sigma ^{2} E[σMLE2]=?σ2,证明如下:

V a r [ μ M L E ] = V a r [ 1 N ∑ i = 1 N x i ] = 1 N 2 ∑ i = 1 N V a r [ x i ] = 1 N 2 ∑ i = 1 N σ 2 = σ 2 N σ M L E 2 = 1 N ∑ i = 1 N ( x i − μ M L E ) 2 = 1 N ∑ i = 1 N ( x i 2 − 2 x i μ M L E + μ M L E 2 ) 2 = 1 N ∑ i = 1 N x i 2 − 1 N ∑ i = 1 N 2 x i μ M L E + 1 N ∑ i = 1 N μ M L E 2 = 1 N ∑ i = 1 N x i 2 − 2 μ M L E 2 + μ M L E 2 = 1 N ∑ i = 1 N x i 2 − μ M L E 2 { {Var[\mu _{MLE}]}}=Var[\frac{1}{N}\sum_{i=1}^{N}x_{i}]=\frac{1}{N^{2}}\sum_{i=1}^{N}Var[x_{i}]=\frac{1}{N^{2}}\sum_{i=1}^{N}\sigma ^{2}=\frac{\sigma ^{2}}{N}\\ {\color{Red}{\sigma _{MLE}^{2}}}=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu _{MLE})^{2}\\ =\frac{1}{N}\sum_{i=1}^{N}(x_{i}^{2}-2x_{i}\mu _{MLE}+\mu _{MLE}^{2})^{2}\\ =\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-\frac{1}{N}\sum_{i=1}^{N}2x_{i}\mu _{MLE}+\frac{1}{N}\sum_{i=1}^{N}\mu _{MLE}^{2}\\ =\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-2\mu _{MLE}^{2}+\mu _{MLE}^{2}\\ =\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-\mu _{MLE}^{2} Var[μMLE]=Var[N1i=1Nxi]=N21i=1NVar[xi]=N21i=1Nσ2=Nσ2σMLE2=N1i=1N(xiμMLE)2=N1i=1N(xi22xiμMLE+μMLE2)2=N1i=1Nxi2N1i=1N2xiμMLE+N1i=1NμMLE2=N1i=1Nxi22μMLE2+μMLE2=N1i=1Nxi2μMLE2
E [ σ M L E 2 ] = E [ 1 N ∑ i = 1 N x i 2 − μ M L E 2 ] = E [ ( 1 N ∑ i = 1 N x i 2 − μ 2 ) − ( μ M L E 2 − μ 2 ) ] = E [ 1 N ∑ i = 1 N ( x i 2 − μ 2 ) ] − E [ μ M L E 2 − μ 2 ] = 1 N ∑ i = 1 N E [ x i 2 − μ 2 ] − ( E [ μ M L E 2 ] − E [ μ 2 ] ) = 1 N ∑ i = 1 N ( E [ x i 2 ] − μ 2 ) − ( E [ μ M L E 2 ] − μ 2 ) = 1 N ∑ i = 1 N ( V a r [ x i ] + E [ x i ] 2 − μ 2 ) − ( E [ μ M L E 2 ] − μ M L E 2 ) = 1 N ∑ i = 1 N ( V a r [ x i ] + μ 2 − μ 2 ) − ( E [ μ M L E 2 ] − E [ μ M L E ] 2 ) = 1 N ∑ i = 1 N V a r [ x i ] − V a r [ μ M L E ] = σ 2 − 1 N σ 2 = N − 1 N σ 2 E[{\color{Red}{\sigma _{MLE}^{2}}}]=E[\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-\mu _{MLE}^{2}]\\ =E[(\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}-\mu ^{2})-(\mu _{MLE}^{2}-\mu ^{2})]\\ =E[\frac{1}{N}\sum_{i=1}^{N}(x_{i}^{2}-\mu ^{2})]-E[\mu _{MLE}^{2}-\mu ^{2}]\\ =\frac{1}{N}\sum_{i=1}^{N}E[x_{i}^{2}-\mu ^{2}]-(E[\mu _{MLE}^{2}]-E[\mu ^{2}])\\ =\frac{1}{N}\sum_{i=1}^{N}(E[x_{i}^{2}]-\mu ^{2})-(E[\mu _{MLE}^{2}]-\mu ^{2})\\ =\frac{1}{N}\sum_{i=1}^{N}(Var[x_{i}]+E[x_{i}]^{2}-\mu ^{2})-(E[\mu _{MLE}^{2}]-\mu_{MLE} ^{2})\\ =\frac{1}{N}\sum_{i=1}^{N}(Var[x_{i}]+\mu ^{2}-\mu ^{2})-(E[\mu _{MLE}^{2}]-E[\mu_{MLE}] ^{2})\\ =\frac{1}{N}\sum_{i=1}^{N}Var[x_{i}]-{\color{Blue}{Var[\mu _{MLE}]}}\\ =\sigma ^{2}-\frac{1}{N}\sigma ^{2}\\ =\frac{N-1}{N}\sigma ^{2} E[σMLE2]=E[N1i=1Nxi2μMLE2]=E[(N1i=1Nxi2μ2)(μMLE2μ2)]=E[N1i=1N(xi2μ2)]E[μMLE2μ2]=N1i=1NE[xi2μ2](E[μMLE2]E[μ2])=N1i=1N(E[xi2]μ2)(E[μMLE2]μ2)=N1i=1N(Var[xi]+E[xi]2μ2)(E[μMLE2]μMLE2)=N1i=1N(Var[xi]+μ2μ2)(E[μMLE2]E[μMLE]2)=N1i=1NVar[xi]Var[μMLE]=σ2N1σ2=NN1σ2

可以理解为当 μ \mu μ μ M L E \mu_{MLE} μMLE就已经确定了所有 x i x_{i} xi的和等于 N μ M L E N\mu_{MLE} NμMLE,也就是说当 N − 1 N-1 N1 x i x_{i} xi确定以后,第 N N N x i x_{i} xi也就被确定了,所以少了一个“自由度”,因此 E [ σ M L E 2 ] = N − 1 N σ 2 E[{\sigma _{MLE}^{2}}]=\frac{N-1}{N}\sigma ^{2} E[σMLE2]=NN1σ2

方差的无偏估计:

σ ^ 2 = 1 N − 1 ∑ i = 1 N ( x i − μ M L E ) 2 \hat{\sigma} ^{2}=\frac{1}{N-1}\sum_{i=1}^{N}(x_{i}-\mu _{MLE})^{2} σ^2=N11i=1N(xiμMLE)2

三、为什么高斯分布的等高线是个“椭圆”

(一)高斯分布与马氏距离

1.多维高斯分布

x ∼ i i d N ( μ , Σ ) = 1 ( 2 π ) D / 2 ∣ Σ ∣ 1 / 2 e x p ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ⏟ 二 次 型 ) x ∈ R p , r . v . x = ( x 1 x 2 ⋮ x p ) μ = ( μ 1 μ 2 ⋮ μ p ) Σ = [ σ 11 σ 12 ⋯ σ 1 p σ 21 σ 22 ⋯ σ 2 p ⋮ ⋮ ⋱ ⋮ σ p 1 σ p 2 ⋯ σ p p ] p × p x\overset{iid}{\sim }N(\mu ,\Sigma )=\frac{1}{(2\pi )^{D/2}|\Sigma |^{1/2}}exp(-\frac{1}{2}\underset{二次型}{\underbrace{(x-\mu)^{T}\Sigma ^{-1}(x-\mu)}})\\ x\in \mathbb{R}^{p},r.v.\\ x=\begin{pmatrix} x_{1}\\ x_{2}\\ \vdots \\ x_{p} \end{pmatrix}\mu =\begin{pmatrix} \mu_{1}\\ \mu_{2}\\ \vdots \\ \mu_{p} \end{pmatrix}\Sigma = \begin{bmatrix} \sigma _{11}& \sigma _{12}& \cdots & \sigma _{1p}\\ \sigma _{21}& \sigma _{22}& \cdots & \sigma _{2p}\\ \vdots & \vdots & \ddots & \vdots \\ \sigma _{p1}& \sigma _{p2}& \cdots & \sigma _{pp} \end{bmatrix}_{p\times p} xiidN(μ,Σ)=(2π)D/2Σ1/21exp(21 (xμ)TΣ1(xμ))xRpr.v.x=x1x2xpμ=μ1μ2μpΣ=σ11σ21σp1σ12σ22σp2σ1pσ2pσppp×p
Σ \Sigma Σ一般是半正定的,在本次证明中假设是正定的,即所有的特征值都是正的,没有 0 0 0

2.马氏距离

( x − μ ) T Σ − 1 ( x − μ ) \sqrt{(x-\mu)^{T}\Sigma ^{-1}(x-\mu)} (xμ)TΣ1(xμ) 为马氏距离( x x x μ \mu μ之间,当 Σ \Sigma Σ I I I时马氏距离即为欧氏距离。)

(二)证明高斯分布等高线为椭圆

1.协方差矩阵的特征值分解

任意的 N × N N \times N N×N实对称矩阵都有 N N N个线性无关的特征向量。并且这些特征向量都可以正交单位化而得到一组正交且模为 1 1 1的向量。故实对称矩阵 Σ \Sigma Σ可被分解成 Σ = U Λ U T \Sigma=U\Lambda U^{T} Σ=UΛUT

Σ \Sigma Σ进行特征分解, Σ = U Λ U T \Sigma=U\Lambda U^{T} Σ=UΛUT
其中 U U T = U T U = I , Λ = d i a g ( λ i ) i = 1 , 2 , ⋯   , p , U = ( u 1 , u 2 , ⋯   , u p ) p × p UU^{T}=U^{T}U=I,\underset{i=1,2,\cdots ,p}{\Lambda =diag(\lambda _{i})},U=(u _{1},u _{2},\cdots ,u _{p})_{p\times p} UUT=UTU=Ii=1,2,,pΛ=diag(λi)U=(u1,u2,,up)p×p
因此 Σ = U Λ U T = ( u 1 u 2 ⋯ u p ) [ λ 1 0 ⋯ 0 0 λ 2 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ λ p ] ( u 1 T u 2 T ⋮ u p T ) = ( u 1 λ 1 u 2 λ 2 ⋯ u p λ p ) ( u 1 T u 2 T ⋮ u p T ) = ∑ i = 1 p u i λ i u i T \Sigma=U\Lambda U^{T}\\ =\begin{pmatrix} u _{1} & u _{2} & \cdots & u _{p} \end{pmatrix}\begin{bmatrix} \lambda _{1} & 0 & \cdots & 0 \\ 0 & \lambda _{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda _{p} \end{bmatrix}\begin{pmatrix} u_{1}^{T}\\ u_{2}^{T}\\ \vdots \\ u_{p}^{T} \end{pmatrix}\\ =\begin{pmatrix} u _{1}\lambda _{1} & u _{2}\lambda _{2} & \cdots & u _{p}\lambda _{p} \end{pmatrix}\begin{pmatrix} u_{1}^{T}\\ u_{2}^{T}\\ \vdots \\ u_{p}^{T} \end{pmatrix}\\ =\sum_{i=1}^{p}u_{i}\lambda _{i}u_{i}^{T} Σ=UΛUT=(u1u2up)λ1000λ2000λpu1Tu2TupT=(u1λ1u2λ2upλp)u1Tu2TupT=i=1puiλiuiT
Σ − 1 = ( U Λ U T ) − 1 = ( U T ) − 1 Λ − 1 U − 1 = U Λ − 1 U T = ∑ i = 1 p u i 1 λ i u i T , 其 中 Λ − 1 = d i a g ( 1 λ i ) , i = 1 , 2 , ⋯   , p \Sigma ^{-1}=(U\Lambda U^{T})^{-1}=(U^{T})^{-1}\Lambda ^{-1}U^{-1}=U{\Lambda^{-1}}U^{T}=\sum_{i=1}^{p}u_{i}\frac{1}{\lambda _{i}}u _{i}^{T},其中\Lambda^{-1}=diag(\frac{1}{\lambda _{i}}),i=1,2,\cdots ,p Σ1=(UΛUT)1=(UT)1Λ1U1=UΛ1UT=i=1puiλi1uiT,Λ1=diag(λi1),i=1,2,,p

2.将概率密度整理成椭圆方程的形式

Δ = ( x − μ ) T Σ − 1 ( x − μ ) = ( x − μ ) T ∑ i = 1 p u i 1 λ i u i T ( x − μ ) = ∑ i = 1 p ( x − μ ) T u i 1 λ i u i T ( x − μ ) ( 令 y i = ( x − μ ) T u i ) = ∑ i = 1 p y i 1 λ i y i T = ∑ i = 1 p y i 2 λ i \Delta =(x-\mu )^{T}\Sigma ^{-1}(x-\mu )\\ =(x-\mu )^{T}\sum_{i=1}^{p}u _{i}\frac{1}{\lambda _{i}}u _{i}^{T}(x-\mu )\\ =\sum_{i=1}^{p}(x-\mu )^{T}u _{i}\frac{1}{\lambda _{i}}u _{i}^{T}(x-\mu )\\ (令y_{i}=(x-\mu )^{T}u _{i})\\ =\sum_{i=1}^{p}y_{i}\frac{1}{\lambda _{i}}y_{i}^{T}\\ =\sum_{i=1}^{p}\frac{y_{i}^{2}}{\lambda _{i}} Δ=(xμ)TΣ1(xμ)=(xμ)Ti=1puiλi1uiT(xμ)=i=1p(xμ)Tuiλi1uiT(xμ)(yi=(xμ)Tui)=i=1pyiλi1yiT=i=1pλiyi2

上式中 y i = ( x − μ ) T u i y_{i}=(x-\mu )^{T}u _{i} yi=(xμ)Tui可以理解为将x减去均值进行中心化以后再投影到 u i u _{i} ui方向上,相当于做了一次坐标轴变换。

当x的维度为 2 2 2 p = 2 时 Δ = y 1 2 λ 1 + y 2 2 λ 2 p=2时\Delta =\frac{y_{1}^{2}}{\lambda _{1}}+\frac{y_{2}^{2}}{\lambda _{2}} p=2Δ=λ1y12+λ2y22,得到类似椭圆方程的等式,所以也就可以解释为什么其等高线是椭圆形状。

四、高斯分布的局限性

(一)参数过多

协方差矩阵 Σ p × p \Sigma _{p\times p} Σp×p中的参数共有 1 + 2 + ⋯ + p = p ( p + 1 ) 2 1+2+\cdots +p=\frac{p(p+1)}{2} 1+2++p=2p(p+1)个( Σ p × p \Sigma _{p\times p} Σp×p是对称矩阵),因此当 x x x的维度 p p p很大时,高斯分布的参数就会有很多,其计算复杂度为 O ( p 2 ) O(p^{2}) O(p2)

可以通过假设高斯分布的协方差矩阵为对角矩阵来减少参数,当高斯分布的协方差矩阵为对角矩阵时,特征向量的方向就会和原坐标轴的方向平行,因此高斯分布的等高线(同心椭圆)就不会倾斜。

另外如果在高斯分布的协方差矩阵为对角矩阵为对角矩阵的基础上使得其特征值全部相等(即 λ 1 = λ 2 = ⋯ = λ i \lambda _{1}=\lambda _{2}=\cdots=\lambda _{i} λ1=λ2==λi),则高斯分布的等高线就会成为一个圆形,而且不会倾斜,称为各向同性。

(二)单个高斯分布拟合能力有限

解决方案是使用多个高斯分布,比如高斯混合模型。

五、求高斯分布的边缘概率与条件概率

(一)概述

首先将变量、均值和方差进行划分:

x = ( x a x b ) x=\begin{pmatrix} x_{a}\\ x_{b} \end{pmatrix} x=(xaxb)
其中 x a x_{a} xa m m m维的, x b x_{b} xb n n n维的。
μ = ( μ a μ b ) Σ = ( Σ a a Σ a b Σ b a Σ b b ) \mu =\begin{pmatrix} \mu_{a}\\ \mu_{b} \end{pmatrix}\Sigma =\begin{pmatrix} \Sigma _{aa}&\Sigma _{ab}\\ \Sigma _{ba}&\Sigma _{bb} \end{pmatrix} μ=(μaμb)Σ=(ΣaaΣbaΣabΣbb)

本部分旨在根据上述已知来求 P ( x a ) , P ( x b ∣ x a ) , P ( x b ) , P ( x a ∣ x b ) 。 P(x_{a}),P(x_{b}|x_{a}),P(x_{b}),P(x_{a}|x_{b})。 P(xa),P(xbxa),P(xb),P(xaxb)

(二)定理

以下定义为推导过程中主要用到的定理,这里只展示定理的内容,不进行证明:

已知: x ∼ N ( μ , Σ ) , x ∈ R p y = A x + B , y ∈ R q x\sim N(\mu ,\Sigma ),x\in \mathbb{R}^{p}\\ y=Ax+B,y\in \mathbb{R}^{q} xN(μ,Σ),xRpy=Ax+B,yRq
结论: y ∼ N ( A μ + B , A Σ A T ) y\sim N(A\mu +B,A\Sigma A^{T}) yN(Aμ+B,AΣAT)

一个简单但不严谨的证明:

E [ y ] = E [ A x + B ] = A E [ x ] + B = A μ + B V a r [ y ] = V a r [ A x + B ] = V a r [ A x ] + V a r [ B ] = A V a r [ x ] A T + 0 = A Σ A T E[y]=E[Ax+B]=AE[x]+B=A\mu +B\\ Var[y]=Var[Ax+B]\\ =Var[Ax]+Var[B]\\ =AVar[x]A^{T}+0\\ =A\Sigma A^{T} E[y]=E[Ax+B]=AE[x]+B=Aμ+BVar[y]=Var[Ax+B]=Var[Ax]+Var[B]=AVar[x]AT+0=AΣAT

(三)求边缘概率 P ( x a ) P(x_{a}) P(xa)

x a = ( I m 0 n ) ⏟ A ( x a x b ) ⏟ x E [ x a ] = ( I m 0 n ) ( μ a μ b ) = μ a V a r [ x a ] = ( I m 0 n ) ( Σ a a Σ a b Σ b a Σ b b ) ( I m 0 n ) = ( Σ a a Σ a b ) ( I m 0 n ) = Σ a a x_{a}=\underset{A}{\underbrace{\begin{pmatrix} I_{m} & 0_{n} \end{pmatrix}}}\underset{x}{\underbrace{\begin{pmatrix} x_{a}\\ x_{b} \end{pmatrix}}}\\ E[x_{a}]=\begin{pmatrix} I_{m} & 0_{n} \end{pmatrix}\begin{pmatrix} \mu _{a}\\ \mu _{b} \end{pmatrix}=\mu _{a}\\ Var[x_{a}]=\begin{pmatrix} I_{m} & 0_{n} \end{pmatrix}\begin{pmatrix} \Sigma _{aa}&\Sigma _{ab}\\ \Sigma _{ba}&\Sigma _{bb} \end{pmatrix}\begin{pmatrix} I_{m}\\ 0_{n} \end{pmatrix}\\ =\begin{pmatrix} \Sigma _{aa}&\Sigma _{ab} \end{pmatrix}\begin{pmatrix} I_{m}\\ 0_{n} \end{pmatrix}=\Sigma _{aa} xa=A (Im0n)x (xaxb)E[xa]=(Im0n)(μaμb)=μaVar[xa]=(Im0n)(ΣaaΣbaΣabΣbb)(Im0n)=(ΣaaΣab)(Im0n)=Σaa

所以 x a ∼ N ( μ a , Σ a a ) x_{a}\sim N(\mu _{a},\Sigma _{aa}) xaN(μa,Σaa),同理 x b ∼ N ( μ b , Σ b b ) x_{b}\sim N(\mu _{b},\Sigma _{bb}) xbN(μb,Σbb)

(四)求条件概率 P ( x b ∣ x a ) P(x_{b}|x_{a}) P(xbxa)

构造 { x b ⋅ a = x b − Σ b a Σ a a − 1 x a μ b ⋅ a = μ b − Σ b a Σ a a − 1 μ a Σ b b ⋅ a = Σ b b − Σ b a Σ a a − 1 Σ a b \left\{\begin{matrix} x_{b\cdot a}=x_{b}-\Sigma _{ba}\Sigma _{aa}^{-1}x_{a}\\ \mu _{b\cdot a}=\mu_{b}-\Sigma _{ba}\Sigma _{aa}^{-1}\mu_{a}\\ \Sigma _{bb\cdot a}=\Sigma _{bb}-\Sigma _{ba}\Sigma _{aa}^{-1}\Sigma _{ab} \end{matrix}\right. xba=xbΣbaΣaa1xaμba=μbΣbaΣaa1μaΣbba=ΣbbΣbaΣaa1Σab ( Σ b b ⋅ a 是 Σ a a 的 舒 尔 补 ) (\Sigma _{bb\cdot a}是\Sigma _{aa}的舒尔补) (ΣbbaΣaa)
x b ⋅ a = ( Σ b a Σ a a − 1 I n ) ⏟ A ( x a x b ) ⏟ x E [ x b ⋅ a ] = ( − Σ b a Σ a a − 1 I n ) ( μ a μ b ) = μ b − Σ b a Σ a a − 1 μ a = μ b ⋅ a V a r [ x b ⋅ a ] = ( − Σ b a Σ a a − 1 I n ) ( Σ a a Σ a b Σ b a Σ b b ) ( − Σ a a − 1 Σ b a T I n ) = ( − Σ b a Σ a a − 1 Σ a a + Σ b a − Σ b a Σ a a − 1 Σ a b + Σ b b ) = ( 0 − Σ b a Σ a a − 1 Σ a b + Σ b b ) ( − Σ a a − 1 Σ b a T I n ) = Σ b b − Σ b a Σ a a − 1 Σ a b = Σ b b ⋅ a x_{b\cdot a}=\underset{A}{\underbrace{\begin{pmatrix} \Sigma _{ba}\Sigma _{aa}^{-1}& I_{n} \end{pmatrix}}}\underset{x}{\underbrace{\begin{pmatrix} x_{a}\\ x_{b} \end{pmatrix}}}\\ E[x_{b\cdot a}]=\begin{pmatrix} -\Sigma _{ba}\Sigma _{aa}^{-1}& I_{n} \end{pmatrix}\begin{pmatrix} \mu _{a}\\ \mu _{b} \end{pmatrix}=\mu_{b}-\Sigma _{ba}\Sigma _{aa}^{-1}\mu_{a}=\mu _{b\cdot a}\\ Var[x_{b\cdot a}]=\begin{pmatrix} -\Sigma _{ba}\Sigma _{aa}^{-1}& I_{n} \end{pmatrix}\begin{pmatrix} \Sigma _{aa}&\Sigma _{ab}\\ \Sigma _{ba}&\Sigma _{bb} \end{pmatrix}\begin{pmatrix} -\Sigma _{aa}^{-1}\Sigma _{ba}^{T}\\ I_{n} \end{pmatrix}\\ =\begin{pmatrix} -\Sigma _{ba}\Sigma _{aa}^{-1}\Sigma _{aa}+\Sigma _{ba}& -\Sigma _{ba}\Sigma _{aa}^{-1}\Sigma _{ab}+\Sigma _{bb} \end{pmatrix}\\ =\begin{pmatrix} 0& -\Sigma _{ba}\Sigma _{aa}^{-1}\Sigma _{ab}+\Sigma _{bb} \end{pmatrix}\begin{pmatrix} -\Sigma _{aa}^{-1}\Sigma _{ba}^{T}\\ I_{n} \end{pmatrix}\\ =\Sigma _{bb}-\Sigma _{ba}\Sigma _{aa}^{-1}\Sigma _{ab}\\ =\Sigma _{bb\cdot a} xba=A (ΣbaΣaa1In)x (xaxb)E[xba]=(ΣbaΣaa1In)(μaμb)=μbΣbaΣaa1μa=μbaVar[xba]=(ΣbaΣaa1In)(ΣaaΣbaΣabΣbb)(Σaa1ΣbaTIn)=(ΣbaΣaa1Σaa+ΣbaΣbaΣaa1Σab+Σbb)=(0ΣbaΣaa1Σab+Σbb)(Σaa1ΣbaTIn)=ΣbbΣbaΣaa1Σab=Σbba

现在可以得到 x b ⋅ a ∼ N ( μ b ⋅ a , Σ b b ⋅ a ) x_{b\cdot a}\sim N(\mu _{b\cdot a},\Sigma _{bb\cdot a}) xbaN(μba,Σbba)。根据 x b x_{b} xb x b ⋅ a x_{b\cdot a} xba的关系可以得到 x b ∣ x a x_{b}|x_{a} xbxa的分布:

x b = x b ⋅ a ⏟ x + Σ b a Σ a a − 1 x a ⏟ B ( 在 求 条 件 概 率 P ( x b ∣ x a ) 时 x a 对 于 x b 来 说 可 以 看 做 已 知 , 因 此 上 式 中 Σ b a Σ a a − 1 x a 看 做 常 量 B ) x_{b}=\underset{x}{\underbrace{x_{b\cdot a}}}+\underset{B}{\underbrace{\Sigma _{ba}\Sigma _{aa}^{-1}x_{a}}}\\ (在求条件概率P(x_{b}|x_{a})时x_{a}对于x_{b}来说可以看做已知,因此上式中\Sigma _{ba}\Sigma _{aa}^{-1}x_{a}看做常量B) xb=x xba+B ΣbaΣaa1xa(P(xbxa)xaxbΣbaΣaa1xaB) E [ x b ∣ x a ] = μ b ⋅ a + Σ b a Σ a a − 1 x a V a r [ x b ∣ x a ] = V a r [ x b ⋅ a ] = Σ b b ⋅ a E[x_{b}|x_{a}]=\mu _{b\cdot a}+\Sigma _{ba}\Sigma _{aa}^{-1}x_{a}\\ Var[x_{b}|x_{a}]=Var[x_{b\cdot a}]=\Sigma _{bb\cdot a} E[xbxa]=μba+ΣbaΣaa1xaVar[xbxa]=Var[xba]=Σbba

因此可以得到 x b ∣ x a ∼ N ( μ b ⋅ a + Σ b a Σ a a − 1 x a , Σ b b ⋅ a ) x_{b}|x_{a}\sim N(\mu _{b\cdot a}+\Sigma _{ba}\Sigma _{aa}^{-1}x_{a},\Sigma _{bb\cdot a}) xbxaN(μba+ΣbaΣaa1xa,Σbba),同理可以得到 x a ∣ x b ∼ N ( μ a ⋅ b + Σ a b Σ b b − 1 x b , Σ a a ⋅ b ) x_{a}|x_{b}\sim N(\mu _{a\cdot b}+\Sigma _{ab}\Sigma _{bb}^{-1}x_{b},\Sigma _{aa\cdot b}) xaxbN(μab+ΣabΣbb1xb,Σaab)

六、求高斯分布的联合概率分布

(一)概述

p ( x ) = N ( x ∣ μ , Λ − 1 ) p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 ) p(x)=N(x|\mu ,\Lambda ^{-1})\\ p(y|x)=N(y|Ax+b ,L ^{-1}) p(x)=N(xμ,Λ1)p(yx)=N(yAx+b,L1) Λ 和 L 是 精 度 矩 阵 ( p r e c i s i o n   m a t r i x ) , p r e c i s i o n   m a t r i x = ( c o v a r i a n c e   m a t r i x ) T 。 \Lambda和L是精度矩阵(precision\,matrix), precision\,matrix=(covariance\,matrix)^{T}。 ΛLprecisionmatrixprecisionmatrix=(covariancematrix)T

本部分旨在根据上述已知来求 p ( y ) , p ( x ∣ y ) p(y),p(x|y) p(y),p(xy)

(二)求解 p ( y ) p(y) p(y)

由上述已知可以确定 y y y x x x的关系为线性高斯模型,则 y y y x x x符合下述关系:

y = A x + b + ε , ε ∼ N ( 0 , L − 1 ) y=Ax+b+\varepsilon ,\varepsilon\sim N(0,L ^{-1}) y=Ax+b+ε,εN(0,L1)

然后求解 y y y的均值和方差:

E [ y ] = E [ A x + b + ε ] = E [ A x + b ] + E [ ε ] = A μ + b V a r [ y ] = V a r [ A x + b + ε ] = V a r [ A x + b ] + V a r [ ε ] = A Λ − 1 A T + L − 1 E[y]=E[Ax+b+\varepsilon]=E[Ax+b]+E[\varepsilon]=A\mu+b\\ Var[y]=Var[Ax+b+\varepsilon]=Var[Ax+b]+Var[\varepsilon]=A\Lambda ^{-1}A^{T}+L ^{-1} E[y]=E[Ax+b+ε]=E[Ax+b]+E[ε]=Aμ+bVar[y]=Var[Ax+b+ε]=Var[Ax+b]+Var[ε]=AΛ1AT+L1
则可以得出: y ∼ N ( A μ + b , L − 1 + A Λ − 1 A T ) y\sim N(A\mu+b,L ^{-1}+A\Lambda ^{-1}A^{T}) yN(Aμ+b,L1+AΛ1AT)

(三)求解 p ( x ∣ y ) p(x|y) p(xy)

求解 p ( x ∣ y ) p(x|y) p(xy)需要首先求解x与y的联合分布,然后根据上一部分的公式直接得到 p ( x ∣ y ) p(x|y) p(xy)

构 造 z = ( x y ) ∼ N ( [ μ A μ + b ] , [ Λ − 1 Δ Δ T L − 1 + A Λ − 1 A T ] ) 构造z=\begin{pmatrix} x\\ y \end{pmatrix}\sim N\left ( \begin{bmatrix} \mu \\ A\mu+b \end{bmatrix} , \begin{bmatrix} \Lambda ^{-1} & \Delta \\ \Delta ^{T} & L ^{-1}+A\Lambda ^{-1}A^{T} \end{bmatrix}\right ) z=(xy)N([μAμ+b],[Λ1ΔTΔL1+AΛ1AT])
现在需要求解 Δ \Delta Δ Δ = C o v ( x , y ) = E [ ( x − E [ x ] ) ( y − E [ y ] ) T ] = E [ ( x − μ ) ( y − A μ − b ) T ] = E [ ( x − μ ) ( A x + b + ε − A μ − b ) T ] = E [ ( x − μ ) ( A x − A μ + ε ) T ] = E [ ( x − μ ) ( A x − A μ ) T + ( x − μ ) ε T ] = E [ ( x − μ ) ( A x − A μ ) T ] + E [ ( x − μ ) ε T ] ( 因 为 x ⊥ ε , 所 以 ( x − μ ) ⊥ ε , 所 以 E [ ( x − μ ) ε T ] = E [ ( x − μ ) ] E [ ε T ] ) = E [ ( x − μ ) ( A x − A μ ) T ] + E [ ( x − μ ) ] E [ ε T ] = E [ ( x − μ ) ( A x − A μ ) T ] + E [ ( x − μ ) ] ⋅ 0 = E [ ( x − μ ) ( A x − A μ ) T ] = E [ ( x − μ ) ( x − μ ) T A T ] = E [ ( x − μ ) ( x − μ ) T ] A T = V a r [ x ] A T = Λ − 1 A T \Delta=Cov(x,y)\\ =E[(x-E[x])(y-E[y])^{T}]\\ =E[(x-\mu )(y-A\mu-b)^{T}]\\ =E[(x-\mu )(Ax+b+\varepsilon-A\mu-b)^{T}]\\ =E[(x-\mu )(Ax-A\mu+\varepsilon)^{T}]\\ =E[(x-\mu )(Ax-A\mu)^{T}+(x-\mu)\varepsilon^{T}]\\ =E[(x-\mu )(Ax-A\mu)^{T}]+E[(x-\mu)\varepsilon^{T}]\\ (因为x\perp \varepsilon,所以(x-\mu)\perp \varepsilon,所以E[(x-\mu)\varepsilon^{T}]=E[(x-\mu)]E[\varepsilon^{T}])\\ =E[(x-\mu )(Ax-A\mu)^{T}]+E[(x-\mu)]E[\varepsilon^{T}]\\ =E[(x-\mu )(Ax-A\mu)^{T}]+E[(x-\mu)]\cdot 0\\ =E[(x-\mu )(Ax-A\mu)^{T}]\\ =E[(x-\mu )(x-\mu )^{T}A^{T}]\\ =E[(x-\mu )(x-\mu )^{T}]A^{T}\\ =Var[x]A^{T}\\ =\Lambda ^{-1}A^{T} Δ=Cov(x,y)=E[(xE[x])(yE[y])T]=E[(xμ)(yAμb)T]=E[(xμ)(Ax+b+εAμb)T]=E[(xμ)(AxAμ+ε)T]=E[(xμ)(AxAμ)T+(xμ)εT]=E[(xμ)(AxAμ)T]+E[(xμ)εT]xε(xμ)εE[(xμ)εT]=E[(xμ)]E[εT]=E[(xμ)(AxAμ)T]+E[(xμ)]E[εT]=E[(xμ)(AxAμ)T]+E[(xμ)]0=E[(xμ)(AxAμ)T]=E[(xμ)(xμ)TAT]=E[(xμ)(xμ)T]AT=Var[x]AT=Λ1AT
由此可得 , z = ( x y ) ∼ N ( [ μ A μ + b ] , [ Λ − 1 Λ − 1 A T A Λ − 1 L − 1 + A Λ − 1 A T ] ) ,z=\begin{pmatrix} x\\ y \end{pmatrix}\sim N\left ( \begin{bmatrix} \mu \\ A\mu+b \end{bmatrix} , \begin{bmatrix} \Lambda ^{-1} & \Lambda ^{-1}A^{T} \\ A\Lambda ^{-1} & L ^{-1}+A\Lambda ^{-1}A^{T} \end{bmatrix}\right ) z=(xy)N([μAμ+b],[Λ1AΛ1Λ1ATL1+AΛ1AT])

套用上一部分的公式可以得到: x ∣ y ∼ N ( μ x ⋅ y + Λ − 1 A T ( L − 1 + A Λ − 1 A T ) − 1 y , Σ x x ⋅ y ) x|y\sim N(\mu _{x\cdot y}+\Lambda ^{-1}A^{T} (L ^{-1}+A\Lambda ^{-1}A^{T})^{-1}y,\Sigma _{xx\cdot y}) xyN(μxy+Λ1AT(L1+AΛ1AT)1y,Σxxy)

下一章传送门:白板推导系列笔记(三)-线性回归

参考文章
高斯分布|机器学习推导

你可能感兴趣的:(哔站机器学习白板推导,机器学习)