MLE与MAP的关系

设数据集为 D D D,参数为 θ \theta θ

对于MLE有

P ( D ∣ θ ) = P ( x 1 , x 2 . . . x n ∣ θ ) = ∏ i = 1 n P ( x i ∣ θ ) P(D|\theta) = P(x_1,x_2...x_n|\theta) = \prod^n_{i=1}P(x_i|\theta) P(Dθ)=P(x1,x2...xnθ)=i=1nP(xiθ)

取对数

l o g P ( D ∣ θ ) = ∑ i = 1 n l o g P ( x i ∣ θ ) logP(D|\theta) = \sum^n_{i=1}logP(x_i|\theta) logP(Dθ)=i=1nlogP(xiθ)

对于MAP有

l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) + P ( θ ) logP(\theta|D) = \sum^n_{i=1}logP(x_i|\theta) + P(\theta) logP(θD)=i=1nlogP(xiθ)+P(θ)

P ( θ ) P(\theta) P(θ)服从高斯分布 P ( θ ) ∼ N ( 0 , μ 2 ) P(\theta) \sim N(0,\mu^2) P(θ)N(0,μ2)

P ( θ ) = 1 2 π σ e x p ( − θ 2 2 σ 2 ) P(\theta)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{\theta^2}{2\sigma^2}) P(θ)=2π σ1exp(2σ2θ2)

取对数

l o g P ( θ ) = − l o g 2 π σ − θ 2 2 σ 2 logP(\theta)=-log\sqrt{2\pi}\sigma -\frac{\theta^2}{2\sigma^2} logP(θ)=log2π σ2σ2θ2

则有

l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) − l o g 2 π σ − θ 2 2 σ 2 logP(\theta|D) = \sum^n_{i=1}logP(x_i|\theta) -log\sqrt{2\pi}\sigma -\frac{\theta^2}{2\sigma^2} logP(θD)=i=1nlogP(xiθ)log2π σ2σ2θ2

其中 σ \sigma σ是一个常数,所以 − l o g 2 π σ -log\sqrt{2\pi}\sigma log2π σ可以忽略,所以有

l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) − 1 2 σ 2 ⋅ θ 2 logP(\theta|D) = \sum^n_{i=1}logP(x_i|\theta) -\frac{1}{2\sigma^2}·\theta^2 logP(θD)=i=1nlogP(xiθ)2σ21θ2

1 2 σ 2 \frac{1}{2\sigma^2} 2σ21为L2正则的惩罚项 λ \lambda λ,所以当 P ( θ ) P(\theta) P(θ)服从高斯分布时,其实MAP就等于MLE加上L2正则

P ( θ ) P(\theta) P(θ)服从拉普拉斯分布 P ( θ ) ∼ L ( 0 , b ) P(\theta) \sim L(0,b) P(θ)L(0,b)

P ( θ ) = 1 2 b e x p ( − ∣ θ ∣ b ) P(\theta)=\frac{1}{2b}exp(-\frac{|\theta|}{b}) P(θ)=2b1exp(bθ)

取对数

l o g P ( θ ) = − l o g 2 b − ∣ θ ∣ b logP(\theta)=-log2b -\frac{|\theta|}{b} logP(θ)=log2bbθ

则有

l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) − l o g 2 b − ∣ θ ∣ b logP(\theta|D)= \sum^n_{i=1}logP(x_i|\theta) -log2b -\frac{|\theta|}{b} logP(θD)=i=1nlogP(xiθ)log2bbθ

其中 b b b是一个常数,所以 l o g 2 b log2b log2b可以忽略,所以有

l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) − 1 b ⋅ ∣ θ ∣ logP(\theta|D)= \sum^n_{i=1}logP(x_i|\theta) -\frac{1}{b}·|\theta| logP(θD)=i=1nlogP(xiθ)b1θ

1 b \frac{1}{b} b1为L1正则的惩罚项 λ \lambda λ,所以当 P ( θ ) P(\theta) P(θ)服从拉普拉斯分布时,其实MAP就等于MLE加上L1正则

由于MAP是MLE加上一个先验概率,即

l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) + P ( θ ) logP(\theta|D) = \sum^n_{i=1}logP(x_i|\theta) + P(\theta) logP(θD)=i=1nlogP(xiθ)+P(θ)

当数据量越来越大时, ∑ i = 1 n l o g P ( x i ∣ θ ) \sum^n_{i=1}logP(x_i|\theta) i=1nlogP(xiθ)的值会越来越大,而 P ( θ ) P(\theta) P(θ)的值是不会变的,最终MAP的值就会趋向于MLE的值

你可能感兴趣的:(机器学习)