设数据集为 D D D,参数为 θ \theta θ
对于MLE有
P ( D ∣ θ ) = P ( x 1 , x 2 . . . x n ∣ θ ) = ∏ i = 1 n P ( x i ∣ θ ) P(D|\theta) = P(x_1,x_2...x_n|\theta) = \prod^n_{i=1}P(x_i|\theta) P(D∣θ)=P(x1,x2...xn∣θ)=i=1∏nP(xi∣θ)
取对数
l o g P ( D ∣ θ ) = ∑ i = 1 n l o g P ( x i ∣ θ ) logP(D|\theta) = \sum^n_{i=1}logP(x_i|\theta) logP(D∣θ)=i=1∑nlogP(xi∣θ)
对于MAP有
l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) + P ( θ ) logP(\theta|D) = \sum^n_{i=1}logP(x_i|\theta) + P(\theta) logP(θ∣D)=i=1∑nlogP(xi∣θ)+P(θ)
当 P ( θ ) P(\theta) P(θ)服从高斯分布 P ( θ ) ∼ N ( 0 , μ 2 ) P(\theta) \sim N(0,\mu^2) P(θ)∼N(0,μ2)时
P ( θ ) = 1 2 π σ e x p ( − θ 2 2 σ 2 ) P(\theta)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{\theta^2}{2\sigma^2}) P(θ)=2πσ1exp(−2σ2θ2)
取对数
l o g P ( θ ) = − l o g 2 π σ − θ 2 2 σ 2 logP(\theta)=-log\sqrt{2\pi}\sigma -\frac{\theta^2}{2\sigma^2} logP(θ)=−log2πσ−2σ2θ2
则有
l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) − l o g 2 π σ − θ 2 2 σ 2 logP(\theta|D) = \sum^n_{i=1}logP(x_i|\theta) -log\sqrt{2\pi}\sigma -\frac{\theta^2}{2\sigma^2} logP(θ∣D)=i=1∑nlogP(xi∣θ)−log2πσ−2σ2θ2
其中 σ \sigma σ是一个常数,所以 − l o g 2 π σ -log\sqrt{2\pi}\sigma −log2πσ可以忽略,所以有
l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) − 1 2 σ 2 ⋅ θ 2 logP(\theta|D) = \sum^n_{i=1}logP(x_i|\theta) -\frac{1}{2\sigma^2}·\theta^2 logP(θ∣D)=i=1∑nlogP(xi∣θ)−2σ21⋅θ2
令 1 2 σ 2 \frac{1}{2\sigma^2} 2σ21为L2正则的惩罚项 λ \lambda λ,所以当 P ( θ ) P(\theta) P(θ)服从高斯分布时,其实MAP就等于MLE加上L2正则
当 P ( θ ) P(\theta) P(θ)服从拉普拉斯分布 P ( θ ) ∼ L ( 0 , b ) P(\theta) \sim L(0,b) P(θ)∼L(0,b)时
P ( θ ) = 1 2 b e x p ( − ∣ θ ∣ b ) P(\theta)=\frac{1}{2b}exp(-\frac{|\theta|}{b}) P(θ)=2b1exp(−b∣θ∣)
取对数
l o g P ( θ ) = − l o g 2 b − ∣ θ ∣ b logP(\theta)=-log2b -\frac{|\theta|}{b} logP(θ)=−log2b−b∣θ∣
则有
l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) − l o g 2 b − ∣ θ ∣ b logP(\theta|D)= \sum^n_{i=1}logP(x_i|\theta) -log2b -\frac{|\theta|}{b} logP(θ∣D)=i=1∑nlogP(xi∣θ)−log2b−b∣θ∣
其中 b b b是一个常数,所以 l o g 2 b log2b log2b可以忽略,所以有
l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) − 1 b ⋅ ∣ θ ∣ logP(\theta|D)= \sum^n_{i=1}logP(x_i|\theta) -\frac{1}{b}·|\theta| logP(θ∣D)=i=1∑nlogP(xi∣θ)−b1⋅∣θ∣
令 1 b \frac{1}{b} b1为L1正则的惩罚项 λ \lambda λ,所以当 P ( θ ) P(\theta) P(θ)服从拉普拉斯分布时,其实MAP就等于MLE加上L1正则
由于MAP是MLE加上一个先验概率,即
l o g P ( θ ∣ D ) = ∑ i = 1 n l o g P ( x i ∣ θ ) + P ( θ ) logP(\theta|D) = \sum^n_{i=1}logP(x_i|\theta) + P(\theta) logP(θ∣D)=i=1∑nlogP(xi∣θ)+P(θ)
当数据量越来越大时, ∑ i = 1 n l o g P ( x i ∣ θ ) \sum^n_{i=1}logP(x_i|\theta) ∑i=1nlogP(xi∣θ)的值会越来越大,而 P ( θ ) P(\theta) P(θ)的值是不会变的,最终MAP的值就会趋向于MLE的值