EM算法是一种迭代算法,用于含有隐变量的概率模型参数的极大似然估计,或极大后验概率估计。EM算法的每次迭代由两部组成:E步,求期望(expectation);M步,求极大(maximization)。
EM算法的引入
我们面对一个含有隐变量的概率模型,目标是极大化观测数据 Y Y Y关于参数 θ \theta θ的对数似然函数,即极大化
L ( θ ) = l o g ( P ( Y ∣ θ ) ) = log ∑ Z P ( Y , Z ∣ θ ) = log ( ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) L(\theta ) = log(P(\left. Y \right|\theta )) = \log \sum\limits_Z {P(\left. {Y,Z} \right|\theta )} = \log \left( {\sum\limits_Z {P(\left. Y \right|Z,\theta )P(Z\left| \theta \right.)} } \right) L(θ)=log(P(Y∣θ))=logZ∑P(Y,Z∣θ)=log(Z∑P(Y∣Z,θ)P(Z∣θ))这一极大化问题的主要困难是包含了未观测数据 Z Z Z并有包含和的对数。假设在第 i i i次迭代后 θ \theta θ的估计值是 θ i \theta_i θi,考虑
L ( θ ) − L ( θ i ) = log ( ∑ Z P ( Y ∣ Z , θ ) P ( Z ∣ θ ) ) − l o g P ( Y ∣ θ i ) L(\theta ) - L({\theta _i}) = \log \left( {\sum\limits_Z {P(\left. Y \right|Z,\theta )P(Z\left| \theta \right.)} } \right) - logP(\left. Y \right|{\theta _i}) L(θ)−L(θi)=log(Z∑P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θi)利用Jensen不等式求其下界:
L ( θ ) − L ( θ i ) ≥ ∑ Z P ( Z ∣ Y , θ i ) l o g P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ i ) P ( Y ∣ θ i ) L(\theta ) - L({\theta _i}) \ge \sum\limits_Z {P(\left. Z \right|Y,{\theta _i})log\frac{ {P(\left. Y \right|Z,\theta )P(Z\left| \theta \right.)}}{ {P(\left. Z \right|Y,{\theta _i})P(\left. Y \right|{\theta _i})}}} L(θ)−L(θi)≥Z∑P(Z∣Y,θi)logP(Z∣Y,θi)P(Y∣θi)P(Y∣Z,θ)P(Z∣θ)令 B ( θ , θ i ) = L ( θ i ) + ∑ Z P ( Z ∣ Y , θ i ) l o g P ( Y ∣ Z , θ ) P ( Z ∣ θ ) P ( Z ∣ Y , θ i ) P ( Y ∣ θ i ) B(\theta ,{\theta _i}) = L({\theta _i}) + \sum\limits_Z {P(\left. Z \right|Y,{\theta _i})log\frac{ {P(\left. Y \right|Z,\theta )P(Z\left| \theta \right.)}}{ {P(\left. Z \right|Y,{\theta _i})P(\left. Y \right|{\theta _i})}}} B(θ,θi)=L(θi)+Z∑P(Z∣Y,θi)logP(Z∣Y,θi)P(Y∣θi)P(Y∣Z,θ)P(Z∣θ)则 L ( θ ) ≥ B ( θ , θ i ) L(\theta ) \ge B(\theta ,{\theta _i}) L(θ)≥B(θ,θi)即函数 B ( θ , θ i ) B(\theta ,{\theta _i}) B(θ,θi)是 L ( θ ) L(\theta) L(θ)的一个下界。因此任何可以使 B ( θ , θ i ) B(\theta ,{\theta _i}) B(θ,θi)增大的 θ \theta θ,也可以使 L ( θ ) L(\theta) L(θ)增大。为了使 L ( θ ) L(\theta) L(θ)有尽可能大的增长,选择 θ i + 1 \theta_{i+1} θi+1使 B ( θ , θ i ) B(\theta ,{\theta _i}) B(θ,θi)达到极大,即
θ i + 1 = arg max θ B ( θ , θ i ) {\theta _{i + 1}} = \arg \mathop {\max }\limits_\theta B(\theta ,{\theta _i}) θi+1=argθmaxB(θ,θi)有 θ i + 1 = arg max θ ∑ Z P ( Z ∣ Y , θ i ) l o g P ( Y , Z ∣ θ ) {\theta _{i + 1}} = \arg \mathop {\max }\limits_\theta \sum\limits_Z {P(\left. Z \right|Y,{\theta _i})logP(\left. {Y,Z} \right|\theta )} θi+1=argθmaxZ∑P(Z∣Y,θi)logP(Y,Z∣θ)令 Q ( θ , θ i ) = ∑ Z P ( Z ∣ Y , θ i ) l o g P ( Y , Z ∣ θ ) Q(\theta ,{\theta _i}) = \sum\limits_Z {P(\left. Z \right|Y,{\theta _i})logP(\left. {Y,Z} \right|\theta )} Q(θ,θi)=Z∑P(Z∣Y,θi)logP(Y,Z∣θ)得 θ i + 1 = arg max θ Q ( θ , θ i ) {\theta _{i + 1}} = \arg \mathop {\max }\limits_\theta Q(\theta ,{\theta _i}) θi+1=argθmaxQ(θ,θi)上式等价于EM算法的一次迭代,即求Q函数及其极大化。EM算法时通过不断求解下界的极大化逼近求解对数似然函数极大化的算法。具体地,其算法可表述如下:
(1)选择参数的初值 θ 0 \theta_0 θ0,开始迭代;
(2)E步:计算 Q ( θ , θ i ) = ∑ Z P ( Z ∣ Y , θ i ) l o g P ( Y , Z ∣ θ ) Q(\theta ,{\theta _i}) = \sum\limits_Z {P(\left. Z \right|Y,{\theta _i})logP(\left. {Y,Z} \right|\theta )} Q(θ,θi)=Z∑P(Z∣Y,θi)logP(Y,Z∣θ),这里 P ( Z ∣ Y , θ i ) P(\left. Z \right|Y,{\theta _i}) P(Z∣Y,θi)是在给定观测数据Y和当前的参数估计 θ i {\theta _i} θi下隐变量数据Z的条件概率分布。
(3)M步:求使 Q ( θ , θ i ) Q(\theta ,{\theta _i}) Q(θ,θi)极大化的 θ i + 1 \theta_{i+1} θi+1
(4)重复第(2)步和第(3)步,直到收敛。
EM算法不仅能用于监督学习,还可用于非监督学习。
高斯混合模型参数估计的EM算法
高斯混合模型是指具有如下形式的概率分布模型:
P ( y ∣ θ ) = ∑ k = 1 K α k ϕ ( y ∣ θ k ) P(y|\theta ) = \sum\limits_{k = 1}^K { {\alpha _k}\phi (y|{\theta _k})} P(y∣θ)=k=1∑Kαkϕ(y∣θk)其中, α k \alpha_k αk是系数, α k ≥ 0 \alpha_k\ge 0 αk≥0, ∑ k = 1 K α k = 1 \sum\limits_{k = 1}^K { {\alpha _k}} = 1 k=1∑Kαk=1。 ϕ ( y ∣ θ k ) {\phi (y|{\theta _k})} ϕ(y∣θk)是高斯分布密度, θ k = ( μ k , σ k ) {\theta _k} = ({\mu _k},{\sigma _k}) θk=(μk,σk)
ϕ ( y ∣ θ k ) = 1 2 π σ k exp ( − ( y − μ k ) 2 2 σ k 2 ) \phi (y|{\theta _k}) = \frac{1}{ {\sqrt {2\pi } {\sigma _k}}}\exp ( - \frac{ { { {(y - {\mu _k})}^2}}}{ {2\sigma _k^2}}) ϕ(y∣θk)=2πσk1exp(−2σk2(y−μk)2)(1)明确隐变量,写出完全数据的对数似然函数
可以设想观测数据是这样产生的:首先依概率 α k \alpha_k αk选择第k个高斯分布模型 ϕ ( y ∣ θ k ) {\phi (y|{\theta _k})} ϕ(y∣θk),然后依第k个高斯分布模型 ϕ ( y ∣ θ k ) {\phi (y|{\theta _k})} ϕ(y∣θk)的概率分布生成观测数据 y j y_j yj。这时反映观测数据 y j y_j yj来自第k个分模型的数据是未知的,以隐变量表示如下
γ j k = 1 , 第 j 个 观 测 值 来 自 第 k 个 分 模 型 {\gamma _{jk}} = 1,第j个观测值来自第k个分模型 γjk=1,第j个观测值来自第k个分模型
γ j k = 0 , 否 则 {\gamma _{jk}} = 0,否则 γjk=0,否则
于是对数似然函数为
l o g P ( y , γ ∣ θ ) = ∑ k = 1 K { n k log α k + ∑ j = 1 N γ j k [ log 1 2 π − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } {\mathop{\rm logP}\nolimits} (y,\gamma |\theta ) = \sum\limits_{k = 1}^K {\left\{ { {n_k}\log {\alpha _k} + \sum\limits_{j = 1}^N { {\gamma _{jk}}\left[ {\log \frac{1}{ {\sqrt {2\pi } }} - \log {\sigma _k} - \frac{1}{ {2\sigma _k^2}}{ {({y_j} - {\mu _k})}^2}} \right]} } \right\}} logP(y,γ∣θ)=k=1∑K{ nklogαk+j=1∑Nγjk[log2π1−logσk−2σk21(yj−μk)2]}
(2)EM算法的E步:确定Q函数
Q ( θ , θ i ) = E [ l o g P ( y , γ ∣ θ ) ∣ y , θ i ] = ∑ k = 1 K { ∑ j = 1 N ( E γ j k ) log α k + ∑ j = 1 N ( E γ j k ) [ log 1 2 π − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } Q(\theta ,{\theta _i}) = E\left[ { {\mathop{\rm logP}\nolimits} (y,\gamma |\theta )|y,{\theta _i}} \right] = \sum\limits_{k = 1}^K {\left\{ {\sum\limits_{j = 1}^N {(E{\gamma _{jk}})\log {\alpha _k}} + \sum\limits_{j = 1}^N {(E{\gamma _{jk}})\left[ {\log \frac{1}{ {\sqrt {2\pi } }} - \log {\sigma _k} - \frac{1}{ {2\sigma _k^2}}{ {({y_j} - {\mu _k})}^2}} \right]} } \right\}} Q(θ,θi)=E[logP(y,γ∣θ)∣y,θi]=k=1∑K{ j=1∑N(Eγjk)logαk+j=1∑N(Eγjk)[log2π1−logσk−2σk21(yj−μk)2]}经一系列变形后 Q ( θ , θ i ) = ∑ k = 1 K { n k log α k + ∑ j = 1 N γ ^ j k [ log 1 2 π − log σ k − 1 2 σ k 2 ( y j − μ k ) 2 ] } Q(\theta ,{\theta _i}) = \sum\limits_{k = 1}^K {\left\{ { {n_k}\log {\alpha _k} + \sum\limits_{j = 1}^N { { {\hat \gamma }_{jk}}\left[ {\log \frac{1}{ {\sqrt {2\pi } }} - \log {\sigma _k} - \frac{1}{ {2\sigma _k^2}}{ {({y_j} - {\mu _k})}^2}} \right]} } \right\}} Q(θ,θi)=k=1∑K{ nklogαk+j=1∑Nγ^jk[log2π1−logσk−2σk21(yj−μk)2]}其中, γ ^ j k = E γ j k = α k ϕ ( y j ∣ θ k ) ∑ k = 1 K α k ϕ ( y j ∣ θ k ) { {\hat \gamma }_{jk}} = E{\gamma _{jk}} = \frac{ { {\alpha _k}\phi ({y_j}|{\theta _k})}}{ {\sum\limits_{k = 1}^K { {\alpha _k}\phi ({y_j}|{\theta _k})} }} γ^jk=Eγjk=k=1∑Kαkϕ(yj∣θk)αkϕ(yj∣θk), γ ^ j k { {\hat \gamma }_{jk}} γ^jk表示当前模型参数下第j个观测数据来自第k个分模型的概率,称为分模型k对观测数据 y j y_j yj的响应度; n k = ∑ j = 1 N E γ j k {n_k} = \sum\limits_{j = 1}^N {E{\gamma _{jk}}} nk=j=1∑NEγjk。
(3)确定EM算法的M步
迭代的M步是求函数 Q ( θ , θ i ) Q(\theta,\theta_i) Q(θ,θi)对 θ \theta θ的极大值,即求新一轮迭代的模型参数。具体操作,可求参数偏导数并令其为零,结果如下:
μ ^ k = ∑ j = 1 N γ ^ j k y j ∑ j = 1 N γ ^ j k { {\hat \mu }_k} = \frac{ {\sum\limits_{j = 1}^N { { {\hat \gamma }_{jk}}{y_j}} }}{ {\sum\limits_{j = 1}^N { { {\hat \gamma }_{jk}}} }} μ^k=j=1∑Nγ^jkj=1∑Nγ^jkyj σ ^ k 2 = ∑ j = 1 N γ ^ j k ( y j − μ k ) 2 ∑ j = 1 N γ ^ j k \hat \sigma _k^2 = \frac{ {\sum\limits_{j = 1}^N { { {\hat \gamma }_{jk}}{ {({y_j} - {\mu _k})}^2}} }}{ {\sum\limits_{j = 1}^N { { {\hat \gamma }_{jk}}} }} σ^k2=j=1∑Nγ^jkj=1∑Nγ^jk(yj−μk)2 α ^ k = n k N = ∑ j = 1 N γ ^ j k N { {\hat \alpha }_k} = \frac{ { {n_k}}}{N} = \frac{ {\sum\limits_{j = 1}^N { { {\hat \gamma }_{jk}}} }}{N} α^k=Nnk=Nj=1∑Nγ^jk重复以上计算,知道对数似然函数值不再有明显变化。注意到M步中只跟 γ ^ j k { {\hat \gamma }_{jk}} γ^jk有关,因此在E步中,只需计算 γ ^ j k { {\hat \gamma }_{jk}} γ^jk即可。
参考:
李航《统计学习方法》