极大似然估计是一种应用很广泛的参数估计方法。在已有某个地区身高数据以及知道身高服从高斯分布的情况下,利用极大似然估计的方法可以估计出高斯分布 μ , σ \mu,\sigma μ,σ两个参数。
如果是多组数据,多个模型呢?获取现在我们有全国多个省份的身高数据,但并不知道它们具体属于哪个省份,只知道每个省之间服从不同的高斯分布,此时的模型称为高斯混合模型(GMM),其公式为
P ( x ; θ ) = ∑ i = 1 k p k N ( x ; μ k , Σ k ) . . . . . . . . . . ( 1 ) P(x;\theta)=\sum_{i=1}^kp_kN(x;\mu_k,\Sigma_k)..........(1) P(x;θ)=i=1∑kpkN(x;μk,Σk)..........(1)
此时用极大似然估计的方法并不能很好地求出模型里面的参数,因为模型中存在一个隐变量——样本的label。令其为Z,则可以写出似然函数的形式:
L ( θ ) = ∑ i = 1 m log p ( x i ; θ ) = ∑ i = 1 m log ∫ p ( x i , z ; θ ) d z . . . . . . . . . . ( 2 ) L(\theta)=\sum_{i=1}^m\log p(x_i;\theta)=\sum_{i=1}^m\log \int p(x_i,z;\theta)dz..........(2) L(θ)=i=1∑mlogp(xi;θ)=i=1∑mlog∫p(xi,z;θ)dz..........(2)
如果根据极大似然估计的算法对式子(2)进行求导,意味着我们要对 log ( y 1 + y 2 + . . . ) \log (y1+y2+...) log(y1+y2+...)的形式进行求导,这计算量简直爆炸。
这里我们需要用到EM算法来进行求解,在介绍EM算法前,我们先来了解一个不等式——Jensen不等式,它的定义如下:
如果X是一个随机变量,f(X)是一个凸函数(二阶导数大或等于0),那么有:
E ( f ( x ) ) ≥ f [ E ( x ) ] E(f(x)) \ge f[E(x)] E(f(x))≥f[E(x)]
如果f(X)是凹函数,不等号反向
E ( f ( x ) ) ≤ f [ E ( x ) ] E(f(x)) \le f[E(x)] E(f(x))≤f[E(x)]
当且仅当X是常数的时候等号成立。
结合图像我们可以对Jensen不等式有一个直观的理解,对于x=a和x=b两个点,E(f(x))可以看成是 1 2 ( f ( a ) + f ( b ) ) \frac{1}{2}(f(a)+f(b)) 21(f(a)+f(b)),f[E(x)]可以看成是 f ( 1 2 ( f ( a ) + f ( b ) ) ) f(\frac{1}{2}(f(a)+f(b))) f(21(f(a)+f(b))),即ab中点对应的函数值。由于图像是凹图像,所以不难得出 E ( f ( x ) ) ≥ f [ E ( x ) ] E(f(x)) \ge f[E(x)] E(f(x))≥f[E(x)]。
利用Jensen不等式,由于log函数是凹函数,式子(2)可以改写成:
∑ i = 1 m log ∫ p ( x i , z ; θ ) d z = ∑ i = 1 m log ∫ Q ( z ) P ( x i , z ; θ ) Q ( z ) d z ≥ ∑ i = 1 m ∫ Q ( z ) log P ( x i , z ; θ ) Q ( z ) d z . . . . . . . . . . ( 3 ) \begin{aligned} \sum_{i=1}^m\log \int p(x_i,z;\theta)dz &=\sum_{i=1}^m\log \int Q(z)\frac{P(x_i,z;\theta)}{Q(z)}dz\\ &\ge \sum_{i=1}^m\int Q(z)\log \frac{P(x_i,z;\theta)}{Q(z)}dz..........(3) \end{aligned} i=1∑mlog∫p(xi,z;θ)dz=i=1∑mlog∫Q(z)Q(z)P(xi,z;θ)dz≥i=1∑m∫Q(z)logQ(z)P(xi,z;θ)dz..........(3)
由于 Q ( z ) Q(z) Q(z)是概率分布,所以满足
∑ z Q ( z ) = 1.......... ( 4 ) \sum_zQ(z)=1..........(4) z∑Q(z)=1..........(4)
根据Jenson不等式,等式成立时,有
p ( x i , z ; θ ) Q ( z ) = C . . . . . . . . . . ( 5 ) \frac{p(x_i,z;\theta)}{Q(z)}=C..........(5) Q(z)p(xi,z;θ)=C..........(5)
根据4和5,有
∑ z i Q ( z ) = ∑ z i p ( x i , z ; θ ) C = 1.......... ( 6 ) Q ( z ) = p ( x i , z ; θ ) C = p ( x i , z ; θ ) ∑ z p ( x i , z ; θ ) = p ( x i , z ; θ ) p ( x i ) = p ( z ∣ x i ; θ ) . . . . . . . . . . ( 7 ) \sum_{z_i}Q(z)=\sum_{z_i}\frac{p(x_i,z;\theta)}{C}=1..........(6) \\ Q(z)=\frac{p(x_i,z;\theta)}{C}=\frac{p(x_i,z;\theta)}{\sum_{z}p(x_i,z;\theta)}=\frac{p(x_i,z;\theta)}{p(x_i)}=p(z|x_i;\theta)..........(7) zi∑Q(z)=zi∑Cp(xi,z;θ)=1..........(6)Q(z)=Cp(xi,z;θ)=∑zp(xi,z;θ)p(xi,z;θ)=p(xi)p(xi,z;θ)=p(z∣xi;θ)..........(7)
将(7)回代到(3),有
∑ i = 1 m log ∫ p ( x i , z ; θ ) d z ≥ ∑ i = 1 m ∫ p ( z ∣ x i ; θ ) log P ( x i , z ; θ ) p ( z ∣ x i ; θ ) d z \sum_{i=1}^m\log\int p(x_i,z;\theta)dz\ge \sum_{i=1}^m\int p(z|x_i;\theta)\log \frac{P(x_i,z;\theta)}{p(z|x_i;\theta)}dz i=1∑mlog∫p(xi,z;θ)dz≥i=1∑m∫p(z∣xi;θ)logp(z∣xi;θ)P(xi,z;θ)dz
通过每次令
θ ^ = arg max θ ∫ p ( z ∣ x i ; θ ) log P ( x i , z ; θ ) p ( z ∣ x i ; θ ) d z = ∫ p ( z ∣ x i ; θ ) log P ( x i , z ; θ ) d z \hat \theta=\arg \max_\theta\int p(z|x_i;\theta)\log \frac{P(x_i,z;\theta)}{p(z|x_i;\theta)}dz=\int p(z|x_i;\theta)\log P(x_i,z;\theta)dz θ^=argθmax∫p(z∣xi;θ)logp(z∣xi;θ)P(xi,z;θ)dz=∫p(z∣xi;θ)logP(xi,z;θ)dz
就可以不断迭代使得似然函数最大。因此EM算法本质上是一种令下界不断提高,从而使得似然函数也不断提高的迭代算法。
因此,EM算法的流程如下:
为什么EM算法能近似实现对观测数据的极大似然估计呢?我们需要证明我们的对数似然函数的值在迭代的过程中一直在增大。假设当前迭代到第i轮,参数为 θ i \theta_i θi,希望新估计值 θ \theta θ可以使 L ( θ ) L(\theta) L(θ)增大,即 L ( θ ) > L ( θ i ) L(\theta)\gt L(\theta_i) L(θ)>L(θi),则:
L ( θ ) − L ( θ i ) = log ( ∑ Z p ( x ∣ z , θ ) p ( z ∣ θ ) ) − log ( p ( x ∣ θ i ) ) = log ∑ Z p ( z ∣ x , θ i ) p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) − log ( p ( x ∣ θ i ) ) ≥ ∑ Z p ( z ∣ x , θ i ) log p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) − log ( p ( x ∣ θ i ) ) = ∑ Z p ( z ∣ x , θ i ) log p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) ( p ( x ∣ θ i ) . . . . . . . . . . . ( 8 ) \begin{aligned} L(\theta)-L(\theta_{i}) &=\log (\sum_Zp(x|z,\theta)p(z|\theta))-\log \left(p(x|\theta_{i})\right) \\ &=\log \sum_Zp(z|x,\theta_{i})\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})}-\log \left(p(x|\theta_{i})\right)\\ &\ge \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})}-\log \left(p(x|\theta_{i})\right)\\ &= \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})}...........(8) \end{aligned} L(θ)−L(θi)=log(Z∑p(x∣z,θ)p(z∣θ))−log(p(x∣θi))=logZ∑p(z∣x,θi)p(z∣x,θi)p(x∣z,θ)p(z∣θ)−log(p(x∣θi))≥Z∑p(z∣x,θi)logp(z∣x,θi)p(x∣z,θ)p(z∣θ)−log(p(x∣θi))=Z∑p(z∣x,θi)logp(z∣x,θi)(p(x∣θi)p(x∣z,θ)p(z∣θ)...........(8)
令 B ( θ , θ i ) = L ( θ i ) + ∑ Z p ( z ∣ x , θ i ) log p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) ( p ( x ∣ θ i ) . . . . . . . . . . . ( 9 ) B(\theta,\theta_{i})=L(\theta_{i})+\sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})}...........(9) B(θ,θi)=L(θi)+Z∑p(z∣x,θi)logp(z∣x,θi)(p(x∣θi)p(x∣z,θ)p(z∣θ)...........(9)
则有 L ( θ ) ≥ B ( θ , θ i ) L(\theta)\ge B(\theta,\theta_{i}) L(θ)≥B(θ,θi),因此可以认为 B ( θ , θ i ) B(\theta,\theta_{i}) B(θ,θi)为 L ( θ ) L(\theta) L(θ)的一个下界,且有 L ( θ i ) = B ( θ i , θ i ) L(\theta_{i})=B(\theta_{i},\theta_{i}) L(θi)=B(θi,θi),为了使 L ( θ ) L(\theta) L(θ)尽可能地变大,我们解以下式子:
arg max θ B ( θ , θ i ) = arg max θ L ( θ i ) + ∑ Z p ( z ∣ x , θ i ) log p ( x ∣ z , θ ) p ( z ∣ θ ) p ( z ∣ x , θ i ) ( p ( x ∣ θ i ) = arg max θ ∑ Z p ( z ∣ x , θ i ) log ( p ( x ∣ z , θ ) p ( z ∣ θ ) ) = arg max θ ∑ Z p ( z ∣ x , θ i ) log p ( x , z ∣ θ ) = arg max θ Q ( θ , θ i ) . . . . . . . . . . . ( 10 ) \begin{aligned} \arg \max_\theta B(\theta,\theta_{i}) &=\arg \max_\theta L(\theta_{i})+ \sum_Zp(z|x,\theta_{i})\log\frac{p(x|z,\theta)p(z|\theta)}{p(z|x,\theta_{i})(p(x|\theta_{i})} \\ &= \arg \max_\theta \sum_Zp(z|x,\theta_{i})\log (p(x|z,\theta)p(z|\theta)) \\ &= \arg \max_\theta \sum_Zp(z|x,\theta_{i})\log p(x,z|\theta)\\ &= \arg \max_\theta Q(\theta,\theta_i)...........(10) \end{aligned} argθmaxB(θ,θi)=argθmaxL(θi)+Z∑p(z∣x,θi)logp(z∣x,θi)(p(x∣θi)p(x∣z,θ)p(z∣θ)=argθmaxZ∑p(z∣x,θi)log(p(x∣z,θ)p(z∣θ))=argθmaxZ∑p(z∣x,θi)logp(x,z∣θ)=argθmaxQ(θ,θi)...........(10)
而这正是EM中我们所做的一次迭代,由此可证EM算法是通过不断求解下界的极大化逼近求解对数似然函数极大化的算法。
可以使用单调有界的思想来证明EM算法收敛。首先, P ( x ∣ θ ) P(x|\theta) P(x∣θ)为观测数据的似然函数,我们要证明递增性,即
P ( x ∣ θ i + 1 ) ≥ P ( x ∣ θ i ) . . . . . . . . . . . ( 11 ) P(x|\theta_{i+1}) \ge P(x| \theta_i)...........(11) P(x∣θi+1)≥P(x∣θi)...........(11)
证明如下:
P ( x ∣ θ ) = P ( x , z ∣ θ ) P ( z ∣ x , θ )    ⟺    log P ( x ∣ θ ) = log P ( x , z ∣ θ ) − log P ( z ∣ x , θ ) . . . . . . . . . . . ( 12 ) \begin{aligned} &P(x|\theta)=\frac{P(x,z|\theta)}{P(z|x,\theta)}\iff \log P(x|\theta) = \log P(x,z|\theta) - \log P(z|x,\theta)...........(12) \end{aligned} P(x∣θ)=P(z∣x,θ)P(x,z∣θ)⟺logP(x∣θ)=logP(x,z∣θ)−logP(z∣x,θ)...........(12)
令
H ( θ , θ i ) = ∑ z log P ( z ∣ x , θ ) P ( z ∣ x , θ i ) . . . . . . . . . . . ( 13 ) H(\theta,\theta_i)=\sum\limits_z\log P(z|x,\theta)P(z|x,\theta_i)...........(13) H(θ,θi)=z∑logP(z∣x,θ)P(z∣x,θi)...........(13)
根据Q函数定义有,
Q ( θ , θ i ) = ∑ z log P ( x , z ∣ θ ) P ( z ∣ x , θ i ) Q(\theta,\theta_i)=\sum\limits_z\log P(x,z|\theta)P(z|x,\theta_i) Q(θ,θi)=z∑logP(x,z∣θ)P(z∣x,θi)
于是对数似然函数可以写成
log P ( x ∣ θ ) = Q ( θ , θ i ) − H ( θ , θ i ) . . . . . . . . . . ( 14 ) \log P(x|\theta) =Q(\theta,\theta_i)-H(\theta,\theta_i)..........(14) logP(x∣θ)=Q(θ,θi)−H(θ,θi)..........(14)
上式中分别取 θ i \theta_i θi 和 θ i + 1 \theta_{i+1} θi+1并相减,有
log P ( x ∣ θ i + 1 ) − log P ( x ∣ θ i ) = [ Q ( θ i + 1 , θ i ) − Q ( θ i , θ i ) ] − [ H ( θ i + 1 , θ i ) − H ( θ i , θ i ) ] . . . . . . . . . . ( 15 ) \begin{aligned} & \log P(x|\theta_{i+1})- \log P(x|\theta_i) \\ &=[Q(\theta_{i+1},\theta_i)-Q(\theta_i,\theta_i)]-[H(\theta_{i+1},\theta_i)-H(\theta_i,\theta_i)]..........(15) \end{aligned} logP(x∣θi+1)−logP(x∣θi)=[Q(θi+1,θi)−Q(θi,θi)]−[H(θi+1,θi)−H(θi,θi)]..........(15)
由于 θ i + 1 \theta_{i+1} θi+1使 Q ( θ , θ i ) Q(\theta,\theta_i) Q(θ,θi)极大,所以有
Q ( θ i + 1 , θ i ) − Q ( θ i , θ i ) ≥ 0.......... ( 16 ) Q(\theta_{i+1},\theta_i)-Q(\theta_i,\theta_i) \ge 0 ..........(16) Q(θi+1,θi)−Q(θi,θi)≥0..........(16)
而由式子13以及Jensen不等式,有
H ( θ i + 1 , θ i ) − H ( θ i , θ i ) = ∑ z ( log P ( z ∣ x , θ i + 1 ) P ( z ∣ x , θ i ) P ( z ∣ x , θ i ) ) ≤ log ( ∑ z P ( z ∣ x , θ i + 1 ) P ( z ∣ x , θ i ) P ( z ∣ x , θ i ) ) = log P ( z ∣ x , θ i + 1 ) = 0......... ( 17 ) \begin{aligned} &H(\theta_{i+1},\theta_i)-H(\theta_i,\theta_i) \\ &=\sum_z\left(\log \frac{P(z|x,\theta_{i+1})}{P(z|x,\theta_i)}P(z|x,\theta_i)\right) \\ &\le \log \left(\sum_z \frac{P(z|x,\theta_{i+1})}{P(z|x,\theta_i)}P(z|x,\theta_i)\right) \\ &=\log P(z|x,\theta_{i+1})=0.........(17) \end{aligned} H(θi+1,θi)−H(θi,θi)=z∑(logP(z∣x,θi)P(z∣x,θi+1)P(z∣x,θi))≤log(z∑P(z∣x,θi)P(z∣x,θi+1)P(z∣x,θi))=logP(z∣x,θi+1)=0.........(17)
可知
log P ( x ∣ θ i + 1 ) − log P ( x ∣ θ i ) ≥ 0.......... ( 18 ) \log P(x|\theta_{i+1})- \log P(x|\theta_i) \ge 0 ..........(18) logP(x∣θi+1)−logP(x∣θi)≥0..........(18)
又因为 P ( x ∣ θ i ) ≤ 1 P(x|\theta_i) \le 1 P(x∣θi)≤1有界,可知 L ( θ i ) = log P ( x ∣ θ i ) L(\theta_i)=\log P(x|\theta_i) L(θi)=logP(x∣θi)必定收敛。
由于采用迭代的方式进行求解,EM算法可以保证收敛到一个稳定点,但是却不能保证收敛到全局的极大值点,因此它是局部最优的算法,当然,如果我们的优化目标 B ( θ , θ i ) B(\theta,\theta_i) B(θ,θi)是凸的,则EM算法可以保证收敛到全局最大值,这点和梯度下降法这样的迭代算法相同。初值的选择非常重要,常用的方法是选择多个初值尝试,比较结果选取较好的。
log P ( x ∣ θ ) = log P ( x , z ∣ θ ) − log P ( z ∣ x , θ ) = log P ( x , z ∣ θ ) q ( z ) − log P ( z ∣ x , θ ) q ( z ) \begin{aligned} \log P(x|\theta) &=\log P(x,z|\theta)-\log P(z|x,\theta) \\ &=\log \frac{P(x,z|\theta)}{q(z)}-\log \frac{P(z|x,\theta)}{q(z)} \\ \end{aligned} logP(x∣θ)=logP(x,z∣θ)−logP(z∣x,θ)=logq(z)P(x,z∣θ)−logq(z)P(z∣x,θ)
左右两边同时对z积分,此时有
l e f t = ∫ log P ( x ∣ θ ) q ( z ) d z = log P ( x ∣ θ ) ∫ q ( z ) d z = P ( x ∣ θ ) left=\int \log P(x|\theta) q(z)dz=\log P(x|\theta)\int q(z)dz= P(x|\theta) left=∫logP(x∣θ)q(z)dz=logP(x∣θ)∫q(z)dz=P(x∣θ)
右边有
r i g h t = ∫ z q ( z ) log P ( x , z ∣ θ ) q ( z ) d z − ∫ z q ( z ) log P ( z ∣ x , θ ) q ( z ) d z right=\int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz-\int_zq(z)\log \frac{P(z|x,\theta)}{q(z)}dz right=∫zq(z)logq(z)P(x,z∣θ)dz−∫zq(z)logq(z)P(z∣x,θ)dz
根据相对熵(KL散度)的定义,对于p(x)和q(x)两个分布,p对q的相对熵表示为
D ( p ∣ ∣ q ) = ∑ i = 1 n p ( x ) log p ( x ) q ( x ) D(p||q)=\sum_{i=1}^np(x)\log \frac{p(x)}{q(x)} D(p∣∣q)=i=1∑np(x)logq(x)p(x)
可以发现right等式右边第二项其实就是q(z)对 P ( z ∣ x , θ ) P(z|x,\theta) P(z∣x,θ)的相对熵,即
P ( x ∣ θ ) = ∫ z q ( z ) log P ( x , z ∣ θ ) q ( z ) d z + D ( q ( z ) ∣ ∣ p ( z ∣ x , θ ) ) P(x|\theta) =\int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz+D(q(z)||p(z|x,\theta)) P(x∣θ)=∫zq(z)logq(z)P(x,z∣θ)dz+D(q(z)∣∣p(z∣x,θ))
相对熵有两个性质,一个是不具有对称性,即 D ( p ∣ ∣ q ) ≠ D ( q ∣ ∣ p ) D(p||q)\ne D(q||p) D(p∣∣q)̸=D(q∣∣p),另一个性质是 D ( p ∣ ∣ q ) ≥ 0 D(p||q) \ge 0 D(p∣∣q)≥0。当p和q的概率分布相同时,相对熵为0。也就是说,第一项其实就是 p ( x ∣ θ ) p(x|\theta) p(x∣θ)的一个下界,我们称之为ELBO。我们想要让下界ELBO增大,从而使得 p ( x ∣ θ ) p(x|\theta) p(x∣θ)增大,则
θ ^ = arg max θ ∫ z q ( z ) log P ( x , z ∣ θ ) q ( z ) d z = arg max θ ∫ z P ( z ∣ x , θ i ) log P ( x , z ∣ θ ) P ( z ∣ x , θ i ) d z = arg max θ ∫ z P ( z ∣ x , θ i ) log P ( x , z ∣ θ ) d z \begin{aligned} \hat \theta &=\arg \max_\theta \int_zq(z)\log \frac{P(x,z|\theta)}{q(z)}dz \\ &=\arg \max_\theta \int_zP(z|x,\theta_i)\log \frac{P(x,z|\theta)}{P(z|x,\theta_i)}dz \\ &=\arg \max_\theta \int_zP(z|x,\theta_i)\log P(x,z|\theta)dz \\ \end{aligned} θ^=argθmax∫zq(z)logq(z)P(x,z∣θ)dz=argθmax∫zP(z∣x,θi)logP(z∣x,θi)P(x,z∣θ)dz=argθmax∫zP(z∣x,θi)logP(x,z∣θ)dz
因此得到 θ \theta θ的迭代公式。
下面我们使用EM算法来求解GMM模型,GMM模型如下:
p ( x ∣ θ ) = ∑ k = 1 K p k N ( x ∣ μ k , Σ k ) . . . . . . . . . ( 19 ) p(x|\theta)=\sum_{k=1}^Kp_kN(x|\mu_k,\Sigma_k).........(19) p(x∣θ)=k=1∑KpkN(x∣μk,Σk).........(19)
联合概率为
p ( x , z ) = p ( z ) p ( x ∣ z ) = p k ⋅ N ( x ∣ μ z , Σ z ) p(x,z)=p(z)p(x|z)=p_k·N(x|\mu_z,\Sigma_z) p(x,z)=p(z)p(x∣z)=pk⋅N(x∣μz,Σz)
后验概率为
p ( z ∣ x ) = p ( x , z ) p ( x ) = p k ⋅ N ( x ∣ μ z , Σ z ) ∑ k = 1 K p k ⋅ N ( x ∣ μ k , Σ k ) p(z|x)=\frac{p(x,z)}{p(x)}=\frac{p_k·N(x|\mu_z,\Sigma_z)}{\sum_{k=1}^Kp_k·N(x|\mu_k,\Sigma_k)} p(z∣x)=p(x)p(x,z)=∑k=1Kpk⋅N(x∣μk,Σk)pk⋅N(x∣μz,Σz)
在EM算法中,有
Q ( θ , θ t ) = ∫ z log P ( x , z ∣ θ ) P ( z ∣ x , θ t ) d z = ∑ z log ∏ i = 1 N P ( x i , z i ∣ θ ) ∏ i = 1 N P ( z i ∣ x i , θ t ) = ∑ z 1 , z 2 , . . . , z n ∑ i = 1 N log P ( x i , z i ∣ θ ) ∏ i = 1 N P ( z i ∣ x i , θ t ) = ∑ z 1 , z 2 , . . . , z n ( log P ( x 1 , z 1 ∣ θ ) + log P ( x 2 , z 2 ∣ θ ) + . . . + log P ( x n , z n ∣ θ ) ) ∏ i = 1 N P ( z i ∣ x i , θ t ) \begin{aligned} Q(\theta,\theta_t) &=\int_z\log P(x,z|\theta)P(z|x,\theta_t)dz\\ &=\sum_z\log \prod_{i=1}^NP(x_i,z_i|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}\sum_{i=1}^N\log P(x_i,z_i|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}(\log P(x_1,z_1|\theta)+\log P(x_2,z_2|\theta)+...+\log P(x_n,z_n|\theta))\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ \end{aligned} Q(θ,θt)=∫zlogP(x,z∣θ)P(z∣x,θt)dz=z∑logi=1∏NP(xi,zi∣θ)i=1∏NP(zi∣xi,θt)=z1,z2,...,zn∑i=1∑NlogP(xi,zi∣θ)i=1∏NP(zi∣xi,θt)=z1,z2,...,zn∑(logP(x1,z1∣θ)+logP(x2,z2∣θ)+...+logP(xn,zn∣θ))i=1∏NP(zi∣xi,θt)
对于其中一项
∑ z 1 , z 2 , . . . , z n log P ( x 1 , z 1 ∣ θ ) ∏ i = 1 N P ( z i ∣ x i , θ t ) = ∑ z 1 , z 2 , . . . , z n log P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) ∏ i = 2 N P ( z i ∣ x i , θ t ) = ∑ z 1 log P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) ∑ z 2 , . . . , z n ∏ i = 2 N P ( z i ∣ x i , θ t ) = ∑ z 1 log P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) ∑ z 2 P ( z 2 ∣ x 2 , θ t ) ∑ z 3 P ( z 3 ∣ x 3 , θ t ) . . . ∑ z n P ( z 3 ∣ x n , θ t ) = ∑ z 1 log P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) \begin{aligned} &\sum_{z_1,z_2,...,z_n}\log P(x_1,z_1|\theta)\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1,z_2,...,z_n}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\prod_{i=2}^NP(z_i|x_i,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\sum_{z_2,...,z_n}\prod_{i=2}^NP(z_i|x_i,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)\sum_{z_2}P(z_2|x_2,\theta_t)\sum_{z_3}P(z_3|x_3,\theta_t)...\sum_{z_n}P(z_3|x_n,\theta_t) \\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t) \end{aligned} z1,z2,...,zn∑logP(x1,z1∣θ)i=1∏NP(zi∣xi,θt)=z1,z2,...,zn∑logP(x1,z1∣θ)P(z1∣x1,θt)i=2∏NP(zi∣xi,θt)=z1∑logP(x1,z1∣θ)P(z1∣x1,θt)z2,...,zn∑i=2∏NP(zi∣xi,θt)=z1∑logP(x1,z1∣θ)P(z1∣x1,θt)z2∑P(z2∣x2,θt)z3∑P(z3∣x3,θt)...zn∑P(z3∣xn,θt)=z1∑logP(x1,z1∣θ)P(z1∣x1,θt)
因此
Q ( θ , θ t ) = ∑ z 1 , z 2 , . . . , z n ( log P ( x 1 , z 1 ∣ θ ) + log P ( x 2 , z 2 ∣ θ ) + . . . + log P ( x n , z n ∣ θ ) ) ∏ i = 1 N P ( z i ∣ x i , θ t ) = ∑ z 1 log P ( x 1 , z 1 ∣ θ ) P ( z 1 ∣ x 1 , θ t ) + . . . + ∑ z n log P ( x n , z n ∣ θ ) P ( z n ∣ x n , θ t ) = ∑ i = 1 N ∑ z i log P ( x i , z i ∣ θ ) P ( z i ∣ x i , θ t ) = ∑ i = 1 N ∑ z i log ( p z i ⋅ N ( x i ∣ μ z i , Σ z i ) ) ⋅ P ( z i ∣ x i , θ t ) = ∑ z i ∑ i = 1 N log ( p z i ⋅ N ( x i ∣ μ z i , Σ z i ) ) ⋅ P ( z i ∣ x i , θ t ) = ∑ k = 1 K ∑ i = 1 N log ( p k ⋅ N ( x i ∣ μ k , Σ k ) ) ⋅ P ( z i = c k ∣ x i , θ t ) = ∑ k = 1 K ∑ i = 1 N ( log p k + log N ( x i ∣ μ k , Σ k ) ) ⋅ P ( z i = c k ∣ x i , θ t ) \begin{aligned} Q(\theta,\theta_t) &=\sum_{z_1,z_2,...,z_n}(\log P(x_1,z_1|\theta)+\log P(x_2,z_2|\theta)+...+\log P(x_n,z_n|\theta))\prod_{i=1}^NP(z_i|x_i,\theta_t)\\ &=\sum_{z_1}\log P(x_1,z_1|\theta)P(z_1|x_1,\theta_t)+...+\sum_{z_n}\log P(x_n,z_n|\theta)P(z_n|x_n,\theta_t) \\ &=\sum_{i=1}^N\sum_{z_i}\log P(x_i,z_i|\theta)P(z_i|x_i,\theta_t) \\ &=\sum_{i=1}^N\sum_{z_i}\log(p_{z_i}·N(x_i|\mu_{z_i},\Sigma_{z_i}))·P(z_i|x_i,\theta_t)\\ &=\sum_{z_i}\sum_{i=1}^N\log(p_{z_i}·N(x_i|\mu_{z_i},\Sigma_{z_i}))·P(z_i|x_i,\theta_t)\\ &=\sum_{k=1}^K\sum_{i=1}^N\log(p_k·N(x_i|\mu_{k},\Sigma_{k}))·P(z_i=c_k|x_i,\theta_t)\\ &=\sum_{k=1}^K\sum_{i=1}^N(\log p_k+\log N(x_i|\mu_{k},\Sigma_{k}))·P(z_i=c_k|x_i,\theta_t)\\ \end{aligned} Q(θ,θt)=z1,z2,...,zn∑(logP(x1,z1∣θ)+logP(x2,z2∣θ)+...+logP(xn,zn∣θ))i=1∏NP(zi∣xi,θt)=z1∑logP(x1,z1∣θ)P(z1∣x1,θt)+...+zn∑logP(xn,zn∣θ)P(zn∣xn,θt)=i=1∑Nzi∑logP(xi,zi∣θ)P(zi∣xi,θt)=i=1∑Nzi∑log(pzi⋅N(xi∣μzi,Σzi))⋅P(zi∣xi,θt)=zi∑i=1∑Nlog(pzi⋅N(xi∣μzi,Σzi))⋅P(zi∣xi,θt)=k=1∑Ki=1∑Nlog(pk⋅N(xi∣μk,Σk))⋅P(zi=ck∣xi,θt)=k=1∑Ki=1∑N(logpk+logN(xi∣μk,Σk))⋅P(zi=ck∣xi,θt)
接下来求解 θ t + 1 = ( p k , μ k , Σ k ) \theta_{t+1}=(p_k,\mu_k,\Sigma_k) θt+1=(pk,μk,Σk)
p k t + 1 = arg max p k ∑ k = 1 K ∑ i = 1 N log p k ⋅ P ( z i = c k ∣ x i , θ t ) , s . t . ∑ k = 1 K p k = 1 μ k t + 1 = arg max μ k ∑ k = 1 K ∑ i = 1 N log N ( x i ∣ μ k , Σ k ) ⋅ P ( z i = c k ∣ x i , θ t ) Σ k t + 1 = arg max Σ k ∑ k = 1 K ∑ i = 1 N log N ( x i ∣ μ k , Σ k ) ⋅ P ( z i = c k ∣ x i , θ t ) p_k^{t+1}=\arg \max_{p_k}\sum_{k=1}^K\sum_{i=1}^N\log p_k·P(z_i=c_k|x_i,\theta_t),\quad s.t. \sum_{k=1}^Kp_k=1 \\ \mu_k^{t+1}=\arg \max_{\mu_k}\sum_{k=1}^K\sum_{i=1}^N\log N(x_i|\mu_{k},\Sigma_{k})·P(z_i=c_k|x_i,\theta_t) \\ \Sigma_k^{t+1}=\arg \max_{\Sigma_k}\sum_{k=1}^K\sum_{i=1}^N\log N(x_i|\mu_{k},\Sigma_{k})·P(z_i=c_k|x_i,\theta_t) \\ pkt+1=argpkmaxk=1∑Ki=1∑Nlogpk⋅P(zi=ck∣xi,θt),s.t.k=1∑Kpk=1μkt+1=argμkmaxk=1∑Ki=1∑NlogN(xi∣μk,Σk)⋅P(zi=ck∣xi,θt)Σkt+1=argΣkmaxk=1∑Ki=1∑NlogN(xi∣μk,Σk)⋅P(zi=ck∣xi,θt)
通过展开之后求导得到最优解,其中 p k t + 1 p_k^{t+1} pkt+1的求解使用拉格朗日乘子法进行求解。重复以上E-step和M-step直至对数似然值不再有明显变化为止。