EM算法及其应用
考虑一个高斯分布 p ( x ∣ θ ) p(\boldsymbol{x}|\theta) p(x∣θ), 其中 θ = ( μ , Σ ) \theta=(\mu,\Sigma) θ=(μ,Σ). 样本集 X = { x 1 , . . . , x N } X=\{x_1,...,x_N\} X={x1,...,xN}中每个样本都是独立的从该高斯分布中抽取得到的,满足独立同分布假设。
因此,取到这个样本集的概率为: p ( X ∣ θ ) = ∏ i = 1 N p ( x i ∣ θ ) \begin{aligned} p(X\mid{\theta}) &= \prod_{i=1}^Np(x_i\mid\theta) \end{aligned} p(X∣θ)=i=1∏Np(xi∣θ)我们要估计模型参数向量 θ \theta θ的值,由于现在样本集 X X X已知,所以可以把 p ( X ∣ θ ) p(X\mid{\theta}) p(X∣θ)看作是参数向量 θ \theta θ的函数, 称之为样本集 X X X下的似然函数 L ( θ ∣ X ) L(\theta\mid{X}) L(θ∣X)。
L ( θ ∣ X ) = ∏ i = 1 N p ( x i ∣ θ ) \begin{aligned} L(\theta\mid{X}) &= \prod_{i=1}^Np(x_i\mid\theta) \end{aligned} L(θ∣X)=i=1∏Np(xi∣θ)极大似然估计要做的就是最大化该似然函数,找到能够使得 p ( X ∣ θ ) p(X\mid{\theta}) p(X∣θ)最大的参数向量 θ \theta θ,换句话说,就是找到最符合当前观测样本集 X X X的模型的参数向量 θ \theta θ。
为方便运算,通常我们使用似然函数的对数形式(可以将乘积运算转化为求和形式),称之为“对数似然函数”。由于底数大于1的对数函数总是单调递增的,所以使对数似然函数达到最大值的参数向量也能使似然函数本身达到最大值。故,对数似然函数为: L ( θ ∣ X ) = log ( ∏ i = 1 N p ( x i ∣ θ ) ) = ∑ i = 1 N log p ( x i ∣ θ ) \begin{aligned} L(\theta\mid{X}) &= \log\Big(\prod_{i=1}^Np(x_i\mid\theta)\Big) \\ &= \sum_{i=1}^N\log{p(x_i\mid\theta)} \end{aligned} L(θ∣X)=log(i=1∏Np(xi∣θ))=i=1∑Nlogp(xi∣θ)参数的估计值 θ ^ = arg max θ L ( θ ∣ X ) \hat{\theta}=\arg\underset{\theta}{\max}L(\theta\mid{X}) θ^=argθmaxL(θ∣X)
混合高斯模型是指由 k k k个高斯模型叠加形成的一种概率分布模型,形式化表示为: p ( x ∣ Θ ) = ∑ l = 1 k α l p l ( x ∣ θ l ) p(\mathbf{x}\mid{\Theta}) = \sum_{l=1}^{k}\alpha_lp_l(\mathbf{x}\mid{\theta_l}) p(x∣Θ)=l=1∑kαlpl(x∣θl)
其中,参数 Θ = ( α 1 , . . . , α k , θ 1 , . . . , θ k ) \Theta=(\alpha_1,...,\alpha_k,\theta_1,...,\theta_k) Θ=(α1,...,αk,θ1,...,θk),并且有 Σ l = 1 k α l = 1 \Sigma_{l=1}^{k}\alpha_l=1 Σl=1kαl=1, Σ l = 1 k α l = 1 \Sigma_{l=1}^{k}\alpha_l=1 Σl=1kαl=1代表单个高斯分布在混合高斯模型中的权重。现在我们假设观测样本集 X = ( x 1 , . . . x N ) X=(x_1,...x_N) X=(x1,...xN)来自于该混合高斯模型,满足独立同分布假设。为了估计出该混合高斯模型的参数 Θ \Theta Θ,我们写出这n个数据的对数似然函数: L ( Θ ∣ X ) = log ( p ( X ∣ Θ ) ) = log ( ∏ i = 1 N p ( x i ∣ Θ ) ) = ∑ i = 1 N log ( p ( x i ∣ Θ ) ) = ∑ i = 1 N log ( ∑ l = 1 k α l p l ( x i ∣ θ l ) ) \begin{aligned} L(\Theta\mid{X}) &= \log\Big(p(X\mid{\Theta})\Big) \\ &= \log\Big(\prod_{i=1}^{N}p(x_i\mid{\Theta})\Big) \\ &= \sum_{i=1}^{N}\log\Big(p(x_i\mid{\Theta})\Big) \\ &= \sum_{i=1}^{N}\log\Big(\sum_{l=1}^{k}\alpha_lp_l(x_i\mid{\theta_l})\Big) \end{aligned} L(Θ∣X)=log(p(X∣Θ))=log(i=1∏Np(xi∣Θ))=i=1∑Nlog(p(xi∣Θ))=i=1∑Nlog(l=1∑kαlpl(xi∣θl))我们的目标就是要通过最大化该似然函数从而估计出混合高斯模型的参数 Θ \Theta Θ
观察该式,由于对数函数中还包含求和式,因此如果仿照极大似然估计单个高斯分布参数的方法来求解这个问题,是无法得到解析解的。所以我们要寻求更好的方式来解决这个问题。
基本过程
考虑一个样本集 X X X是从某种分布(参数为 Θ \Theta Θ) 中观测得到的,我们称之为不完全数据(incomplete data)。我们引入一个无法直接观测得到的随机变量集合 Z Z Z,叫作隐变量。 X X X和 Z Z Z连在一起称作完全数据。我们可以得到它们的联合概率分布为: p ( X , Z ∣ Θ ) = p ( Z ∣ X , Θ ) p ( X ∣ Θ ) p(X,Z\mid{\Theta})=p(Z\mid{X,\Theta})p(X\mid{\Theta}) p(X,Z∣Θ)=p(Z∣X,Θ)p(X∣Θ)我们定义一个新的似然函数叫作完全数据似然: L ( Θ ∣ X , Z ) = p ( X , Z ∣ Θ ) L(\Theta\mid{X,Z})=p(X,Z\mid{\Theta}) L(Θ∣X,Z)=p(X,Z∣Θ)我们可以通过极大化这样一个似然函数来估计参数 Θ \Theta Θ。然而 Z Z Z是隐变量,其分布未知上式无法直接求解,我们通过计算完全数据的对数似然函数关于 Z Z Z的期望,来最大化已观测数据的边际似然。我们定义这样一个期望值为 Q Q Q函数: Q ( Θ , Θ g ) = E Z [ log p ( X , Z ∣ Θ ) ∣ X , Θ g ] = ∑ Z log p ( X , Z ∣ Θ ) p ( Z ∣ X , Θ g ) \begin{aligned} Q(\Theta,\Theta^{g}) &= \mathbb{E}_Z\Big[\log{p(X,Z\mid\Theta)}\mid{X,\Theta^g}\Big]\\ &= \sum_Z\log{p(X,Z\mid{\Theta})p(Z\mid{X,\Theta^g})} \end{aligned} Q(Θ,Θg)=EZ[logp(X,Z∣Θ)∣X,Θg]=Z∑logp(X,Z∣Θ)p(Z∣X,Θg) 其中, Θ g \Theta^g Θg表示当前的参数估计值, X X X是观测数据,都作为常量。 Θ \Theta Θ是我们要极大化的参数, Z Z Z是来自于分布 p ( Z ∣ X , Θ g ) p(Z\mid{X,\Theta^g}) p(Z∣X,Θg)的随机变量。 p ( Z ∣ X , Θ g ) p(Z\mid{X,\Theta^g}) p(Z∣X,Θg)是未观测变量的边缘分布,并且依赖于观测数据 X X X和当前模型参数 Θ g \Theta^g Θg.
EM算法中的E-setp就是指上面计算期望值的过程。
M-step则是极大化这样一个期望,从而得到能够最大化期望的参数 Θ \Theta Θ
Θ g + 1 = arg max θ Q ( Θ , Θ g ) = arg max θ ∑ Z log p ( X , Z ∣ Θ ) p ( Z ∣ X , Θ g ) \begin{aligned} \Theta^{g+1} &= \arg\underset{\theta}{\max}Q(\Theta,\Theta^{g}) \\ &= \arg\underset{\theta}{\max}\sum_Z\log{p(X,Z\mid{\Theta})p(Z\mid{X,\Theta^g})} \end{aligned} Θg+1=argθmaxQ(Θ,Θg)=argθmaxZ∑logp(X,Z∣Θ)p(Z∣X,Θg)重复执行E-step和M-step直至收敛。
收敛性证明(略)
回到高斯混合模型的参数估计问题,我们将样本集 X X X看作不完全数据,考虑一个无法观测到的随机变量 Z = { z i } i = 1 N Z=\{z_i\}_{i=1}^{N} Z={zi}i=1N,其中 z i ∈ { 1 , . . . , k } z_i\in\{1,...,k\} zi∈{1,...,k}, 用来指示每一个数据点是由哪一个高斯分布产生的,(可以类比成一个类别标签,这个标签我们是无法直接观测得到的)。比如, z i = k z_i=k zi=k表示第 i i i个样本点是由混合高斯模型中的第 k k k个分量模型生成的。那么完全数据 ( X , Z ) (X,Z) (X,Z)的概率分布为: p ( X , Z ∣ Θ ) = ∏ i = 1 N p ( x i , z i ∣ Θ ) = ∏ i = 1 N p ( z i ∣ Θ ) p ( x i ∣ z i , Θ ) = ∏ i = 1 N α z i p z i ( x i ∣ θ z i ) \begin{aligned} p(X,Z\mid{\Theta}) &= \prod_{i=1}^{N}p(x_i,z_i\mid{\Theta}) \\ &= \prod_{i=1}^{N}p(z_i\mid{\Theta})p(x_i\mid{z_i,\Theta}) \\ &= \prod_{i=1}^{N}\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}}) \end{aligned} p(X,Z∣Θ)=i=1∏Np(xi,zi∣Θ)=i=1∏Np(zi∣Θ)p(xi∣zi,Θ)=i=1∏Nαzipzi(xi∣θzi)在混合高斯模型问题中, p ( z i ∣ Θ ) p(z_i\mid\Theta) p(zi∣Θ)是指第 z i z_i zi个模型的先验概率,也就是其权重 α z i \alpha_{z_i} αzi.给定样本点来源的高斯分布类别后, p ( x i ∣ z i , Θ ) p(x_i\mid{z_i,\Theta}) p(xi∣zi,Θ)可以写成对应的高斯分布下的概率密度形式,即 p z i ( x i ∣ θ z i ) p_{z_i}(x_i\mid{\theta_{z_i}}) pzi(xi∣θzi).
根据贝叶斯公式,又可以得到隐变量的条件概率分布: p ( Z ∣ X , Θ ) = ∏ i = 1 N p ( z i ∣ x i , Θ ) = ∏ i = 1 N p ( x i , z i , Θ ) p ( x i , Θ ) = ∏ i = 1 N α z i p z i ( x i ∣ θ z i ) ∑ z i = 1 k α z i p z i ( x i ∣ θ z i ) \begin{aligned} p(Z\mid{X,\Theta}) &= \prod_{i=1}^Np(z_i\mid{x_i,\Theta}) \\ &= \prod_{i=1}^N\frac{p(x_i,z_i,\Theta)}{p(x_i,\Theta)} \\ &= \prod_{i=1}^N\frac{\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})}{\sum_{z_i=1}^k\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})} \end{aligned} p(Z∣X,Θ)=i=1∏Np(zi∣xi,Θ)=i=1∏Np(xi,Θ)p(xi,zi,Θ)=i=1∏N∑zi=1kαzipzi(xi∣θzi)αzipzi(xi∣θzi)因此,我们可以写出相应的 Q Q Q函数: Q ( Θ , Θ g ) = ∑ Z log p ( X , Z ∣ Θ ) p ( Z ∣ X , Θ g ) = ∑ Z log ∏ i = 1 N α z i p z i ( x i ∣ θ z i ) ∏ i = 1 N p ( z i ∣ x i , Θ g ) = ∑ Z ∑ i = 1 N log ( α z i p z i ( x i ∣ θ z i ) ) ∏ i = 1 N p ( z i ∣ x i , Θ g ) = ∑ z 1 = 1 k ∑ z 2 = 1 k . . . ∑ z N = 1 k ∑ i = 1 N log ( α z i p z i ( x i ∣ θ z i ) ) ∏ i = 1 N p ( z i ∣ x i , Θ g ) = ∑ z 1 = 1 k ∑ z 2 = 1 k . . . ∑ z N = 1 k log ( α z 1 p z 1 ( x 1 ∣ θ z 1 ) ) ∏ i = 1 N p ( z i ∣ x i , Θ g ) + ∑ z 1 = 1 k ∑ z 2 = 1 k . . . ∑ z N = 1 k ∑ i = 2 N log ( α z i p z i ( x i ∣ θ z i ) ) ∏ i = 1 N p ( z i ∣ x i , Θ g ) = A + B \begin{aligned} Q(\Theta,\Theta^{g}) &= \sum_Z\log{p(X,Z\mid{\Theta})p(Z\mid{X,\Theta^g})} \\ &= \sum_Z\log{\prod_{i=1}^{N}\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})}\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \sum_Z\sum_{i=1}^{N}\log\Big(\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \sum_{z_1=1}^k\sum_{z_2=1}^k...\sum_{z_N=1}^k\sum_{i=1}^{N}\log\Big(\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \sum_{z_1=1}^k\sum_{z_2=1}^k...\sum_{z_N=1}^k\log\Big(\alpha_{z_1}p_{z_1}(x_1\mid{\theta_{z_1}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &\quad+ \sum_{z_1=1}^k\sum_{z_2=1}^k...\sum_{z_N=1}^k\sum_{i=2}^{N}\log\Big(\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \mathcal{A} + \mathcal{B} \end{aligned} Q(Θ,Θg)=Z∑logp(X,Z∣Θ)p(Z∣X,Θg)=Z∑logi=1∏Nαzipzi(xi∣θzi)i=1∏Np(zi∣xi,Θg)=Z∑i=1∑Nlog(αzipzi(xi∣θzi))i=1∏Np(zi∣xi,Θg)=z1=1∑kz2=1∑k...zN=1∑ki=1∑Nlog(αzipzi(xi∣θzi))i=1∏Np(zi∣xi,Θg)=z1=1∑kz2=1∑k...zN=1∑klog(αz1pz1(x1∣θz1))i=1∏Np(zi∣xi,Θg)+z1=1∑kz2=1∑k...zN=1∑ki=2∑Nlog(αzipzi(xi∣θzi))i=1∏Np(zi∣xi,Θg)=A+B其中, A = ∑ z 1 = 1 k ∑ z 2 = 1 k . . . ∑ z N = 1 k log ( α z 1 p z 1 ( x 1 ∣ θ z 1 ) ) ∏ i = 1 N p ( z i ∣ x i , Θ g ) = ∑ z 1 = 1 k log ( α z 1 p z 1 ( x 1 ∣ θ z 1 ) ) p ( z 1 ∣ x 1 , Θ g ) ∑ z 2 = 1 k . . . ∑ z N = 1 k ∏ i = 2 N p ( z i ∣ x i , Θ g ) ⏟ r e s u l t = 1 = ∑ z 1 = 1 k log ( α z 1 p z 1 ( x 1 ∣ θ z 1 ) ) p ( z 1 ∣ x 1 , Θ g ) \begin{aligned} \mathcal{A} &= \sum_{z_1=1}^k\sum_{z_2=1}^k...\sum_{z_N=1}^k\log\Big(\alpha_{z_1}p_{z_1}(x_1\mid{\theta_{z_1}})\Big)\prod_{i=1}^Np(z_i\mid{x_i,\Theta^g}) \\ &= \sum_{z_1=1}^k\log\Big(\alpha_{z_1}p_{z_1}(x_1\mid{\theta_{z_1}})\Big)p(z_1\mid{x_1,\Theta^g})\underbrace{\sum_{z_2=1}^k...\sum_{z_N=1}^k\prod_{i=2}^Np(z_i\mid{x_i,\Theta^g})}_{result=1} \\ &= \sum_{z_1=1}^k\log\Big(\alpha_{z_1}p_{z_1}(x_1\mid{\theta_{z_1}})\Big)p(z_1\mid{x_1,\Theta^g}) \\ \end{aligned} A=z1=1∑kz2=1∑k...zN=1∑klog(αz1pz1(x1∣θz1))i=1∏Np(zi∣xi,Θg)=z1=1∑klog(αz1pz1(x1∣θz1))p(z1∣x1,Θg)result=1 z2=1∑k...zN=1∑ki=2∏Np(zi∣xi,Θg)=z1=1∑klog(αz1pz1(x1∣θz1))p(z1∣x1,Θg) B \mathcal{B} B 式可以按照同样的操作技巧进行分解,故不赘述。
并且,我们用 l l l代替 z i z_i zi来简化我们的表达式。因此, Q Q Q函数可以化简为: Q ( Θ , Θ g ) = ∑ i = 1 N ∑ z i = 1 k log ( α z i p z i ( x i ∣ θ z i ) ) p ( z i ∣ x i , Θ g ) = ∑ i = 1 N ∑ l = 1 k log ( α l p l ( x i ∣ θ l ) ) p ( l ∣ x i , Θ g ) = ∑ l = 1 k ∑ i = 1 N log ( α l p l ( x i ∣ θ l ) ) p ( l ∣ x i , Θ g ) = ∑ l = 1 k ∑ i = 1 N log ( α l ) p ( l ∣ x i , Θ g ) + ∑ l = 1 k ∑ i = 1 N log ( p l ( x i ∣ θ l ) ) p ( l ∣ x i , Θ g ) \begin{aligned} Q(\Theta,\Theta^{g}) &= \sum_{i=1}^N\sum_{z_i=1}^k\log\Big(\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})\Big)p(z_i\mid{x_i,\Theta^g}) \\ &= \sum_{i=1}^N\sum_{l=1}^k\log\Big(\alpha_{l}p_{l}(x_i\mid{\theta_{l}})\Big)p(l\mid{x_i,\Theta^g}) \\ &= \sum_{l=1}^k\sum_{i=1}^N\log\Big(\alpha_{l}p_{l}(x_i\mid{\theta_{l}})\Big)p(l\mid{x_i,\Theta^g}) \\ &= \sum_{l=1}^k\sum_{i=1}^N\log(\alpha_l)p(l\mid{x_i,\Theta^g})+\sum_{l=1}^k\sum_{i=1}^N\log\Big(p_l(x_i\mid{\theta_{l}})\Big)p(l\mid{x_i,\Theta^g}) \\ \end{aligned} Q(Θ,Θg)=i=1∑Nzi=1∑klog(αzipzi(xi∣θzi))p(zi∣xi,Θg)=i=1∑Nl=1∑klog(αlpl(xi∣θl))p(l∣xi,Θg)=l=1∑ki=1∑Nlog(αlpl(xi∣θl))p(l∣xi,Θg)=l=1∑ki=1∑Nlog(αl)p(l∣xi,Θg)+l=1∑ki=1∑Nlog(pl(xi∣θl))p(l∣xi,Θg)
这样,我们可以对包含参数 α l \alpha_l αl和参数 θ l \theta_l θl的项分别进行最大化从而得到各自的估计值。
1. 估计参数 θ l \theta_l θl
这个问题可以表示为下面的约束最优化问题: max α l ∑ l = 1 k ∑ i = 1 N log ( α l ) p ( l ∣ x i , Θ g ) s . t . ∑ l = 1 k α l = 1 \begin{aligned} \underset{\alpha_l}{\max}&\quad \sum_{l=1}^k\sum_{i=1}^N\log(\alpha_l)p(l\mid{x_i,\Theta^g}) \\ s.t.& \quad\sum_{l=1}^k{\alpha_l}=1 \end{aligned} αlmaxs.t.l=1∑ki=1∑Nlog(αl)p(l∣xi,Θg)l=1∑kαl=1引入拉格朗日乘子 λ \lambda λ,构建拉格朗日函数: L ( α 1 , . . . , α 2 , λ ) = ∑ l = 1 k ∑ i = 1 N log ( α l ) p ( l ∣ x i , Θ g ) − λ ( ∑ l = 1 k α l − 1 ) = ∑ l = 1 k log ( α l ) ∑ i = 1 N p ( l ∣ x i , Θ g ) − λ ( ∑ l = 1 k α l − 1 ) \begin{aligned} \mathcal{L}(\alpha_1,...,\alpha_2,\lambda) &= \sum_{l=1}^k\sum_{i=1}^N\log(\alpha_l)p(l\mid{x_i,\Theta^g})-\lambda\Big(\sum_{l=1}^k{\alpha_l}-1\Big) \\ &= \sum_{l=1}^k\log(\alpha_l)\sum_{i=1}^Np(l\mid{x_i,\Theta^g})-\lambda\Big(\sum_{l=1}^k{\alpha_l}-1\Big) \end{aligned} L(α1,...,α2,λ)=l=1∑ki=1∑Nlog(αl)p(l∣xi,Θg)−λ(l=1∑kαl−1)=l=1∑klog(αl)i=1∑Np(l∣xi,Θg)−λ(l=1∑kαl−1)对参数 α l \alpha_l αl求偏导并令其为 0 0 0: ∂ L ∂ α l = 1 α l ∑ i = 1 N p ( l ∣ x i , Θ g ) − λ = 0 \frac{\partial\mathcal{L}}{\partial\alpha_l}=\frac{1}{\alpha_l}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})-\lambda=0 ∂αl∂L=αl1i=1∑Np(l∣xi,Θg)−λ=0得到: α l = 1 λ ∑ i = 1 N p ( l ∣ x i , Θ g ) \alpha_l=\frac{1}{\lambda}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g}) αl=λ1i=1∑Np(l∣xi,Θg)代回约束条件,有: 1 − 1 λ ∑ l = 1 k ∑ i = 1 N p ( l ∣ x i , Θ g ) = 0 1-\frac{1}{\lambda}\sum_{l=1}^{k}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g})=0 1−λ1l=1∑ki=1∑Np(l∣xi,Θg)=0在之前的推导中,我们得到过这样一个公式,即隐变量 z z z的条件分布: p ( z i ∣ x i , Θ ) = p ( x i , z i , Θ ) p ( x i , Θ ) = α z i p z i ( x i ∣ θ z i ) ∑ z i = 1 k α z i p z i ( x i ∣ θ z i ) \begin{aligned} p(z_i\mid{x_i,\Theta}) &= \frac{p(x_i,z_i,\Theta)}{p(x_i,\Theta)} \\ &= \frac{\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})}{\sum_{z_i=1}^k\alpha_{z_i}p_{z_i}(x_i\mid{\theta_{z_i}})} \end{aligned} p(zi∣xi,Θ)=p(xi,Θ)p(xi,zi,Θ)=∑zi=1kαzipzi(xi∣θzi)αzipzi(xi∣θzi)同样用 l l l代替 z i z_i zi来简化我们的表达式,得到: p ( l ∣ x i , Θ ) = p ( x i , l , Θ ) p ( x i , Θ ) = α l p l ( x i ∣ θ l ) ∑ l = 1 k α l p l ( x i ∣ θ l ) \begin{aligned} p(l\mid{x_i,\Theta}) &= \frac{p(x_i,l,\Theta)}{p(x_i,\Theta)} \\ &= \frac{\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}{\sum_{l=1}^k\alpha_{l}p_{l}(x_i\mid{\theta_{l}})} \end{aligned} p(l∣xi,Θ)=p(xi,Θ)p(xi,l,Θ)=∑l=1kαlpl(xi∣θl)αlpl(xi∣θl)
故将其代回之前的式子,得到: 1 − 1 λ ∑ l = 1 k ∑ i = 1 N α l p l ( x i ∣ θ l ) ∑ l = 1 k α l p l ( x i ∣ θ l ) = 0 1 − 1 λ ∑ i = 1 N ∑ l = 1 k α l p l ( x i ∣ θ l ) ∑ l = 1 k α l p l ( x i ∣ θ l ) = 0 1 − 1 λ ∑ i = 1 N ( 1 ) = 0 1 − N λ = 0 λ = N \begin{aligned} 1-\frac{1}{\lambda}\sum_{l=1}^{k}\sum_{i=1}^{N}\frac{\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}{\sum_{l=1}^k\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}&=0 \\ 1-\frac{1}{\lambda}\sum_{i=1}^{N}\sum_{l=1}^{k}\frac{\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}{\sum_{l=1}^k\alpha_{l}p_{l}(x_i\mid{\theta_{l}})}&=0 \\ 1-\frac{1}{\lambda}\sum_{i=1}^{N}(1)&=0 \\ 1-\frac{N}{\lambda}&=0 \\ \lambda&=N \end{aligned} 1−λ1l=1∑ki=1∑N∑l=1kαlpl(xi∣θl)αlpl(xi∣θl)1−λ1i=1∑Nl=1∑k∑l=1kαlpl(xi∣θl)αlpl(xi∣θl)1−λ1i=1∑N(1)1−λNλ=0=0=0=0=N故, α l = 1 N ∑ i = 1 N p ( l ∣ x i , Θ g ) \alpha_l=\frac{1}{N}\sum_{i=1}^{N}p(l\mid{x_i,\Theta^g}) αl=N1i=1∑Np(l∣xi,Θg)在估计剩下的参数之前,先补充一下一会要用到的线性代数知识。
【知识点~Matrix Algebra】
矩阵的迹等于矩阵的主对角线上元素之和,且有以下性质
t r ( A + B ) = t r ( A ) + t r ( B ) tr(A+B)=tr(A)+tr(B) tr(A+B)=tr(A)+tr(B)
t r ( A B ) = t r ( B A ) tr(AB)=tr(BA) tr(AB)=tr(BA)
∑ i x i T A x i = t r ( A B ) w h e r e B = ∑ i x i x i T \sum_ix_i^TAx_i=tr(AB) \quad where \quad B=\sum_ix_ix_i^T i∑xiTAxi=tr(AB)whereB=i∑xixiT
a i , j a_{i,j} ai,j表示矩阵 A A A的第 i i i行, j j j列的元素,给出矩阵函数求导的一些公式:
∂ x T A x ∂ x = ( A + A T ) x \frac{\partial{x^TAx}}{\partial{x}}=(A+A^T)x ∂x∂xTAx=(A+AT)x
∂ log ∣ A ∣ ∂ A = 2 A − 1 − d i a g ( A − 1 ) \frac{\partial\log|A|}{\partial{A}}=2A^{-1}-diag(A^{-1}) ∂A∂log∣A∣=2A−1−diag(A−1)
∂ t r ( A B ) ∂ A = B + B T − d i a g ( B ) \frac{\partial{tr(AB)}}{\partial{A}}=B+B^T-diag(B) ∂A∂tr(AB)=B+BT−diag(B)
对于 d d d维高斯分布来说,参数 θ = ( μ , Σ ) \theta=(\mu,\Sigma) θ=(μ,Σ),且: p l ( x ∣ μ l , Σ l ) = 1 ( 2 π ) d / 2 ∣ Σ l ∣ 1 / 2 exp [ − 1 2 ( x − μ l ) T Σ l − 1 ( x − μ l ) ] p_l(x\mid{\mu_l,\Sigma_l})=\frac{1}{(2\pi)^{d/2}\vert\Sigma_l\vert^{1/2}}\exp\big[-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)\big] pl(x∣μl,Σl)=(2π)d/2∣Σl∣1/21exp[−21(x−μl)TΣl−1(x−μl)]
2. 估计参数 μ l \mu_l μl
∑ l = 1 k ∑ i = 1 N log ( p l ( x i ∣ μ l , Σ l ) ) p ( l ∣ x i , Θ g ) = ∑ l = 1 k ∑ i = 1 N ( log ( 2 π − d / 2 ) + log ∣ Σ l ∣ − 1 / 2 − 1 2 ( x i − μ l ) T Σ l − 1 ( x i − μ l ) ) p ( l ∣ x i , Θ g ) \begin{aligned} &\quad\sum_{l=1}^k\sum_{i=1}^N\log\Big(p_l(x_i\mid{\mu_l,\Sigma_l})\Big)p(l\mid{x_i,\Theta^g}) \\ &= \sum_{l=1}^k\sum_{i=1}^N\Big(\log(2\pi^{-d/2})+\log|\Sigma_l|^{-1/2}-\frac{1}{2}(x_i-\mu_l)^T\Sigma_l^{-1}(x_i-\mu_l)\Big)p(l\mid{x_i,\Theta^g}) \\ \end{aligned} l=1∑ki=1∑Nlog(pl(xi∣μl,Σl))p(l∣xi,Θg)=l=1∑ki=1∑N(log(2π−d/2)+log∣Σl∣−1/2−21(xi−μl)TΣl−1(xi−μl))p(l∣xi,Θg)
忽略其中的常数项(因为常数项在求导之后为0),上式化简为: ∑ l = 1 k ∑ i = 1 N ( − 1 2 log ∣ Σ l ∣ − 1 2 ( x i − μ l ) T Σ l − 1 ( x i − μ l ) ) p ( l ∣ x i , Θ g ) \sum_{l=1}^k\sum_{i=1}^N\Big(-\frac{1}{2}\log|\Sigma_l|-\frac{1}{2}(x_i-\mu_l)^T\Sigma_l^{-1}(x_i-\mu_l)\Big)p(l\mid{x_i,\Theta^g}) l=1∑ki=1∑N(−21log∣Σl∣−21(xi−μl)TΣl−1(xi−μl))p(l∣xi,Θg)关于 μ l \mu_l μl求导,得到: ∑ l = 1 k ∑ i = 1 N ( − 1 2 ( Σ l − 1 + ( Σ l − 1 ) T ) ( x i − μ l ) ( − 1 ) ) p ( l ∣ x i , Θ g ) \sum_{l=1}^k\sum_{i=1}^N\Bigg(-\frac{1}{2}\Big(\Sigma_l^{-1}+(\Sigma_l^{-1})^T\Big)\Big(x_i-\mu_l\Big)\Big(-1\Big)\Bigg)p(l\mid{x_i,\Theta^g}) l=1∑ki=1∑N(−21(Σl−1+(Σl−1)T)(xi−μl)(−1))p(l∣xi,Θg)
因为协方差矩阵Sigma为对称阵,故 1 2 ( Σ l − 1 + ( Σ l − 1 ) T ) = Σ l − 1 \frac{1}{2}\Big(\Sigma_l^{-1}+(\Sigma_l^{-1})^T\Big)=\Sigma_l^{-1} 21(Σl−1+(Σl−1)T)=Σl−1,因此,上式继续化简为: ∑ l = 1 k ∑ i = 1 N Σ l − 1 ( x i − μ l ) p ( l ∣ x i , Θ g ) \sum_{l=1}^k\sum_{i=1}^N\Sigma_l^{-1}(x_i-\mu_l)p(l\mid{x_i,\Theta^g}) l=1∑ki=1∑NΣl−1(xi−μl)p(l∣xi,Θg)令该式为0,得到: ∑ l = 1 k Σ l − 1 ∑ i = 1 N μ l p ( l ∣ x i , Θ g ) = ∑ l = 1 k Σ l − 1 ∑ i = 1 N x i p ( l ∣ x i , Θ g ) ∑ i = 1 N μ l p ( l ∣ x i , Θ g ) = ∑ i = 1 N x i p ( l ∣ x i , Θ g ) μ l ∑ i = 1 N p ( l ∣ x i , Θ g ) = ∑ i = 1 N x i p ( l ∣ x i , Θ g ) μ l = ∑ i = 1 N x i p ( l ∣ x i , Θ g ) ∑ i = 1 N p ( l ∣ x i , Θ g ) \begin{aligned} \sum_{l=1}^k\Sigma_l^{-1}\sum_{i=1}^N{\mu_l}p(l\mid{x_i,\Theta^g})&=\sum_{l=1}^k\Sigma_l^{-1}\sum_{i=1}^N{x_i}p(l\mid{x_i,\Theta^g}) \\ \sum_{i=1}^N{\mu_l}p(l\mid{x_i,\Theta^g})&=\sum_{i=1}^N{x_i}p(l\mid{x_i,\Theta^g}) \\ \mu_l\sum_{i=1}^Np(l\mid{x_i,\Theta^g})&=\sum_{i=1}^N{x_i}p(l\mid{x_i,\Theta^g}) \\ \mu_l&=\frac{\sum_{i=1}^N{x_i}p(l\mid{x_i,\Theta^g})}{\sum_{i=1}^Np(l\mid{x_i,\Theta^g})} \\ \end{aligned} l=1∑kΣl−1i=1∑Nμlp(l∣xi,Θg)i=1∑Nμlp(l∣xi,Θg)μli=1∑Np(l∣xi,Θg)μl=l=1∑kΣl−1i=1∑Nxip(l∣xi,Θg)=i=1∑Nxip(l∣xi,Θg)=i=1∑Nxip(l∣xi,Θg)=∑i=1Np(l∣xi,Θg)∑i=1Nxip(l∣xi,Θg)