机器学习笔记之EM算法(二)EM算法公式推导过程

机器学习笔记之EM算法——EM算法公式推导过程

  • 引言
    • 回顾:EM算法公式
    • 推导过程

引言

上一节介绍了隐变量和EM算法,以及 以EM算法公式为条件,证明了随着EM算法迭代步骤的增加,每次迭代得到新的模型参数 θ ( t + 1 ) \theta^{(t+1)} θ(t+1)总是优于之前迭代结果 θ t , θ t − 1 , ⋯ \theta^{t},\theta^{t-1},\cdots θt,θt1,。最终 至少达到局部最优。本节将介绍EM算法公式的推导过程

回顾:EM算法公式

EM算法本质上是 求解包含隐变量 Z \mathcal Z Z的概率模型 P ( X ∣ θ ) P(\mathcal X \mid \theta) P(Xθ)的最优参数。

隐变量是人为定义的一种变量——其原因是仅观察样本集合,很难观测到概率模型 P ( X ∣ θ ) P(\mathcal X \mid\theta) P(Xθ)的分布规律。通过定义隐变量来协助求解概率模型

EM算法的底层逻辑依然是极大似然估计
隐变量是协助求解概率模型所定义的一种手段,它并不真实存在。因此只是在求解过程中‘引入隐变量’而不是像 arg ⁡ max ⁡ θ log ⁡ P ( X , Z ∣ θ ) \mathop{\arg\max}\limits_{\theta} \log P(\mathcal X,\mathcal Z \mid \theta) θargmaxlogP(X,Zθ)直接写在概率模型中。
θ ^ = arg ⁡ max ⁡ θ log ⁡ P ( X ∣ θ ) \hat \theta = \mathop{\arg\max}\limits_{\theta} \log P(\mathcal X \mid \theta) θ^=θargmaxlogP(Xθ)
X \mathcal X X称作观测数据(Observed Data),它是基于真正的样本集合得到的真实信息
Z \mathcal Z Z称作非观测数据(隐变量(Latent Variable)),它可看作隐藏在样本集合内的规律信息
( X , Z ) (\mathcal X,\mathcal Z) (X,Z)称作完整数据(Complete Data);
θ \theta θ概率模型 P ( X ∣ θ ) P(\mathcal X \mid \theta) P(Xθ)模型参数(Parameter);

EM算法公式表示如下:
θ ( t + 1 ) = arg ⁡ max ⁡ θ ∫ Z log ⁡ P ( X , Z ∣ θ ) P ( Z ∣ X , θ ( t ) ) d Z \theta^{(t+1)} = \mathop{\arg\max}\limits_{\theta}\int_{\mathcal Z } \log P(\mathcal X,\mathcal Z\mid \theta) P(\mathcal Z \mid \mathcal X ,\theta^{(t)})d\mathcal Z θ(t+1)=θargmaxZlogP(X,Zθ)P(ZX,θ(t))dZ

上述公式本质上是一个迭代过程,每一次迭代均分为两个步骤,并且在迭代过程中两个步骤交替执行

  • E步(Expectation-step):将 ∫ Z log ⁡ P ( X , Z ∣ θ ) P ( Z ∣ X , θ ( t ) ) d Z \int_{\mathcal Z } \log P(\mathcal X,\mathcal Z\mid \theta) P(\mathcal Z \mid \mathcal X ,\theta^{(t)})d\mathcal Z ZlogP(X,Zθ)P(ZX,θ(t))dZ视作 log ⁡ P ( X , Z ∣ θ ) \log P(\mathcal X,\mathcal Z\mid \theta) logP(X,Zθ)在概率分布 P ( Z ∣ X , θ ( t ) ) P(\mathcal Z \mid \mathcal X ,\theta^{(t)}) P(ZX,θ(t))下的期望结果。即:
    E Z ∣ X , θ ( t ) [ log ⁡ P ( X , Z ∣ θ ) ] \mathbb E_{\mathcal Z \mid \mathcal X,\theta^{(t)}} [\log P(\mathcal X,\mathcal Z \mid \theta)] EZX,θ(t)[logP(X,Zθ)]
  • M步(Maximization step):选择合适的模型参数 θ ( t + 1 ) \theta^{(t+1)} θ(t+1),使得 E步的期望结果最大
    θ ( t + 1 ) = arg ⁡ max ⁡ θ { E Z ∣ X , θ ( t ) [ log ⁡ P ( X , Z ∣ θ ) ] } \theta^{(t+1)} = \mathop{\arg\max}\limits_{\theta} \left\{\mathbb E_{\mathcal Z \mid \mathcal X,\theta^{(t)}} [\log P(\mathcal X,\mathcal Z \mid \theta)]\right\} θ(t+1)=θargmax{EZX,θ(t)[logP(X,Zθ)]}

并且以 EM算法公式为条件,从 极大似然估计角度 验证了对于模型参数 θ ( t + 1 ) \theta^{(t+1)} θ(t+1)的似然结果确实优于模型参数 θ ( t ) \theta^{(t)} θ(t)的似然结果。即EM公式的合法性
log ⁡ P ( X ∣ θ ( t + 1 ) ) ≥ log ⁡ P ( X ∣ θ ( t ) ) \log P(\mathcal X \mid \theta^{(t+1)}) \geq \log P(\mathcal X \mid \theta^{(t)}) logP(Xθ(t+1))logP(Xθ(t))

本节将介绍EM算法公式的推导过程。

推导过程

依然从极大似然估计的角度出发,引入隐变量 Z \mathcal Z Z,将概率模型 log ⁡ \log log似然表示如下:
log ⁡ P ( X ∣ θ ) = log ⁡ P ( X , Z ∣ θ ) P ( Z ∣ X , θ ) = log ⁡ P ( X , Z ∣ θ ) − log ⁡ P ( Z ∣ X , θ ) \log P(\mathcal X \mid \theta) = \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{P(\mathcal Z \mid \mathcal X,\theta)} = \log P(\mathcal X,\mathcal Z \mid \theta) - \log P(\mathcal Z \mid \mathcal X,\theta) logP(Xθ)=logP(ZX,θ)P(X,Zθ)=logP(X,Zθ)logP(ZX,θ)

这里出现一个技巧性操作:引入一个关于隐变量 Z \mathcal Z Z概率分布的 log ⁡ \log log结果 log ⁡ Q ( Z ) \log \mathcal Q(\mathcal Z) logQ(Z)。则有:
log ⁡ P ( X ∣ θ ) = log ⁡ P ( X , Z ∣ θ ) − log ⁡ Q ( Z ) − [ log ⁡ P ( Z ∣ X , θ ) − log ⁡ Q ( Z ) ] = log ⁡ P ( X , Z ∣ θ ) Q ( Z ) − log ⁡ P ( Z ∣ X , θ ) Q ( Z ) \begin{aligned} \log P(\mathcal X \mid \theta) & = \log P(\mathcal X,\mathcal Z \mid \theta) - \log \mathcal Q(\mathcal Z) - \left[\log P(\mathcal Z \mid \mathcal X,\theta) - \log \mathcal Q(\mathcal Z)\right] \\ & = \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} - \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)} \end{aligned} logP(Xθ)=logP(X,Zθ)logQ(Z)[logP(ZX,θ)logQ(Z)]=logQ(Z)P(X,Zθ)logQ(Z)P(ZX,θ)
将等式两边分别基于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)分布求解期望

  • 基于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)分布对 log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)求解期望:
    ∫ Z Q ( Z ) ⋅ log ⁡ P ( X ∣ θ ) d Z \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log P(\mathcal X \mid \theta) d\mathcal Z ZQ(Z)logP(Xθ)dZ
    由于 log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)不含变量 Z \mathcal Z Z,视为常数;上式可转化为:
    log ⁡ P ( X ∣ θ ) ∫ Z Q ( Z ) d Z \log P(\mathcal X \mid \theta) \int_{\mathcal Z} \mathcal Q(\mathcal Z) d\mathcal Z logP(Xθ)ZQ(Z)dZ
    由于概率密度积分 ∫ Z Q ( Z ) d Z = 1 \int_{\mathcal Z}\mathcal Q(\mathcal Z) d\mathcal Z = 1 ZQ(Z)dZ=1,因此, Q ( Z ) \mathcal Q(\mathcal Z) Q(Z) log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)的期望结果为 log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)自身
    log ⁡ P ( X ∣ θ ) ∫ Z Q ( Z ) d Z = log ⁡ P ( X ∣ θ ) ⋅ 1 = log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) \int_{\mathcal Z} \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X \mid \theta) \cdot 1 = \log P(\mathcal X \mid \theta) logP(Xθ)ZQ(Z)dZ=logP(Xθ)1=logP(Xθ)

  • 基于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)分布对 log ⁡ P ( X , Z ∣ θ ) Q ( Z ) − log ⁡ P ( Z ∣ X , θ ) Q ( Z ) \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} - \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)} logQ(Z)P(X,Zθ)logQ(Z)P(ZX,θ)求解期望
    ∫ Z Q ( Z ) [ log ⁡ P ( X , Z ∣ θ ) Q ( Z ) − log ⁡ P ( Z ∣ X , θ ) Q ( Z ) ] d Z \int_{\mathcal Z} \mathcal Q(\mathcal Z) \left[\log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} - \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z ZQ(Z)[logQ(Z)P(X,Zθ)logQ(Z)P(ZX,θ)]dZ
    将上述式子展开,分为两个部分
    ∫ Z Q ( Z ) log ⁡ P ( X , Z ∣ θ ) Q ( Z ) d Z + [ − ∫ Z Q ( Z ) log ⁡ P ( Z ∣ X , θ ) Q ( Z ) d Z ] \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} d\mathcal Z + \left[- \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)} d\mathcal Z\right] ZQ(Z)logQ(Z)P(X,Zθ)dZ+[ZQ(Z)logQ(Z)P(ZX,θ)dZ]
    观察该式的第二项:它就是 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z) P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(ZX,θ)两种概率分布的相对熵,也称 K L \mathcal K\mathcal L KL散度(Kullback-Leibler Divergence)。
    从实际意义的角度,它描述的是 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z) P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(ZX,θ)两种概率分布之间差异性的一种度量。
    − ∫ Z Q ( Z ) log ⁡ P ( Z ∣ X , θ ) Q ( Z ) d Z = K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) - \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)} d\mathcal Z = \mathcal K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) ZQ(Z)logQ(Z)P(ZX,θ)dZ=KL(Q(Z)∣∣P(ZX,θ))
    观察该式第一项,它同样有一个词描绘它:证据下界(Evidence Lower Bound,ELBO)。它的实际意义可表示为: log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)的下界
    可以将上述结果进行整理,表示如下:
    ∫ Z Q ( Z ) ⋅ log ⁡ P ( X ∣ θ ) d Z = ∫ Z Q ( Z ) [ log ⁡ P ( X , Z ∣ θ ) Q ( Z ) − log ⁡ P ( Z ∣ X , θ ) Q ( Z ) ] d Z → log ⁡ P ( X ∣ θ ) = E L B O + K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) \begin{aligned}\int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log P(\mathcal X \mid \theta) d\mathcal Z & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \left[\log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} - \log \frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \\ \to \log P(\mathcal X \mid \theta) & = ELBO + \mathcal K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) \end{aligned} ZQ(Z)logP(Xθ)dZlogP(Xθ)=ZQ(Z)[logQ(Z)P(X,Zθ)logQ(Z)P(ZX,θ)]dZ=ELBO+KL(Q(Z)∣∣P(ZX,θ))
    基于 K L \mathcal K\mathcal L KL散度的性质
    K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) ≥ 0 \mathcal K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) \geq 0 KL(Q(Z)∣∣P(ZX,θ))0
    则有: log ⁡ P ( X ∣ θ ) ≥ E L B O \log P(\mathcal X \mid \theta) \geq ELBO logP(Xθ)ELBO恒成立。因此 log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)存在下界 E L B O ELBO ELBO。什么时候可以取等?根据 K L \mathcal K\mathcal L KL散度的性质,当:
    实际意义即: Q ( Z ) \mathcal Q(\mathcal Z) Q(Z) P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(ZX,θ)的概率分布完全相同。
    Q ( Z ) = P ( Z ∣ X , θ ) → K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) = 0 → log ⁡ P ( X ∣ θ ) = E L B O \begin{aligned}\mathcal Q(\mathcal Z)=P(\mathcal Z \mid \mathcal X,\theta) & \to \mathcal K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) = 0 \\ & \to \log P(\mathcal X \mid \theta) = ELBO \end{aligned} Q(Z)=P(ZX,θ)KL(Q(Z)∣∣P(ZX,θ))=0logP(Xθ)=ELBO
    观察该式子,实际上, Q ( Z ) \mathcal Q(\mathcal Z) Q(Z) P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(ZX,θ)迭代过程中总是要越来越近似的(这两个式子都表示关于隐变量 Z \mathcal Z Z的概率分布,两个分布越来越大自然是不合理的)。

    基于上述推测, K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) KL(Q(Z)∣∣P(ZX,θ))逐渐向0逼近
    但我们的核心目标依然是 log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)最大,但由于 K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) KL(Q(Z)∣∣P(ZX,θ)) 虽然 ≥ 0 \geq0 0恒成立,但因其向0逼近,导致 K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) KL(Q(Z)∣∣P(ZX,θ))不能分担 log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)增大的任务

    因此,基于上述推测,一个朴素想法是:
    Q ( Z ) \mathcal Q(\mathcal Z) Q(Z) P ( Z ∣ X , θ ) P(\mathcal Z \mid\mathcal X,\theta) P(ZX,θ)相等的条件下,使得 E L B O ELBO ELBO结果达到最大。从而使 log ⁡ P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(Xθ)达到最大
    。即:
    注意:这是两个步骤~
    θ ^ = arg ⁡ max ⁡ θ log ⁡ P ( X ∣ θ ) = arg ⁡ max ⁡ θ E L B O ( Q ( Z ) = P ( Z ∣ X , θ ) ) \hat \theta = \mathop{\arg\max}\limits_{\theta} \log P(\mathcal X \mid \theta) = \mathop{\arg\max}\limits_{\theta} ELBO \quad(\mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta)) θ^=θargmaxlogP(Xθ)=θargmaxELBO(Q(Z)=P(ZX,θ))
    E L B O ELBO ELBO带入,有:
    θ ^ = arg ⁡ max ⁡ θ ∫ Z Q ( Z ) log ⁡ P ( X , Z ∣ θ ) Q ( Z ) d Z \hat \theta = \mathop{\arg\max}\limits_{\theta} \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)} d\mathcal Z θ^=θargmaxZQ(Z)logQ(Z)P(X,Zθ)dZ
    此时,将 Q ( Z ) = P ( Z ∣ X , θ ( t ) ) \mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta^{(t)}) Q(Z)=P(ZX,θ(t))代入:
    注意:此时的 P ( Z ∣ X , θ ( t ) ) P(\mathcal Z \mid \mathcal X,\theta^{(t)}) P(ZX,θ(t))表示‘上一次迭代’隐变量 Z \mathcal Z Z的后验概率分布,而不是抽象的 P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(ZX,θ)本身。
    个人理解:
    解释一下这里将上标 ( t ) (t) (t)加上:我们需要 K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) = 0 K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) = 0 KL(Q(Z)∣∣P(ZX,θ))=0,因此需要 Q ( Z ) = P ( Z ∣ X , θ ) \mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta) Q(Z)=P(ZX,θ);
    当前迭代步骤最优的后验概率 P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(ZX,θ)理论上应该是 P ( Z ∣ X , θ ( t + 1 ) ) P(\mathcal Z \mid \mathcal X,\theta^{(t+1)}) P(ZX,θ(t+1)),但是 θ ( t + 1 ) \theta^{(t+1)} θ(t+1)是本次迭代需要求解的模型参数。因此 P ( Z ∣ X , θ ( t + 1 ) ) P(\mathcal Z \mid \mathcal X,\theta^{(t+1)}) P(ZX,θ(t+1))在当前迭代步骤中不存在。
    因此,只能找一个当前迭代步骤下,最优的一组关于隐变量 Z \mathcal Z Z的概率分布;根据EM算法公式的收敛性,当前迭代步骤下的最优结果自然是‘上一次迭代的模型参数’ θ ( t ) \theta^{(t)} θ(t)产生的后验概率结果 P ( Z ∣ X , θ ( t ) ) P(\mathcal Z \mid \mathcal X,\theta^{(t)}) P(ZX,θ(t))
    因此 Q ( Z ) = P ( Z ∣ X , θ ( t ) ) \mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta^{(t)}) Q(Z)=P(ZX,θ(t))可能不会使 K L ( Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ) = 0 K\mathcal L(\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)) = 0 KL(Q(Z)∣∣P(ZX,θ))=0,但不可否认的是,它绝对是距离 0 0 0最近的那一个。
    θ ^ = arg ⁡ max ⁡ θ ∫ Z P ( Z ∣ X , θ ( t ) ) log ⁡ [ P ( X , Z ∣ θ ) P ( Z ∣ X , θ ( t ) ) ] d Z \hat \theta = \mathop{\arg\max}\limits_{\theta} \int_{\mathcal Z} P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \log \left[\frac{P(\mathcal X,\mathcal Z \mid \theta)}{ P(\mathcal Z \mid \mathcal X,\theta^{(t)})}\right] d\mathcal Z θ^=θargmaxZP(ZX,θ(t))log[P(ZX,θ(t))P(X,Zθ)]dZ
    将上式展开:
    θ ^ = arg ⁡ max ⁡ θ { ∫ Z P ( Z ∣ X , θ ( t ) ) log ⁡ P ( X , Z ∣ θ ) d Z − ∫ Z P ( Z ∣ X , θ ( t ) ) log ⁡ P ( Z ∣ X , θ ( t ) ) } \hat \theta = \mathop{\arg\max}\limits_{\theta} \left\{\int_{\mathcal Z} P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \log P(\mathcal X,\mathcal Z \mid \theta) d\mathcal Z - \int_{\mathcal Z} P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \log P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \right\} θ^=θargmax{ZP(ZX,θ(t))logP(X,Zθ)dZZP(ZX,θ(t))logP(ZX,θ(t))}
    观察大括号中的第二项
    ∫ Z P ( Z ∣ X , θ ( t ) ) log ⁡ P ( Z ∣ X , θ ( t ) ) \int_{\mathcal Z} P(\mathcal Z \mid \mathcal X,\theta^{(t)}) \log P(\mathcal Z \mid \mathcal X,\theta^{(t)}) ZP(ZX,θ(t))logP(ZX,θ(t))
    其中 θ ( t ) \theta^{(t)} θ(t)上一次迭代产生的最优模型参数,在本次迭代过程中相当于常数。因此第二项和 θ \theta θ无关。整理后可得:
    θ ^ = arg ⁡ max ⁡ θ ∫ Z log ⁡ P ( X , Z ∣ θ ) P ( Z ∣ X , θ ( t ) ) d Z \hat \theta = \mathop{\arg\max}\limits_{\theta}\int_{\mathcal Z } \log P(\mathcal X,\mathcal Z\mid \theta) P(\mathcal Z \mid \mathcal X ,\theta^{(t)})d\mathcal Z θ^=θargmaxZlogP(X,Zθ)P(ZX,θ(t))dZ

证毕。

相关参考:
机器学习-EM算法2(公式导出之ELBO+KL Divergence)

你可能感兴趣的:(机器学习,机器学习,EM算法,证据下界,KL散度,极大似然估计)