机器学习笔记之变分推断(二)公式推导过程(基于平均场假设)

机器学习笔记之变分推断——基于平均场假设的公式推导过程

  • 引言
    • 回顾:推断与变分推断
    • 变分推断:公式推导过程
      • 初始转化过程
      • 最大变分 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]求解过程
        • 基于平均场假设的求解思想
        • 求解过程实现

引言

上一节介绍了分别从频率角度贝叶斯角度认识机器学习问题,并介绍了推断(Inference)在整个贝叶斯角度的重要作用。本节将正式介绍确定性近似推断的代表方法——变分推断(Variational Inference)

提示:本节推导过程与EM算法部分存在相似之处,请对比食用,谢谢

回顾:推断与变分推断

关于贝叶斯角度认识问题,本质上是给定样本集合 X \mathcal X X,针对陌生数据 x ^ \hat x x^预测问题,即 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^X)

基于上述逻辑,贝叶斯角度的具体做法是:针对样本集合 X \mathcal X X构建模型,通过模型参数 θ \theta θ作为样本集合 X \mathcal X X与陌生数据 x ^ \hat x x^之间构建关系的桥梁,将 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^X)表示为如下形式:
P ( x ^ ∣ X ) = ∫ θ P ( x ^ , θ ∣ X ) d θ = ∫ θ P ( x ^ ∣ θ ) ⋅ P ( θ ∣ X ) d θ \begin{aligned} P(\hat x \mid \mathcal X) & = \int_{\theta} P(\hat x ,\theta \mid \mathcal X) d\theta \\ & = \int_{\theta} P(\hat x \mid \theta) \cdot P(\theta \mid \mathcal X)d\theta \end{aligned} P(x^X)=θP(x^,θX)dθ=θP(x^θ)P(θX)dθ

基于上述公式,使用贝叶斯定理求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θX)
P ( θ ∣ X ) = P ( X ∣ θ ) ⋅ P ( θ ) P ( X ) \begin{aligned} P(\theta \mid \mathcal X) & = \frac{P(\mathcal X \mid \theta) \cdot P(\theta)}{P(\mathcal X)} \end{aligned} P(θX)=P(X)P(Xθ)P(θ)
至此,关于求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θX)的过程称为推断
关于样本数据的边缘概率分布 P ( X ) P(\mathcal X) P(X)可看成一个积分操作
引入‘隐变量’。
P ( X ) = ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z = ∫ z 1 ⋯ ∫ z K P ( X ∣ Z ) ⋅ P ( Z ) d z , ⋯   , z K \begin{aligned} P(\mathcal X) & = \int_{\mathcal Z} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z)d\mathcal Z \\ & = \int_{z_1} \cdots \int_{z_{\mathcal K}} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z) d z_,\cdots,z_{\mathcal K} \end{aligned} P(X)=ZP(XZ)P(Z)dZ=z1zKP(XZ)P(Z)dz,,zK
如果引入的隐变量 Z = ( z 1 , ⋯   , z K ) T \mathcal Z = (z_1,\cdots,z_{\mathcal K})^{T} Z=(z1,,zK)T中的维度 K \mathcal K K过高,导致 P ( X ) P(\mathcal X) P(X)积分困难,最终使得 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θX)无法求解
因此,需要使用一些方法近似求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θX)。而变分推断(Variational Inference,VI)就是 近似推断中,确定性近似的代表方法

变分推断:公式推导过程

近似推断(Approximate Inference)的核心观点是针对 ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z \int_{\mathcal Z} P(\mathcal X \mid Z) \cdot P(\mathcal Z)d\mathcal Z ZP(XZ)P(Z)dZ积分困难的问题,通过找出一个关于隐变量 Z \mathcal Z Z的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)去逼近后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)。即:
Q ( Z ) ≈ P ( Z ∣ X ) \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q(Z)P(ZX)
这里定义:

  • X \mathcal X X观测变量(Observed Data),即真实的样本数据
  • Z \mathcal Z Z表示 隐变量(Latent Data)和 模型参数(Parameter)的统称
    因为‘隐变量’本身就不真实存在,它只是一个‘表达’概率模型的中间环节。因此,‘隐变量 + 模型参数’合并在一起是合理的。
    该定义与EM算法中的定义式略有区分的,EM算法中,隐变量是隐变量,参数是参数。
  • 依然将联合概率分布 ( X , Z ) (\mathcal X,\mathcal Z) (X,Z)称作 完整数据(Complete Data)。

初始转化过程

EM算法中,底层逻辑是使用极大似然估计(Maximum Likelihood Estimate,MLE)进行求解,并且求解的模型参数 θ \theta θ
变分推断求解过程中,模型参数 θ \theta θ合并进隐变量 Z \mathcal Z Z。这里依然从概率模型 P ( X ) P(\mathcal X) P(X)入手,执行推导过程
条件概率公式~为方便推导过程,依然保留‘log函数’。
log ⁡ P ( X ) = log ⁡ [ P ( X , Z ) P ( Z ∣ X ) ] = log ⁡ P ( X , Z ) − log ⁡ P ( Z ∣ X ) \begin{aligned} \log P(\mathcal X) & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{P(\mathcal Z \mid \mathcal X)}\right] \\ & = \log P(\mathcal X,\mathcal Z) - \log P(\mathcal Z \mid \mathcal X) \end{aligned} logP(X)=log[P(ZX)P(X,Z)]=logP(X,Z)logP(ZX)

  • EM算法推导思路相同,引入一个关于隐变量的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)。则有:
    log ⁡ P ( X ) = log ⁡ P ( X , Z ) − Q ( Z ) − [ log ⁡ P ( Z ∣ X ) − Q ( Z ) ] = log ⁡ [ P ( X , Z ) Q ( Z ) ] − log ⁡ [ P ( Z ∣ X ) Q ( Z ) ] \begin{aligned} \log P(\mathcal X) & = \log P(\mathcal X,\mathcal Z) - \mathcal Q(\mathcal Z) - [\log P(\mathcal Z \mid \mathcal X) - \mathcal Q(\mathcal Z)] \\ & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] - \log \left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] \end{aligned} logP(X)=logP(X,Z)Q(Z)[logP(ZX)Q(Z)]=log[Q(Z)P(X,Z)]log[Q(Z)P(ZX)]

  • 等式两端分别对 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)积分

    • 等式左端:
      ∫ Z log ⁡ P ( X ) ⋅ Q ( Z ) d Z = log ⁡ P ( X ) ∫ Z Q ( Z ) d Z = log ⁡ P ( X ) \int_{\mathcal Z} \log P(\mathcal X) \cdot \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X) \int_{\mathcal Z} \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X) ZlogP(X)Q(Z)dZ=logP(X)ZQ(Z)dZ=logP(X)
    • 等式右端:
      ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z − ∫ Z Q ( Z ) ⋅ [ P ( Z ∣ X ) Q ( Z ) ] d Z \begin{aligned} \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \end{aligned} ZQ(Z)log[Q(Z)P(X,Z)]dZZQ(Z)[Q(Z)P(ZX)]dZ
  • EM算法中定义 ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z ZQ(Z)log[Q(Z)P(X,Z)]dZ证据下界(Evidence Lower Bound,ELBO);
    − ∫ Z Q ( Z ) ⋅ [ P ( Z ∣ X ) Q ( Z ) ] d Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z ZQ(Z)[Q(Z)P(ZX)]dZ是关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)隐变量 Z \mathcal Z Z后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX) K L \mathcal K\mathcal L KL散度

    因此, log ⁡ P ( X ) \log P(\mathcal X) logP(X)将表示如下:
    log ⁡ P ( X ) = E L B O + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = ELBO + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=ELBO+KL[Q(Z)∣∣P(ZX)]

  • 概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)看成一个变量,并将 E L B O ELBO ELBO部分定义成一个关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的函数 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]
    L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)] Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的变分。
    log ⁡ P ( X ) = L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=L[Q(Z)]+KL[Q(Z)∣∣P(ZX)]
    观察上式,根据数学定义, K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] ≥ 0 \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \geq 0 KL[Q(Z)∣∣P(ZX)]0恒成立,则有:
    log ⁡ P ( X ) ≥ L [ Q ( Z ) ] \log P(\mathcal X) \geq \mathcal L[\mathcal Q(\mathcal Z)] logP(X)L[Q(Z)]

  • 换句话说, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界,并且 上界结果是 log ⁡ P ( X ) \log P(\mathcal X) logP(X)

    观测变量 X \mathcal X X给定的条件下, log ⁡ P ( X ) \log P(\mathcal X) logP(X)可看作固定常数;因此 L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] L[Q(Z)]+KL[Q(Z)∣∣P(ZX)]结果也是固定常数

    在上述条件下, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]在什么情况下能够取得最大值自然是 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(ZX)]越趋近于0,那么 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(ZX)]越趋近于上界 log ⁡ P ( X ) \log P(\mathcal X) logP(X)

    K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(ZX)]本身数学性质表示概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)与概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)的关联关系
    Q ( Z ) ≈ P ( Z ∣ X ) ⇒ K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] → 0 \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) \Rightarrow \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \to 0 Q(Z)P(ZX)KL[Q(Z)∣∣P(ZX)]0

至此,逻辑关系已经找到:
目标:找到一个与隐变量 Z \mathcal Z Z后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)近似的分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z),替代 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)参与运算;

  • 基于 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界 log ⁡ P ( X ) \log P(\mathcal X) logP(X),并且 log ⁡ P ( X ) \log P(\mathcal X) logP(X)只和观测变量 X \mathcal X X有关,视为常数
  • log ⁡ P ( X ) \log P(\mathcal X) logP(X)固定的情况下, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]越大, K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(ZX)]越小,因而 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z) P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)越近似。

从而将寻找近似分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)转化为最大化 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]的问题。

最大变分 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]求解过程

基于上述问题转化,基于隐变量的最优分布 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z)可进行如下表示:
Q ^ ( Z ) = arg ⁡ max ⁡ Q ( Z ) L [ Q ( Z ) ] ⇒ Q ^ ( Z ) ≈ P ( Z ∣ X ) \hat {\mathcal Q}(\mathcal Z) = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \Rightarrow \hat {\mathcal Q}(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q^(Z)=Q(Z)argmaxL[Q(Z)]Q^(Z)P(ZX)

基于平均场假设的求解思想

概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)进行假设:平均场假设(Mean Theory)。
上面定义 Z \mathcal Z Z隐变量模型参数 的统称。这里关于 Z \mathcal Z Z数量的统计是指 概率图中状态变量的数量假设状态变量的数量是 K \mathcal K K,并将其划分成 M \mathcal M M个组,每个组中的 状态变量之间不存在重复。
数学符号表达如下:
Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) = Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) \begin{aligned} \mathcal Q(\mathcal Z) & = \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \\ & = \mathcal Q_1(\mathcal Z^{(1)}) \cdot \mathcal Q_2(\mathcal Z^{(2)}) \cdots \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \end{aligned} Q(Z)=i=1MQi(Z(i))=Q1(Z(1))Q2(Z(2))QM(Z(M))
其中, Z ( i ) \mathcal Z^{(i)} Z(i)表示第 i i i个子集合包含的 状态变量的具体结果,而 Q i ( Z ( i ) ) \mathcal Q_i(\mathcal Z^{(i)}) Qi(Z(i))表示向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布。也可以理解为: i i i个子集合所包含状态变量的联合概率分布

  • 具体想法是,由于各子集合之间相互独立,想要求解 Q j ( Z ( j ) ) ( j = { 1 , 2 , ⋯   , M } ) \mathcal Q_j(\mathcal Z^{(j)}) (j=\{1,2,\cdots,\mathcal M\}) Qj(Z(j))(j={1,2,,M}),则需要固定 Q − j ( Z ( − j ) ) \mathcal Q_{-j}(\mathcal Z^{(-j)}) Qj(Z(j))分量的结果
    Q − j ( Z ( − j ) ) \mathcal Q_{-j}(\mathcal Z^{(-j)}) Qj(Z(j))表示除去编号 j j j子集合后剩余子集合的概率分布。
  • 基于上述步骤,会分别求解出各向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布 Q i ( Z ( i ) ) ( i = 1 , 2 , ⋯ M ) \mathcal Q_{i}(\mathcal Z^{(i)})(i=1,2,\cdots \mathcal M) Qi(Z(i))(i=1,2,M),最后相乘即可。

求解过程实现

Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) \mathcal Q(\mathcal Z) =\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) Q(Z)=i=1MQi(Z(i))带入 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]中:
L [ Q ( Z ) ] = ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z = ∫ Z Q ( Z ) log ⁡ P ( X , Z ) d Z − ∫ Z Q ( Z ) log ⁡ Q ( Z ) d Z \begin{aligned} \mathcal L[\mathcal Q(\mathcal Z)] & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \\ & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log P(\mathcal X,\mathcal Z)d \mathcal Z - \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \mathcal Q(\mathcal Z) d\mathcal Z \end{aligned} L[Q(Z)]=ZQ(Z)log[Q(Z)P(X,Z)]dZ=ZQ(Z)logP(X,Z)dZZQ(Z)logQ(Z)dZ

  • 带入第一项
    按照上述固定要求,将 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))提出来。
    ∫ Z ∏ i = 1 M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯   , d Z ( M ) = ∫ Z ( j ) Q j ( Z ( j ) ) ⋅ { ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯   , Z ( j − 1 ) , Z ( j + 1 ) , ⋯   , Z ( M ) } d Z ( j ) \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)}) \cdot \log P(\mathcal X,\mathcal Z) \right] d\mathcal Z^{(1)},\cdots,d\mathcal Z^{(\mathcal M)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \cdot \left\{\int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)}\right\} d \mathcal Z^{(j)} \end{aligned} Zi=1M[Qi(Z(i))logP(X,Z)]dZ(1),,dZ(M)=Z(j)Qj(Z(j)) Z(1)Z(j1)Z(j+1)Z(M)i=jM[Qi(Z(i))logP(X,Z)]dZ(1),,Z(j1),Z(j+1),,Z(M) dZ(j)
    观察上述大括号中的项,我们可以将其看做期望形式
    ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯   , Z ( j − 1 ) , Z ( j + 1 ) , ⋯   , Z ( M ) = E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] \begin{aligned} & \int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)} \\ & = \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] \end{aligned} Z(1)Z(j1)Z(j+1)Z(M)i=jM[Qi(Z(i))logP(X,Z)]dZ(1),,Z(j1),Z(j+1),,Z(M)=Ei=jMQi(Z(i))[logP(X,Z)]
    至此,第一项表示为
    ∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] d Z j \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z_j ZjQj(Z(j))Ei=jMQi(Z(i))[logP(X,Z)]dZj

  • 带入第二项
    log部分改成求和方式~
    ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ ∏ i = 1 M Q i ( Z ( i ) ) d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ ∑ i = 1 M log ⁡ Q i ( Z ( i ) ) d Z \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \\ & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \sum_{i=1}^{\mathcal M} \log\mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \end{aligned} Zi=1MQi(Z(i))logi=1MQi(Z(i))dZ=Zi=1MQi(Z(i))i=1MlogQi(Z(i))dZ
    继续展开(连加部分展开):
    = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ [ log ⁡ Q 1 ( Z ( 1 ) ) + log ⁡ Q 2 ( Z ( 2 ) ) + ⋯ + log ⁡ Q M ( Z ( M ) ) ] d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z + ⋯ + ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q M ( Z ( M ) ) d Z \begin{aligned} & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \left[\log \mathcal Q_1(\mathcal Z^{(1)}) + \log \mathcal Q_2(\mathcal Z^{(2)}) + \cdots + \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \right] d\mathcal Z \\ & = \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z + \cdots + \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) d\mathcal Z \end{aligned} =Zi=1MQi(Z(i))[logQ1(Z(1))+logQ2(Z(2))++logQM(Z(M))]dZ=Zi=1MQi(Z(i))logQ1(Z(1))dZ++Zi=1MQi(Z(i))logQM(Z(M))dZ

    • 观察上式中的第一项
      ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z Zi=1MQi(Z(i))logQ1(Z(1))dZ
    • 该项中的多重积分连乘符号全部展开
      ∫ Z ( 1 ) ∫ Z ( 2 ) ⋯ ∫ Z ( M ) [ Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) ] d Z ( 1 ) , Z ( 2 ) , ⋯   , Z ( M ) \int_{\mathcal Z^{(1)}} \int_{\mathcal Z^{(2)}} \cdots \int_{\mathcal Z^{(\mathcal M)}} \left[\mathcal Q_1(\mathcal Z^{(1)}) \cdot Q_2(\mathcal Z^{(2)}) \cdots Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)})\right] d \mathcal Z^{(1)},\mathcal Z^{(2)},\cdots,\mathcal Z^{(\mathcal M)} Z(1)Z(2)Z(M)[Q1(Z(1))Q2(Z(2))QM(Z(M))logQ1(Z(1))]dZ(1),Z(2),,Z(M)
    • 积分带入
      ∫ Z ( 1 ) Q 1 ( Z ( 1 ) ) log ⁡ Q 1 ( Z ( 1 ) ) d Z ( 1 ) ⋅ ∫ Z ( 2 ) Q 2 ( Z ( 2 ) ) ⋯ ∫ Z ( M ) Q M ( Z ( M ) ) \int_{\mathcal Z^{(1)}} \mathcal Q_1(\mathcal Z^{(1)}) \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z^{(1)} \cdot \int_{\mathcal Z^{(2)}} \mathcal Q_2(\mathcal Z^{(2)}) \cdots \int_{\mathcal Z^{(\mathcal M)}} \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) Z(1)Q1(Z(1))logQ1(Z(1))dZ(1)Z(2)Q2(Z(2))Z(M)QM(Z(M))
    • 从第二项开始到最后一项,使用概率密度积分结果均为 1 1 1
      ∫ Z ( 2 ) Q 2 ( Z ( 2 ) ) = ⋯ = ∫ Z ( M ) Q M ( Z ( M ) ) = 1 \int_{\mathcal Z^{(2)}} \mathcal Q_2(\mathcal Z^{(2)}) =\cdots = \int_{\mathcal Z^{(\mathcal M)}} \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) = 1 Z(2)Q2(Z(2))==Z(M)QM(Z(M))=1
    • 因此,则有:
      ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z = ∫ Z ( 1 ) Q 1 ( Z ( 1 ) ) log ⁡ Q 1 ( Z ( 1 ) ) d Z ( 1 ) \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z = \int_{\mathcal Z^{(1)}} \mathcal Q_1(\mathcal Z^{(1)}) \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z^{(1)} Zi=1MQi(Z(i))logQ1(Z(1))dZ=Z(1)Q1(Z(1))logQ1(Z(1))dZ(1)
    • 因此,整个 连加部分展开式 可以写成:
      ∑ i = 1 M ∫ Z ( i ) Q i ( Z ( i ) ) log ⁡ Q i ( Z ( i ) ) d Z ( i ) \sum_{i=1}^{\mathcal M} \int_{\mathcal Z^{(i)}} \mathcal Q_i(\mathcal Z^{(i)}) \log \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z^{(i)} i=1MZ(i)Qi(Z(i))logQi(Z(i))dZ(i)
    • 基于求解项是 Z ( j ) \mathcal Z^{(j)} Z(j),将 Z ( j ) \mathcal Z^{(j)} Z(j)与其他向分开:
      其他项不是需要求解的内容,视为常数。用 C \mathcal C C表示。
      ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C Z(j)Qj(Z(j))logQj(Z(j))dZ(j)+C
  • 至此,观察第一项和第二项的结果:
    ∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] d Z ( j ) ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z^{(j)} \\ \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C ZjQj(Z(j))Ei=jMQi(Z(i))[logP(X,Z)]dZ(j)Z(j)Qj(Z(j))logQj(Z(j))dZ(j)+C
    这两项很相似,它们有共同的积分 ∫ Z ( j ) \int_{\mathcal Z^{(j)}} Z(j)第一个乘数项均是 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j)),因此,为了能够使两项合并设定 E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] Ei=jMQi(Z(i))[logP(X,Z)]是关于 X , Z ( j ) \mathcal X,\mathcal Z^{(j)} X,Z(j)函数。即:
    首先, Z \mathcal Z Z中包含 Z ( j ) \mathcal Z^{(j)} Z(j);
    其次,由于除了 Z ( j ) \mathcal Z^{(j)} Z(j)之外,其余变量均被固定住,视作常量。因此则有 ϕ ^ ( X , Z ( j ) ) \hat \phi(\mathcal X,\mathcal Z^{(j)}) ϕ^(X,Z(j))的表达。
    E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] = log ⁡ ϕ ^ ( X , Z ( j ) ) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] = \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) Ei=jMQi(Z(i))[logP(X,Z)]=logϕ^(X,Z(j))
    此时,两项合并的结果如下:
    ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ ϕ ^ ( X , Z ( j ) ) d Z ( j ) − ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) = ∫ Z ( j ) Q j ( Z ( j ) ) [ log ⁡ ϕ ^ ( X , Z ( j ) ) − log ⁡ Q j ( Z ( j ) ) ] = ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ [ ϕ ^ ( X , Z ( j ) ) Q j ( Z ( j ) ) ] d Z ( j ) + C = − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] \begin{aligned} & \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) d\mathcal Z^{(j)} - \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \left[\log \hat \phi(\mathcal X,\mathcal Z^{(j)}) - \log \mathcal Q_j(\mathcal Z^{(j)})\right] \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \left[\frac{\hat \phi(\mathcal X,\mathcal Z^{(j)})}{\mathcal Q_j(\mathcal Z^{(j)})}\right]d \mathcal Z^{(j)} +\mathcal C \\ & = - \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right] \end{aligned} Z(j)Qj(Z(j))logϕ^(X,Z(j))dZ(j)Z(j)Qj(Z(j))logQj(Z(j))dZ(j)=Z(j)Qj(Z(j))[logϕ^(X,Z(j))logQj(Z(j))]=Z(j)Qj(Z(j))log[Qj(Z(j))ϕ^(X,Z(j))]dZ(j)+C=KL[ϕ^(X,Z(j))∣∣Qj(Z(j))]
    最终:
    Q j ^ ( Z ( j ) ) = arg ⁡ max ⁡ Q ( Z ) L [ Q ( Z ) ] = arg ⁡ max ⁡ Q ( Z ) { − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] } \begin{aligned} \hat{\mathcal Q_j}(\mathcal Z^{(j)}) & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \\ & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \left\{- \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right]\right\} \end{aligned} Qj^(Z(j))=Q(Z)argmaxL[Q(Z)]=Q(Z)argmax{KL[ϕ^(X,Z(j))∣∣Qj(Z(j))]}
    因此,基于平均场假设 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))的最优解即:
    Q j ^ ( Z ( j ) ) = ϕ ^ ( X , Z ( j ) ) ( j ∈ { 1 , 2 , ⋯   , M } ) \hat {\mathcal Q_j}(\mathcal Z^{(j)}) = \hat \phi(\mathcal X,\mathcal Z^{(j)}) (j \in \{1,2,\cdots,\mathcal M\}) Qj^(Z(j))=ϕ^(X,Z(j))(j{1,2,,M})
    从而最终求得最优解 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z)
    Q ^ ( Z ) = ∏ j = 1 M Q j ( Z ( j ) ) \hat {\mathcal Q}(\mathcal Z) = \prod_{j=1}^{\mathcal M} \mathcal Q_j(\mathcal Z^{(j)}) Q^(Z)=j=1MQj(Z(j))

基于平均场假设,关于 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)的近似求解推导结束。下一节将介绍变分推断与EM算法之间的关联关系

相关参考:
变分法:理解变分
机器学习-变分推断2(公式推导)

你可能感兴趣的:(机器学习,机器学习,变分推断,推断,平均场假设,概率密度积分)