上一节介绍了分别从频率角度和贝叶斯角度认识机器学习问题,并介绍了推断(Inference)在整个贝叶斯角度的重要作用。本节将正式介绍确定性近似推断的代表方法——变分推断(Variational Inference)
提示:本节推导过程与EM算法部分存在相似之处,请对比食用,谢谢。
关于从贝叶斯角度认识问题,本质上是给定样本集合 X \mathcal X X,针对陌生数据 x ^ \hat x x^的预测问题,即 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^∣X)。
基于上述逻辑,贝叶斯角度的具体做法是:针对样本集合 X \mathcal X X构建模型,通过模型参数 θ \theta θ作为样本集合 X \mathcal X X与陌生数据 x ^ \hat x x^之间构建关系的桥梁,将 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^∣X)表示为如下形式:
P ( x ^ ∣ X ) = ∫ θ P ( x ^ , θ ∣ X ) d θ = ∫ θ P ( x ^ ∣ θ ) ⋅ P ( θ ∣ X ) d θ \begin{aligned} P(\hat x \mid \mathcal X) & = \int_{\theta} P(\hat x ,\theta \mid \mathcal X) d\theta \\ & = \int_{\theta} P(\hat x \mid \theta) \cdot P(\theta \mid \mathcal X)d\theta \end{aligned} P(x^∣X)=∫θP(x^,θ∣X)dθ=∫θP(x^∣θ)⋅P(θ∣X)dθ
基于上述公式,使用贝叶斯定理求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X):
P ( θ ∣ X ) = P ( X ∣ θ ) ⋅ P ( θ ) P ( X ) \begin{aligned} P(\theta \mid \mathcal X) & = \frac{P(\mathcal X \mid \theta) \cdot P(\theta)}{P(\mathcal X)} \end{aligned} P(θ∣X)=P(X)P(X∣θ)⋅P(θ)
至此,关于求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)的过程称为推断;
关于样本数据的边缘概率分布 P ( X ) P(\mathcal X) P(X)可看成一个积分操作:
引入‘隐变量’。
P ( X ) = ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z = ∫ z 1 ⋯ ∫ z K P ( X ∣ Z ) ⋅ P ( Z ) d z , ⋯ , z K \begin{aligned} P(\mathcal X) & = \int_{\mathcal Z} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z)d\mathcal Z \\ & = \int_{z_1} \cdots \int_{z_{\mathcal K}} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z) d z_,\cdots,z_{\mathcal K} \end{aligned} P(X)=∫ZP(X∣Z)⋅P(Z)dZ=∫z1⋯∫zKP(X∣Z)⋅P(Z)dz,⋯,zK
如果引入的隐变量 Z = ( z 1 , ⋯ , z K ) T \mathcal Z = (z_1,\cdots,z_{\mathcal K})^{T} Z=(z1,⋯,zK)T中的维度 K \mathcal K K过高,导致 P ( X ) P(\mathcal X) P(X)积分困难,最终使得 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)无法求解。
因此,需要使用一些方法近似求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)。而变分推断(Variational Inference,VI)就是 近似推断中,确定性近似的代表方法。
近似推断(Approximate Inference)的核心观点是针对 ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z \int_{\mathcal Z} P(\mathcal X \mid Z) \cdot P(\mathcal Z)d\mathcal Z ∫ZP(X∣Z)⋅P(Z)dZ积分困难的问题,通过找出一个关于隐变量 Z \mathcal Z Z的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)去逼近后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)。即:
Q ( Z ) ≈ P ( Z ∣ X ) \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q(Z)≈P(Z∣X)
这里定义:
因为‘隐变量’本身就不真实存在,它只是一个‘表达’概率模型的中间环节。因此,‘隐变量 + 模型参数’合并在一起是合理的。
该定义与EM算法中的定义式略有区分的,EM算法中,隐变量是隐变量,参数是参数。
在EM算法中,底层逻辑是使用极大似然估计(Maximum Likelihood Estimate,MLE)进行求解,并且求解的是模型参数 θ \theta θ;
在变分推断求解过程中,模型参数 θ \theta θ合并进隐变量 Z \mathcal Z Z中。这里依然从概率模型 P ( X ) P(\mathcal X) P(X)入手,执行推导过程:
条件概率公式~为方便推导过程,依然保留‘log函数’。
log P ( X ) = log [ P ( X , Z ) P ( Z ∣ X ) ] = log P ( X , Z ) − log P ( Z ∣ X ) \begin{aligned} \log P(\mathcal X) & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{P(\mathcal Z \mid \mathcal X)}\right] \\ & = \log P(\mathcal X,\mathcal Z) - \log P(\mathcal Z \mid \mathcal X) \end{aligned} logP(X)=log[P(Z∣X)P(X,Z)]=logP(X,Z)−logP(Z∣X)
和EM算法推导思路相同,引入一个关于隐变量的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)。则有:
log P ( X ) = log P ( X , Z ) − Q ( Z ) − [ log P ( Z ∣ X ) − Q ( Z ) ] = log [ P ( X , Z ) Q ( Z ) ] − log [ P ( Z ∣ X ) Q ( Z ) ] \begin{aligned} \log P(\mathcal X) & = \log P(\mathcal X,\mathcal Z) - \mathcal Q(\mathcal Z) - [\log P(\mathcal Z \mid \mathcal X) - \mathcal Q(\mathcal Z)] \\ & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] - \log \left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] \end{aligned} logP(X)=logP(X,Z)−Q(Z)−[logP(Z∣X)−Q(Z)]=log[Q(Z)P(X,Z)]−log[Q(Z)P(Z∣X)]
等式两端分别对 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)求积分:
在EM算法中定义 ∫ Z Q ( Z ) ⋅ log [ P ( X , Z ) Q ( Z ) ] d Z \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z ∫ZQ(Z)⋅log[Q(Z)P(X,Z)]dZ为 证据下界(Evidence Lower Bound,ELBO);
− ∫ Z Q ( Z ) ⋅ [ P ( Z ∣ X ) Q ( Z ) ] d Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z −∫ZQ(Z)⋅[Q(Z)P(Z∣X)]dZ是关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)和隐变量 Z \mathcal Z Z的后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的 K L \mathcal K\mathcal L KL散度;
因此, log P ( X ) \log P(\mathcal X) logP(X)将表示如下:
log P ( X ) = E L B O + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = ELBO + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=ELBO+KL[Q(Z)∣∣P(Z∣X)]
将概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)看成一个变量,并将 E L B O ELBO ELBO部分定义成一个关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的函数: L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]:
称
L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]为
Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的变分。
log P ( X ) = L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=L[Q(Z)]+KL[Q(Z)∣∣P(Z∣X)]
观察上式,根据数学定义, K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] ≥ 0 \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \geq 0 KL[Q(Z)∣∣P(Z∣X)]≥0恒成立,则有:
log P ( X ) ≥ L [ Q ( Z ) ] \log P(\mathcal X) \geq \mathcal L[\mathcal Q(\mathcal Z)] logP(X)≥L[Q(Z)]
换句话说, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界,并且 上界结果是 log P ( X ) \log P(\mathcal X) logP(X)。
当观测变量 X \mathcal X X给定的条件下, log P ( X ) \log P(\mathcal X) logP(X)可看作固定常数;因此 L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] L[Q(Z)]+KL[Q(Z)∣∣P(Z∣X)]结果也是固定常数。
在上述条件下, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]在什么情况下能够取得最大值?自然是 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越趋近于0,那么 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越趋近于上界 log P ( X ) \log P(\mathcal X) logP(X)。
而 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]本身数学性质表示概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)与概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的关联关系:
Q ( Z ) ≈ P ( Z ∣ X ) ⇒ K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] → 0 \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) \Rightarrow \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \to 0 Q(Z)≈P(Z∣X)⇒KL[Q(Z)∣∣P(Z∣X)]→0
至此,逻辑关系已经找到:
目标:找到一个与隐变量 Z \mathcal Z Z的后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)近似的分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z),替代 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)参与运算;
从而将寻找近似分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)转化为最大化 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]的问题。
基于上述问题转化,基于隐变量的最优分布 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z)可进行如下表示:
Q ^ ( Z ) = arg max Q ( Z ) L [ Q ( Z ) ] ⇒ Q ^ ( Z ) ≈ P ( Z ∣ X ) \hat {\mathcal Q}(\mathcal Z) = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \Rightarrow \hat {\mathcal Q}(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q^(Z)=Q(Z)argmaxL[Q(Z)]⇒Q^(Z)≈P(Z∣X)
对概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)进行假设:平均场假设(Mean Theory)。
上面定义 Z \mathcal Z Z是隐变量和模型参数 的统称。这里关于 Z \mathcal Z Z数量的统计是指 概率图中状态变量的数量。假设状态变量的数量是 K \mathcal K K个,并将其划分成 M \mathcal M M个组,每个组中的 状态变量之间不存在重复。
数学符号表达如下:
Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) = Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) \begin{aligned} \mathcal Q(\mathcal Z) & = \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \\ & = \mathcal Q_1(\mathcal Z^{(1)}) \cdot \mathcal Q_2(\mathcal Z^{(2)}) \cdots \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \end{aligned} Q(Z)=i=1∏MQi(Z(i))=Q1(Z(1))⋅Q2(Z(2))⋯QM(Z(M))
其中, Z ( i ) \mathcal Z^{(i)} Z(i)表示第 i i i个子集合包含的 状态变量的具体结果,而 Q i ( Z ( i ) ) \mathcal Q_i(\mathcal Z^{(i)}) Qi(Z(i))表示向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布。也可以理解为:第 i i i个子集合所包含状态变量的联合概率分布。
表示除去编号
j j j子集合后剩余子集合的概率分布。
将 Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) \mathcal Q(\mathcal Z) =\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) Q(Z)=∏i=1MQi(Z(i))带入 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]中:
L [ Q ( Z ) ] = ∫ Z Q ( Z ) ⋅ log [ P ( X , Z ) Q ( Z ) ] d Z = ∫ Z Q ( Z ) log P ( X , Z ) d Z − ∫ Z Q ( Z ) log Q ( Z ) d Z \begin{aligned} \mathcal L[\mathcal Q(\mathcal Z)] & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \\ & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log P(\mathcal X,\mathcal Z)d \mathcal Z - \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \mathcal Q(\mathcal Z) d\mathcal Z \end{aligned} L[Q(Z)]=∫ZQ(Z)⋅log[Q(Z)P(X,Z)]dZ=∫ZQ(Z)logP(X,Z)dZ−∫ZQ(Z)logQ(Z)dZ
带入第一项:
按照上述固定要求,将
Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))提出来。
∫ Z ∏ i = 1 M [ Q i ( Z ( i ) ) ⋅ log P ( X , Z ) ] d Z ( 1 ) , ⋯ , d Z ( M ) = ∫ Z ( j ) Q j ( Z ( j ) ) ⋅ { ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log P ( X , Z ) ] d Z ( 1 ) , ⋯ , Z ( j − 1 ) , Z ( j + 1 ) , ⋯ , Z ( M ) } d Z ( j ) \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)}) \cdot \log P(\mathcal X,\mathcal Z) \right] d\mathcal Z^{(1)},\cdots,d\mathcal Z^{(\mathcal M)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \cdot \left\{\int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)}\right\} d \mathcal Z^{(j)} \end{aligned} ∫Zi=1∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,dZ(M)=∫Z(j)Qj(Z(j))⋅⎩ ⎨ ⎧∫Z(1)⋯∫Z(j−1)∫Z(j+1)⋯∫Z(M)i=j∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,Z(j−1),Z(j+1),⋯,Z(M)⎭ ⎬ ⎫dZ(j)
观察上述大括号中的项,我们可以将其看做期望形式:
∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log P ( X , Z ) ] d Z ( 1 ) , ⋯ , Z ( j − 1 ) , Z ( j + 1 ) , ⋯ , Z ( M ) = E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] \begin{aligned} & \int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)} \\ & = \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] \end{aligned} ∫Z(1)⋯∫Z(j−1)∫Z(j+1)⋯∫Z(M)i=j∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,Z(j−1),Z(j+1),⋯,Z(M)=E∏i=jMQi(Z(i))[logP(X,Z)]
至此,第一项表示为:
∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] d Z j \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z_j ∫ZjQj(Z(j))E∏i=jMQi(Z(i))[logP(X,Z)]dZj
带入第二项:
log部分改成求和方式~
∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ∏ i = 1 M Q i ( Z ( i ) ) d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ ∑ i = 1 M log Q i ( Z ( i ) ) d Z \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \\ & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \sum_{i=1}^{\mathcal M} \log\mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \end{aligned} ∫Zi=1∏MQi(Z(i))⋅logi=1∏MQi(Z(i))dZ=∫Zi=1∏MQi(Z(i))⋅i=1∑MlogQi(Z(i))dZ
继续展开(连加部分展开):
= ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ [ log Q 1 ( Z ( 1 ) ) + log Q 2 ( Z ( 2 ) ) + ⋯ + log Q M ( Z ( M ) ) ] d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log Q 1 ( Z ( 1 ) ) d Z + ⋯ + ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log Q M ( Z ( M ) ) d Z \begin{aligned} & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \left[\log \mathcal Q_1(\mathcal Z^{(1)}) + \log \mathcal Q_2(\mathcal Z^{(2)}) + \cdots + \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \right] d\mathcal Z \\ & = \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z + \cdots + \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) d\mathcal Z \end{aligned} =∫Zi=1∏MQi(Z(i))⋅[logQ1(Z(1))+logQ2(Z(2))+⋯+logQM(Z(M))]dZ=∫Zi=1∏MQi(Z(i))⋅logQ1(Z(1))dZ+⋯+∫Zi=1∏MQi(Z(i))⋅logQM(Z(M))dZ
其他项不是需要求解的内容,视为常数。用
C \mathcal C C表示。
至此,观察第一项和第二项的结果:
∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] d Z ( j ) ∫ Z ( j ) Q j ( Z ( j ) ) log Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z^{(j)} \\ \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C ∫ZjQj(Z(j))E∏i=jMQi(Z(i))[logP(X,Z)]dZ(j)∫Z(j)Qj(Z(j))logQj(Z(j))dZ(j)+C
这两项很相似,它们有共同的积分 ∫ Z ( j ) \int_{\mathcal Z^{(j)}} ∫Z(j),第一个乘数项均是 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j)),因此,为了能够使两项合并,设定 E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] E∏i=jMQi(Z(i))[logP(X,Z)]是关于 X , Z ( j ) \mathcal X,\mathcal Z^{(j)} X,Z(j)的函数。即:
首先,
Z \mathcal Z Z中包含
Z ( j ) \mathcal Z^{(j)} Z(j);
其次,由于除了
Z ( j ) \mathcal Z^{(j)} Z(j)之外,其余变量均被固定住,视作常量。因此则有
ϕ ^ ( X , Z ( j ) ) \hat \phi(\mathcal X,\mathcal Z^{(j)}) ϕ^(X,Z(j))的表达。
E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] = log ϕ ^ ( X , Z ( j ) ) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] = \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) E∏i=jMQi(Z(i))[logP(X,Z)]=logϕ^(X,Z(j))
此时,两项合并的结果如下:
∫ Z ( j ) Q j ( Z ( j ) ) log ϕ ^ ( X , Z ( j ) ) d Z ( j ) − ∫ Z ( j ) Q j ( Z ( j ) ) log Q j ( Z ( j ) ) d Z ( j ) = ∫ Z ( j ) Q j ( Z ( j ) ) [ log ϕ ^ ( X , Z ( j ) ) − log Q j ( Z ( j ) ) ] = ∫ Z ( j ) Q j ( Z ( j ) ) log [ ϕ ^ ( X , Z ( j ) ) Q j ( Z ( j ) ) ] d Z ( j ) + C = − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] \begin{aligned} & \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) d\mathcal Z^{(j)} - \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \left[\log \hat \phi(\mathcal X,\mathcal Z^{(j)}) - \log \mathcal Q_j(\mathcal Z^{(j)})\right] \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \left[\frac{\hat \phi(\mathcal X,\mathcal Z^{(j)})}{\mathcal Q_j(\mathcal Z^{(j)})}\right]d \mathcal Z^{(j)} +\mathcal C \\ & = - \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right] \end{aligned} ∫Z(j)Qj(Z(j))logϕ^(X,Z(j))dZ(j)−∫Z(j)Qj(Z(j))logQj(Z(j))dZ(j)=∫Z(j)Qj(Z(j))[logϕ^(X,Z(j))−logQj(Z(j))]=∫Z(j)Qj(Z(j))log[Qj(Z(j))ϕ^(X,Z(j))]dZ(j)+C=−KL[ϕ^(X,Z(j))∣∣Qj(Z(j))]
最终:
Q j ^ ( Z ( j ) ) = arg max Q ( Z ) L [ Q ( Z ) ] = arg max Q ( Z ) { − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] } \begin{aligned} \hat{\mathcal Q_j}(\mathcal Z^{(j)}) & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \\ & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \left\{- \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right]\right\} \end{aligned} Qj^(Z(j))=Q(Z)argmaxL[Q(Z)]=Q(Z)argmax{−KL[ϕ^(X,Z(j))∣∣Qj(Z(j))]}
因此,基于平均场假设的 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))的最优解即:
Q j ^ ( Z ( j ) ) = ϕ ^ ( X , Z ( j ) ) ( j ∈ { 1 , 2 , ⋯ , M } ) \hat {\mathcal Q_j}(\mathcal Z^{(j)}) = \hat \phi(\mathcal X,\mathcal Z^{(j)}) (j \in \{1,2,\cdots,\mathcal M\}) Qj^(Z(j))=ϕ^(X,Z(j))(j∈{1,2,⋯,M})
从而最终求得最优解 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z):
Q ^ ( Z ) = ∏ j = 1 M Q j ( Z ( j ) ) \hat {\mathcal Q}(\mathcal Z) = \prod_{j=1}^{\mathcal M} \mathcal Q_j(\mathcal Z^{(j)}) Q^(Z)=j=1∏MQj(Z(j))
基于平均场假设,关于 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的近似求解推导结束。下一节将介绍变分推断与EM算法之间的关联关系。
相关参考:
变分法:理解变分
机器学习-变分推断2(公式推导)