上一节介绍了使用马尔可夫链蒙特卡洛方法(MCMC)处理波尔兹曼机模型参数梯度求解过程中概率分布不可求的问题,本节将介绍变分推断方法处理梯度问题。
相比于受限玻尔兹曼机,玻尔兹曼机对于随机变量之间关联关系的约束更加宽松,观测变量、隐变量自身之间也存在关联关系。这里以观测变量与隐变量之间关联关系的模型参数 W \mathcal W W 为例,关于 W \mathcal W W的对数似然梯度(Log Likelihood Gradient)可表示为:
∇ W [ log P ( v ( i ) ; θ ) ] = E P d a t a [ v ( i ) ( h ( i ) ) T ] − E P m o d e l [ v ( i ) ( h ( i ) ) T ] \nabla_{\mathcal W} \left[\log \mathcal P(v^{(i)};\theta)\right] = \mathbb E_{\mathcal P_{data}} \left[v^{(i)}(h^{(i)})^T\right] - \mathbb E_{\mathcal P_{model}} \left[v^{(i)}(h^{(i)})^T\right] ∇W[logP(v(i);θ)]=EPdata[v(i)(h(i))T]−EPmodel[v(i)(h(i))T]
其中 P d a t a \mathcal P_{data} Pdata表示真实分布。其底层逻辑是从客观存在的概率模型 P d a t a ( V ) \mathcal P_{data}(\mathcal V) Pdata(V)随机生成 N N N个样本,构成当前的样本集合 V = { v ( 1 ) , v ( 2 ) , ⋯ , v ( N ) } \mathcal V = \{v^{(1)},v^{(2)},\cdots,v^{(N)}\} V={v(1),v(2),⋯,v(N)}。
但这里的真实分布 P d a t a \mathcal P_{data} Pdata和配分函数——随机最大似然正相中的 P d a t a \mathcal P_{data} Pdata中存在稍许不同:
详细推导见:
玻尔兹曼机——基本介绍从上式推导可以看出,玻尔兹曼机中无论是正相还是负相,均存在包含隐变量的概率分布:
P d a t a ⇒ P m o d e l ( h ( i ) ∣ v ( i ) ) P m o d e l ⇒ P m o d e l ( h ( i ) , v ( i ) ) \begin{aligned} \mathcal P_{data} \Rightarrow \mathcal P_{model}(h^{(i)} \mid v^{(i)}) \\ \mathcal P_{model} \Rightarrow \mathcal P_{model}(h^{(i)},v^{(i)}) \end{aligned} Pdata⇒Pmodel(h(i)∣v(i))Pmodel⇒Pmodel(h(i),v(i))
在玻尔兹曼机的约束条件中,无论是 P m o d e l ( h ( i ) , v ( i ) ) \mathcal P_{model}(h^{(i)},v^{(i)}) Pmodel(h(i),v(i))还是 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)∣v(i)),都是极难近似求解的。
在80年代早期,没有给出变分推断概念时,针对模型参数 W \mathcal W W的对数似然梯度是通过吉布斯采样的方式进行求解。这种方式求解的核心思路是:针对单个变量(观测变量、隐变量)的后验概率进行表示,而不是隐变量/观测变量的后验概率。
关于单个变量后验概率的推导过程详见
玻尔兹曼机——梯度求解(MCMC方法)
P ( v i ( i ) ∣ h ( i ) , v − i ( i ) ) = { Sigmoid { ∑ j = 1 P W i j ⋅ h j ( i ) + ∑ k ≠ i D L i k ⋅ v k ( i ) } v i ( i ) = 1 1 − Sigmoid { ∑ j = 1 P W i j ⋅ h j ( i ) + ∑ k ≠ i D L i k ⋅ v k ( i ) } v i ( i ) = 0 P ( h j ( i ) ∣ v ( i ) , h − j ( i ) ) = { Sigmoid { ∑ i = 1 D W i j ⋅ v i ( i ) + ∑ m ≠ j J j m ⋅ h m ( i ) } h j ( i ) = 1 1 − Sigmoid { ∑ i = 1 D W i j ⋅ v i ( i ) + ∑ m ≠ j J j m ⋅ h m ( i ) } h j ( i ) = 0 \begin{aligned} \mathcal P(v_i^{(i)} \mid h^{(i)},v_{-i}^{(i)}) = \begin{cases} \text{Sigmoid} \left\{\sum_{j=1}^{\mathcal P} \mathcal W_{ij} \cdot h_j^{(i)} + \sum_{k \neq i}^{\mathcal D} \mathcal L_{ik} \cdot v_k^{(i)}\right\} \quad v_i^{(i)} = 1\\ 1 - \text{Sigmoid} \left\{\sum_{j=1}^{\mathcal P} \mathcal W_{ij} \cdot h_j^{(i)} + \sum_{k \neq i}^{\mathcal D} \mathcal L_{ik} \cdot v_k^{(i)}\right\} \quad v_i^{(i)} = 0 \end{cases} \\ \mathcal P(h_j^{(i)} \mid v^{(i)},h_{-j}^{(i)}) = \begin{cases} \text{Sigmoid} \left\{\sum_{i=1}^{\mathcal D} \mathcal W_{ij} \cdot v_i^{(i)} + \sum_{m \neq j} \mathcal J_{jm} \cdot h_m^{(i)}\right\} \quad h_j^{(i)} = 1 \\ 1 - \text{Sigmoid} \left\{\sum_{i=1}^{\mathcal D} \mathcal W_{ij} \cdot v_i^{(i)} + \sum_{m \neq j} \mathcal J_{jm} \cdot h_m^{(i)}\right\} \quad h_j^{(i)} = 0 \\ \end{cases} \end{aligned} P(vi(i)∣h(i),v−i(i))=⎩ ⎨ ⎧Sigmoid{∑j=1PWij⋅hj(i)+∑k=iDLik⋅vk(i)}vi(i)=11−Sigmoid{∑j=1PWij⋅hj(i)+∑k=iDLik⋅vk(i)}vi(i)=0P(hj(i)∣v(i),h−j(i))=⎩ ⎨ ⎧Sigmoid{∑i=1DWij⋅vi(i)+∑m=jJjm⋅hm(i)}hj(i)=11−Sigmoid{∑i=1DWij⋅vi(i)+∑m=jJjm⋅hm(i)}hj(i)=0
此时,上述两种概率是可求的,在吉布斯采样过程中,通过固定待采样之外的其他随机变量,针对待采样的随机变量计算概率分布,并进行采样。直到所有随机变量均采样完毕,第一次迭代结束;最终通过若干次迭代,最终达到平稳分布。
基于该分布的采样结果可以直接近似模型参数的梯度 ∇ W [ log P ( v ( i ) ; θ ) ] \nabla_{\mathcal W} \left[\log \mathcal P(v^{(i)};\theta)\right] ∇W[logP(v(i);θ)]使之跳过对正相、负相期望的求解。
这里如果有不同理解的小伙伴,欢迎评论区一起讨论。
该方法的核心在于使用变分推断直接近似求解后验概率 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)∣v(i)),从而避免原先使用MCMC采样的方式进行求解。
这种方法针对于大规模的随机变量集合,它的采样时间同样是随着随机变量数量的增加指数级别地增长。
关于正向部分 P d a t a ⇒ P d a t a ( v ( i ) ∈ V ) ⋅ P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{data} \Rightarrow \mathcal P_{data}(v^{(i)} \in \mathcal V) \cdot \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pdata⇒Pdata(v(i)∈V)⋅Pmodel(h(i)∣v(i))在之前的MCMC采样方法需要将 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)∣v(i))近似出来或者使用受限玻尔兹曼机的约束条件将 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)∣v(i))使用 Sigmoid \text{Sigmoid} Sigmoid函数描述出来。本节使用基于平均场假设的变分推断(Variational Inference)对 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)∣v(i))进行描述。
关于 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)∣v(i))变分推断的核心是找到一个 合适的分布 Q ( h ( i ) ∣ v ( i ) ) \mathcal Q(h^{(i)} \mid v^{(i)}) Q(h(i)∣v(i)),使得 Q ( h ( i ) ∣ v ( i ) ) \mathcal Q(h^{(i)} \mid v^{(i)}) Q(h(i)∣v(i))近似 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)∣v(i))。而变分推断的底层逻辑依然是极大似然估计:
将
v ( i ) v^{(i)} v(i)在模型中对应的隐变量
h ( i ) h^{(i)} h(i)引进来。
log P ( v ( i ) ; θ ) = log [ P ( v ( i ) , h ( i ) ; θ ) P ( h ( i ) ∣ v ( i ) ; θ ) ] = log P ( v ( i ) , h ( i ) ; θ ) − log P ( h ( i ) ∣ v ( i ) ; θ ) { θ = { W , L , J } v ( i ) ∈ V = { v ( 1 ) , v ( 2 ) , ⋯ , v ( N ) } \begin{aligned} \log \mathcal P(v^{(i)};\theta) & = \log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal P(h^{(i)} \mid v^{(i)};\theta)}\right]\\ & = \log \mathcal P(v^{(i)},h^{(i)};\theta) - \log \mathcal P(h^{(i)} \mid v^{(i)};\theta) \end{aligned} \\ \begin{cases} \theta = \{\mathcal W,\mathcal L,\mathcal J\} \\ v^{(i)} \in \mathcal V = \{v^{(1)},v^{(2)},\cdots,v^{(N)}\} \end{cases} logP(v(i);θ)=log[P(h(i)∣v(i);θ)P(v(i),h(i);θ)]=logP(v(i),h(i);θ)−logP(h(i)∣v(i);θ){θ={W,L,J}v(i)∈V={v(1),v(2),⋯,v(N)}
在此基础上,将近似分布
Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)}\mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)引入,其中
ϕ \phi ϕ表示这个近似分布的参数信息。
log P ( v ( i ) ; θ ) = [ log P ( v ( i ) , h ( i ) ; θ ) − log Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] − [ log P ( h ( i ) ∣ v ( i ) ; θ ) − log Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = log [ P ( v ( i ) , h ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] − log [ P ( h ( i ) ∣ v ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \begin{aligned} \log \mathcal P(v^{(i)};\theta) & = \left[\log \mathcal P(v^{(i)},h^{(i)};\theta) - \log \mathcal Q(h^{(i)}\mid v^{(i)};\phi)\right] - \left[\log \mathcal P(h^{(i)} \mid v^{(i)};\theta) - \log \mathcal Q(h^{(i)}\mid v^{(i)};\phi)\right] \\ & = \log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right] - \log \left[\frac{\mathcal P(h^{(i)} \mid v^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right] \end{aligned} logP(v(i);θ)=[logP(v(i),h(i);θ)−logQ(h(i)∣v(i);ϕ)]−[logP(h(i)∣v(i);θ)−logQ(h(i)∣v(i);ϕ)]=log[Q(h(i)∣v(i);ϕ)P(v(i),h(i);θ)]−log[Q(h(i)∣v(i);ϕ)P(h(i)∣v(i);θ)]
等式两端同时对
h ( i ) h^{(i)} h(i)求解积分,由于是玻尔兹曼机,所有变量均是服从伯努利分布的离散型随机变量。因此使用
∑ h ( i ) \sum_{h^{(i)}} ∑h(i)~
再次强调,有负号才是
K L Divergence \mathcal K\mathcal L\text{ Divergence} KL Divergence.
Equation Left : ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log P ( v ( i ) ; θ ) = log P ( v ( i ) ; θ ) ⋅ ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = log P ( v ( i ) ; θ ) Equation Right : ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) { log [ P ( v ( i ) , h ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] − log [ P ( h ( i ) ∣ v ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log [ P ( v ( i ) , h ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] ⏟ E L B O − ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log [ P ( h ( i ) ∣ v ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] ⏟ K L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ∣ ∣ P ( h ( i ) ∣ v ( i ) ; θ ) ] \begin{aligned}\text{Equation Left : } \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal P(v^{(i)};\theta) & = \log \mathcal P(v^{(i)};\theta) \cdot \underbrace{\sum_{h^{(i)}} \mathcal Q(h^{(i)}\mid v^{(i)};\phi)}_{=1} = \log \mathcal P(v^{(i)};\theta) \end{aligned}\\ \begin{aligned} \text{Equation Right :}&\quad \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left\{\log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right] - \log \left[\frac{\mathcal P(h^{(i)} \mid v^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right]\right\} \\ & = \underbrace{\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi)\log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right]}_{ELBO} \underbrace{- \sum_{h^{(i)}}\mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \left[\frac{\mathcal P(h^{(i)} \mid v^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right]}_{\mathcal K\mathcal L\left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi) || \mathcal P(h^{(i)} \mid v^{(i)};\theta)\right]} \end{aligned} Equation Left : h(i)∑Q(h(i)∣v(i);ϕ)logP(v(i);θ)=logP(v(i);θ)⋅=1 h(i)∑Q(h(i)∣v(i);ϕ)=logP(v(i);θ)Equation Right :h(i)∑Q(h(i)∣v(i);ϕ){log[Q(h(i)∣v(i);ϕ)P(v(i),h(i);θ)]−log[Q(h(i)∣v(i);ϕ)P(h(i)∣v(i);θ)]}=ELBO h(i)∑Q(h(i)∣v(i);ϕ)log[Q(h(i)∣v(i);ϕ)P(v(i),h(i);θ)]KL[Q(h(i)∣v(i);ϕ)∣∣P(h(i)∣v(i);θ)] −h(i)∑Q(h(i)∣v(i);ϕ)log[Q(h(i)∣v(i);ϕ)P(h(i)∣v(i);θ)]
至此,将证据下界(Evidence Lower Bound,ELBO)表示如下:
证据下界(ELBO)也称作
Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)的变分。用
L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \mathcal L \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] L[Q(h(i)∣v(i);ϕ)]符号表示。
H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = − ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] = - \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal Q(h^{(i)} \mid v^{(i)};\phi) H[Q(h(i)∣v(i);ϕ)]=−∑h(i)Q(h(i)∣v(i);ϕ)logQ(h(i)∣v(i);ϕ)表示概率分布
Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)的熵。
ELBO = L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log [ P ( v ( i ) , h ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ log P ( v ( i ) , h ( i ) ; θ ) − log Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log P ( v ( i ) , h ( i ) ; θ ) − ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log Q ( h ( i ) ∣ v ( i ) ; ϕ ) = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log P ( v ( i ) , h ( i ) ; θ ) + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \begin{aligned} \text{ELBO} & = \mathcal L \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \\ & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi)\log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right] \\ & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[\log \mathcal P(v^{(i)},h^{(i)};\theta) - \log \mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \\ & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal P(v^{(i)},h^{(i)};\theta) - \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \\ & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal P(v^{(i)},h^{(i)};\theta) + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \end{aligned} ELBO=L[Q(h(i)∣v(i);ϕ)]=h(i)∑Q(h(i)∣v(i);ϕ)log[Q(h(i)∣v(i);ϕ)P(v(i),h(i);θ)]=h(i)∑Q(h(i)∣v(i);ϕ)[logP(v(i),h(i);θ)−logQ(h(i)∣v(i);ϕ)]=h(i)∑Q(h(i)∣v(i);ϕ)logP(v(i),h(i);θ)−h(i)∑Q(h(i)∣v(i);ϕ)logQ(h(i)∣v(i);ϕ)=h(i)∑Q(h(i)∣v(i);ϕ)logP(v(i),h(i);θ)+H[Q(h(i)∣v(i);ϕ)]
后续的求解思路是:通过求解近似分布的参数 ϕ \phi ϕ,使得 ELBO \text{ELBO} ELBO达到最大,等价于 K L Divergence \mathcal K\mathcal L \text{ Divergence} KL Divergence趋近于 0 0 0,最终使 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)与 P ( h ( i ) ∣ v ( i ) ; θ ) \mathcal P(h^{(i)}\mid v^{(i)};\theta) P(h(i)∣v(i);θ)最近似。
至此,将求解近似分布 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)}\mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)的问题转移至 求解最优参数 ϕ ^ \hat \phi ϕ^,使得 ELBO \text{ELBO} ELBO达到最大:
ϕ ^ = arg max ϕ L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \hat \phi = \mathop{\arg\max}\limits_{\phi} \mathcal L \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] ϕ^=ϕargmaxL[Q(h(i)∣v(i);ϕ)]
在介绍基于平均场假设变分推断的过程中,关于 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)}\mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)的平均场假设具体是将 h ( i ) = ( h 1 ( i ) , h 2 ( i ) , ⋯ , h P ( i ) ) T h^{(i)} = \left(h_1^{(i)},h_2^{(i)},\cdots,h_{\mathcal P}^{(i)}\right)^T h(i)=(h1(i),h2(i),⋯,hP(i))T划分成若干个相互独立的子集合。由于相互独立,因而后验概率分布可描述为各子集合后验结果的乘积形式:
由于
h ( i ) h^{(i)} h(i)中一共包含
P \mathcal P P个随机变量,这里就假设划分的子集合数量为
P \mathcal P P,也就是每个子集合仅包含
1 1 1个随机变量。
Q ( h ( i ) ∣ v ( i ) ; ϕ ) = ∏ j = 1 P Q ( h j ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) = \prod_{j=1}^{\mathcal P} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)=j=1∏PQ(hj(i)∣v(i);ϕ)
由于 h j ( i ) ( i = 1 , 2 , ⋯ , P ) h_j^{(i)}(i=1,2,\cdots,\mathcal P) hj(i)(i=1,2,⋯,P)均服从伯努利分布,那么设定符号对概率分布 Q ( h j ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) Q(hj(i)∣v(i);ϕ)进行如下表示:
Q ( h j ( i ) ∣ v ( i ) ; ϕ ) = { Q ( h j ( i ) = 1 ∣ v ( i ) ; ϕ ) = ϕ j Q ( h j ( i ) = 0 ∣ v ( i ) ; ϕ ) = 1 − ϕ j \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) = \begin{cases} \mathcal Q(h_j^{(i)}=1 \mid v^{(i)};\phi) = \phi_j \\ \mathcal Q(h_j^{(i)}=0 \mid v^{(i)};\phi) = 1- \phi_j \end{cases} Q(hj(i)∣v(i);ϕ)={Q(hj(i)=1∣v(i);ϕ)=ϕjQ(hj(i)=0∣v(i);ϕ)=1−ϕj
ϕ j \phi_j ϕj虽然不是参数,它只是一个描述概率的实数,但 ϕ j \phi_j ϕj如果已经求解,那么 Q ( h j ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) Q(hj(i)∣v(i);ϕ)自然也求解了。
因此,也可以将模型参数
ϕ \phi ϕ看作是'包含各随机变量概率信息的集合'
{ ϕ 1 , ϕ 2 , ⋯ , ϕ P } \{\phi_1,\phi_2,\cdots,\phi_{\mathcal P}\} {ϕ1,ϕ2,⋯,ϕP}
至此,将变分推断的求解目标 ϕ ^ \hat \phi ϕ^分解成了 P \mathcal P P个相互独立的概率信息 ϕ ^ j ( j = 1 , 2 , ⋯ , P ) \hat {\phi}_j(j=1,2,\cdots,\mathcal P) ϕ^j(j=1,2,⋯,P):
每一个
ϕ ^ j ( j = 1 , 2 , ⋯ , P ) \hat {\phi}_j(j=1,2,\cdots,\mathcal P) ϕ^j(j=1,2,⋯,P)都要求解。
ϕ ^ j = arg max ϕ j L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \hat {\phi}_j = \mathop{\arg\max}\limits_{\phi_j} \mathcal L \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] ϕ^j=ϕjargmaxL[Q(h(i)∣v(i);ϕ)]
将
ELBO \text{ELBO} ELBO的展开式带入,并将
P ( v ( i ) , h ( i ) ; θ ) = 1 Z exp { ( v ( i ) ) T W ⋅ h ( i ) + 1 2 ( v ( i ) ) T L ⋅ v ( i ) + 1 2 ( h ( i ) ) T J ⋅ h ( i ) } \mathcal P(v^{(i)},h^{(i)};\theta) = \frac{1}{\mathcal Z} \exp \left\{(v^{(i)})^T\mathcal W\cdot h^{(i)} + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} + \frac{1}{2} (h^{(i)})^T\mathcal J \cdot h^{(i)}\right\} P(v(i),h(i);θ)=Z1exp{(v(i))TW⋅h(i)+21(v(i))TL⋅v(i)+21(h(i))TJ⋅h(i)}进行展开。
玻尔兹曼机——概率密度函数回顾
log \log log和
exp \exp exp之间相互消掉了。
ϕ ^ j = arg max ϕ j { ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log P ( v ( i ) , h ( i ) ; θ ) + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } = arg max ϕ j { ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ − log Z + ( v ( i ) ) T W ⋅ h ( i ) + 1 2 ( v ( i ) ) T L ⋅ v ( i ) + 1 2 ( h ( i ) ) T J ⋅ h ( i ) ] + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } \begin{aligned} \hat {\phi}_j & = \mathop{\arg\max}\limits_{\phi_j} \left\{\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal P(v^{(i)},h^{(i)};\theta) + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right]\right\} \\ & = \mathop{\arg\max}\limits_{\phi_j} \left\{\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[-\log \mathcal Z + (v^{(i)})^T\mathcal W\cdot h^{(i)} + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} + \frac{1}{2} (h^{(i)})^T\mathcal J \cdot h^{(i)}\right] + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right]\right\} \end{aligned} ϕ^j=ϕjargmax{h(i)∑Q(h(i)∣v(i);ϕ)logP(v(i),h(i);θ)+H[Q(h(i)∣v(i);ϕ)]}=ϕjargmax{h(i)∑Q(h(i)∣v(i);ϕ)[−logZ+(v(i))TW⋅h(i)+21(v(i))TL⋅v(i)+21(h(i))TJ⋅h(i)]+H[Q(h(i)∣v(i);ϕ)]}
将中括号中的项分成含 h ( i ) h^{(i)} h(i)和不含 h ( i ) h^{(i)} h(i)的两部分:
{ Δ 1 = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ − log Z + 1 2 ( v ( i ) ) T L ⋅ v ( i ) ] Δ 2 = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ ( v ( i ) ) T W ⋅ h ( i ) + 1 2 ( h ( i ) ) T J ⋅ h ( i ) ] ϕ j ^ = arg max ϕ j { Δ 1 + Δ 2 + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } \begin{cases} \Delta_1 = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[-\log \mathcal Z + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)}\right] \\ \Delta_2 = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[(v^{(i)})^T\mathcal W\cdot h^{(i)} + \frac{1}{2} (h^{(i)})^T\mathcal J \cdot h^{(i)}\right] \end{cases} \\ \hat {\phi_j} = \mathop{\arg\max}\limits_{\phi_j} \left\{\Delta_1 + \Delta_2 + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \right\} {Δ1=∑h(i)Q(h(i)∣v(i);ϕ)[−logZ+21(v(i))TL⋅v(i)]Δ2=∑h(i)Q(h(i)∣v(i);ϕ)[(v(i))TW⋅h(i)+21(h(i))TJ⋅h(i)]ϕj^=ϕjargmax{Δ1+Δ2+H[Q(h(i)∣v(i);ϕ)]}
对 Δ 1 \Delta_1 Δ1进行化简:
很明显,
− log Z + 1 2 ( v ( i ) ) T L ⋅ v ( i ) -\log \mathcal Z + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} −logZ+21(v(i))TL⋅v(i)和
h ( i ) h^{(i)} h(i)没有关联关系,可看作常数提到公式前面;
∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) ∑h(i)Q(h(i)∣v(i);ϕ)本身是‘概率密度积分’,其结果是
1 1 1.
Δ 1 = [ − log Z + 1 2 ( v ( i ) ) T L ⋅ v ( i ) ] ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = − log Z + 1 2 ( v ( i ) ) T L ⋅ v ( i ) \begin{aligned} \Delta_1 & = \left[-\log \mathcal Z + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)}\right] \underbrace{\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi)}_{=1} \\ & = -\log \mathcal Z + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} \end{aligned} Δ1=[−logZ+21(v(i))TL⋅v(i)]=1 h(i)∑Q(h(i)∣v(i);ϕ)=−logZ+21(v(i))TL⋅v(i)
与此同时,
Z = ∑ h ( i ) , v ( i ) exp { − E ( v ( i ) , h ( i ) ) } \mathcal Z = \sum_{h^{(i)},v^{(i)}}\exp\{-\mathbb E(v^{(i)},h^{(i)})\} Z=∑h(i),v(i)exp{−E(v(i),h(i))}是配分函数,和
ϕ j \phi_j ϕj之间无关联关系(配分函数将
h ( i ) h^{(i)} h(i)全部积分掉了);并且
1 2 ( v ( i ) ) T L ⋅ v ( i ) \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} 21(v(i))TL⋅v(i)也和
ϕ j \phi_j ϕj之间无关联关系(
ϕ j \phi_j ϕj描述的是
h j ( i ) h_j^{(i)} hj(i)的后验概率信息,而该项中并不包含隐变量)。因此,在求解最优
ϕ ^ j \hat {\phi}_j ϕ^j过程中,
Δ 1 \Delta_1 Δ1整个项全部可以省略。
ϕ j ^ = arg max ϕ j { Δ 2 + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } \begin{aligned} \hat {\phi_j} & = \mathop{\arg\max}\limits_{\phi_j} \left\{ \Delta_2 + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \right\} \\ \end{aligned} ϕj^=ϕjargmax{Δ2+H[Q(h(i)∣v(i);ϕ)]}
后续思路:既然是求解最大值,可以 将 Δ 2 + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \Delta_2 + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] Δ2+H[Q(h(i)∣v(i);ϕ)]对 ϕ j \phi_j ϕj求解偏导数,如果偏导数存在,令其等于0,将最值求出来;如果不存在,可以使用梯度上升法去求解一个近似最优解。
将上述部分展开,分成如下三个部分:
Δ 2 + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ ( v ( i ) ) T W ⋅ h ( i ) + 1 2 ( h ( i ) ) T J ⋅ h ( i ) ] + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = Λ 1 + Λ 2 + Λ 3 { Λ 1 = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ ( v ( i ) ) T W ⋅ h ( i ) ] Λ 2 = 1 2 ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ ( h ( i ) ) T J ⋅ h ( i ) ] Λ 3 = H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \begin{aligned} \Delta_2 + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[(v^{(i)})^T\mathcal W\cdot h^{(i)} + \frac{1}{2} (h^{(i)})^T\mathcal J \cdot h^{(i)}\right] + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \\ & = \Lambda_1 + \Lambda_2 + \Lambda_3 \\ & \begin{cases} \Lambda_1 = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[(v^{(i)})^T\mathcal W\cdot h^{(i)}\right] \\ \Lambda_2 = \frac{1}{2}\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[(h^{(i)})^T\mathcal J \cdot h^{(i)}\right] \\ \Lambda_3 = \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \end{cases} \end{aligned} Δ2+H[Q(h(i)∣v(i);ϕ)]=h(i)∑Q(h(i)∣v(i);ϕ)[(v(i))TW⋅h(i)+21(h(i))TJ⋅h(i)]+H[Q(h(i)∣v(i);ϕ)]=Λ1+Λ2+Λ3⎩ ⎨ ⎧Λ1=∑h(i)Q(h(i)∣v(i);ϕ)[(v(i))TW⋅h(i)]Λ2=21∑h(i)Q(h(i)∣v(i);ϕ)[(h(i))TJ⋅h(i)]Λ3=H[Q(h(i)∣v(i);ϕ)]
对 Λ 1 \Lambda_1 Λ1进行化简:首先将 Λ 1 \Lambda_1 Λ1继续展开,将 h j ( i ) h_j^{(i)} hj(i)表示出来:
需要展开两个部分:将
Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)使用平均场假设进行展开;将矩阵乘法
( v ( i ) ) T W ⋅ h ( i ) (v^{(i)})^T\mathcal W\cdot h^{(i)} (v(i))TW⋅h(i)进行展开。
Λ 1 = ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ ∑ i = 1 D ∑ l = 1 P v i ( i ) ⋅ W i l ⋅ h l ( i ) \begin{aligned} \Lambda_1 = \sum_{h^{(i)}} \prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \sum_{i=1}^{\mathcal D}\sum_{l=1}^{\mathcal P} v_i^{(i)} \cdot \mathcal W_{il} \cdot h_l^{(i)} \end{aligned} Λ1=h(i)∑l=1∏PQ(hl(i)∣v(i);ϕ)⋅i=1∑Dl=1∑Pvi(i)⋅Wil⋅hl(i)
可以发现,里面的项数是非常多的( D × P \mathcal D \times \mathcal P D×P项,包含乘法、加法),以第一项 v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)} v1(i)⋅W11⋅h1(i)为例,观察是否能够向下化简:
从
∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) \prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) ∏l=1PQ(hl(i)∣v(i);ϕ)中单独将
Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi) Q(h1(i)∣v(i);ϕ)分出来;并且将
∑ h 1 ( i ) \sum_{h_1^{(i)}} ∑h1(i)从
∑ h ( i ) \sum_{h^{(i)}} ∑h(i)中分出来。
实际上,这步操作和
变分推断(平均场假设)推导过程的处理方式是相同的。
∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) ] = ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) ] ⋅ ∑ h 2 ( i ) , ⋯ , h P ( i ) ∏ l = 2 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) = ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) ] ⋅ ∑ h 2 ( i ) Q ( h 2 ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 ⋯ ∑ h P ( i ) Q ( h P ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) ] \begin{aligned} & \quad \sum_{h^{(i)}} \prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)}\right] \\ & = \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)}\right] \cdot \sum_{h_2^{(i)},\cdots,h_{\mathcal P}^{(i)}} \prod_{l = 2}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \\ & = \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)}\right] \cdot \underbrace{\sum_{h_2^{(i)}}\mathcal Q(h_2^{(i)} \mid v^{(i)};\phi)}_{=1} \cdots \underbrace{\sum_{h_{\mathcal P}^{(i)}}\mathcal Q(h_{\mathcal P}^{(i)} \mid v^{(i)};\phi)}_{=1} \\ & = \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)}\right] \end{aligned} h(i)∑l=1∏PQ(hl(i)∣v(i);ϕ)⋅[v1(i)⋅W11⋅h1(i)]=h1(i)∑Q(h1(i)∣v(i);ϕ)⋅[v1(i)⋅W11⋅h1(i)]⋅h2(i),⋯,hP(i)∑l=2∏PQ(hl(i)∣v(i);ϕ)=h1(i)∑Q(h1(i)∣v(i);ϕ)⋅[v1(i)⋅W11⋅h1(i)]⋅=1 h2(i)∑Q(h2(i)∣v(i);ϕ)⋯=1 hP(i)∑Q(hP(i)∣v(i);ϕ)=h1(i)∑Q(h1(i)∣v(i);ϕ)⋅[v1(i)⋅W11⋅h1(i)]
由于 h 1 ( i ) h_1^{(i)} h1(i)同样也是服从伯努利分布,继续将上式化简:
Q ( h 1 ( i ) = 1 ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ 1 ] + 0 = ϕ 1 ⋅ v 1 ( i ) ⋅ W 11 \begin{aligned} \mathcal Q(h_1^{(i)} = 1 \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot 1\right] + 0 = \phi_1 \cdot v_1^{(i)} \cdot \mathcal W_{11} \end{aligned} Q(h1(i)=1∣v(i);ϕ)⋅[v1(i)⋅W11⋅1]+0=ϕ1⋅v1(i)⋅W11
其他项的处理方式均相同。至此, Λ 1 \Lambda_1 Λ1可化简为:
一共包含
D × P \mathcal D \times \mathcal P D×P项,均要进行还原。
Λ 1 = ∑ i = 1 D ∑ l = 1 P ϕ l ⋅ v i ( i ) ⋅ W i l \Lambda_1 = \sum_{i=1}^{\mathcal D}\sum_{l=1}^{\mathcal P} \phi_l \cdot v_i^{(i)} \cdot \mathcal W_{il} Λ1=i=1∑Dl=1∑Pϕl⋅vi(i)⋅Wil
对 Λ 2 \Lambda_2 Λ2进行化简:
关于
Λ 2 \Lambda_2 Λ2的化简思路和
Λ 1 \Lambda_1 Λ1是完全相同的,只不过更加复杂一些。因为包含
2 2 2个
h h h项。
Λ 2 = 1 2 ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ ∑ j = 1 P ∑ l = 1 P h j ( i ) ⋅ J i l ⋅ h l ( i ) \begin{aligned} \Lambda_2 = \frac{1}{2}\sum_{h^{(i)}} \prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \sum_{j=1}^{\mathcal P}\sum_{l=1}^{\mathcal P}h_j^{(i)} \cdot \mathcal J_{il} \cdot h_l^{(i)} \end{aligned} Λ2=21h(i)∑l=1∏PQ(hl(i)∣v(i);ϕ)⋅j=1∑Pl=1∑Phj(i)⋅Jil⋅hl(i)
第一种情况: i ≠ l ⇒ J i l i \neq l \Rightarrow\mathcal J_{il} i=l⇒Jil不在 J \mathcal J J的对角线上。以 h 1 ( i ) J 12 ⋅ h 2 ( i ) h_1^{(i)} \mathcal J_{12} \cdot h_2^{(i)} h1(i)J12⋅h2(i)为例:
1 2 ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 12 ⋅ h 2 ( i ) ] = 1 2 ∑ h 1 ( i ) ∑ h 2 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ Q ( h 2 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 12 ⋅ h 2 ( i ) ] ⋅ ∑ h 3 ( i ) , ⋯ , h P ( i ) ∏ l = 3 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = 1 2 ∑ h 1 ( i ) ∑ h 2 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ Q ( h 2 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 12 ⋅ h 2 ( i ) ] \begin{aligned} & \quad \frac{1}{2} \sum_{h^{(i)}}\prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \left[h_1^{(i)} \cdot \mathcal J_{12} \cdot h_2^{(i)}\right] \\ & = \frac{1}{2} \sum_{h_1^{(i)}} \sum_{h_2^{(i)}}\mathcal Q(h_1^{(i)} \mid v^{(i)};\phi)\cdot \mathcal Q(h_2^{(i)} \mid v^{(i)};\phi) \cdot \left[h_1^{(i)} \cdot \mathcal J_{12} \cdot h_2^{(i)}\right] \cdot \underbrace{\sum_{h_3^{(i)},\cdots,h_{\mathcal P}^{(i)}} \prod_{l = 3}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi)}_{=1} \\ & = \frac{1}{2} \sum_{h_1^{(i)}} \sum_{h_2^{(i)}}\mathcal Q(h_1^{(i)} \mid v^{(i)};\phi)\cdot \mathcal Q(h_2^{(i)} \mid v^{(i)};\phi) \cdot \left[h_1^{(i)} \cdot \mathcal J_{12} \cdot h_2^{(i)}\right] \end{aligned} 21h(i)∑l=1∏PQ(hl(i)∣v(i);ϕ)⋅[h1(i)⋅J12⋅h2(i)]=21h1(i)∑h2(i)∑Q(h1(i)∣v(i);ϕ)⋅Q(h2(i)∣v(i);ϕ)⋅[h1(i)⋅J12⋅h2(i)]⋅=1 h3(i),⋯,hP(i)∑l=3∏PQ(hl(i)∣v(i);ϕ)=21h1(i)∑h2(i)∑Q(h1(i)∣v(i);ϕ)⋅Q(h2(i)∣v(i);ϕ)⋅[h1(i)⋅J12⋅h2(i)]
此时,关于 h 1 ( i ) , h 2 ( i ) h_1^{(i)},h_2^{(i)} h1(i),h2(i)的取值一共划分为四种情况:
但是,实际上只有 h 1 ( i ) = 1 , h 2 ( i ) = 1 h_1^{(i)} = 1,h_2^{(i)} = 1 h1(i)=1,h2(i)=1才有结果,其余结果均为0。因此 h 1 ( i ) J 12 ⋅ h 2 ( i ) h_1^{(i)} \mathcal J_{12} \cdot h_2^{(i)} h1(i)J12⋅h2(i)对应的结果为:
1 2 ⋅ Q ( h 1 ( i ) = 1 ∣ v ( i ) ; ϕ ) ⋅ Q ( h 2 ( i ) = 1 ∣ v ( i ) ; ϕ ) ⋅ [ 1 ⋅ J 12 ⋅ 1 ] = 1 2 ϕ 1 ⋅ J 12 ⋅ ϕ 2 \frac{1}{2} \cdot \mathcal Q(h_1^{(i)}=1 \mid v^{(i)};\phi)\cdot \mathcal Q(h_2^{(i)}=1 \mid v^{(i)};\phi) \cdot \left[1 \cdot \mathcal J_{12} \cdot 1\right] = \frac{1}{2} \phi_1 \cdot \mathcal J_{12} \cdot \phi_2 21⋅Q(h1(i)=1∣v(i);ϕ)⋅Q(h2(i)=1∣v(i);ϕ)⋅[1⋅J12⋅1]=21ϕ1⋅J12⋅ϕ2
关于第一种情况的特殊性:由于参数矩阵 J \mathcal J J本身是实对称矩阵,同样有:
这意味着
h 1 ( i ) J 12 ⋅ h 2 ( i ) h_1^{(i)} \mathcal J_{12} \cdot h_2^{(i)} h1(i)J12⋅h2(i)和
h 2 ( i ) J 21 ⋅ h 1 ( i ) h_2^{(i)} \mathcal J_{21} \cdot h_1^{(i)} h2(i)J21⋅h1(i)的结果是相同的。
1 2 ϕ 1 ⋅ J 12 ⋅ ϕ 2 = 1 2 ϕ 2 ⋅ J 21 ⋅ ϕ 1 \frac{1}{2} \phi_1 \cdot \mathcal J_{12} \cdot \phi_2 = \frac{1}{2} \phi_2 \cdot \mathcal J_{21} \cdot \phi_1 21ϕ1⋅J12⋅ϕ2=21ϕ2⋅J21⋅ϕ1
第二种情况: i = l ⇒ J i l i=l \Rightarrow \mathcal J_{il} i=l⇒Jil在 J \mathcal J J的对角线上。以 h 1 ( i ) J 11 ⋅ h 1 ( i ) h_1^{(i)} \mathcal J_{11} \cdot h_1^{(i)} h1(i)J11⋅h1(i)为例:
不同于第一种情况,这里只能积分一个 ->
∑ h 1 ( i ) \sum_{h_1^{(i)}} ∑h1(i)
1 2 ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 11 ⋅ h 1 ( i ) ] = 1 2 ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 11 ⋅ h 1 ( i ) ] ⋅ ∑ h 2 ( i ) , ⋯ , h P ( i ) ∏ l = 2 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = 1 2 ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 11 ⋅ h 1 ( i ) ] \begin{aligned} & \quad \frac{1}{2} \sum_{h^{(i)}}\prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \left[h_1^{(i)} \cdot \mathcal J_{11} \cdot h_1^{(i)}\right] \\ & = \frac{1}{2} \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi)\cdot \left[h_1^{(i)} \cdot \mathcal J_{11} \cdot h_1^{(i)}\right] \cdot \underbrace{\sum_{h_2^{(i)},\cdots,h_{\mathcal P}^{(i)}}\prod_{l=2}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi)}_{=1} \\ & = \frac{1}{2} \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi)\cdot \left[h_1^{(i)} \cdot \mathcal J_{11} \cdot h_1^{(i)}\right] \end{aligned} 21h(i)∑l=1∏PQ(hl(i)∣v(i);ϕ)⋅[h1(i)⋅J11⋅h1(i)]=21h1(i)∑Q(h1(i)∣v(i);ϕ)⋅[h1(i)⋅J11⋅h1(i)]⋅=1 h2(i),⋯,hP(i)∑l=2∏PQ(hl(i)∣v(i);ϕ)=21h1(i)∑Q(h1(i)∣v(i);ϕ)⋅[h1(i)⋅J11⋅h1(i)]
和第一种情况相似,但只有两种选择: h 1 ( i ) = 1 ; h 1 ( i ) = 0 h_1^{(i)} = 1;h_1^{(i)} = 0 h1(i)=1;h1(i)=0。最终结果有:
由于