机器学习笔记之玻尔兹曼机(三)梯度求解(基于平均场理论的变分推断)

机器学习笔记之玻尔兹曼机——基于平均场推断梯度求解

  • 引言
    • 回顾:玻尔兹曼机模型参数梯度求解困难与MCMC方法的处理方式
    • 变分推断方法处理玻尔兹曼机对数似然梯度

引言

上一节介绍了使用马尔可夫链蒙特卡洛方法(MCMC)处理波尔兹曼机模型参数梯度求解过程中概率分布不可求的问题,本节将介绍变分推断方法处理梯度问题

回顾:玻尔兹曼机模型参数梯度求解困难与MCMC方法的处理方式

相比于受限玻尔兹曼机玻尔兹曼机对于随机变量之间关联关系的约束更加宽松,观测变量、隐变量自身之间也存在关联关系。这里以观测变量与隐变量之间关联关系的模型参数 W \mathcal W W 为例,关于 W \mathcal W W对数似然梯度(Log Likelihood Gradient)可表示为:
∇ W [ log ⁡ P ( v ( i ) ; θ ) ] = E P d a t a [ v ( i ) ( h ( i ) ) T ] − E P m o d e l [ v ( i ) ( h ( i ) ) T ] \nabla_{\mathcal W} \left[\log \mathcal P(v^{(i)};\theta)\right] = \mathbb E_{\mathcal P_{data}} \left[v^{(i)}(h^{(i)})^T\right] - \mathbb E_{\mathcal P_{model}} \left[v^{(i)}(h^{(i)})^T\right] W[logP(v(i);θ)]=EPdata[v(i)(h(i))T]EPmodel[v(i)(h(i))T]
其中 P d a t a \mathcal P_{data} Pdata表示真实分布。其底层逻辑是从客观存在的概率模型 P d a t a ( V ) \mathcal P_{data}(\mathcal V) Pdata(V)随机生成 N N N个样本,构成当前的样本集合 V = { v ( 1 ) , v ( 2 ) , ⋯   , v ( N ) } \mathcal V = \{v^{(1)},v^{(2)},\cdots,v^{(N)}\} V={v(1),v(2),,v(N)}

但这里的真实分布 P d a t a \mathcal P_{data} Pdata和配分函数——随机最大似然正相中的 P d a t a \mathcal P_{data} Pdata中存在稍许不同:

  • 随机最大似然中的 P d a t a \mathcal P_{data} Pdata是纯粹使用蒙特卡洛方法的逆向推导产生的分布:
    E P d a t a [ ∇ θ log ⁡ P ^ ( x ( i ) ; θ ) ] ≈ 1 N ∑ i = 1 N ∇ θ log ⁡ P ^ ( x ( i ) ; θ ) \mathbb E_{\mathcal P_{data}} \left[\nabla_{\theta} \log \hat {\mathcal P}(x^{(i)};\theta)\right] \approx \frac{1}{N} \sum_{i=1}^N \nabla_{\theta} \log \hat {\mathcal P}(x^{(i)};\theta) EPdata[θlogP^(x(i);θ)]N1i=1NθlogP^(x(i);θ)
  • 可模型参数 W \mathcal W W(玻尔兹曼机的其他参数也是一样的)梯度的正相并不仅仅包含蒙特卡洛方法的逆向推导,还包含关于隐变量的后验概率
    详细推导见:玻尔兹曼机——基本介绍
    1 N ∑ i = 1 N ∑ h ( i ) P ( h ( i ) ∣ v ( i ) ) [ v ( i ) ( h ( i ) ) T ] = 1 N ∑ i = 1 N { E P ( h ( i ) ∣ v ( i ) ) [ v ( i ) ( h ( i ) ) T ] } ≈ E P d a t a ( v ( i ) ∈ V ) { E P ( h ( i ) ∣ v ( i ) ) [ v ( i ) ( h ( i ) ) T ] } = E P d a t a [ v ( i ) ( h ( i ) ) T ] P d a t a ⇒ P d a t a ( v ( i ) ∈ V ) ⋅ P m o d e l ( h ( i ) ∣ v ( i ) ) \begin{aligned} \frac{1}{N} \sum_{i=1}^{N} \sum_{h^{(i)}} \mathcal P(h^{(i)} \mid v^{(i)}) \left[v^{(i)}(h^{(i)})^T\right] & = \frac{1}{N}\sum_{i=1}^N \left\{\mathbb E_{\mathcal P(h^{(i)} \mid v^{(i)})} \left[v^{(i)}(h^{(i)})^T\right]\right\} \\ & \approx \mathbb E_{\mathcal P_{data}(v^{(i)} \in \mathcal V)} \left\{\mathbb E_{\mathcal P(h^{(i)} \mid v^{(i)})} \left[v^{(i)}(h^{(i)})^T\right]\right\} \\ & = \mathbb E_{\mathcal P_{data}} \left[v^{(i)}(h^{(i)})^T\right] \\ \mathcal P_{data} \Rightarrow \mathcal P_{data}(v^{(i)} \in \mathcal V) & \cdot \mathcal P_{model}(h^{(i)} \mid v^{(i)}) \end{aligned} N1i=1Nh(i)P(h(i)v(i))[v(i)(h(i))T]PdataPdata(v(i)V)=N1i=1N{EP(h(i)v(i))[v(i)(h(i))T]}EPdata(v(i)V){EP(h(i)v(i))[v(i)(h(i))T]}=EPdata[v(i)(h(i))T]Pmodel(h(i)v(i))

从上式推导可以看出,玻尔兹曼机中无论是正相还是负相,均存在包含隐变量的概率分布
P d a t a ⇒ P m o d e l ( h ( i ) ∣ v ( i ) ) P m o d e l ⇒ P m o d e l ( h ( i ) , v ( i ) ) \begin{aligned} \mathcal P_{data} \Rightarrow \mathcal P_{model}(h^{(i)} \mid v^{(i)}) \\ \mathcal P_{model} \Rightarrow \mathcal P_{model}(h^{(i)},v^{(i)}) \end{aligned} PdataPmodel(h(i)v(i))PmodelPmodel(h(i),v(i))

玻尔兹曼机的约束条件中,无论是 P m o d e l ( h ( i ) , v ( i ) ) \mathcal P_{model}(h^{(i)},v^{(i)}) Pmodel(h(i),v(i))还是 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)v(i)),都是极难近似求解的。

在80年代早期,没有给出变分推断概念时,针对模型参数 W \mathcal W W的对数似然梯度是通过吉布斯采样的方式进行求解。这种方式求解的核心思路是:针对单个变量(观测变量、隐变量)的后验概率进行表示,而不是隐变量/观测变量的后验概率
关于单个变量后验概率的推导过程详见玻尔兹曼机——梯度求解(MCMC方法)
P ( v i ( i ) ∣ h ( i ) , v − i ( i ) ) = { Sigmoid { ∑ j = 1 P W i j ⋅ h j ( i ) + ∑ k ≠ i D L i k ⋅ v k ( i ) } v i ( i ) = 1 1 − Sigmoid { ∑ j = 1 P W i j ⋅ h j ( i ) + ∑ k ≠ i D L i k ⋅ v k ( i ) } v i ( i ) = 0 P ( h j ( i ) ∣ v ( i ) , h − j ( i ) ) = { Sigmoid { ∑ i = 1 D W i j ⋅ v i ( i ) + ∑ m ≠ j J j m ⋅ h m ( i ) } h j ( i ) = 1 1 − Sigmoid { ∑ i = 1 D W i j ⋅ v i ( i ) + ∑ m ≠ j J j m ⋅ h m ( i ) } h j ( i ) = 0 \begin{aligned} \mathcal P(v_i^{(i)} \mid h^{(i)},v_{-i}^{(i)}) = \begin{cases} \text{Sigmoid} \left\{\sum_{j=1}^{\mathcal P} \mathcal W_{ij} \cdot h_j^{(i)} + \sum_{k \neq i}^{\mathcal D} \mathcal L_{ik} \cdot v_k^{(i)}\right\} \quad v_i^{(i)} = 1\\ 1 - \text{Sigmoid} \left\{\sum_{j=1}^{\mathcal P} \mathcal W_{ij} \cdot h_j^{(i)} + \sum_{k \neq i}^{\mathcal D} \mathcal L_{ik} \cdot v_k^{(i)}\right\} \quad v_i^{(i)} = 0 \end{cases} \\ \mathcal P(h_j^{(i)} \mid v^{(i)},h_{-j}^{(i)}) = \begin{cases} \text{Sigmoid} \left\{\sum_{i=1}^{\mathcal D} \mathcal W_{ij} \cdot v_i^{(i)} + \sum_{m \neq j} \mathcal J_{jm} \cdot h_m^{(i)}\right\} \quad h_j^{(i)} = 1 \\ 1 - \text{Sigmoid} \left\{\sum_{i=1}^{\mathcal D} \mathcal W_{ij} \cdot v_i^{(i)} + \sum_{m \neq j} \mathcal J_{jm} \cdot h_m^{(i)}\right\} \quad h_j^{(i)} = 0 \\ \end{cases} \end{aligned} P(vi(i)h(i),vi(i))= Sigmoid{j=1PWijhj(i)+k=iDLikvk(i)}vi(i)=11Sigmoid{j=1PWijhj(i)+k=iDLikvk(i)}vi(i)=0P(hj(i)v(i),hj(i))= Sigmoid{i=1DWijvi(i)+m=jJjmhm(i)}hj(i)=11Sigmoid{i=1DWijvi(i)+m=jJjmhm(i)}hj(i)=0

此时,上述两种概率是可求的,在吉布斯采样过程中,通过固定待采样之外的其他随机变量,针对待采样的随机变量计算概率分布,并进行采样。直到所有随机变量均采样完毕,第一次迭代结束;最终通过若干次迭代,最终达到平稳分布

基于该分布的采样结果可以直接近似模型参数的梯度 ∇ W [ log ⁡ P ( v ( i ) ; θ ) ] \nabla_{\mathcal W} \left[\log \mathcal P(v^{(i)};\theta)\right] W[logP(v(i);θ)]使之跳过对正相、负相期望的求解
这里如果有不同理解的小伙伴,欢迎评论区一起讨论。

变分推断方法处理玻尔兹曼机对数似然梯度

该方法的核心在于使用变分推断直接近似求解后验概率 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)v(i)),从而避免原先使用MCMC采样的方式进行求解。
这种方法针对于大规模的随机变量集合,它的采样时间同样是随着随机变量数量的增加指数级别地增长

关于正向部分 P d a t a ⇒ P d a t a ( v ( i ) ∈ V ) ⋅ P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{data} \Rightarrow \mathcal P_{data}(v^{(i)} \in \mathcal V) \cdot \mathcal P_{model}(h^{(i)} \mid v^{(i)}) PdataPdata(v(i)V)Pmodel(h(i)v(i))在之前的MCMC采样方法需要将 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)v(i))近似出来或者使用受限玻尔兹曼机的约束条件 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)v(i))使用 Sigmoid \text{Sigmoid} Sigmoid函数描述出来。本节使用基于平均场假设的变分推断(Variational Inference)对 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)v(i))进行描述。

关于 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)v(i))变分推断的核心是找到一个 合适的分布 Q ( h ( i ) ∣ v ( i ) ) \mathcal Q(h^{(i)} \mid v^{(i)}) Q(h(i)v(i)),使得 Q ( h ( i ) ∣ v ( i ) ) \mathcal Q(h^{(i)} \mid v^{(i)}) Q(h(i)v(i))近似 P m o d e l ( h ( i ) ∣ v ( i ) ) \mathcal P_{model}(h^{(i)} \mid v^{(i)}) Pmodel(h(i)v(i))。而变分推断的底层逻辑依然是极大似然估计
v ( i ) v^{(i)} v(i)在模型中对应的隐变量 h ( i ) h^{(i)} h(i)引进来。
log ⁡ P ( v ( i ) ; θ ) = log ⁡ [ P ( v ( i ) , h ( i ) ; θ ) P ( h ( i ) ∣ v ( i ) ; θ ) ] = log ⁡ P ( v ( i ) , h ( i ) ; θ ) − log ⁡ P ( h ( i ) ∣ v ( i ) ; θ ) { θ = { W , L , J } v ( i ) ∈ V = { v ( 1 ) , v ( 2 ) , ⋯   , v ( N ) } \begin{aligned} \log \mathcal P(v^{(i)};\theta) & = \log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal P(h^{(i)} \mid v^{(i)};\theta)}\right]\\ & = \log \mathcal P(v^{(i)},h^{(i)};\theta) - \log \mathcal P(h^{(i)} \mid v^{(i)};\theta) \end{aligned} \\ \begin{cases} \theta = \{\mathcal W,\mathcal L,\mathcal J\} \\ v^{(i)} \in \mathcal V = \{v^{(1)},v^{(2)},\cdots,v^{(N)}\} \end{cases} logP(v(i);θ)=log[P(h(i)v(i);θ)P(v(i),h(i);θ)]=logP(v(i),h(i);θ)logP(h(i)v(i);θ){θ={W,L,J}v(i)V={v(1),v(2),,v(N)}
在此基础上,将近似分布 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)}\mid v^{(i)};\phi) Q(h(i)v(i);ϕ)引入,其中 ϕ \phi ϕ表示这个近似分布的参数信息。
log ⁡ P ( v ( i ) ; θ ) = [ log ⁡ P ( v ( i ) , h ( i ) ; θ ) − log ⁡ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] − [ log ⁡ P ( h ( i ) ∣ v ( i ) ; θ ) − log ⁡ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = log ⁡ [ P ( v ( i ) , h ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] − log ⁡ [ P ( h ( i ) ∣ v ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \begin{aligned} \log \mathcal P(v^{(i)};\theta) & = \left[\log \mathcal P(v^{(i)},h^{(i)};\theta) - \log \mathcal Q(h^{(i)}\mid v^{(i)};\phi)\right] - \left[\log \mathcal P(h^{(i)} \mid v^{(i)};\theta) - \log \mathcal Q(h^{(i)}\mid v^{(i)};\phi)\right] \\ & = \log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right] - \log \left[\frac{\mathcal P(h^{(i)} \mid v^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right] \end{aligned} logP(v(i);θ)=[logP(v(i),h(i);θ)logQ(h(i)v(i);ϕ)][logP(h(i)v(i);θ)logQ(h(i)v(i);ϕ)]=log[Q(h(i)v(i);ϕ)P(v(i),h(i);θ)]log[Q(h(i)v(i);ϕ)P(h(i)v(i);θ)]
等式两端同时对 h ( i ) h^{(i)} h(i)求解积分,由于是玻尔兹曼机,所有变量均是服从伯努利分布的离散型随机变量。因此使用 ∑ h ( i ) \sum_{h^{(i)}} h(i)~
再次强调,有负号才是 K L  Divergence \mathcal K\mathcal L\text{ Divergence} KL Divergence.
Equation Left :  ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ P ( v ( i ) ; θ ) = log ⁡ P ( v ( i ) ; θ ) ⋅ ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = log ⁡ P ( v ( i ) ; θ ) Equation Right : ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) { log ⁡ [ P ( v ( i ) , h ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] − log ⁡ [ P ( h ( i ) ∣ v ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ [ P ( v ( i ) , h ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] ⏟ E L B O − ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ [ P ( h ( i ) ∣ v ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] ⏟ K L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ∣ ∣ P ( h ( i ) ∣ v ( i ) ; θ ) ] \begin{aligned}\text{Equation Left : } \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal P(v^{(i)};\theta) & = \log \mathcal P(v^{(i)};\theta) \cdot \underbrace{\sum_{h^{(i)}} \mathcal Q(h^{(i)}\mid v^{(i)};\phi)}_{=1} = \log \mathcal P(v^{(i)};\theta) \end{aligned}\\ \begin{aligned} \text{Equation Right :}&\quad \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left\{\log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right] - \log \left[\frac{\mathcal P(h^{(i)} \mid v^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right]\right\} \\ & = \underbrace{\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi)\log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right]}_{ELBO} \underbrace{- \sum_{h^{(i)}}\mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \left[\frac{\mathcal P(h^{(i)} \mid v^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right]}_{\mathcal K\mathcal L\left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi) || \mathcal P(h^{(i)} \mid v^{(i)};\theta)\right]} \end{aligned} Equation Left : h(i)Q(h(i)v(i);ϕ)logP(v(i);θ)=logP(v(i);θ)=1 h(i)Q(h(i)v(i);ϕ)=logP(v(i);θ)Equation Right :h(i)Q(h(i)v(i);ϕ){log[Q(h(i)v(i);ϕ)P(v(i),h(i);θ)]log[Q(h(i)v(i);ϕ)P(h(i)v(i);θ)]}=ELBO h(i)Q(h(i)v(i);ϕ)log[Q(h(i)v(i);ϕ)P(v(i),h(i);θ)]KL[Q(h(i)v(i);ϕ)∣∣P(h(i)v(i);θ)] h(i)Q(h(i)v(i);ϕ)log[Q(h(i)v(i);ϕ)P(h(i)v(i);θ)]
至此,将证据下界(Evidence Lower Bound,ELBO)表示如下:
证据下界(ELBO)也称作 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)v(i);ϕ)的变分。用 L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \mathcal L \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] L[Q(h(i)v(i);ϕ)]符号表示。
H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = − ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] = - \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal Q(h^{(i)} \mid v^{(i)};\phi) H[Q(h(i)v(i);ϕ)]=h(i)Q(h(i)v(i);ϕ)logQ(h(i)v(i);ϕ)表示概率分布 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)v(i);ϕ)的熵。
ELBO = L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ [ P ( v ( i ) , h ( i ) ; θ ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ log ⁡ P ( v ( i ) , h ( i ) ; θ ) − log ⁡ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ P ( v ( i ) , h ( i ) ; θ ) − ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ Q ( h ( i ) ∣ v ( i ) ; ϕ ) = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ P ( v ( i ) , h ( i ) ; θ ) + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \begin{aligned} \text{ELBO} & = \mathcal L \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \\ & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi)\log \left[\frac{\mathcal P(v^{(i)},h^{(i)};\theta)}{\mathcal Q(h^{(i)}\mid v^{(i)};\phi)}\right] \\ & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[\log \mathcal P(v^{(i)},h^{(i)};\theta) - \log \mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \\ & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal P(v^{(i)},h^{(i)};\theta) - \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \\ & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal P(v^{(i)},h^{(i)};\theta) + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \end{aligned} ELBO=L[Q(h(i)v(i);ϕ)]=h(i)Q(h(i)v(i);ϕ)log[Q(h(i)v(i);ϕ)P(v(i),h(i);θ)]=h(i)Q(h(i)v(i);ϕ)[logP(v(i),h(i);θ)logQ(h(i)v(i);ϕ)]=h(i)Q(h(i)v(i);ϕ)logP(v(i),h(i);θ)h(i)Q(h(i)v(i);ϕ)logQ(h(i)v(i);ϕ)=h(i)Q(h(i)v(i);ϕ)logP(v(i),h(i);θ)+H[Q(h(i)v(i);ϕ)]
后续的求解思路是:通过求解近似分布的参数 ϕ \phi ϕ,使得 ELBO \text{ELBO} ELBO达到最大,等价于 K L  Divergence \mathcal K\mathcal L \text{ Divergence} KL Divergence趋近于 0 0 0,最终使 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)v(i);ϕ) P ( h ( i ) ∣ v ( i ) ; θ ) \mathcal P(h^{(i)}\mid v^{(i)};\theta) P(h(i)v(i);θ)最近似
至此,将求解近似分布 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)}\mid v^{(i)};\phi) Q(h(i)v(i);ϕ)的问题转移至 求解最优参数 ϕ ^ \hat \phi ϕ^,使得 ELBO \text{ELBO} ELBO达到最大
ϕ ^ = arg ⁡ max ⁡ ϕ L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \hat \phi = \mathop{\arg\max}\limits_{\phi} \mathcal L \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] ϕ^=ϕargmaxL[Q(h(i)v(i);ϕ)]
在介绍基于平均场假设变分推断的过程中,关于 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)}\mid v^{(i)};\phi) Q(h(i)v(i);ϕ)的平均场假设具体是 h ( i ) = ( h 1 ( i ) , h 2 ( i ) , ⋯   , h P ( i ) ) T h^{(i)} = \left(h_1^{(i)},h_2^{(i)},\cdots,h_{\mathcal P}^{(i)}\right)^T h(i)=(h1(i),h2(i),,hP(i))T划分成若干个相互独立的子集合。由于相互独立,因而后验概率分布可描述为各子集合后验结果的乘积形式
由于 h ( i ) h^{(i)} h(i)中一共包含 P \mathcal P P个随机变量,这里就假设划分的子集合数量为 P \mathcal P P,也就是每个子集合仅包含 1 1 1个随机变量。
Q ( h ( i ) ∣ v ( i ) ; ϕ ) = ∏ j = 1 P Q ( h j ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) = \prod_{j=1}^{\mathcal P} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) Q(h(i)v(i);ϕ)=j=1PQ(hj(i)v(i);ϕ)
由于 h j ( i ) ( i = 1 , 2 , ⋯   , P ) h_j^{(i)}(i=1,2,\cdots,\mathcal P) hj(i)(i=1,2,,P)均服从伯努利分布,那么设定符号概率分布 Q ( h j ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) Q(hj(i)v(i);ϕ)进行如下表示:
Q ( h j ( i ) ∣ v ( i ) ; ϕ ) = { Q ( h j ( i ) = 1 ∣ v ( i ) ; ϕ ) = ϕ j Q ( h j ( i ) = 0 ∣ v ( i ) ; ϕ ) = 1 − ϕ j \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) = \begin{cases} \mathcal Q(h_j^{(i)}=1 \mid v^{(i)};\phi) = \phi_j \\ \mathcal Q(h_j^{(i)}=0 \mid v^{(i)};\phi) = 1- \phi_j \end{cases} Q(hj(i)v(i);ϕ)={Q(hj(i)=1v(i);ϕ)=ϕjQ(hj(i)=0v(i);ϕ)=1ϕj
ϕ j \phi_j ϕj虽然不是参数,它只是一个描述概率的实数,但 ϕ j \phi_j ϕj如果已经求解,那么 Q ( h j ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) Q(hj(i)v(i);ϕ)自然也求解了。
因此,也可以将模型参数 ϕ \phi ϕ看作是'包含各随机变量概率信息的集合' { ϕ 1 , ϕ 2 , ⋯   , ϕ P } \{\phi_1,\phi_2,\cdots,\phi_{\mathcal P}\} {ϕ1,ϕ2,,ϕP}
至此,将变分推断的求解目标 ϕ ^ \hat \phi ϕ^分解成了 P \mathcal P P个相互独立的概率信息 ϕ ^ j ( j = 1 , 2 , ⋯   , P ) \hat {\phi}_j(j=1,2,\cdots,\mathcal P) ϕ^j(j=1,2,,P)
每一个 ϕ ^ j ( j = 1 , 2 , ⋯   , P ) \hat {\phi}_j(j=1,2,\cdots,\mathcal P) ϕ^j(j=1,2,,P)都要求解。
ϕ ^ j = arg ⁡ max ⁡ ϕ j L [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \hat {\phi}_j = \mathop{\arg\max}\limits_{\phi_j} \mathcal L \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] ϕ^j=ϕjargmaxL[Q(h(i)v(i);ϕ)]
ELBO \text{ELBO} ELBO的展开式带入,并将 P ( v ( i ) , h ( i ) ; θ ) = 1 Z exp ⁡ { ( v ( i ) ) T W ⋅ h ( i ) + 1 2 ( v ( i ) ) T L ⋅ v ( i ) + 1 2 ( h ( i ) ) T J ⋅ h ( i ) } \mathcal P(v^{(i)},h^{(i)};\theta) = \frac{1}{\mathcal Z} \exp \left\{(v^{(i)})^T\mathcal W\cdot h^{(i)} + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} + \frac{1}{2} (h^{(i)})^T\mathcal J \cdot h^{(i)}\right\} P(v(i),h(i);θ)=Z1exp{(v(i))TWh(i)+21(v(i))TLv(i)+21(h(i))TJh(i)}进行展开。玻尔兹曼机——概率密度函数回顾
log ⁡ \log log exp ⁡ \exp exp之间相互消掉了。
ϕ ^ j = arg ⁡ max ⁡ ϕ j { ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log ⁡ P ( v ( i ) , h ( i ) ; θ ) + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } = arg ⁡ max ⁡ ϕ j { ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ − log ⁡ Z + ( v ( i ) ) T W ⋅ h ( i ) + 1 2 ( v ( i ) ) T L ⋅ v ( i ) + 1 2 ( h ( i ) ) T J ⋅ h ( i ) ] + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } \begin{aligned} \hat {\phi}_j & = \mathop{\arg\max}\limits_{\phi_j} \left\{\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal P(v^{(i)},h^{(i)};\theta) + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right]\right\} \\ & = \mathop{\arg\max}\limits_{\phi_j} \left\{\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[-\log \mathcal Z + (v^{(i)})^T\mathcal W\cdot h^{(i)} + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} + \frac{1}{2} (h^{(i)})^T\mathcal J \cdot h^{(i)}\right] + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right]\right\} \end{aligned} ϕ^j=ϕjargmax{h(i)Q(h(i)v(i);ϕ)logP(v(i),h(i);θ)+H[Q(h(i)v(i);ϕ)]}=ϕjargmax{h(i)Q(h(i)v(i);ϕ)[logZ+(v(i))TWh(i)+21(v(i))TLv(i)+21(h(i))TJh(i)]+H[Q(h(i)v(i);ϕ)]}
将中括号中的项分成 h ( i ) h^{(i)} h(i)和不含 h ( i ) h^{(i)} h(i)的两部分
{ Δ 1 = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ − log ⁡ Z + 1 2 ( v ( i ) ) T L ⋅ v ( i ) ] Δ 2 = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ ( v ( i ) ) T W ⋅ h ( i ) + 1 2 ( h ( i ) ) T J ⋅ h ( i ) ] ϕ j ^ = arg ⁡ max ⁡ ϕ j { Δ 1 + Δ 2 + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } \begin{cases} \Delta_1 = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[-\log \mathcal Z + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)}\right] \\ \Delta_2 = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[(v^{(i)})^T\mathcal W\cdot h^{(i)} + \frac{1}{2} (h^{(i)})^T\mathcal J \cdot h^{(i)}\right] \end{cases} \\ \hat {\phi_j} = \mathop{\arg\max}\limits_{\phi_j} \left\{\Delta_1 + \Delta_2 + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \right\} {Δ1=h(i)Q(h(i)v(i);ϕ)[logZ+21(v(i))TLv(i)]Δ2=h(i)Q(h(i)v(i);ϕ)[(v(i))TWh(i)+21(h(i))TJh(i)]ϕj^=ϕjargmax{Δ1+Δ2+H[Q(h(i)v(i);ϕ)]}
Δ 1 \Delta_1 Δ1进行化简:
很明显, − log ⁡ Z + 1 2 ( v ( i ) ) T L ⋅ v ( i ) -\log \mathcal Z + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} logZ+21(v(i))TLv(i) h ( i ) h^{(i)} h(i)没有关联关系,可看作常数提到公式前面; ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) h(i)Q(h(i)v(i);ϕ)本身是‘概率密度积分’,其结果是 1 1 1.
Δ 1 = [ − log ⁡ Z + 1 2 ( v ( i ) ) T L ⋅ v ( i ) ] ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = − log ⁡ Z + 1 2 ( v ( i ) ) T L ⋅ v ( i ) \begin{aligned} \Delta_1 & = \left[-\log \mathcal Z + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)}\right] \underbrace{\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi)}_{=1} \\ & = -\log \mathcal Z + \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} \end{aligned} Δ1=[logZ+21(v(i))TLv(i)]=1 h(i)Q(h(i)v(i);ϕ)=logZ+21(v(i))TLv(i)
与此同时, Z = ∑ h ( i ) , v ( i ) exp ⁡ { − E ( v ( i ) , h ( i ) ) } \mathcal Z = \sum_{h^{(i)},v^{(i)}}\exp\{-\mathbb E(v^{(i)},h^{(i)})\} Z=h(i),v(i)exp{E(v(i),h(i))}是配分函数,和 ϕ j \phi_j ϕj之间无关联关系(配分函数将 h ( i ) h^{(i)} h(i)全部积分掉了);并且 1 2 ( v ( i ) ) T L ⋅ v ( i ) \frac{1}{2} (v^{(i)})^T\mathcal L \cdot v^{(i)} 21(v(i))TLv(i)也和 ϕ j \phi_j ϕj之间无关联关系( ϕ j \phi_j ϕj描述的是 h j ( i ) h_j^{(i)} hj(i)的后验概率信息,而该项中并不包含隐变量)。因此,在求解最优 ϕ ^ j \hat {\phi}_j ϕ^j过程中, Δ 1 \Delta_1 Δ1整个项全部可以省略。
ϕ j ^ = arg ⁡ max ⁡ ϕ j { Δ 2 + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] } \begin{aligned} \hat {\phi_j} & = \mathop{\arg\max}\limits_{\phi_j} \left\{ \Delta_2 + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \right\} \\ \end{aligned} ϕj^=ϕjargmax{Δ2+H[Q(h(i)v(i);ϕ)]}
后续思路:既然是求解最大值,可以 Δ 2 + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \Delta_2 + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] Δ2+H[Q(h(i)v(i);ϕ)] ϕ j \phi_j ϕj求解偏导数,如果偏导数存在,令其等于0,将最值求出来;如果不存在,可以使用梯度上升法去求解一个近似最优解。

将上述部分展开,分成如下三个部分:
Δ 2 + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ ( v ( i ) ) T W ⋅ h ( i ) + 1 2 ( h ( i ) ) T J ⋅ h ( i ) ] + H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] = Λ 1 + Λ 2 + Λ 3 { Λ 1 = ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ ( v ( i ) ) T W ⋅ h ( i ) ] Λ 2 = 1 2 ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) [ ( h ( i ) ) T J ⋅ h ( i ) ] Λ 3 = H [ Q ( h ( i ) ∣ v ( i ) ; ϕ ) ] \begin{aligned} \Delta_2 + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] & = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[(v^{(i)})^T\mathcal W\cdot h^{(i)} + \frac{1}{2} (h^{(i)})^T\mathcal J \cdot h^{(i)}\right] + \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \\ & = \Lambda_1 + \Lambda_2 + \Lambda_3 \\ & \begin{cases} \Lambda_1 = \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[(v^{(i)})^T\mathcal W\cdot h^{(i)}\right] \\ \Lambda_2 = \frac{1}{2}\sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \left[(h^{(i)})^T\mathcal J \cdot h^{(i)}\right] \\ \Lambda_3 = \mathcal H \left[\mathcal Q(h^{(i)} \mid v^{(i)};\phi)\right] \end{cases} \end{aligned} Δ2+H[Q(h(i)v(i);ϕ)]=h(i)Q(h(i)v(i);ϕ)[(v(i))TWh(i)+21(h(i))TJh(i)]+H[Q(h(i)v(i);ϕ)]=Λ1+Λ2+Λ3 Λ1=h(i)Q(h(i)v(i);ϕ)[(v(i))TWh(i)]Λ2=21h(i)Q(h(i)v(i);ϕ)[(h(i))TJh(i)]Λ3=H[Q(h(i)v(i);ϕ)]

  • Λ 1 \Lambda_1 Λ1进行化简:首先将 Λ 1 \Lambda_1 Λ1继续展开,将 h j ( i ) h_j^{(i)} hj(i)表示出来:
    需要展开两个部分:将 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)v(i);ϕ)使用平均场假设进行展开;将矩阵乘法 ( v ( i ) ) T W ⋅ h ( i ) (v^{(i)})^T\mathcal W\cdot h^{(i)} (v(i))TWh(i)进行展开。
    Λ 1 = ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ ∑ i = 1 D ∑ l = 1 P v i ( i ) ⋅ W i l ⋅ h l ( i ) \begin{aligned} \Lambda_1 = \sum_{h^{(i)}} \prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \sum_{i=1}^{\mathcal D}\sum_{l=1}^{\mathcal P} v_i^{(i)} \cdot \mathcal W_{il} \cdot h_l^{(i)} \end{aligned} Λ1=h(i)l=1PQ(hl(i)v(i);ϕ)i=1Dl=1Pvi(i)Wilhl(i)
    可以发现,里面的项数是非常多的( D × P \mathcal D \times \mathcal P D×P项,包含乘法、加法),以第一项 v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)} v1(i)W11h1(i)为例,观察是否能够向下化简:
    ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) \prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) l=1PQ(hl(i)v(i);ϕ)中单独将 Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi) Q(h1(i)v(i);ϕ)分出来;并且将 ∑ h 1 ( i ) \sum_{h_1^{(i)}} h1(i) ∑ h ( i ) \sum_{h^{(i)}} h(i)中分出来。
    实际上,这步操作和变分推断(平均场假设)推导过程的处理方式是相同的。
    ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) ] = ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) ] ⋅ ∑ h 2 ( i ) , ⋯   , h P ( i ) ∏ l = 2 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) = ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) ] ⋅ ∑ h 2 ( i ) Q ( h 2 ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 ⋯ ∑ h P ( i ) Q ( h P ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ h 1 ( i ) ] \begin{aligned} & \quad \sum_{h^{(i)}} \prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)}\right] \\ & = \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)}\right] \cdot \sum_{h_2^{(i)},\cdots,h_{\mathcal P}^{(i)}} \prod_{l = 2}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \\ & = \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)}\right] \cdot \underbrace{\sum_{h_2^{(i)}}\mathcal Q(h_2^{(i)} \mid v^{(i)};\phi)}_{=1} \cdots \underbrace{\sum_{h_{\mathcal P}^{(i)}}\mathcal Q(h_{\mathcal P}^{(i)} \mid v^{(i)};\phi)}_{=1} \\ & = \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot h_1^{(i)}\right] \end{aligned} h(i)l=1PQ(hl(i)v(i);ϕ)[v1(i)W11h1(i)]=h1(i)Q(h1(i)v(i);ϕ)[v1(i)W11h1(i)]h2(i),,hP(i)l=2PQ(hl(i)v(i);ϕ)=h1(i)Q(h1(i)v(i);ϕ)[v1(i)W11h1(i)]=1 h2(i)Q(h2(i)v(i);ϕ)=1 hP(i)Q(hP(i)v(i);ϕ)=h1(i)Q(h1(i)v(i);ϕ)[v1(i)W11h1(i)]
    由于 h 1 ( i ) h_1^{(i)} h1(i)同样也是服从伯努利分布,继续将上式化简:
    Q ( h 1 ( i ) = 1 ∣ v ( i ) ; ϕ ) ⋅ [ v 1 ( i ) ⋅ W 11 ⋅ 1 ] + 0 = ϕ 1 ⋅ v 1 ( i ) ⋅ W 11 \begin{aligned} \mathcal Q(h_1^{(i)} = 1 \mid v^{(i)};\phi) \cdot \left[v_1^{(i)} \cdot \mathcal W_{11} \cdot 1\right] + 0 = \phi_1 \cdot v_1^{(i)} \cdot \mathcal W_{11} \end{aligned} Q(h1(i)=1v(i);ϕ)[v1(i)W111]+0=ϕ1v1(i)W11
    其他项的处理方式均相同。至此, Λ 1 \Lambda_1 Λ1可化简为:
    一共包含 D × P \mathcal D \times \mathcal P D×P项,均要进行还原。
    Λ 1 = ∑ i = 1 D ∑ l = 1 P ϕ l ⋅ v i ( i ) ⋅ W i l \Lambda_1 = \sum_{i=1}^{\mathcal D}\sum_{l=1}^{\mathcal P} \phi_l \cdot v_i^{(i)} \cdot \mathcal W_{il} Λ1=i=1Dl=1Pϕlvi(i)Wil

  • Λ 2 \Lambda_2 Λ2进行化简
    关于 Λ 2 \Lambda_2 Λ2的化简思路和 Λ 1 \Lambda_1 Λ1是完全相同的,只不过更加复杂一些。因为包含 2 2 2 h h h项。
    Λ 2 = 1 2 ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ ∑ j = 1 P ∑ l = 1 P h j ( i ) ⋅ J i l ⋅ h l ( i ) \begin{aligned} \Lambda_2 = \frac{1}{2}\sum_{h^{(i)}} \prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \sum_{j=1}^{\mathcal P}\sum_{l=1}^{\mathcal P}h_j^{(i)} \cdot \mathcal J_{il} \cdot h_l^{(i)} \end{aligned} Λ2=21h(i)l=1PQ(hl(i)v(i);ϕ)j=1Pl=1Phj(i)Jilhl(i)
    第一种情况 i ≠ l ⇒ J i l i \neq l \Rightarrow\mathcal J_{il} i=lJil不在 J \mathcal J J的对角线上。以 h 1 ( i ) J 12 ⋅ h 2 ( i ) h_1^{(i)} \mathcal J_{12} \cdot h_2^{(i)} h1(i)J12h2(i)为例:
    1 2 ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 12 ⋅ h 2 ( i ) ] = 1 2 ∑ h 1 ( i ) ∑ h 2 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ Q ( h 2 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 12 ⋅ h 2 ( i ) ] ⋅ ∑ h 3 ( i ) , ⋯   , h P ( i ) ∏ l = 3 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = 1 2 ∑ h 1 ( i ) ∑ h 2 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ Q ( h 2 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 12 ⋅ h 2 ( i ) ] \begin{aligned} & \quad \frac{1}{2} \sum_{h^{(i)}}\prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \left[h_1^{(i)} \cdot \mathcal J_{12} \cdot h_2^{(i)}\right] \\ & = \frac{1}{2} \sum_{h_1^{(i)}} \sum_{h_2^{(i)}}\mathcal Q(h_1^{(i)} \mid v^{(i)};\phi)\cdot \mathcal Q(h_2^{(i)} \mid v^{(i)};\phi) \cdot \left[h_1^{(i)} \cdot \mathcal J_{12} \cdot h_2^{(i)}\right] \cdot \underbrace{\sum_{h_3^{(i)},\cdots,h_{\mathcal P}^{(i)}} \prod_{l = 3}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi)}_{=1} \\ & = \frac{1}{2} \sum_{h_1^{(i)}} \sum_{h_2^{(i)}}\mathcal Q(h_1^{(i)} \mid v^{(i)};\phi)\cdot \mathcal Q(h_2^{(i)} \mid v^{(i)};\phi) \cdot \left[h_1^{(i)} \cdot \mathcal J_{12} \cdot h_2^{(i)}\right] \end{aligned} 21h(i)l=1PQ(hl(i)v(i);ϕ)[h1(i)J12h2(i)]=21h1(i)h2(i)Q(h1(i)v(i);ϕ)Q(h2(i)v(i);ϕ)[h1(i)J12h2(i)]=1 h3(i),,hP(i)l=3PQ(hl(i)v(i);ϕ)=21h1(i)h2(i)Q(h1(i)v(i);ϕ)Q(h2(i)v(i);ϕ)[h1(i)J12h2(i)]
    此时,关于 h 1 ( i ) , h 2 ( i ) h_1^{(i)},h_2^{(i)} h1(i),h2(i)的取值一共划分为四种情况

    • h 1 ( i ) = 0 , h 2 ( i ) = 0 h_1^{(i)} = 0,h_2^{(i)} = 0 h1(i)=0,h2(i)=0
    • h 1 ( i ) = 1 , h 2 ( i ) = 0 h_1^{(i)} = 1,h_2^{(i)} = 0 h1(i)=1,h2(i)=0
    • h 1 ( i ) = 0 , h 2 ( i ) = 1 h_1^{(i)} = 0,h_2^{(i)} = 1 h1(i)=0,h2(i)=1
    • h 1 ( i ) = 1 , h 2 ( i ) = 1 h_1^{(i)} = 1,h_2^{(i)} = 1 h1(i)=1,h2(i)=1

    但是,实际上只有 h 1 ( i ) = 1 , h 2 ( i ) = 1 h_1^{(i)} = 1,h_2^{(i)} = 1 h1(i)=1,h2(i)=1才有结果,其余结果均为0。因此 h 1 ( i ) J 12 ⋅ h 2 ( i ) h_1^{(i)} \mathcal J_{12} \cdot h_2^{(i)} h1(i)J12h2(i)对应的结果为:
    1 2 ⋅ Q ( h 1 ( i ) = 1 ∣ v ( i ) ; ϕ ) ⋅ Q ( h 2 ( i ) = 1 ∣ v ( i ) ; ϕ ) ⋅ [ 1 ⋅ J 12 ⋅ 1 ] = 1 2 ϕ 1 ⋅ J 12 ⋅ ϕ 2 \frac{1}{2} \cdot \mathcal Q(h_1^{(i)}=1 \mid v^{(i)};\phi)\cdot \mathcal Q(h_2^{(i)}=1 \mid v^{(i)};\phi) \cdot \left[1 \cdot \mathcal J_{12} \cdot 1\right] = \frac{1}{2} \phi_1 \cdot \mathcal J_{12} \cdot \phi_2 21Q(h1(i)=1v(i);ϕ)Q(h2(i)=1v(i);ϕ)[1J121]=21ϕ1J12ϕ2
    关于第一种情况的特殊性:由于参数矩阵 J \mathcal J J本身实对称矩阵,同样有:
    这意味着 h 1 ( i ) J 12 ⋅ h 2 ( i ) h_1^{(i)} \mathcal J_{12} \cdot h_2^{(i)} h1(i)J12h2(i) h 2 ( i ) J 21 ⋅ h 1 ( i ) h_2^{(i)} \mathcal J_{21} \cdot h_1^{(i)} h2(i)J21h1(i)的结果是相同的。
    1 2 ϕ 1 ⋅ J 12 ⋅ ϕ 2 = 1 2 ϕ 2 ⋅ J 21 ⋅ ϕ 1 \frac{1}{2} \phi_1 \cdot \mathcal J_{12} \cdot \phi_2 = \frac{1}{2} \phi_2 \cdot \mathcal J_{21} \cdot \phi_1 21ϕ1J12ϕ2=21ϕ2J21ϕ1
    第二种情况: i = l ⇒ J i l i=l \Rightarrow \mathcal J_{il} i=lJil J \mathcal J J的对角线上。以 h 1 ( i ) J 11 ⋅ h 1 ( i ) h_1^{(i)} \mathcal J_{11} \cdot h_1^{(i)} h1(i)J11h1(i)为例:
    不同于第一种情况,这里只能积分一个 -> ∑ h 1 ( i ) \sum_{h_1^{(i)}} h1(i)
    1 2 ∑ h ( i ) ∏ l = 1 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 11 ⋅ h 1 ( i ) ] = 1 2 ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 11 ⋅ h 1 ( i ) ] ⋅ ∑ h 2 ( i ) , ⋯   , h P ( i ) ∏ l = 2 P Q ( h l ( i ) ∣ v ( i ) ; ϕ ) ⏟ = 1 = 1 2 ∑ h 1 ( i ) Q ( h 1 ( i ) ∣ v ( i ) ; ϕ ) ⋅ [ h 1 ( i ) ⋅ J 11 ⋅ h 1 ( i ) ] \begin{aligned} & \quad \frac{1}{2} \sum_{h^{(i)}}\prod_{l=1}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi) \cdot \left[h_1^{(i)} \cdot \mathcal J_{11} \cdot h_1^{(i)}\right] \\ & = \frac{1}{2} \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi)\cdot \left[h_1^{(i)} \cdot \mathcal J_{11} \cdot h_1^{(i)}\right] \cdot \underbrace{\sum_{h_2^{(i)},\cdots,h_{\mathcal P}^{(i)}}\prod_{l=2}^{\mathcal P} \mathcal Q(h_l^{(i)} \mid v^{(i)};\phi)}_{=1} \\ & = \frac{1}{2} \sum_{h_1^{(i)}} \mathcal Q(h_1^{(i)} \mid v^{(i)};\phi)\cdot \left[h_1^{(i)} \cdot \mathcal J_{11} \cdot h_1^{(i)}\right] \end{aligned} 21h(i)l=1PQ(hl(i)v(i);ϕ)[h1(i)J11h1(i)]=21h1(i)Q(h1(i)v(i);ϕ)[h1(i)J11h1(i)]=1 h2(i),,hP(i)l=2PQ(hl(i)v(i);ϕ)=21h1(i)Q(h1(i)v(i);ϕ)[h1(i)J11h1(i)]
    和第一种情况相似,但只有两种选择 h 1 ( i ) = 1 ; h 1 ( i ) = 0 h_1^{(i)} = 1;h_1^{(i)} = 0 h1(i)=1;h1(i)=0。最终结果有:
    由于

你可能感兴趣的:(机器学习,算法,玻尔兹曼机后验概率求解,基于平均场假设的变分推断,概率图模型)