基于玻尔兹曼机(三)梯度求解(基于平均场理论的变分推断)的思路继续求解 Λ 3 \Lambda_3 Λ3的梯度。
对 Λ 3 \Lambda_3 Λ3进行化简。其就是一个熵的形式:
Λ 3 = − ∑ h ( i ) Q ( h ( i ) ∣ v ( i ) ; ϕ ) log Q ( h ( i ) ∣ v ( i ) ; ϕ ) \Lambda_3 = - \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Λ3=−h(i)∑Q(h(i)∣v(i);ϕ)logQ(h(i)∣v(i);ϕ)
由于平均场假设作为条件,将上式进行展开:
这里暂时先将负号带回去了~
Λ 3 = ∑ h ( i ) { ∏ j = 1 P Q ( h j ( i ) ∣ v ( i ) ; ϕ ) log ∏ j = 1 P [ 1 Q ( h j ( i ) ∣ v ( i ) ; ϕ ) ] } = ∑ h ( i ) { ∏ j = 1 P Q ( h j ( i ) ∣ v ( i ) ; ϕ ) ∑ j = 1 P log [ 1 Q ( h j ( i ) ∣ v ( i ) ; ϕ ) ] } \begin{aligned} \Lambda_3 & = \sum_{h^{(i)}} \left\{\prod_{j=1}^{\mathcal P} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) \log \prod_{j=1}^{\mathcal P} \left[\frac{1}{\mathcal Q(h_j^{(i)} \mid v^{(i)};\phi)}\right]\right\} \\ & = \sum_{h^{(i)}} \left\{\prod_{j=1}^{\mathcal P} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) \sum_{j=1}^{\mathcal P}\log \left[\frac{1}{\mathcal Q(h_j^{(i)} \mid v^{(i)};\phi)}\right]\right\} \end{aligned} Λ3=h(i)∑{j=1∏PQ(hj(i)∣v(i);ϕ)logj=1∏P[Q(hj(i)∣v(i);ϕ)1]}=h(i)∑{j=1∏PQ(hj(i)∣v(i);ϕ)j=1∑Plog[Q(hj(i)∣v(i);ϕ)1]}
继续将 ∑ h ( i ) = ∑ h 1 ( i ) , h 2 ( i ) , ⋯ , h P ( i ) \sum_{h^{(i)}} = \sum_{h_1^{(i)},h_2^{(i)},\cdots,h_{\mathcal P}^{(i)}} ∑h(i)=∑h1(i),h2(i),⋯,hP(i)展开,并与相关项进行归纳:
Λ 3 = ∑ j = 1 P ∑ h j ( i ) Q ( h j ( i ) ∣ v ( i ) ; ϕ ) ⋅ log [ 1 Q ( h j ( i ) ∣ v ( i ) ; ϕ ) ] \Lambda_3 = \sum_{j=1}^{\mathcal P} \sum_{h_j^{(i)}} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) \cdot \log \left[\frac{1}{\mathcal Q(h_j^{(i)} \mid v^{(i)};\phi)}\right] Λ3=j=1∑Phj(i)∑Q(hj(i)∣v(i);ϕ)⋅log[Q(hj(i)∣v(i);ϕ)1]
继续观察 ∑ h j ( i ) Q ( h j ( i ) ∣ v ( i ) ; ϕ ) ⋅ log [ 1 Q ( h j ( i ) ∣ v ( i ) ; ϕ ) ] \sum_{h_j^{(i)}} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) \cdot \log \left[\frac{1}{\mathcal Q(h_j^{(i)} \mid v^{(i)};\phi)}\right] ∑hj(i)Q(hj(i)∣v(i);ϕ)⋅log[Q(hj(i)∣v(i);ϕ)1],基于 h j ( i ) h_j^{(i)} hj(i)的伯努利分布,该式可表示为如下形式:
Q ( h j ( i ) = 1 ∣ v ( i ) ; ϕ ) ⋅ l o g [ 1 Q ( h j ( i ) = 1 ∣ v ( i ) ; ϕ ) ] + Q ( h j ( i ) = 0 ∣ v ( i ) ; ϕ ) ⋅ l o g [ 1 Q ( h j ( i ) = 0 ∣ v ( i ) ; ϕ ) ] = ϕ j ⋅ log [ 1 ϕ j ] + ( 1 − ϕ j ) log [ 1 1 − ϕ j ] \begin{aligned} & \quad \mathcal Q(h_j^{(i)}=1 \mid v^{(i)};\phi) \cdot log \left[\frac{1}{\mathcal Q(h_j^{(i)}=1 \mid v^{(i)};\phi)}\right] + \mathcal Q(h_j^{(i)}=0 \mid v^{(i)};\phi) \cdot log \left[\frac{1}{\mathcal Q(h_j^{(i)}=0 \mid v^{(i)};\phi)}\right] \\ & = \phi_j \cdot \log \left[\frac{1}{\phi_j}\right] + (1 - \phi_j) \log \left[\frac{1}{1 - \phi_j}\right] \end{aligned} Q(hj(i)=1∣v(i);ϕ)⋅log[Q(hj(i)=1∣v(i);ϕ)1]+Q(hj(i)=0∣v(i);ϕ)⋅log[Q(hj(i)=0∣v(i);ϕ)1]=ϕj⋅log[ϕj1]+(1−ϕj)log[1−ϕj1]
至此, Λ 3 \Lambda_3 Λ3可表示为:
Λ 3 = ∑ j = 1 P { ϕ j ⋅ log [ 1 ϕ j ] + ( 1 − ϕ j ) log [ 1 1 − ϕ j ] } \Lambda_3 = \sum_{j=1}^{\mathcal P}\left\{\phi_j \cdot \log \left[\frac{1}{\phi_j}\right] + (1 - \phi_j) \log \left[\frac{1}{1 - \phi_j}\right]\right\} Λ3=j=1∑P{ϕj⋅log[ϕj1]+(1−ϕj)log[1−ϕj1]}
至此, Λ 1 , Λ 2 , Λ 3 \Lambda_1,\Lambda_2,\Lambda_3 Λ1,Λ2,Λ3全部化简完毕,将这三项分别对 ϕ j \phi_j ϕj求偏导。具体结果如下:
{ ∂ ∂ ϕ j [ ∑ i = 1 D ∑ l = 1 P ϕ l ⋅ v i ( i ) ⋅ W i l ] = ∑ i = 1 D v i ( i ) ⋅ W i j ∂ ∂ ϕ j [ ∑ j = 1 P ∑ l ≠ j P ϕ i ⋅ ϕ l ⋅ J i l ] = ∑ l ≠ j P ϕ l ⋅ J i l ∂ ∂ ϕ j [ ∑ j = 1 P { ϕ j ⋅ log [ 1 ϕ j ] + ( 1 − ϕ j ) log [ 1 1 − ϕ j ] } ] = − log ϕ j 1 − ϕ j \begin{cases} \frac{\partial}{\partial \phi_j} \left[\sum_{i=1}^{\mathcal D}\sum_{l=1}^{\mathcal P} \phi_l \cdot v_i^{(i)} \cdot \mathcal W_{il}\right] = \sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} \\ \frac{\partial}{\partial \phi_j} \left[\sum_{j=1}^{\mathcal P}\sum_{l\neq j}^{\mathcal P} \phi_i \cdot \phi_l \cdot \mathcal J_{il}\right] = \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il} \\ \frac{\partial}{\partial \phi_j} [\sum_{j=1}^{\mathcal P}\left\{\phi_j \cdot \log \left[\frac{1}{\phi_j}\right] + (1 - \phi_j) \log \left[\frac{1}{1 - \phi_j}\right]\right\}] = -\log \frac{\phi_j}{1 - \phi_j} \end{cases} ⎩ ⎨ ⎧∂ϕj∂[∑i=1D∑l=1Pϕl⋅vi(i)⋅Wil]=∑i=1Dvi(i)⋅Wij∂ϕj∂[∑j=1P∑l=jPϕi⋅ϕl⋅Jil]=∑l=jPϕl⋅Jil∂ϕj∂[∑j=1P{ϕj⋅log[ϕj1]+(1−ϕj)log[1−ϕj1]}]=−log1−ϕjϕj
令三项之和为0,求解 ϕ j \phi_j ϕj:
∂ ∂ ϕ j [ Λ 1 + Λ 2 + Λ 3 ] ≜ 0 ⇔ ∑ i = 1 D v i ( i ) ⋅ W i j + ∑ l ≠ j P ϕ l ⋅ J i l − log ϕ j 1 − ϕ j = 0 \frac{\partial}{\partial \phi_j} [\Lambda_1 + \Lambda_2 + \Lambda_3] \triangleq 0 \Leftrightarrow \sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il} - \log \frac{\phi_j}{1 - \phi_j} = 0 ∂ϕj∂[Λ1+Λ2+Λ3]≜0⇔i=1∑Dvi(i)⋅Wij+l=j∑Pϕl⋅Jil−log1−ϕjϕj=0
因而有:
ϕ j [ 1 + exp ( ∑ i = 1 D v i ( i ) ⋅ W i j + ∑ l ≠ j P ϕ l ⋅ J i l ) ] = exp ( ∑ i = 1 D v i ( i ) ⋅ W i j + ∑ l ≠ j P ϕ l ⋅ J i l ) ⇒ ϕ j = 1 1 + exp ( ∑ i = 1 D v i ( i ) ⋅ W i j + ∑ l ≠ j P ϕ l ⋅ J i l ) = Sigmoid [ ∑ i = 1 D v i ( i ) ⋅ W i j + ∑ l ≠ j P ϕ l ⋅ J i l ] \phi_j \left[1 + \exp \left(\sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il}\right)\right] = \exp \left(\sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il}\right) \\ \begin{aligned} \Rightarrow \phi_j & = \frac{1}{1 + \exp \left(\sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il}\right)} \\ & = \text{Sigmoid} \left[\sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il}\right] \end{aligned} ϕj 1+exp i=1∑Dvi(i)⋅Wij+l=j∑Pϕl⋅Jil =exp i=1∑Dvi(i)⋅Wij+l=j∑Pϕl⋅Jil ⇒ϕj=1+exp(∑i=1Dvi(i)⋅Wij+∑l=jPϕl⋅Jil)1=Sigmoid i=1∑Dvi(i)⋅Wij+l=j∑Pϕl⋅Jil
很明显,这是一个迭代方程——用非 ϕ j \phi_j ϕj的其他结果的线性运算 ∑ l ≠ j P \sum_{l \neq j}^{\mathcal P} ∑l=jP对 ϕ j \phi_j ϕj进行表示。并且它是一个不动点方程。它的具体求解方式是:在初始状态下给定随机结果,固定住 ϕ j \phi_j ϕj之外的其他项,求解当前迭代步骤下 ϕ j \phi_j ϕj的最优解;以此类推,直到所有的 ϕ j ( j = 1 , 2 , ⋯ , P ) \phi_{j}(j=1,2,\cdots,\mathcal P) ϕj(j=1,2,⋯,P)均固定一遍,此时一次迭代结束,继续迭代下去,基于不动点方程的性质, ϕ j ( j = 1 , 2 , ⋯ , P ) \phi_j(j=1,2,\cdots,\mathcal P) ϕj(j=1,2,⋯,P)均会收敛至某一结果,至此 ϕ = { ϕ 1 , ϕ 2 , ⋯ , ϕ P } \phi = \{\phi_1,\phi_2,\cdots,\phi_{\mathcal P}\} ϕ={ϕ1,ϕ2,⋯,ϕP}可以被近似出来,最终求解 Q ( h ( i ) ∣ v ( i ) ; ϕ ) \mathcal Q(h^{(i)} \mid v^{(i)};\phi) Q(h(i)∣v(i);ϕ)的分布结果,并最终替代 P m o d e l ( h ( i ) ∣ v ( i ) ; θ ) \mathcal P_{model}(h^{(i)} \mid v^{(i)};\theta) Pmodel(h(i)∣v(i);θ)对模型参数的梯度进行描述。
这实际上就是‘坐标上升思想’。
返回要求解的模型参数梯度:
{ ∇ W = η ( E P d a t a [ v ( i ) ( h ( i ) ) T ] − E P m o d e l [ v ( i ) ( h ( i ) ) T ] ) P d a t a ⇒ P d a t a ( v ( i ) ∈ V ) ⋅ P m o d e l ( h ( i ) ∣ v ( i ) ) P m o d e l = P m o d e l ( v ( i ) , h ( i ) ) \begin{cases} \nabla_{\mathcal W} = \eta \left(\mathbb E_{\mathcal P_{data}} \left[v^{(i)}(h^{(i)})^T\right] - \mathbb E_{\mathcal P_{model}} \left[v^{(i)}(h^{(i)})^T\right]\right)\\ \mathcal P_{data} \Rightarrow \mathcal P_{data}(v^{(i)} \in \mathcal V) \cdot \mathcal P_{model}(h^{(i)} \mid v^{(i)}) \\ \mathcal P_{model} = \mathcal P_{model}(v^{(i)},h^{(i)}) \end{cases} ⎩ ⎨ ⎧∇W=η(EPdata[v(i)(h(i))T]−EPmodel[v(i)(h(i))T])Pdata⇒Pdata(v(i)∈V)⋅Pmodel(h(i)∣v(i))Pmodel=Pmodel(v(i),h(i))
此时关于隐变量的后验概率 P m o d e l \mathcal P_{model} Pmodel并不需要使用MCMC采样方法进行求解,通过变分推断的方式也可以进行求解。
关于负相的处理方式,依然需要使用采样方法,随着技术的迭代,采样方式也得到了更新,例如对比散度方法,同样可以加快采样速度。
至此,关于玻尔兹曼机的介绍到此结束。
相关参考:
(系列二十八)玻尔兹曼机7-平均场推断3