上一节介绍了含隐变量能量模型的对数似然梯度求解。本节针对受限玻尔兹曼机,对模型参数进行求解。
已知一个随机变量集合 X \mathcal X X,随机变量间关系可描述成马尔可夫随机场,并包含观测变量 v v v和隐变量 h h h两种。
根据能量模型的描述,将联合概率分布 P ( X ) \mathcal P(\mathcal X) P(X)表示为如下形式:
P ( X ) = P ( h , v ) = 1 Z exp { − E [ h , v ] } \mathcal P(\mathcal X) = \mathcal P(h,v) = \frac{1}{\mathcal Z} \exp\{ - \mathbb E[h,v]\} P(X)=P(h,v)=Z1exp{−E[h,v]}
其中, Z \mathcal Z Z表示配分函数。如果此时给定样本集合(基于观测变量的集合):
V = { v ( i ) } i = 1 N v ( i ) ∈ R n \mathcal V = \{v^{(i)}\}_{i=1}^N \quad v^{(i)} \in \mathbb R^n V={v(i)}i=1Nv(i)∈Rn
关于样本集合 V \mathcal V V的对数似然 L ( θ ) \mathcal L(\theta) L(θ)可表示为:
各样本间依然服从‘独立同分布’。
L ( θ ) = 1 N ∑ v ( i ) ∈ V log P ( v ( i ) ; θ ) \mathcal L(\theta) = \frac{1}{N} \sum_{v^{(i)} \in \mathcal V} \log \mathcal P(v^{(i)};\theta) L(θ)=N1v(i)∈V∑logP(v(i);θ)
对于对数似然 L ( θ ) \mathcal L(\theta) L(θ)的梯度 ∇ θ L ( θ ) \nabla_{\theta}\mathcal L(\theta) ∇θL(θ)可表示为:
∇ θ L ( θ ) = ∂ ∂ θ [ 1 N ∑ v ( i ) ∈ V log P ( v ( i ) ; θ ) ] = 1 N ∑ v ( i ) ∈ V ∇ θ log P ( v ( i ) ; θ ) ∇ θ log P ( v ( i ) ; θ ) = ∑ h ( i ) , v ( i ) { P ( h ( i ) , v ( i ) ) ⋅ ∂ ∂ θ [ E ( h ( i ) , v ( i ) ) ] } − ∑ h ( i ) { P ( h ( i ) ∣ v ( i ) ) ⋅ ∂ ∂ θ [ E ( h ( i ) , v ( i ) ) ] } ⇒ { h ( i ) = ( h 1 ( i ) , h 2 ( i ) , ⋯ , h m ( i ) ) T v ( i ) = ( v 1 ( i ) , v 2 ( i ) , ⋯ , v n ( i ) ) T m + n = p \begin{aligned} \nabla_{\theta}\mathcal L(\theta) & = \frac{\partial}{\partial \theta} \left[\frac{1}{N} \sum_{v^{(i)} \in \mathcal V} \log \mathcal P(v^{(i)};\theta)\right] \\ & = \frac{1}{N} \sum_{v^{(i)} \in \mathcal V} \nabla_{\theta} \log \mathcal P(v^{(i)};\theta) \\ \nabla_{\theta} \log \mathcal P(v^{(i)};\theta) & = \sum_{h^{(i)},v^{(i)}} \left\{\mathcal P(h^{(i)},v^{(i)}) \cdot \frac{\partial}{\partial \theta} \left[\mathbb E(h^{(i)},v^{(i)})\right]\right\} - \sum_{h^{(i)}} \left\{\mathcal P(h^{(i)} \mid v^{(i)})\cdot \frac{\partial}{\partial \theta} \left[\mathbb E(h^{(i)},v^{(i)})\right]\right\} \\ & \Rightarrow \begin{cases} h^{(i)} = (h_1^{(i)},h_2^{(i)},\cdots,h_m^{(i)})^T \\ v^{(i)} = (v_1^{(i)},v_2^{(i)},\cdots,v_n^{(i)})^T \\ m + n = p \end{cases} \end{aligned} ∇θL(θ)∇θlogP(v(i);θ)=∂θ∂⎣⎡N1v(i)∈V∑logP(v(i);θ)⎦⎤=N1v(i)∈V∑∇θlogP(v(i);θ)=h(i),v(i)∑{P(h(i),v(i))⋅∂θ∂[E(h(i),v(i))]}−h(i)∑{P(h(i)∣v(i))⋅∂θ∂[E(h(i),v(i))]}⇒⎩⎪⎨⎪⎧h(i)=(h1(i),h2(i),⋯,hm(i))Tv(i)=(v1(i),v2(i),⋯,vn(i))Tm+n=p
基于受限玻尔兹曼机结构的联合概率分布 P ( X ) \mathcal P(\mathcal X) P(X)表示如下:
P ( X ) = P ( h , v ) = 1 Z exp [ v T W h + b T v + c T h ] \mathcal P(\mathcal X) = \mathcal P(h,v) = \frac{1}{\mathcal Z} \exp \left[v^T \mathcal W h + b^Tv + c^Th \right] P(X)=P(h,v)=Z1exp[vTWh+bTv+cTh]
关于模型参数 θ \theta θ的描述,自然是 θ = { W , b , c } \theta = \{\mathcal W,b,c\} θ={W,b,c}。但本质上只有一种信息表示:结点之间关联关系的表示。其中 b , c b,c b,c分别表示某观测变量/某隐变量与自身的关联关系,这种关系被视作特殊的节点间关系。在本节中,仅求解一般性的关联关系——存在边的,隐变量与观测变量之间的关联关系,也就是模型参数 W \mathcal W W:
W = ( w 11 , w 12 , ⋯ , w 1 n w 21 , w 22 , ⋯ , w 2 n ⋮ w m 1 , w m 2 , ⋯ , w m n ) m × n ⇒ W ( i ) = [ w k j ( i ) ] m × n \mathcal W = \begin{pmatrix} w_{11},w_{12},\cdots,w_{1n} \\ w_{21},w_{22},\cdots,w_{2n} \\ \vdots \\ w_{m1},w_{m2},\cdots,w_{mn} \\ \end{pmatrix}_{m \times n} \Rightarrow \mathcal W^{(i)} = [w_{kj}^{(i)}]_{m \times n} W=⎝⎜⎜⎜⎛w11,w12,⋯,w1nw21,w22,⋯,w2n⋮wm1,wm2,⋯,wmn⎠⎟⎟⎟⎞m×n⇒W(i)=[wkj(i)]m×n
主体思路:如果能够将 W \mathcal W W的梯度求解出来,自然就可以使用极大似然估计,基于梯度上升法,将 W \mathcal W W近似出来:
W ( t + 1 ) ⇐ W ( t ) + η ∇ W L ( W ) \mathcal W^{(t+1)} \Leftarrow \mathcal W^{(t)} + \eta \nabla_{\mathcal W} \mathcal L(\mathcal W) W(t+1)⇐W(t)+η∇WL(W)
基于能量模型的对数似然梯度,将受限玻尔兹曼机的结构表示代入公式中:
依然仅针对一个观测样本
v ( i ) v^{(i)} v(i),该样本的第
k k k个分量(维度)
v k ( i ) v_k^{(i)} vk(i)和对应隐变量
h ( i ) h^{(i)} h(i)的第
j j j个分量
h j ( i ) h_j^{(i)} hj(i)之间的模型参数
w k j ( i ) w_{kj}^{(i)} wkj(i)进行求导。
∇ θ [ log P ( v ( i ) ; θ ) ] = ∂ ∂ w k j ( i ) [ log P ( v ( i ) ; θ ) ] = − ∑ h ( i ) { P ( h ( i ) ∣ v ( i ) ) ⋅ ∂ ∂ w k j ( i ) [ E [ h ( i ) , v ( i ) ] ] } + ∑ h ( i ) , v ( i ) { P ( h ( i ) , v ( i ) ) ⋅ ∂ ∂ w k j ( i ) [ E ( h ( i ) , v ( i ) ) ] } \begin{aligned} \nabla_{\theta}\left[\log \mathcal P(v^{(i)};\theta)\right] & = \frac{\partial}{\partial w_{kj}^{(i)}} \left[ \log \mathcal P(v^{(i)};\theta)\right] \\ & = - \sum_{h^{(i)}} \left\{\mathcal P(h^{(i)} \mid v^{(i)}) \cdot \frac{\partial}{\partial w_{kj}^{(i)}} \left[ \mathbb E [h^{(i)},v^{(i)}]\right]\right\} + \sum_{h^{(i)},v^{(i)}} \left\{\mathcal P(h^{(i)},v^{(i)}) \cdot \frac{\partial}{\partial w_{kj}^{(i)}} \left[\mathbb E(h^{(i)},v^{(i)})\right]\right\} \end{aligned} ∇θ[logP(v(i);θ)]=∂wkj(i)∂[logP(v(i);θ)]=−h(i)∑{P(h(i)∣v(i))⋅∂wkj(i)∂[E[h(i),v(i)]]}+h(i),v(i)∑{P(h(i),v(i))⋅∂wkj(i)∂[E(h(i),v(i))]}
由于这里不考虑模型参数 b , c b,c b,c的影响,使用 Δ \Delta Δ进行表示。因而 能量函数 E [ h ( i ) , v ( i ) ] \mathbb E[h^{(i)},v^{(i)}] E[h(i),v(i)]可表示为如下形式:
E [ h ( i ) , v ( i ) ] = − { [ h ( i ) ] T W ( i ) v ( i ) + Δ ( i ) } = − ( ∑ k = 1 m ∑ j = 1 n h k ( i ) ⋅ w k j ( i ) ⋅ v j ( i ) + Δ ( i ) ) \begin{aligned} \mathbb E[h^{(i)},v^{(i)}] & = - \left\{[h^{(i)}]^T \mathcal W^{(i)} v^{(i)} + \Delta^{(i)}\right\} \\ & = - \left(\sum_{k=1}^m \sum_{j=1}^n h_k^{(i)} \cdot w_{kj}^{(i)} \cdot v_j^{(i)} + \Delta^{(i)} \right) \end{aligned} E[h(i),v(i)]=−{[h(i)]TW(i)v(i)+Δ(i)}=−(k=1∑mj=1∑nhk(i)⋅wkj(i)⋅vj(i)+Δ(i))
将其代入,有:
能量函数
E [ h ( i ) , v ( i ) ] \mathbb E[h^{(i)},v^{(i)}] E[h(i),v(i)]仅对
w k j ( i ) w_{kj}^{(i)} wkj(i)一项进行求导,连加号
∑ ≠ k m , ∑ ≠ j n \sum_{\neq k}^m,\sum_{\neq j}^n ∑=km,∑=jn均视作常数:
∂ ∂ w k j ( i ) [ E [ h ( i ) , v ( i ) ] ] = ∂ ∂ w k j ( i ) [ − ( h 1 ( i ) ⋅ w 11 ( i ) ⋅ v 1 ( i ) + ⋯ h k ( i ) ⋅ w k j ( i ) ⋅ v j ( i ) + ⋯ + h m ( i ) ⋅ w m n ( i ) ⋅ v n ( i ) + Δ ( i ) ) ] = ∂ ∂ w k j ( i ) [ − ( h k ( i ) ⋅ w k j ( i ) ⋅ v j ( i ) ) ] = − h k ( i ) v j ( i ) ∇ θ [ log P ( v ( i ) ; θ ) ] = − ∑ h ( i ) { P ( h ( i ) ∣ v ( i ) ) ⋅ ( − h k ( i ) v j ( i ) ) } + ∑ h ( i ) , v ( i ) { P ( h ( i ) , v ( i ) ) ⋅ ( − h k ( i ) v j ( i ) ) } \begin{aligned} \frac{\partial}{\partial w_{kj}^{(i)}} \left[\mathbb E[h^{(i)},v^{(i)}]\right] & = \frac{\partial}{\partial w_{kj}^{(i)}} \left[- \left(h_1^{(i)} \cdot w_{11}^{(i)} \cdot v_1^{(i)} + \cdots h_k^{(i)} \cdot w_{kj}^{(i)} \cdot v_j^{(i)} + \cdots + h_m^{(i)} \cdot w_{mn}^{(i)} \cdot v_{n}^{(i)} + \Delta^{(i)}\right)\right] \\ & = \frac{\partial}{\partial w_{kj}^{(i)}} \left[ -\left(h_k^{(i)} \cdot w_{kj}^{(i)} \cdot v_j^{(i)}\right)\right] \\ & = - h_{k}^{(i)} v_j^{(i)} \\ \nabla_{\theta}\left[\log \mathcal P(v^{(i)};\theta)\right] & = -\sum_{h^{(i)}} \left\{\mathcal P(h^{(i)} \mid v^{(i)}) \cdot (-h_k^{(i)}v_j^{(i)})\right\} + \sum_{h^{(i)},v^{(i)}} \left\{\mathcal P(h^{(i)},v^{(i)}) \cdot (-h_k^{(i)} v_j^{(i)})\right\} \end{aligned} ∂wkj(i)∂[E[h(i),v(i)]]∇θ[logP(v(i);θ)]=∂wkj(i)∂[−(h1(i)⋅w11(i)⋅v1(i)+⋯hk(i)⋅wkj(i)⋅vj(i)+⋯+hm(i)⋅wmn(i)⋅vn(i)+Δ(i))]=∂wkj(i)∂[−(hk(i)⋅wkj(i)⋅vj(i))]=−hk(i)vj(i)=−h(i)∑{P(h(i)∣v(i))⋅(−hk(i)vj(i))}+h(i),v(i)∑{P(h(i),v(i))⋅(−hk(i)vj(i))}
继续化简,有:
∇ θ [ log P ( v ( i ) ; θ ) ] = ∑ h ( i ) { P ( h ( i ) ∣ v ( i ) ) ⋅ h k ( i ) v j ( i ) } − ∑ h ( i ) , v ( i ) { P ( h ( i ) , v ( i ) ) ⋅ h k ( i ) v j ( i ) } \begin{aligned} \nabla_{\theta}\left[\log \mathcal P(v^{(i)};\theta)\right] & = \sum_{h^{(i)}} \left\{\mathcal P(h^{(i)} \mid v^{(i)}) \cdot h_k^{(i)}v_j^{(i)}\right\} - \sum_{h^{(i)},v^{(i)}} \left\{\mathcal P(h^{(i)},v^{(i)}) \cdot h_k^{(i)} v_j^{(i)}\right\} \end{aligned} ∇θ[logP(v(i);θ)]=h(i)∑{P(h(i)∣v(i))⋅hk(i)vj(i)}−h(i),v(i)∑{P(h(i),v(i))⋅hk(i)vj(i)}
这里依然假设变量信息(隐变量、观测变量)服从伯努利分布(Bernoulli Distribution),因而所有变量均是离散型随机变量,并且仅有两种取值: 0 , 1 0,1 0,1。基于这种假设,可以继续向下化简:
将
∑ h ( i ) \sum_{h^{(i)}} ∑h(i)按照随机变量维度展开.
除去
∑ h k ( i ) \sum_{h_k^{(i)}} ∑hk(i)项之外的其他项全被积分掉了。例如:
剩余的项以此类推~从而有:
将
P ( h ( i ) , v ( i ) ) \mathcal P(h^{(i)},v^{(i)}) P(h(i),v(i))按照条件概率公式展开.
将仅和
v ( i ) v^{(i)} v(i)相关的和仅和
h ( i ) h^{(i)} h(i)相关的项分开.
第一项的结果
∑ h ( i ) { P ( h ( i ) ∣ v ( i ) ) ⋅ h k ( i ) v j ( i ) } = P ( h k ( i ) = 1 ∣ v ( i ) ) ⋅ v j ( i ) \sum_{h^{(i)}}\left\{\mathcal P(h^{(i)} \mid v^{(i)}) \cdot h_k^{(i)}v_j^{(i)}\right\} = \mathcal P(h_k^{(i)}=1 \mid v^{(i)}) \cdot v_j^{(i)} ∑h(i){P(h(i)∣v(i))⋅hk(i)vj(i)}=P(hk(i)=1∣v(i))⋅vj(i)直接拿来用就可以。
最终,受限玻尔兹曼机的梯度可表示为:
∇ θ [ log P ( v ( i ) ; θ ) ] = P ( h k ( i ) = 1 ∣ v ( i ) ) ⋅ v j ( i ) − ∑ v ( i ) P ( v ( i ) ) ⋅ P ( h k ( i ) = 1 ∣ v ( i ) ) ⋅ v j ( i ) \nabla_{\theta}\left[\log \mathcal P(v^{(i)};\theta)\right] = \mathcal P(h_k^{(i)}=1 \mid v^{(i)}) \cdot v_j^{(i)} - \sum_{v^{(i)}} \mathcal P(v^{(i)}) \cdot \mathcal P(h_k^{(i)} = 1 \mid v^{(i)}) \cdot v_j^{(i)} ∇θ[logP(v(i);θ)]=P(hk(i)=1∣v(i))⋅vj(i)−v(i)∑P(v(i))⋅P(hk(i)=1∣v(i))⋅vj(i)
继续观察等号右侧第一项: P ( h k ( i ) = 1 ∣ v ( i ) ) ⋅ v j ( i ) \mathcal P(h_k^{(i)}=1 \mid v^{(i)}) \cdot v_j^{(i)} P(hk(i)=1∣v(i))⋅vj(i)由于 v ( i ) v^{(i)} v(i)是一个样本,那么 v j ( i ) v_j^{(i)} vj(i)自然是已知的;而 P ( h k ( i ) = 1 ∣ v ( i ) ) \mathcal P(h_k^{(i)}=1 \mid v^{(i)}) P(hk(i)=1∣v(i))在受限玻尔兹曼机——推断任务(后验概率)可以将其表示为Sigmoid函数的形式:
P ( h k ( i ) = 1 ∣ v ( i ) ) = Sigmoid ( ∑ l = 1 n w l k ( i ) ⋅ v l + c l ) \mathcal P(h_k^{(i)}=1 \mid v^{(i)}) = \text{Sigmoid}(\sum_{l=1}^n w_{lk}^{(i)} \cdot v_l + c_l) P(hk(i)=1∣v(i))=Sigmoid(l=1∑nwlk(i)⋅vl+cl)
因此可以判定:第一项是**可求 **的。
关于等号右侧第二项: ∑ v ( i ) P ( v ( i ) ) ⋅ P ( h k ( i ) = 1 ∣ v ( i ) ) ⋅ v j ( i ) \sum_{v^{(i)}} \mathcal P(v^{(i)}) \cdot \mathcal P(h_k^{(i)} = 1 \mid v^{(i)}) \cdot v_j^{(i)} ∑v(i)P(v(i))⋅P(hk(i)=1∣v(i))⋅vj(i)
这里关于 v ( i ) v^{(i)} v(i)的积分是约不掉的,基于它的伯努利分布,v^{(i)}的 n n n维随机变量,其时间复杂度是 O ( n ) = 2 n \mathcal O(n) = 2^n O(n)=2n。因此不能硬求它的梯度结果。需要使用吉布斯采样/对比散度的方式进行求解。
针对第二部分,不再使用常规的吉布斯采样进行求解,而是使用 CD-k \text{CD-k} CD-k的方式进行求解。这种方法的核心区别在于:在初始化吉布斯采样时,并不使用随机样本作为输入,而是选择真实分布 P d a t a ( sample from V ) \mathcal P_{data}(\text{sample from }\mathcal V) Pdata(sample from V)的样本作为初始化采样:
由于受限玻尔兹曼机的性质,隐变量中各随机变量之间相互独立,因此没有必要通过‘固定其他隐变量进行采样,走坐标上升法’;而是各随机变量采各自的样本即可。
t = 1 : Initialization V t = 1 = { v t = 1 ( 1 ) , v t = 1 ( 2 ) , ⋯ , v t = 1 ( N ) } ∈ V ⇒ { v 1 , t = 1 ( 1 ) , v 1 , t = 1 ( 2 ) , ⋯ , v 1 , t = 1 ( N ) ⇒ 1 N ∑ i = 1 N v 1 , t = 1 ( i ) ⇒ h 1 , t = 1 v 2 , t = 1 ( 1 ) , v 2 , t = 1 ( 2 ) , ⋯ , v 2 , t = 1 ( N ) ⇒ 1 N ∑ i = 1 N v 2 , t = 1 ( i ) ⇒ h 2 , t = 1 ⋮ v m , t = 1 ( 1 ) , v m , t = 1 ( 2 ) , ⋯ , v m , t = 1 ( N ) ⇒ 1 N ∑ i = 1 N v m , t = 1 ( i ) ⇒ h m , t = 1 \begin{aligned} & t=1: \text{Initialization } \mathcal V^{t=1} = \{v_{t=1}^{(1)},v_{t=1}^{(2)},\cdots,v_{t=1}^{(N)}\} \in \mathcal V \\ & \Rightarrow \begin{cases} v_{1,t=1}^{(1)},v_{1,t=1}^{(2)},\cdots,v_{1,t=1}^{(N)} \Rightarrow \frac{1}{N} \sum_{i=1}^N v_{1,t=1}^{(i)} \Rightarrow h_{1,t=1} \\ v_{2,t=1}^{(1)},v_{2,t=1}^{(2)},\cdots,v_{2,t=1}^{(N)} \Rightarrow \frac{1}{N} \sum_{i=1}^N v_{2,t=1}^{(i)} \Rightarrow h_{2,t=1} \\ \quad \vdots \\ v_{m,t=1}^{(1)},v_{m,t=1}^{(2)},\cdots,v_{m,t=1}^{(N)} \Rightarrow \frac{1}{N} \sum_{i=1}^N v_{m,t=1}^{(i)} \Rightarrow h_{m,t=1} \\ \end{cases} \end{aligned} t=1:Initialization Vt=1={vt=1(1),vt=1(2),⋯,vt=1(N)}∈V⇒⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧v1,t=1(1),v1,t=1(2),⋯,v1,t=1(N)⇒N1∑i=1Nv1,t=1(i)⇒h1,t=1v2,t=1(1),v2,t=1(2),⋯,v2,t=1(N)⇒N1∑i=1Nv2,t=1(i)⇒h2,t=1⋮vm,t=1(1),vm,t=1(2),⋯,vm,t=1(N)⇒N1∑i=1Nvm,t=1(i)⇒hm,t=1
此时,基于初始化样本第一次采样得到的关于 h t = 1 h_{t=1} ht=1的采样结果表示完成,后续执行直到 V t = k \mathcal V^{t=k} Vt=k中采样,对第二部分进行近似计算:
∑ v ( i ) P ( v ( i ) ) ⋅ P ( h k ( i ) = 1 ∣ v ( i ) ) ⋅ v j ( i ) = E v ( k ) ∼ P ( V t = k ) [ P ( h k ( i ) = 1 ∣ v ( i ) ) ⋅ v j ( i ) ] \begin{aligned} \sum_{v^{(i)}}\mathcal P(v^{(i)}) \cdot \mathcal P(h_k^{(i)} = 1 \mid v^{(i)}) \cdot v_j^{(i)} & = \mathbb E_{v^{(k)} \sim \mathcal P(\mathcal V^{t=k})} \left[\mathcal P(h_k^{(i)} = 1 \mid v^{(i)}) \cdot v_j^{(i)}\right] \end{aligned} v(i)∑P(v(i))⋅P(hk(i)=1∣v(i))⋅vj(i)=Ev(k)∼P(Vt=k)[P(hk(i)=1∣v(i))⋅vj(i)]
而常见的吉布斯采样默认待采样分布中随机变量之间存在关联关系,当对第 i i i维度进行采样时,需要将其他所有维度(包含已更新的、未更新的)均固定住。直到所有维度均采样结束后,再向下一时刻迭代。
受限玻尔兹曼机的吉布斯采样方式称为块吉布斯采样(Block Gibbs Sampling),相比传统吉布斯采样方式更加快捷。
相关参考:
直面配分函数:6-Log-Likelihood Gradient of RBM
直面配分函数:7-CD-k for RBM(RBM的对比散度算法)