关于变分推断,现在补充一点背景知识,以便让每一个模块都很独立。我们指导机器学习可以分为频率派和贝叶斯派两大派系,关于这两个派系都有各自不同的特点。
我们知道从频率派角度来看,最终会演化为一个优化问题。比如我们熟悉的回归问题
假设我们有一堆数据 D = { ( x i , y i ) } i = 1 N , x i ∈ R P , y i ∈ R ; D=\{(x_i,y_i)\}_{i=1}^N,x_i\in R^P,y_i\in R; D={(xi,yi)}i=1N,xi∈RP,yi∈R;,我们定义的模型为:
f ( x ) = W T X (1) f(x)=W^TX\tag1 f(x)=WTX(1)
定义一条线,我们需要结合数据X把W估计出来。
为了更好的拟合上述数据,我们提出了损失函数Loss Function:
L ( W ) = ∑ i = 1 N ∣ ∣ w T x i − y i ∣ ∣ 2 (2) L(W)=\sum_{i=1}^{N}||w^Tx_i-y_i||^2\tag2 L(W)=i=1∑N∣∣wTxi−yi∣∣2(2)
最后估计出来的w用 w ^ \hat w w^表示:
w ^ = arg min w L ( W ) (3) \hat w =\mathop{\arg\min}\limits_{w}L(W)\tag3 w^=wargminL(W)(3)
这是一种无约束的优化问题。
求解上述优化问题我们有两种方法:
解析解:
∂ L ( W ) ∂ w = 0 → W ∗ = ( X T X ) − 1 X T Y (4) \frac{\partial L(W)}{\partial w}=0\rightarrow W^*=(X^TX)^{-1}X^TY\tag4 ∂w∂L(W)=0→W∗=(XTX)−1XTY(4)
数值解:
当解析解无法求解的时候,我们用梯度下降的方式求解GD,或者用随机梯度下降的方式求解SGD
SVM支持向量机本质上就是一个分类问题。我们定义SVM的模型为:
f ( w ) = s i g n ( w T x + b ) (5) f(w)=sign(w^Tx+b)\tag5 f(w)=sign(wTx+b)(5)
我们定义SVM支持向量机的Loss Function损失函数为:
L ( W ) = 1 2 w T w ; s . t : y i ( w T x i + b ) ≥ 1 , i = 1 , 2 , . . . , N (6) L(W)=\frac{1}{2}w^Tw;s.t:y_i(w^Tx_i+b)\geq1,i=1,2,...,N\tag6 L(W)=21wTw;s.t:yi(wTxi+b)≥1,i=1,2,...,N(6)
EM算法也是一种优化问题的思想,通过不停的迭代,来找到最好的参数:迭代公式如下:
θ ( t + 1 ) = arg max θ ∫ z log P ( X , Z ∣ θ ) ⋅ P ( Z ∣ X , θ ( t ) ) d z (7) \theta^{(t+1)}=\mathop{\arg\max}\limits_{\theta}\int_z \log P(X,Z|\theta)·P(Z|X,\theta^{(t)})dz\tag 7 θ(t+1)=θargmax∫zlogP(X,Z∣θ)⋅P(Z∣X,θ(t))dz(7)
我们知道从频率派角度来看,最终会演化为一个积分问题。由贝叶斯公式可得:
P ( θ ∣ X ) = P ( X ∣ θ ) ⋅ P ( θ ) P ( X ) (8) P(\theta|X)=\frac{P(X|\theta)·P(\theta)}{P(X)}\tag8 P(θ∣X)=P(X)P(X∣θ)⋅P(θ)(8)
变分推断VI的目的就是想办法找到一个分布q(z)去逼近我们没有办法计算求得解析解的后验分布p(z|x).我们定义
P ( X ) = P ( X , Z ) P ( Z ∣ X ) = P ( X , Z ) / q ( Z ) P ( Z ∣ X ) / q ( Z ) P(X)=\frac{P(X,Z)}{P(Z|X)}=\frac{P(X,Z)/q(Z)}{P(Z|X)/q(Z)} P(X)=P(Z∣X)P(X,Z)=P(Z∣X)/q(Z)P(X,Z)/q(Z)
两边取对数可得:
log P ( X ) = log [ P ( X , Z ) / q ( Z ) ] − log [ P ( Z ∣ X ) / q ( Z ) ] \log P(X)=\log [P(X,Z)/q(Z)] - \log [{P(Z|X)/q(Z)}] logP(X)=log[P(X,Z)/q(Z)]−log[P(Z∣X)/q(Z)]
两边同时对分布q(Z)求期望:
左边:
∫ q ( z ) log P ( X ) d Z = ∫ Z log P ( X ) q ( Z ) d Z = log P ( X ) ∫ Z q ( Z ) d Z = log P ( X ) \int_{q(z)}\log P(X)dZ=\int_Z \log P(X)q(Z)dZ=\log P(X)\int_Z q(Z)dZ=\log P(X) ∫q(z)logP(X)dZ=∫ZlogP(X)q(Z)dZ=logP(X)∫Zq(Z)dZ=logP(X)
右边:
∫ Z q ( Z ) log P ( X , Z ) q ( Z ) d Z ⏟ E L B O + ∫ Z ( − q ( Z ) log P ( Z ∣ X ) q ( Z ) d Z ) ⏟ K L ( q ∣ ∣ p ) (11) \underbrace{\int_Z q(Z)\log \frac{P(X,Z)}{q(Z)} dZ}_{ELBO}+\underbrace{\int_Z(- q(Z)\log \frac{P(Z|X)}{q(Z)}dZ)}_{KL(q||p)}\tag{11} ELBO ∫Zq(Z)logq(Z)P(X,Z)dZ+KL(q∣∣p) ∫Z(−q(Z)logq(Z)P(Z∣X)dZ)(11)
则:
log P ( X ) = ∫ Z q ( Z ) log P ( X , Z ) q ( Z ) d Z ⏟ E L B O = L ( q ) + ∫ Z ( − q ( Z ) log P ( Z ∣ X ) q ( Z ) d Z ) ⏟ K L ( q ∣ ∣ p ) (12) \log P(X)=\underbrace{\int_Z q(Z)\log \frac{P(X,Z)}{q(Z)} dZ}_{ELBO=L(q)}+\underbrace{\int_Z(- q(Z)\log \frac{P(Z|X)}{q(Z)}dZ)}_{KL(q||p)}\tag{12} logP(X)=ELBO=L(q) ∫Zq(Z)logq(Z)P(X,Z)dZ+KL(q∣∣p) ∫Z(−q(Z)logq(Z)P(Z∣X)dZ)(12)
我们定义一个函数L(q)来代替ELBO,用来说明ELBO的输入是关于q的函数,q是我们随意找到的概率密度函数,其中函数L(q)就是我们的所指的变分。由于我们求不出P(Z|X),我们的目的是找到一个q(Z),使得P(Z|X)近似于q(Z),也就是希望KL(q||p)越来越小,这样我们求得的ELBO越接近于 log P ( X ) \log P(X) logP(X),那么我们的目标可以等价于:
q ( Z ) ^ = arg max q ( Z ) L ( q ) → q ( Z ) ^ ≈ p ( Z ∣ X ) (13) \hat{q(Z)}=\mathop{\arg\max}\limits_{q(Z)} L(q)\rightarrow \hat{q(Z)}\approx p(Z|X)\tag{13} q(Z)^=q(Z)argmaxL(q)→q(Z)^≈p(Z∣X)(13)
为了求解q(Z),因为Z是一组隐变量组成的随机变量组合。我们假设q(Z)可以划分为M个组,且每个组之间是相互独立的。这种思想来自于物理学的平均场理论,表达如下:
q ( Z ) = ∏ i = 1 M q i ( Z i ) (14) q(Z)=\prod_{i=1}^{M}q_i(Z_i)\tag{14} q(Z)=i=1∏Mqi(Zi)(14)
平均场理论主要思想如下:
现在我们把q(Z)代入到L(q)中,因为q(Z)有M个划分,我们的想法是先求解第j项,所以我们需要固定其他所有项(固定 { 1 , 2 , . . . , j − 1 , j + 1 , . . . , M } \{1,2,...,j-1,j+1,...,M\} {1,2,...,j−1,j+1,...,M}).这样我们就一个一个求解 q j q_j qj后再求积得到q(Z)即可。
= ∫ Z j q j ( Z j ) d Z j ⋅ [ ∏ i = 1 , 且 i ≠ j M q i ( Z i ) log P ( X , Z ) d Z 1 d Z 2 d Z j − 1 d Z j + 1 . . d Z M ] =\int_{Z_j}q_j(Z_j)dZ_j·[\prod_{i=1,且i≠j}^{M}q_i(Z_i)\log P(X,Z)dZ_1dZ_2dZ_{j-1}dZ_{j+1}..dZ_M] =∫Zjqj(Zj)dZj⋅[∏i=1,且i=jMqi(Zi)logP(X,Z)dZ1dZ2dZj−1dZj+1..dZM]
= ∫ Z j q j ( Z j ) d Z j ⋅ ∫ Z i ; i = 1 , . . . j − 1 , j + 1 , . . . M log P ( X , Z ) ∏ i ≠ j M q i ( Z i ) d Z i =\int_{Z_j}q_j(Z_j)dZ_j·\int_{Z_i;i=1,...j-1,j+1,...M}\log P(X,Z)\prod_{i≠j}^Mq_i(Z_i)dZ_i =∫Zjqj(Zj)dZj⋅∫Zi;i=1,...j−1,j+1,...MlogP(X,Z)∏i=jMqi(Zi)dZi
= ∫ Z j q j ( Z j ) ⋅ E ∏ i ≠ j M q i ( Z i ) [ log P ( X , Z ) ] d Z j =\int_{Z_j}q_j(Z_j)·E_{\prod_{i≠j}^Mq_i(Z_i)}[\log P(X,Z)]dZ_j =∫Zjqj(Zj)⋅E∏i=jMqi(Zi)[logP(X,Z)]dZj
2 = ∫ Z q ( Z ) log q ( Z ) d Z 2=\int_Zq(Z)\log q(Z)dZ 2=∫Zq(Z)logq(Z)dZ
= ∫ Z ∏ i = 1 M q i ( Z i ) log ∏ i = 1 M q i ( Z i ) d Z =\int_Z\prod_{i=1}^{M}q_i(Z_i)\log \prod_{i=1}^{M}q_i(Z_i)dZ =∫Z∏i=1Mqi(Zi)log∏i=1Mqi(Zi)dZ
= ∫ Z ∏ i = 1 M q i ( Z i ) [ ∑ i = 1 M log q i ( Z i ) ] d Z =\int_Z\prod_{i=1}^{M}q_i(Z_i)[\sum_{i=1}^{M}\log q_i(Z_i)]dZ =∫Z∏i=1Mqi(Zi)[∑i=1Mlogqi(Zi)]dZ
将求和展开后拿出第一项目进行分析:
第 一 项 = ∫ Z ∏ i = 1 M q i ( Z i ) log q 1 ( Z 1 ) ] d Z 第一项=\int_Z\prod_{i=1}^{M}q_i(Z_i)\log q_1(Z_1)]dZ 第一项=∫Z∏i=1Mqi(Zi)logq1(Z1)]dZ
= ∫ Z 1 q 1 ( Z 1 ) log q 1 ( Z 1 ) d Z 1 ⋅ ∫ Z 2 . . . Z M ∏ i = 2 M q i ( Z i ) d Z 2 . . . d Z M ⏟ = 1 = ∫ Z 1 q 1 ( Z 1 ) log q 1 ( Z 1 ) d Z 1 =\int_{Z_1}q_1(Z_1)\log q_1(Z_1)dZ_1·\underbrace{\int _{Z_2...Z_M}\prod_{i=2}^{M}q_i(Z_i)dZ_2...dZ_M}_{=1}=\int_{Z_1}q_1(Z_1)\log q_1(Z_1)dZ_1 =∫Z1q1(Z1)logq1(Z1)dZ1⋅=1 ∫Z2...ZMi=2∏Mqi(Zi)dZ2...dZM=∫Z1q1(Z1)logq1(Z1)dZ1
同理可得:
2 = ∫ Z ∏ i = 1 M q i ( Z i ) [ ∑ i = 1 M log q i ( Z i ) ] d Z 2=\int_Z\prod_{i=1}^{M}q_i(Z_i)[\sum_{i=1}^{M}\log q_i(Z_i)]dZ 2=∫Z∏i=1Mqi(Zi)[∑i=1Mlogqi(Zi)]dZ
= ∫ Z 1 q 1 ( Z 1 ) log q 1 ( Z 1 ) d Z 1 + ∫ Z 2 q 2 ( Z 2 ) log q 2 ( Z 2 ) d Z 2 + . . . + ∫ Z M q M ( Z M ) log q M ( Z M ) d Z M =\int_{Z_1}q_1(Z_1)\log q_1(Z_1)dZ_1+\int_{Z_2}q_2(Z_2)\log q_2(Z_2)dZ_2+...+\int_{Z_M}q_M(Z_M)\log q_M(Z_M)dZ_M =∫Z1q1(Z1)logq1(Z1)dZ1+∫Z2q2(Z2)logq2(Z2)dZ2+...+∫ZMqM(ZM)logqM(ZM)dZM
= ∑ i = 1 M ∫ Z i q i ( Z i ) log q i ( Z i ) d Z i =\sum_{i=1}^{M}\int_{Z_i} q_i(Z_i)\log q_i(Z_i)dZ_i =∑i=1M∫Ziqi(Zi)logqi(Zi)dZi
由于我们现在只关心第j项目,其他项目已经固定,所以上式可变为如下:
2 = ∫ Z j q j ( Z j ) log q j ( Z j ) d Z j + C 常 数 2=\int_{Z_j} q_j(Z_j)\log q_j(Z_j)dZ_j+C_{常数} 2=∫Zjqj(Zj)logqj(Zj)dZj+C常数
综上所述:
我们发现1和2中的式子不一致,不方便化简,所以我们希望1中的式子也跟2中的log保持一致,这样就方便后续计算,所以我们定义如下:
因为:
1 = ∫ Z j q j ( Z j ) ⋅ E ∏ i ≠ j M q i ( Z i ) [ log P ( X , Z ) ] d Z j 1=\int_{Z_j}q_j(Z_j)·E_{\prod_{i≠j}^Mq_i(Z_i)}[\log P(X,Z)]dZ_j 1=∫Zjqj(Zj)⋅E∏i=jMqi(Zi)[logP(X,Z)]dZj
为方便后续计算,转换如下:
1 = ∫ Z j q j ( Z j ) ⋅ log P ^ ( X , Z j ) d Z j 1=\int_{Z_j}q_j(Z_j)·\log \hat{P}(X,Z_j)dZ_j 1=∫Zjqj(Zj)⋅logP^(X,Zj)dZj
又因为:
2 = ∫ Z j q j ( Z j ) log q j ( Z j ) d Z j + C 常 数 2=\int_{Z_j} q_j(Z_j)\log q_j(Z_j)dZ_j+C_{常数} 2=∫Zjqj(Zj)logqj(Zj)dZj+C常数
L ( q ) = 1 − 2 = ∫ Z j q j ( Z j ) ⋅ log P ^ ( X , Z j ) q j ( Z j ) d Z j − C = − K L ( q j ( Z j ) ∣ ∣ P ^ ( X , Z j ) ) − C 常 数 L(q)=1-2=\int_{Z_j}q_j(Z_j)·\log \frac{\hat{P}(X,Z_j)}{q_j(Z_j)}dZ_j-C=-KL(q_j(Z_j)||\hat{P}(X,Z_j))-C_{常数} L(q)=1−2=∫Zjqj(Zj)⋅logqj(Zj)P^(X,Zj)dZj−C=−KL(qj(Zj)∣∣P^(X,Zj))−C常数
又由于我们要求最值问题,所以尝试C可以忽略,只关心KL项即可。
L ( q ) = 1 − 2 = − K L ( q j ( Z j ) ∣ ∣ P ^ ( X , Z j ) ) ≤ 0 L(q)=1-2=-KL(q_j(Z_j)||\hat{P}(X,Z_j))\leq0 L(q)=1−2=−KL(qj(Zj)∣∣P^(X,Zj))≤0
我们已经讲解了基于平均场理论的变分推断,即:(VI-mean field),也称经典推断classical VI;
这种迭代方式就跟坐标上升法算法一样的思想。以上是一个经典的基于平均场理论的变分推断方法,但这种方法也有缺点
变分推断说到底还是推断,推断在概率图模型中指的往往是求后验是什么,EM和变分推断在推导过程中用的是同一套理论(ELBO)。只是处理问题有些不同而已。因为变分推断关心的是后验,所以我们应该弱化参数 θ \theta θ.
log P θ ( X ) = E L B O ⏟ L ( q ) + K L ( q ∣ ∣ p ) ⏟ ≥ 0 ≥ L ( q ) (18) \log P_{\theta}(X)=\underbrace{ELBO}_{L(q)}+\underbrace{KL(q||p)}_{\geq0}\geq L(q)\tag{18} logPθ(X)=L(q) ELBO+≥0 KL(q∣∣p)≥L(q)(18)
为了以后更加方便描述数学问题,我们定义用小写表示随机变量:
Stochastic Gradient Variational Inference简称SGVI算法,即随机梯度变分推断算法。对于Z和X组成的模型,我们常常按如下方式区分生成模型和推断模型。
我们上一节分析了基于平均场理论的变分推断,通过平均假设得到了变分推断的结论。可以看到它是一种坐标上升法的思想去处理问题,那么,既然坐标上升法能够处理变分推断问题,那么我们也可以用梯度上升法来解决这个问题,这样我们就自然而然地想到了随机梯度变分推断SGVI。我们知道对于随机梯度算法来说,它的更新是需要方向和步长的,满足如下公式:
θ ( t + 1 ) = θ ( t ) + λ ( t ) ∇ L ( q ) (23) \theta^{(t+1)}=\theta^{(t)}+\lambda^{(t)}\nabla L(q) \tag{23} θ(t+1)=θ(t)+λ(t)∇L(q)(23)
∇ ϕ L ( ϕ ) = ∇ ϕ E q ϕ ( z ) [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] \nabla_{\phi}L(\phi)=\nabla_{\phi}E_{q_{\phi}(z)}[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)] ∇ϕL(ϕ)=∇ϕEqϕ(z)[logPθ(x(i),z)−logqϕ(z)]
= ∇ ϕ { ∫ z q ϕ ( z ) ⋅ [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] d z } =\nabla_{\phi}\{\int_z q_{\phi}(z)·[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]dz\} =∇ϕ{∫zqϕ(z)⋅[logPθ(x(i),z)−logqϕ(z)]dz}
= ∫ z [ ∇ ϕ { q ϕ ( z ) } ⏟ d ( A ) ⋅ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ⏟ B ] d z =\int_z[ \underbrace{\nabla_{\phi}\{ q_{\phi}(z)\}}_{d(A)}·\underbrace{\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}}_{B}]dz =∫z[d(A) ∇ϕ{qϕ(z)}⋅B {[logPθ(x(i),z)−logqϕ(z)]}]dz
+ ∫ z [ { q ϕ ( z ) } ⏟ A ⋅ ∇ ϕ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ⏟ d ( B ) ] d z +\int_z[ \underbrace{\{ q_{\phi}(z)\}}_{A}·\underbrace{\nabla_{\phi}\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}}_{d(B)}]dz +∫z[A {qϕ(z)}⋅d(B) ∇ϕ{[logPθ(x(i),z)−logqϕ(z)]}]dz
我们对于上式分开解析:
1 = ∫ z [ ∇ ϕ { q ϕ ( z ) } ⋅ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] d z 1=\int_z[\nabla_{\phi}\{ q_{\phi}(z)\}·\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]dz 1=∫z[∇ϕ{qϕ(z)}⋅{[logPθ(x(i),z)−logqϕ(z)]}]dz
2 = ∫ z [ { q ϕ ( z ) } ⋅ ∇ ϕ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] d z 2=\int_z[ \{ q_{\phi}(z)\}·\nabla_{\phi}\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]dz 2=∫z[{qϕ(z)}⋅∇ϕ{[logPθ(x(i),z)−logqϕ(z)]}]dz
我 们 知 道 : log P θ ( x ( i ) z ) 与 ϕ 无 关 , 故 导 数 为 0. 我们知道:\log P_{\theta}(x^{(i)}z)与\phi无关,故导数为0. 我们知道:logPθ(x(i)z)与ϕ无关,故导数为0.
2 = ∫ z [ { q ϕ ( z ) } ⋅ ∇ ϕ { [ log P θ ( x ( i ) , z ) ⏟ 与 ϕ 无 关 − log q ϕ ( z ) ] } ] d z 2=\int_z[ \{ q_{\phi}(z)\}·\nabla_{\phi}\{\underbrace{[\log P_{\theta}(x^{(i)},z)}_{与\phi无关}-\log q_{\phi}(z)]\}]dz 2=∫z[{qϕ(z)}⋅∇ϕ{与ϕ无关 [logPθ(x(i),z)−logqϕ(z)]}]dz
2 = ∫ z [ { q ϕ ( z ) } ⋅ ∇ ϕ { − log q ϕ ( z ) } ] d z 2=\int_z[ \{ q_{\phi}(z)\}·\nabla_{\phi}\{-\log q_{\phi}(z)\}]dz 2=∫z[{qϕ(z)}⋅∇ϕ{−logqϕ(z)}]dz
= ∫ z [ { q ϕ ( z ) } ⋅ ∇ ϕ { − log q ϕ ( z ) } ] d z =\int_z[ \{ q_{\phi}(z)\}·\nabla_{\phi}\{-\log q_{\phi}(z)\}]dz =∫z[{qϕ(z)}⋅∇ϕ{−logqϕ(z)}]dz
= − ∫ z [ { q ϕ ( z ) } ⋅ 1 q ϕ ( z ) ∇ ϕ { q ϕ ( z ) } ] d z =-\int_z[ \{ q_{\phi}(z)\}·\frac{1}{q_{\phi}(z)}\nabla_{\phi}\{q_{\phi}(z)\}]dz =−∫z[{qϕ(z)}⋅qϕ(z)1∇ϕ{qϕ(z)}]dz
= − ∫ z [ ∇ ϕ { q ϕ ( z ) } ] d z =-\int_z[\nabla_{\phi}\{q_{\phi}(z)\}]dz =−∫z[∇ϕ{qϕ(z)}]dz
= − ∇ ϕ ∫ z [ { q ϕ ( z ) } ] d z =-\nabla_{\phi}\int_z[\{q_{\phi}(z)\}]dz =−∇ϕ∫z[{qϕ(z)}]dz
= − ∇ ϕ ∫ z [ { q ϕ ( z ) } ] ⏟ = 1 d z =-\nabla_{\phi}\underbrace{\int_z[\{q_{\phi}(z)\}]}_{=1}dz =−∇ϕ=1 ∫z[{qϕ(z)}]dz
= − ∇ ϕ 1 = 0 =-\nabla_{\phi}1=0 =−∇ϕ1=0
结论:
∇ ϕ L ( ϕ ) = ∫ z [ ∇ ϕ { q ϕ ( z ) } ⋅ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] d z (28) \nabla_{\phi}L(\phi)=\int_z[\nabla_{\phi}\{ q_{\phi}(z)\}·\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]dz\tag{28} ∇ϕL(ϕ)=∫z[∇ϕ{qϕ(z)}⋅{[logPθ(x(i),z)−logqϕ(z)]}]dz(28)
因为 ∇ ϕ { q ϕ ( z ) } = q ϕ ( z ) ∇ ϕ log { q ϕ ( z ) } , 故 上 式 可 变 如 下 : \nabla_{\phi}\{ q_{\phi}(z)\}= q_{\phi}(z)\nabla_{\phi}\log \{ q_{\phi}(z)\},故上式可变如下: ∇ϕ{qϕ(z)}=qϕ(z)∇ϕlog{qϕ(z)},故上式可变如下:
∇ ϕ L ( ϕ ) = ∫ z [ q ϕ ( z ) ∇ ϕ log { q ϕ ( z ) } ⋅ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] d z (29) \nabla_{\phi}L(\phi)=\int_z[q_{\phi}(z)\nabla_{\phi}\log \{ q_{\phi}(z)\}·\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]dz\tag{29} ∇ϕL(ϕ)=∫z[qϕ(z)∇ϕlog{qϕ(z)}⋅{[logPθ(x(i),z)−logqϕ(z)]}]dz(29)
转换成关于分布 q ( ϕ ) q(\phi) q(ϕ)的期望形式:
结论:
∇ ϕ L ( ϕ ) = E { q ϕ ( z ) } [ ∇ ϕ log { q ϕ ( z ) } ⋅ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] (30) \nabla_{\phi}L(\phi)=E_{\{q_{\phi}(z)\}}[\nabla_{\phi}\log \{ q_{\phi}(z)\}·\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]\tag{30} ∇ϕL(ϕ)=E{qϕ(z)}[∇ϕlog{qϕ(z)}⋅{[logPθ(x(i),z)−logqϕ(z)]}](30)
我们已经把梯度转换成期望的公式,现在就是如何求这个期望,们采用的是蒙特卡罗采样法,具体步骤如下:
重参数化技巧的核心思想是对 q ϕ ( z ) q_{\phi}(z) qϕ(z)进行简化,假设我们得到一个确定的解 P ( ϵ ) P(\epsilon) P(ϵ),那么我们就很容易解决上述期望不好求的问题。
∇ ϕ L ( ϕ ) = ∇ ϕ E { q ϕ ( z ) } [ log { q ϕ ( z ) } ⋅ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] (34) \nabla_{\phi}L(\phi)=\nabla_{\phi}E_{\{q_{\phi}(z)\}}[\log \{ q_{\phi}(z)\}·\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]\tag{34} ∇ϕL(ϕ)=∇ϕE{qϕ(z)}[log{qϕ(z)}⋅{[logPθ(x(i),z)−logqϕ(z)]}](34)
注:因为z来自于 q ϕ ( z ) q_{\phi}(z) qϕ(z),如果我们能够想办法将z中的随机变量分离出来,将 z 的随机性用另外一个变量 ϵ \epsilon ϵ表示出来,我们定义一个函数,表示变量 ϵ \epsilon ϵ与随机变量z之间的关系:
z = g ϕ ( ϵ , x ( i ) ) → ϵ ∼ P ( ϵ ) (36) z=g_{\phi}(\epsilon,x^{(i)})\quad\rightarrow\quad \epsilon \sim P(\epsilon)\tag{36} z=gϕ(ϵ,x(i))→ϵ∼P(ϵ)(36)
由于 z = g ϕ ( ϵ , x ( i ) ) z=g_{\phi}(\epsilon,x^{(i)}) z=gϕ(ϵ,x(i))是一个函数关系,即映射关系,那么我们可以知道发生z的概率跟发生 ϵ \epsilon ϵ的概率一致;
我们令PA为发生z的概率,故 P A = q ϕ ( z ∣ x ( i ) ) d z PA=q_{\phi}(z|x^{(i)})dz PA=qϕ(z∣x(i))dz;同理令PB为发生 ϵ \epsilon ϵ的概率 P B = p ( ϵ ) d ϵ PB=p(\epsilon)d\epsilon PB=p(ϵ)dϵ,故可得:
P A = ∣ q ϕ ( z ∣ x ( i ) ) d z ∣ = P B = ∣ p ( ϵ ) d ϵ ∣ (39) PA=|q_{\phi}(z|x^{(i)})dz|=PB=|p(\epsilon)d\epsilon|\tag{39} PA=∣qϕ(z∣x(i))dz∣=PB=∣p(ϵ)dϵ∣(39)
现在我们根据上述公式更新梯度:
∇ ϕ L ( ϕ ) = ∇ ϕ ∫ z [ { q ϕ ( z ) } ⋅ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] d z (40) \nabla_{\phi}L(\phi)=\nabla_{\phi}\int_z[\{ q_{\phi}(z)\}·\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]dz\tag{40} ∇ϕL(ϕ)=∇ϕ∫z[{qϕ(z)}⋅{[logPθ(x(i),z)−logqϕ(z)]}]dz(40)
= ∇ ϕ ∫ z [ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] { q ϕ ( z ) } d z (41) =\nabla_{\phi}\int_z[\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]\{ q_{\phi}(z)\}dz\tag{41} =∇ϕ∫z[{[logPθ(x(i),z)−logqϕ(z)]}]{qϕ(z)}dz(41)
= ∇ ϕ ∫ z [ { [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] } ] p ( ϵ ) d ϵ (42) =\nabla_{\phi}\int_z[\{[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\}]p(\epsilon)d\epsilon\tag{42} =∇ϕ∫z[{[logPθ(x(i),z)−logqϕ(z)]}]p(ϵ)dϵ(42)
= ∇ ϕ ∫ ϵ [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] p ( ϵ ) d ϵ (43) =\nabla_{\phi}\int_{\epsilon}[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]p(\epsilon)d\epsilon\tag{43} =∇ϕ∫ϵ[logPθ(x(i),z)−logqϕ(z)]p(ϵ)dϵ(43)
= ∇ ϕ E { P ϵ } [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] (44) =\nabla_{\phi}E_{\{P_{\epsilon}\}}[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\tag{44} =∇ϕE{Pϵ}[logPθ(x(i),z)−logqϕ(z)](44)
= E { P ϵ } ∇ ϕ [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] (45) =E_{\{P_{\epsilon}\}}\nabla_{\phi}[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\tag{45} =E{Pϵ}∇ϕ[logPθ(x(i),z)−logqϕ(z)](45)
= E { P ϵ } ∇ z [ log P θ ( x ( i ) , z ) − log q ϕ ( z ) ] ∇ ϕ z (46) =E_{\{P_{\epsilon}\}}\nabla_{z}[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z)]\nabla_{\phi}z\tag{46} =E{Pϵ}∇z[logPθ(x(i),z)−logqϕ(z)]∇ϕz(46)
我们之前定义了 q ϕ ( z ) = q ϕ ( z ∣ x ( i ) ) ; z = g ϕ ( ϵ , x ( i ) ) q_{\phi}(z)=q_{\phi}(z|x^{(i)});z=g_{\phi}(\epsilon,x^{(i)}) qϕ(z)=qϕ(z∣x(i));z=gϕ(ϵ,x(i))
∇ ϕ L ( ϕ ) = E { P ϵ } { ∇ z [ log P θ ( x ( i ) , z ) − log q ϕ ( z ∣ x ( i ) ) ] } ⋅ ∇ ϕ g ϕ ( ϵ , x ( i ) ) (47) \nabla_{\phi}L(\phi)=E_{\{P_{\epsilon}\}}\{\nabla_{z}[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z|x^{(i)})]\}·\nabla_{\phi}g_{\phi}(\epsilon,x^{(i)})\tag{47} ∇ϕL(ϕ)=E{Pϵ}{∇z[logPθ(x(i),z)−logqϕ(z∣x(i))]}⋅∇ϕgϕ(ϵ,x(i))(47)
那么我们的关于期望的采样就简化了很多,因为 p ( ϵ ) p(\epsilon) p(ϵ)与 ϕ \phi ϕ无关,我们可以先求 [ log P θ ( x ( i ) , z ) − log q ϕ ( z ∣ x ( i ) ) ] [\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z|x^{(i)})] [logPθ(x(i),z)−logqϕ(z∣x(i))]关于z的导数,然后再求 g ϕ ( ϵ , x ( i ) ) g_{\phi}(\epsilon,x^{(i)}) gϕ(ϵ,x(i))关于 ϕ \phi ϕ的导数,这三者之间是相互分离开的,最后我们对结果进行采样,可得:
ϵ ( l ) ∼ p ( ϵ ) ; l = 1 , 2 , . . , L (48) \epsilon^{(l)}\sim p(\epsilon);l=1,2,..,L\tag{48} ϵ(l)∼p(ϵ);l=1,2,..,L(48)
经过上述公式转换后的期望,因函数简化后,我们可以用蒙特卡洛的方法进行采样了。
∇ ϕ L ( ϕ ) ≈ 1 L ∑ i = 1 L [ { ∇ z [ log P θ ( x ( i ) , z ) − log q ϕ ( z ∣ x ( i ) ) ] } ⋅ ∇ ϕ g ϕ ( ϵ , x ( i ) ) ] (49) \nabla_{\phi}L(\phi)≈\frac{1}{L}\sum_{i=1}^{L}[\{\nabla_{z}[\log P_{\theta}(x^{(i)},z)-\log q_{\phi}(z|x^{(i)})]\}·\nabla_{\phi}g_{\phi}(\epsilon,x^{(i)})]\tag{49} ∇ϕL(ϕ)≈L1i=1∑L[{∇z[logPθ(x(i),z)−logqϕ(z∣x(i))]}⋅∇ϕgϕ(ϵ,x(i))](49)
其中 z = g ϕ ( ϵ ( i ) , x ( i ) ) , 最 后 将 上 式 的 值 代 入 迭 代 公 式 即 可 : z=g_{\phi}(\epsilon^{(i)},x^{(i)}),最后将上式的值代入迭代公式即可: z=gϕ(ϵ(i),x(i)),最后将上式的值代入迭代公式即可:
θ ( t + 1 ) = θ ( t ) + λ ( t ) ∇ ϕ L ( ϕ ) (50) \theta^{(t+1)}=\theta^{(t)}+\lambda^{(t)}\nabla_{\phi}L(\phi)\tag{50} θ(t+1)=θ(t)+λ(t)∇ϕL(ϕ)(50)
证毕。