条件变分自编码器(CVAE)及相关论文ELBO推导

推导用到的概率公式:

P ( A , B ∣ C ) = P ( A ∣ B , C ) P ( B ∣ C ) P(A,B|C) = P(A|B,C)P(B|C) P(A,BC)=P(AB,C)P(BC)
证明:
由于 P ( A ∣ B ) = P ( A , B ) P ( B ) P(A|B) = {P(A,B) \over P(B)} P(AB)=P(B)P(A,B),所以 P ( A , B ∣ C ) = P ( A , B , C ) P ( C ) P(A,B|C)={P(A,B,C) \over P(C)} P(A,BC)=P(C)P(A,B,C)
P ( A , B , C ) = P ( A ∣ B , C ) P ( B ∣ C ) P ( C ) P(A,B,C)=P(A|B,C)P(B|C)P(C) P(A,B,C)=P(AB,C)P(BC)P(C),所以 P ( A , B ∣ C ) = P ( A ∣ B , C ) P ( B ∣ C ) P(A,B|C) = P(A|B,C)P(B|C) P(A,BC)=P(AB,C)P(BC)

VAE的缺点

VAE是一种无监督模型,只能生成与输入类似的数据,有研究者提出能够生成不同的数据,比如,输入一张黄皮肤人脸,可以生成类似的白皮肤人脸或者黑皮肤人脸,这种情况下VAE就做不到了,故有研究人员提出了CVAE,这个C就是附加的条件,可以利用附加的条件生成更有多样性的数据。

CVAE

CVAE是一个系列,CVAE的模型不止一个,其他模型以后再慢慢补充。CVAE的推导方法与VAE的类似,不了解VAE的可以看我之前的博客变分自编码器(VAE)

CVAE的模型图如下图所示:
条件变分自编码器(CVAE)及相关论文ELBO推导_第1张图片
在下面的推导中, c c c是附加信息, x x x是输入数据, z z z是隐变量,
l o g p θ ( x ∣ c ) = l o g ∫ z p θ ( x , z ∣ c ) d z = l o g ∫ z q φ ( z ∣ x , c ) p θ ( x , z ∣ c ) q φ ( z ∣ x , c ) d z = l o g E q φ ( z ∣ x , c ) [ p θ ( x , z ∣ c ) q φ ( z ∣ x , c ) ] ≥ E q φ ( z ∣ x , c ) [ l o g p θ ( x , z ∣ c ) q φ ( z ∣ x , c ) ] (1) \begin{aligned} \tag{1}logp_{\theta}(x|c) = & log\int_{z}p_{\theta}(x, z|c)dz \\ = & log\int_{z}{q_{\varphi}(z|x,c)p_{\theta}(x,z|c) \over q_{\varphi}(z|x,c)}dz \\ = & log E_{q_{\varphi}(z|x,c)}[{p_{\theta}(x,z|c) \over q_{\varphi}(z|x,c)}] \\ \ge & E_{q_{\varphi}(z|x,c)}[log{p_{\theta}(x,z|c) \over q_{\varphi}(z|x,c)}] \\ \end{aligned} logpθ(xc)===logzpθ(x,zc)dzlogzqφ(zx,c)qφ(zx,c)pθ(x,zc)dzlogEqφ(zx,c)[qφ(zx,c)pθ(x,zc)]Eqφ(zx,c)[logqφ(zx,c)pθ(x,zc)](1)
E L B O = E q φ ( z ∣ x , c ) [ l o g p θ ( x , z ∣ c ) q φ ( z ∣ x , c ) ] = E q φ ( z ∣ x , c ) [ l o g p θ ( x ∣ z , c ) p θ ( z ∣ c ) q φ ( z ∣ x , c ) ] = E q φ ( z ∣ x , c ) [ l o g p θ ( x ∣ z , c ) ] − K L [ q φ ( z ∣ x , c ) ∣ ∣ p θ ( z ∣ c ) ] (2) \begin{aligned} \tag{2}ELBO = & E_{q_{\varphi}(z|x,c)}[log{p_{\theta}(x,z|c) \over q_{\varphi}(z|x,c)}] \\ = & E_{q_{\varphi}(z|x,c)}[log{p_{\theta}(x|z,c)p_{\theta}(z|c) \over q_{\varphi}(z|x,c)}] \\ = & E_{q_{\varphi}(z|x,c)}[logp_{\theta}(x|z,c)] - KL[q_{\varphi}(z|x,c)||p_{\theta}(z|c)] \end{aligned} ELBO===Eqφ(zx,c)[logqφ(zx,c)pθ(x,zc)]Eqφ(zx,c)[logqφ(zx,c)pθ(xz,c)pθ(zc)]Eqφ(zx,c)[logpθ(xz,c)]KL[qφ(zx,c)pθ(zc)](2)
在(2)式中, p θ ( z ∣ c ) 是 z 的 先 验 , q φ ( z ∣ x , c ) 是 z 的 后 验 p_{\theta}(z|c)是z的先验,q_{\varphi}(z|x,c)是z的后验 pθ(zc)z,qφ(zx,c)z
相应的网络结构如下图所示:
条件变分自编码器(CVAE)及相关论文ELBO推导_第2张图片

与CVAE相关的论文:

[1]Modeling Event Background for If-Then Commonsense Reasoning Using Context-awareVariational Autoencoder.
[2]Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders.
[3]Learning Structured Output Representation using Deep Conditional Generative Models.

论文中的 E L B O ELBO ELBO推导:

[1]中的 E L B O ELBO ELBO推导:
用到的概率公式:
P ( A , B , C ∣ D ) = P ( A ∣ B , C , D ) P ( B , C ∣ D ) = P ( A ∣ B , C , D ) P ( B ∣ C , D ) P ( C ∣ D ) P(A,B,C|D) = P(A|B,C,D)P(B,C|D)= P(A|B,C,D)P(B|C,D)P(C|D) P(A,B,CD)=P(AB,C,D)P(B,CD)=P(AB,C,D)P(BC,D)P(CD)
条件变分自编码器(CVAE)及相关论文ELBO推导_第3张图片
概率图模型如上图所示,虚线是推理网络,实线是生成网络, x 是 x是 xbase event(输入数据), y 是 y是 ytarget(输出数据), z c ′ 和 z c z_{c^{'}}和z_c zczc是隐变量,分别表示event的背景信息和inference dimensions.
l o g p θ ( y ∣ x ) = l o g ∬ p θ ( y , z , z c ′ ∣ x ) d z d z c ′ = l o g ∬ p θ ( y , z , z c ′ ∣ x ) q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) d z d z c ′ = l o g E q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) [ p θ ( y , z , z c ′ ∣ x ) q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) ] ≥ E q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) [ l o g p θ ( y , z , z c ′ ∣ x ) q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) ] E L B O = E q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) [ l o g p θ ( y , z , z c ′ ∣ x ) q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) ] = E q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) [ l o g p θ ( y ∣ z , z c ′ x ) p θ ( z ∣ z c ′ , x ) p θ ( z c ′ ∣ x ) q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) ] = E q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) [ l o g p θ ( y ∣ z , z c ′ x ) ] + E q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) [ l o g p θ ( z ∣ z c ′ , x ) p θ ( z c ′ ∣ x ) q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) ] = E q φ ( z ∣ x , z c ′ , y ) q φ ( z c ′ ∣ x , y ) [ l o g p θ ( y ∣ z , z c ′ x ) ] − ∫ z q φ ( z ∣ x , z c ′ , y ) K L [ q φ ( z c ′ ∣ x , y ) ∣ ∣ p θ ( z c ′ ∣ x ) ] d z − ∫ z c ′ q φ ( z c ′ ∣ x , y ) K L [ q φ ( z ∣ x , z c ′ , y ) ∣ ∣ p θ ( z ∣ z c ′ , x ) ] d z c ′ \begin{aligned} logp_{\theta}(y|x) = & log \iint p_{\theta}(y, z, z_{c^{'}}|x)dzdz_{c^{'}} \\ = & log \iint { p_{\theta}(y, z, z_{c^{'}}|x)q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y) \over q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}dzdz_{c^{'}} \\ = & logE_{q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}[{p_{\theta}(y, z, z_{c^{'}}|x) \over q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}] \\ \ge & E_{q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}[log{p_{\theta}(y, z, z_{c^{'}}|x) \over q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}] \\ ELBO = & E_{q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}[log{p_{\theta}(y, z, z_{c^{'}}|x) \over q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}] \\ = & E_{q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}[log{p_{\theta}(y |z, z_{c^{'}}x)p_{\theta}(z|z_{c^{'}},x)p_{\theta}(z_{c^{'}}|x) \over q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}] \\ = & E_{q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}[log{p_{\theta}(y |z, z_{c^{'}}x)]} +E_{q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}[log{p_{\theta}(z|z_{c^{'}},x)p_{\theta}(z_{c^{'}}|x) \over q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}] \\ =&E_{q_{\varphi}(z|x,z_{c^{'}}, y)q_{\varphi}(z_{c^{'}}|x,y)}[log{p_{\theta}(y |z, z_{c^{'}}x)]} \\ & - \int_z q_{\varphi}(z|x,z_{c^{'}},y)KL[q_{\varphi}(z_{c^{'}}|x,y)||p_{\theta}(z_{c^{'}}|x)]dz \\ & - \int_{z_{c^{'}}}q_{\varphi}(z_{c^{'}}|x,y)KL[q_{\varphi}(z|x, z_{c^{'}},y)||p_{\theta}(z|z_{c^{'}},x)]dz_{c^{'}} \end{aligned} logpθ(yx)===ELBO====logpθ(y,z,zcx)dzdzclogqφ(zx,zc,y)qφ(zcx,y)pθ(y,z,zcx)qφ(zx,zc,y)qφ(zcx,y)dzdzclogEqφ(zx,zc,y)qφ(zcx,y)[qφ(zx,zc,y)qφ(zcx,y)pθ(y,z,zcx)]Eqφ(zx,zc,y)qφ(zcx,y)[logqφ(zx,zc,y)qφ(zcx,y)pθ(y,z,zcx)]Eqφ(zx,zc,y)qφ(zcx,y)[logqφ(zx,zc,y)qφ(zcx,y)pθ(y,z,zcx)]Eqφ(zx,zc,y)qφ(zcx,y)[logqφ(zx,zc,y)qφ(zcx,y)pθ(yz,zcx)pθ(zzc,x)pθ(zcx)]Eqφ(zx,zc,y)qφ(zcx,y)[logpθ(yz,zcx)]+Eqφ(zx,zc,y)qφ(zcx,y)[logqφ(zx,zc,y)qφ(zcx,y)pθ(zzc,x)pθ(zcx)]Eqφ(zx,zc,y)qφ(zcx,y)[logpθ(yz,zcx)]zqφ(zx,zc,y)KL[qφ(zcx,y)pθ(zcx)]dzzcqφ(zcx,y)KL[qφ(zx,zc,y)pθ(zzc,x)]dzc
第一项是重构误差,也就是生成网络,包括 p θ ( y ∣ z , z c ′ , x ) , 用 神 经 网 络 来 拟 合 p_{\theta}(y|z,z_{c^{'}},x),用神经网络来拟合 pθ(yz,zc,x)
后面两项是KL项,也就是编码网络,包括 两 个 先 验 网 络 p θ ( z c ′ ∣ x ) 和 p θ ( z ∣ z c ′ , x ) , 两 个 识 别 网 络 q φ ( z ∣ x , z c ′ , y ) 和 q φ ( z c ′ ∣ x , y ) 两个先验网络p_{\theta}(z_{c^{'}}|x)和p_{\theta}(z|z_{c^{'}},x),两个识别网络q_{\varphi}(z|x,z_{c^{'}},y)和q_{\varphi}(z_{c^{'}}|x,y) pθ(zcx)pθ(zzc,x)qφ(zx,zc,y)qφ(zcx,y),因此总共有5个神经网络来拟合上面提到的概率分布。两个识别网络是用来近似真实的后验分布 p θ ( z ∣ x , z c ′ , y ) 和 p θ ( z c ′ ∣ x , y ) p_{\theta}(z|x,z_{c^{'}},y)和p_{\theta}(z_{c^{'}}|x,y) pθ(zx,zc,y)pθ(zcx,y)的,所以VAE和CVAE都是近似推断而不是精确推断,。
:上面的推导结果和分析与论文中有差异。

[2]中的 E L B O ELBO ELBO推导:
用到的概率公式:
P ( A , B , C ∣ D ) = P ( A ∣ B , C , D ) P ( B , C ∣ D ) = P ( A ∣ B , C , D ) P ( B ∣ C , D ) P ( C ∣ D ) P(A,B,C|D) = P(A|B,C,D)P(B,C|D)= P(A|B,C,D)P(B|C,D)P(C|D) P(A,B,CD)=P(AB,C,D)P(B,CD)=P(AB,C,D)P(BC,D)P(CD)
条件变分自编码器(CVAE)及相关论文ELBO推导_第4张图片
概率图模型如上图所示,虚线是推理网络,实线是生成网络, x x x是对话中的第 k k k条语句, c c c是前 k − 1 k-1 k1条语句,也就是上下文, y y y是语言的一些特征,比如整个对话表达的动作行为, z z z是隐变量。
l o g p θ ( x ∣ c ) = l o g ∬ p θ ( x , z , y ∣ c ) d z d y = l o g ∬ p θ ( x , z , y ∣ c ) q φ ( z ∣ c , x , y ) q φ ( z ∣ c , x , y ) d z d y = l o g E q φ ( z ∣ c , x , y ) [ p θ ( x , z , y ∣ c ) q φ ( z ∣ c , x , y ) ] ≥ E q φ ( z ∣ c , x , y ) [ l o g p θ ( x , z , y ∣ c ) q φ ( z ∣ c , x , y ) ] E L B O = E q φ ( z ∣ c , x , y ) [ l o g p θ ( x , z , y ∣ c ) q φ ( z ∣ c , x , y ) ] = E q φ ( z ∣ c , x , y ) [ l o g p θ ( x ∣ c , z , y ) p θ ( y ∣ c , z ) p θ ( z ∣ c ) q φ ( z ∣ c , x , y ) ] = E q φ ( z ∣ c , x , y ) [ l o g p θ ( x ∣ c , z , y ) ] + E q φ ( z ∣ c , x , y ) [ l o g p θ ( y ∣ c , z ) ] − K L [ q φ ( z ∣ c , x , y ) ∣ ∣ p θ ( z ∣ c ) ] \begin{aligned} logp_{\theta}(x|c) = & log\iint p_{\theta}(x,z,y|c)dzdy \\ = & log \iint{p_{\theta}(x,z,y|c)q_{\varphi}(z|c,x,y) \over q_{\varphi}(z|c,x,y)}dzdy \\ = & logE_{q_{\varphi}(z|c,x,y)}[{p_{\theta}(x,z,y|c)\over q_{\varphi}(z|c,x,y)}] \\ \ge & E_{q_{\varphi}(z|c,x,y)}[log{p_{\theta}(x,z,y|c)\over q_{\varphi}(z|c,x,y)}] \\ ELBO = & E_{q_{\varphi}(z|c,x,y)}[log{p_{\theta}(x,z,y|c)\over q_{\varphi}(z|c,x,y)}] \\ = & E_{q_{\varphi}(z|c,x,y)}[log{p_{\theta}(x|c,z,y)p_{\theta}(y|c,z)p_{\theta}(z|c)\over q_{\varphi}(z|c,x,y)}] \\ = & E_{q_{\varphi}(z|c,x,y)}[logp_{\theta}(x|c,z,y)] \\ &+ E_{q_{\varphi}(z|c,x,y)}[logp_{\theta}(y|c,z)] \\ & - KL[q_{\varphi}(z|c,x,y)||p_{\theta}(z|c)] \end{aligned} logpθ(xc)===ELBO===logpθ(x,z,yc)dzdylogqφ(zc,x,y)pθ(x,z,yc)qφ(zc,x,y)dzdylogEqφ(zc,x,y)[qφ(zc,x,y)pθ(x,z,yc)]Eqφ(zc,x,y)[logqφ(zc,x,y)pθ(x,z,yc)]Eqφ(zc,x,y)[logqφ(zc,x,y)pθ(x,z,yc)]Eqφ(zc,x,y)[logqφ(zc,x,y)pθ(xc,z,y)pθ(yc,z)pθ(zc)]Eqφ(zc,x,y)[logpθ(xc,z,y)]+Eqφ(zc,x,y)[logpθ(yc,z)]KL[qφ(zc,x,y)pθ(zc)]
前两项是重构误差,也就是生成网络,包括 l o g p θ ( x ∣ c , z , y ) logp_{\theta}(x|c,z,y) logpθ(xc,z,y) l o g p θ ( y ∣ c , z ) logp_{\theta}(y|c,z) logpθ(yc,z),分别用来生成 x 和 y x和y xy,第三项是KL项,也就是编码网络,包括先验网络 p θ ( z ∣ c ) p_{\theta}(z|c) pθ(zc)和后验网络 q φ ( z ∣ c , x , y ) q_{\varphi}(z|c,x,y) qφ(zc,x,y),总共有4个网络来拟合上面的概率分布。

你可能感兴趣的:(机器学习,NLP,变分贝叶斯系列)