更多的内容请参考苏剑林老师的科学空间
令 x x x为显变量, z z z为隐变量, p ~ ( x ) \tilde{p}(x) p~(x)为 x x x的证据分布,有
q ( x ) = q θ ( x ) = ∫ q θ ( x , z ) d z q(x)=q_{\theta}(x)=\int{q_{\theta}(x,z)\rm{d}z} q(x)=qθ(x)=∫qθ(x,z)dz
通常,我们希望用 q θ ( x ) q_{\theta}(x) qθ(x)近似 p ~ ( x ) \tilde{p}(x) p~(x),即最小化KL散度(等同于最大化似然函数,最小化交叉熵)
K L ( p ~ ( x ) ∣ ∣ q ( x ) ) = ∫ p ~ ( x ) log p ~ ( x ) q ( x ) d x KL(\tilde{p}(x)||q(x))=\int{\tilde{p}(x)\log{\frac{\tilde{p}(x)}{q(x)}}\rm{d}x} KL(p~(x)∣∣q(x))=∫p~(x)logq(x)p~(x)dx
此时引入联合分布 p ( x , z ) p(x,z) p(x,z),由联合分布和边缘分布的关系可知, p ~ ( x ) = ∫ p ( x , z ) d z \tilde{p}(x)=\int{p(x,z)\rm{d}z} p~(x)=∫p(x,z)dz,变分推断的本质就是将边缘分布的KL散度 K L ( p ~ ( x ) ∣ ∣ q ( x ) ) KL(\tilde{p}(x)||q(x)) KL(p~(x)∣∣q(x))改为边缘分布 K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) KL(p(x,z)||q(x,z)) KL(p(x,z)∣∣q(x,z)),从而有
K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) = ∬ p ( x , z ) log p ( x , z ) q ( x , z ) d x d z KL(p(x,z)||q(x,z)) = \iint{p(x,z)\log{\frac{p(x,z)}{q(x,z)}}\mathrm{d}x\mathrm{d}z} KL(p(x,z)∣∣q(x,z))=∬p(x,z)logq(x,z)p(x,z)dxdz
由贝叶斯公式可以知道, p ( x , z ) = p ( z ∣ x ) p ~ ( x ) p(x,z)=p(z|x)\tilde{p}(x) p(x,z)=p(z∣x)p~(x)和 q ( x , z ) = q ( z ∣ x ) q ( x ) q(x,z)=q(z|x)q(x) q(x,z)=q(z∣x)q(x),则
K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) = ∬ p ( z ∣ x ) p ~ ( x ) log p ( z ∣ x ) p ~ ( x ) q ( z ∣ x ) q ( x ) d x d z KL(p(x,z)||q(x,z)) =\iint{p(z|x)\tilde{p}(x) \log{\frac{p(z|x)\tilde{p}(x)}{q(z|x)q(x)}} \mathrm{d}x\mathrm{d}z} KL(p(x,z)∣∣q(x,z))=∬p(z∣x)p~(x)logq(z∣x)q(x)p(z∣x)p~(x)dxdz
将 log \log log拆分可以得到
∬ p ( z ∣ x ) p ~ ( x ) log p ~ ( x ) q ( x ) d x d z + ∬ p ( z ∣ x ) p ~ ( x ) log p ( z ∣ x ) q ( z ∣ x ) d x d z \iint{p(z|x)\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x\mathrm{d}z} + \iint{ p(z|x)\tilde{p}(x) \log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}x\mathrm{d}z} ∬p(z∣x)p~(x)logq(x)p~(x)dxdz+∬p(z∣x)p~(x)logq(z∣x)p(z∣x)dxdz
其中,
∬ p ( z ∣ x ) p ~ ( x ) log p ~ ( x ) q ( x ) d x d z = ∫ p ~ ( x ) log p ~ ( x ) q ( x ) d x ∫ p ( z ∣ x ) d z = K L ( p ~ ( x ) ∣ ∣ q ( x ) ) \iint{p(z|x)\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x\mathrm{d}z} = \int{\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x} \int{ p(z|x)\mathrm{dz}}=KL(\tilde{p}(x)||q(x)) ∬p(z∣x)p~(x)logq(x)p~(x)dxdz=∫p~(x)logq(x)p~(x)dx∫p(z∣x)dz=KL(p~(x)∣∣q(x))
∬ p ( z ∣ x ) p ~ ( x ) log p ( z ∣ x ) q ( z ∣ x ) d x d z = ∫ p ~ ( x ) ∫ p ( z ∣ x ) log p ( z ∣ x ) q ( z ∣ x ) d z d x = ∫ p ~ ( x ) K L ( p ( z ∣ x ) ∣ ∣ q ( z ∣ x ) ) d x \iint{ p(z|x)\tilde{p}(x) \log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}x\mathrm{d}z}=\int{\tilde{p}(x) \int{ p(z|x)\log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}z} \mathrm{d}x}=\int{\tilde{p}(x)KL(p(z|x)||q(z|x)) \mathrm{d}x} ∬p(z∣x)p~(x)logq(z∣x)p(z∣x)dxdz=∫p~(x)∫p(z∣x)logq(z∣x)p(z∣x)dzdx=∫p~(x)KL(p(z∣x)∣∣q(z∣x))dx
因此
K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) = K L ( p ~ ( x ) ∣ ∣ q ( x ) ) + ∫ p ~ ( x ) K L ( p ( z ∣ x ) ∣ ∣ q ( z ∣ x ) ) d x ≥ K L ( p ~ ( x ) ∣ ∣ q ( x ) ) KL(p(x,z)||q(x,z))=KL(\tilde{p}(x)||q(x))+\int{\tilde{p}(x)KL(p(z|x)||q(z|x)) \mathrm{d}x} \ge KL(\tilde{p}(x)||q(x)) KL(p(x,z)∣∣q(x,z))=KL(p~(x)∣∣q(x))+∫p~(x)KL(p(z∣x)∣∣q(z∣x))dx≥KL(p~(x)∣∣q(x))
意味着联合分布的KL是一个更强的上界。通常情况下, K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) KL(p(x,z)||q(x,z)) KL(p(x,z)∣∣q(x,z))要比 K L ( p ~ ( x ) ∣ ∣ q ( x ) ) KL(\tilde{p}(x)||q(x)) KL(p~(x)∣∣q(x))容易计算,所以变分推断提供了一个可计算的方案。
令 q ( x , z ) = q ( x ∣ z ) q ( z ) q(x,z)=q(x|z)q(z) q(x,z)=q(x∣z)q(z), p ( x , z ) = p ~ ( x ) p ( z ∣ x ) p(x,z)=\tilde{p}(x)p(z|x) p(x,z)=p~(x)p(z∣x),带入联合分布KL散度,
K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) = ∬ p ~ ( x ) p ( z ∣ x ) log p ~ ( x ) p ( z ∣ x ) q ( x ∣ z ) q ( z ) d x d z KL(p(x,z)||q(x,z))=\iint{\tilde{p}(x)p(z|x) \log{\frac{\tilde{p}(x)p(z|x)}{q(x|z)q(z)}} \mathrm{d}x \mathrm{d}z} KL(p(x,z)∣∣q(x,z))=∬p~(x)p(z∣x)logq(x∣z)q(z)p~(x)p(z∣x)dxdz
将 log \log log拆开可以得到
∬ p ~ ( x ) p ( z ∣ x ) log p ~ ( x ) d x d z − ∬ p ~ ( x ) p ( z ∣ x ) log q ( x ∣ z ) d x d z + ∫ p ~ ( x ) K L ( p ( z ∣ x ) ∣ ∣ q ( z ) ) d x \iint{\tilde{p}(x)p(z|x)\log{\tilde{p}(x)} \mathrm{d}x \mathrm{d}z}-\iint{\tilde{p}(x)p(z|x)\log{q(x|z)} \mathrm{d}x \mathrm{d}z} + \int{\tilde{p}(x) KL(p(z|x)||q(z)) \mathrm{d}x} ∬p~(x)p(z∣x)logp~(x)dxdz−∬p~(x)p(z∣x)logq(x∣z)dxdz+∫p~(x)KL(p(z∣x)∣∣q(z))dx
由数字计算vs采样计算可以知道
E x ∼ p ( x ) [ f ( x ) ] = ∫ f ( x ) p ( x ) d x ≈ 1 n ∑ i = 1 n f ( x i ) \mathbb{E}_{x \sim p(x)}[f(x)]=\int{f(x)p(x)\rm{d}x} \approx \frac{1}{n}\sum_{i=1}^n{f(x_i)} Ex∼p(x)[f(x)]=∫f(x)p(x)dx≈n1i=1∑nf(xi)
且 log p ~ ( x ) \log{\tilde{p}(x)} logp~(x)不包含优化目标,可以视为常数,可以得到
E x ∼ p ~ ( x ) [ − ∫ p ( z ∣ x ) log q ( x ∣ z ) d z + K L ( p ( z ∣ x ) ∣ ∣ q ( z ) ) ] = E x ∼ p ~ ( x ) [ E z ∼ p ( z ∣ x ) [ − log q ( x ∣ z ) ] + K L ( p ( z ∣ x ) ∣ ∣ q ( z ) ) ] \mathbb{E}_{x \sim \tilde{p}(x)}[-\int{p(z|x)\log{q(x|z)} \mathrm{d}z} + KL(p(z|x)||q(z))]=\mathbb{E}_{x \sim \tilde{p}(x)}[\mathbb{E}_{z \sim p(z|x)}[-\log{q(x|z)}]+KL(p(z|x)||q(z))] Ex∼p~(x)[−∫p(z∣x)logq(x∣z)dz+KL(p(z∣x)∣∣q(z))]=Ex∼p~(x)[Ez∼p(z∣x)[−logq(x∣z)]+KL(p(z∣x)∣∣q(z))]
论文[1]中提到了The evidence lower bound,即VAE的优化目标。对于隐变量 z z z,用 q ( z ∣ x ) q(z|x) q(z∣x)近似 p ( z ) p(z) p(z),则有
K L ( p ( z ) ∣ ∣ q ( z ∣ x ) ) = ∫ p ( z ) log p ( z ) q ( z ∣ x ) d z = E z ∼ p ( z ) [ log p ( z ) ] − E z ∼ p ( z ) [ log q ( z ∣ x ) ] KL(p(z)||q(z|x))=\int{p(z) \log{\frac{p(z)}{q(z|x)}} \mathrm{d}z}=\mathbb{E}_{z\sim p(z)}[\log{p(z)}] - \mathbb{E}_{z \sim p(z)}[\log{q(z|x)}] KL(p(z)∣∣q(z∣x))=∫p(z)logq(z∣x)p(z)dz=Ez∼p(z)[logp(z)]−Ez∼p(z)[logq(z∣x)]
利用贝叶斯公式,可以得到
E [ log p ( z ) ] − E [ log q ( z , x ) ] + log q ( x ) \mathbb{E}[\log{p(z)}]-\mathbb{E}[\log{q(z,x)}]+\log{q(x)} E[logp(z)]−E[logq(z,x)]+logq(x)
令
E L B O ( p ) = E [ log q ( z , x ) ] − E [ log p ( z ) ] = E [ log q ( x ∣ z ) ] − K L ( p ( z ) ∣ ∣ q ( z ) ) \mathrm{ELBO}(p)=\mathbb{E}[\log{q(z,x)}]-\mathbb{E}[\log{p(z)}]=\mathbb{E}[\log{q(x|z)}]-KL(p(z)||q(z)) ELBO(p)=E[logq(z,x)]−E[logp(z)]=E[logq(x∣z)]−KL(p(z)∣∣q(z))
和前面得到的优化目标是一致的, K L ( p ( z ) ∣ ∣ q ( z ∣ x ) ) KL(p(z)||q(z|x)) KL(p(z)∣∣q(z∣x))和 K L ( p ( z ) ∣ ∣ q ( z ) ) KL(p(z)||q(z)) KL(p(z)∣∣q(z))目的是一致的!
上式中 E z ∼ p ( z ∣ x ) \mathbb{E}_{z \sim p(z|x)} Ez∼p(z∣x)需要对隐变量 z z z进行采样,但是“采样”不可导,因此,在 N ( 0 , 1 ) \mathcal{N}(0,1) N(0,1)中采样得到 ξ \xi ξ,令 z = μ + σ × ξ z=\mu+\sigma \times \xi z=μ+σ×ξ,当每次只采样1个时,VAE的优化目标就变成了
E x ∼ p ~ ( x ) [ − log q ( x ∣ μ , σ ) + K L ( p ( μ , σ ∣ x ) ∣ ∣ q ( μ , σ ) ) ] \mathbb{E}_{x \sim \tilde{p}(x)}[-\log{q(x|\mu,\sigma)}+KL(p(\mu,\sigma|x)||q(\mu,\sigma))] Ex∼p~(x)[−logq(x∣μ,σ)+KL(p(μ,σ∣x)∣∣q(μ,σ))]
GAN约定 q ( z ) ∼ N ( z ; 0 , I ) q(z) \sim N(z;0,I) q(z)∼N(z;0,I),令 q ( x ∣ z ) = δ ( x − G ( z ) ) q(x|z)=\delta(x-G(z)) q(x∣z)=δ(x−G(z)), δ ( x ) \delta(x) δ(x)是狄拉克 δ \delta δ函数, G ( z ) G(z) G(z)为生成器。GAN中引入了一个二元变量 y y y来构成联合分布
q ( x , y ) = { p ~ ( x ) p 1 , y = 1 q ( x ) p 0 , y = 0 q(x,y) = \begin{cases} \tilde{p}(x)p_1, & y=1 \\ q(x)p_0, & y = 0 \end{cases} q(x,y)={p~(x)p1,q(x)p0,y=1y=0
设 p ( x , y ) = p ( y ∣ x ) p ~ ( x ) p(x,y)=p(y|x)\tilde{p}(x) p(x,y)=p(y∣x)p~(x),则
K L ( q ( x , y ) ∣ ∣ p ( x , y ) ) = ∫ p ~ ( x ) p 1 log p ~ ( x ) p 1 p ( 1 ∣ x ) p ~ ( x ) d x + ∫ q ( x ) p 0 log q ( x ) p 0 p ( 0 ∣ x ) p ~ ( x ) d x KL(q(x,y)||p(x,y))=\int{\tilde{p}(x)p_1\log{\frac{\tilde{p}(x)p_1}{p(1|x)\tilde{p}(x)}} \mathrm{d}x} + \int{q(x)p_0 \log{\frac{q(x)p_0}{p(0|x)\tilde{p}(x)}} \mathrm{d}x} KL(q(x,y)∣∣p(x,y))=∫p~(x)p1logp(1∣x)p~(x)p~(x)p1dx+∫q(x)p0logp(0∣x)p~(x)q(x)p0dx
令 D ( x ) = p ( 1 ∣ x ) D(x)=p(1|x) D(x)=p(1∣x)为判别器,采用交替优化,先固定 G ( z ) G(z) G(z),即 q ( x ) q(x) q(x)为常量,此时(我把负号去了)
D = arg max D ∫ p ~ ( x ) log D ( x ) d x + ∫ q ( x ) log ( 1 − D ( x ) ) d x = arg max D E x ∼ p ~ ( x ) [ log D ( x ) ] + E x ∼ q ( x ) [ log ( 1 − D ( x ) ) ] D=\argmax_D{\int{\tilde{p}(x) \log{D(x)} \mathrm{d}x}+\int{q(x) \log{(1-D(x))} \mathrm{d}x}}=\argmax_D{\mathbb{E}_{x \sim \tilde{p}(x)}[\log{D(x)}] + \mathbb{E}_{x \sim q(x)}[\log{(1-D(x))}]} D=Dargmax∫p~(x)logD(x)dx+∫q(x)log(1−D(x))dx=DargmaxEx∼p~(x)[logD(x)]+Ex∼q(x)[log(1−D(x))]
此时固定 D ( x ) D(x) D(x),则
G = arg min G ∫ q ( x ) log q ( x ) ( 1 − D ( x ) ) p ~ ( x ) d x G=\argmin_G{\int{q(x)\log{\frac{q(x)}{(1-D(x))\tilde{p}(x)}} \mathrm{d}x}} G=Gargmin∫q(x)log(1−D(x))p~(x)q(x)dx
由 D ( x ) D(x) D(x)最优解
D ( x ) = p ~ ( x ) p ~ ( x ) + q o ( x ) D(x)=\frac{\tilde{p}(x)}{\tilde{p}(x)+q^o(x)} D(x)=p~(x)+qo(x)p~(x)
可以得到
p ~ ( x ) = D ( x ) q o ( x ) ( 1 − D ( x ) ) \tilde{p}(x) = \frac{D(x)q^o(x)}{(1-D(x))} p~(x)=(1−D(x))D(x)qo(x)
此时
∫ q ( x ) log q ( x ) D ( x ) q o ( x ) d x = − E z ∼ q ( z ) [ log D ( G ( z ) ) ] + K L ( q ( x ) ∣ ∣ q o ( x ) ) \int{q(x) \log{\frac{q(x)}{D(x)q^o(x)}} \mathrm{d}x}=-\mathbb{E}_{z \sim q(z)}[\log{D(G(z))}]+KL(q(x)||q^o(x)) ∫q(x)logD(x)qo(x)q(x)dx=−Ez∼q(z)[logD(G(z))]+KL(q(x)∣∣qo(x))
内容主要来自苏剑林老师的文章,其中主要是在学习的过程中将更加详细的过程记录了一下(其实参考苏老师的其他文章是可以找到更加详细的过程的)。了解变分推断的理论之后,能够进一步了解VAE的理论推导过程,因此才能够考虑将其应用在其他具体的领域和问题上,进行改进和优化
暂时没有梳理出论文[1]中Bayesian mixture of Gaussians的相关内容。
[1] Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. “Variational inference: A review for statisticians.” Journal of the American statistical Association 112.518 (2017): 859-877.
[2] Su, Jianlin. “Variational inference: A unified framework of generative models and some revelations.” arXiv preprint arXiv:1807.05936 (2018).