回顾之前的文章 【理论推导】变分自动编码器 Variational AutoEncoder(VAE),有结论
log p ( x ) = E z ∼ q ( z ∣ x ) [ log p ( x , z ) q ( z ∣ x ) ] + KL ( q ∣ ∣ p ) ≥ E z ∼ q ( z ∣ x ) [ log p ( x , z ) q ( z ∣ x ) ] \log p(x) = \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}] + \text{KL}(q||p) \geq \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}] logp(x)=Ez∼q(z∣x)[logq(z∣x)p(x,z)]+KL(q∣∣p)≥Ez∼q(z∣x)[logq(z∣x)p(x,z)]
该不等式的另一种推导方式如下所示
log p ( x ) = log E z ∼ q ( z ∣ x ) [ p ( x , z ) q ( z ∣ x ) ] ≥ E z ∼ q ( z ∣ x ) [ log p ( x , z ) q ( z ∣ x ) ] \log p(x) = \log \mathbb E_{z\sim q(z|x)}[\frac{p(x,z)}{q(z|x)}] \geq \mathbb E_{z\sim q(z|x)}[\log \frac{p(x,z)}{q(z|x)}] logp(x)=logEz∼q(z∣x)[q(z∣x)p(x,z)]≥Ez∼q(z∣x)[logq(z∣x)p(x,z)]
其中不等号由 Jensen 不等式给出
将单层 VAE 扩展到多层 VAE,如下所示
log p ( x ) = log ∫ z 1 ∫ z 2 p ( x , z 1 , z 2 ) d z 1 d z 2 = log ∫ z 1 ∫ z 2 q ( z 1 , z 2 ∣ x ) p ( x , z 1 , z 2 ) q ( z 1 , z 2 ∣ x ) d z 1 d z 2 = log E z 1 , z 2 ∼ q ( z 1 , z 2 ∣ x ) [ p ( x , z 1 , z 2 ) q ( z 1 , z 2 ∣ x ) ] ≥ E z 1 , z 2 ∼ q ( z 1 , z 2 ∣ x ) [ log p ( x , z 1 , z 2 ) q ( z 1 , z 2 ∣ x ) ] = ( i ) E z 1 , z 2 ∼ q ( z 1 , z 2 ∣ x ) [ log p ( x ∣ z 1 ) p ( z 1 ∣ z 2 ) p ( z 2 ) q ( z 1 ∣ x ) q ( z 2 ∣ z 1 ) ] \begin{align} \log p(x) &= \log \int_{z_1}\int_{z_2} p(x, z_1,z_2) dz_1dz_2 \nonumber \\&= \log \int_{z_1}\int_{z_2} q(z_1, z_2|x) \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)} dz_1dz_2 \nonumber \\&=\log \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)}] \nonumber \\&\geq \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \log \frac{p(x, z_1,z_2)}{q(z_1, z_2|x)}] \nonumber \\&\overset{(i)}{=} \mathbb E_{z_1,z_2\sim q(z_1,z_2|x)}[ \log \frac{p(x|z_1)p(z_1|z_2)p(z_2)}{q(z_1|x)q(z_2|z_1)}] \nonumber \end{align} logp(x)=log∫z1∫z2p(x,z1,z2)dz1dz2=log∫z1∫z2q(z1,z2∣x)q(z1,z2∣x)p(x,z1,z2)dz1dz2=logEz1,z2∼q(z1,z2∣x)[q(z1,z2∣x)p(x,z1,z2)]≥Ez1,z2∼q(z1,z2∣x)[logq(z1,z2∣x)p(x,z1,z2)]=(i)Ez1,z2∼q(z1,z2∣x)[logq(z1∣x)q(z2∣z1)p(x∣z1)p(z1∣z2)p(z2)]
其中 (i) 处要求变量之间满足Markov假设,如果我们将多层 VAE 扩展到更多层,可以得到与扩散模型相近的图示形式,因此我们可以借助VAE相关的技巧来看待扩散模型
扩散模型是通过向图像多次施加噪声来将图像转化为噪声,该过程称为前向扩散过程 (forward diffusion process),而从某个先验噪声分布中采样一个噪声图作为初值,通过不断去噪来生成图像的过程称为是扩散的逆过程,可以类比于使用 Langevin Dynamics 进行图像生成的思路。下面来形式化给出扩散过程的表示,假定 x 0 ∼ q ( x ) x_0\sim q(x) x0∼q(x) 是采样自真实数据分布 q q q 的样本,我们向其添加 T T T 步的高斯噪声,公式如下
q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t|x_{t-1}) = \mathcal N(x_t; \sqrt{1-\beta_t}x_{t-1},\beta_t I) q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
其中 β t ∈ [ 0 , 1 ] \beta_t \in [0,1] βt∈[0,1],整个过程服从Markov假设,因此有 q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1}) q(x1:T∣x0)=∏t=1Tq(xt∣xt−1),当 T → ∞ T\rightarrow \infty T→∞, x T x_T xT 服从高斯分布
如果我们希望快速得到 x t x_t xt,可以不通过递推式而是求一个通项的表达形式。假定 α t = 1 − β t \alpha_t = 1-\beta_t αt=1−βt, α t ‾ = ∏ i = 1 t α i \overline{\alpha_t} = \prod_{i=1}^t\alpha_i αt=∏i=1tαi, { z i , z ‾ i ∼ N ( 0 , I ) } i = 0 T \{z_i, \overline{z}_i \sim \mathcal N(0,I)\}_{i=0}^T {zi,zi∼N(0,I)}i=0T为若干独立同分布的随机变量,根据递推公式,有
x t = α t x t − 1 + 1 − α t z t − 1 = α t α t − 1 x t − 1 + α t 1 − α t − 1 z t − 2 + 1 − α t z t − 1 = ( i ) α t α t − 1 x t − 1 + 1 − α t α t − 1 z ‾ t − 2 = . . . = α ‾ t x 0 + 1 − α ‾ t z ‾ 0 \begin{align} x_t &= \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t}z_{t-1} \nonumber \\&= \sqrt{\alpha_t\alpha_{t-1}}x_{t-1} + \sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}z_{t-2}+ \sqrt{1-\alpha_t}z_{t-1} \nonumber \\&\overset{(i)}{=} \sqrt{\alpha_t\alpha_{t-1}}x_{t-1} + \sqrt{1-\alpha_t\alpha_{t-1}}\overline{z}_{t-2} \nonumber \\&= ...\nonumber \\&=\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\overline{z}_0 \end{align} xt=αtxt−1+1−αtzt−1=αtαt−1xt−1+αt1−αt−1zt−2+1−αtzt−1=(i)αtαt−1xt−1+1−αtαt−1zt−2=...=αtx0+1−αtz0
其中,等式 (i) 为两个高斯分布的线性叠加仍为一个高斯分布,即对于 A ∼ N ( μ a , σ a 2 ) A\sim \mathcal{N}(\mu_a, \sigma_a^2) A∼N(μa,σa2), B ∼ N ( μ b , σ b 2 ) B\sim \mathcal{N}(\mu_b, \sigma_b^2) B∼N(μb,σb2),线性叠加 m A + n B ∼ N ( m μ a + n μ b , m 2 σ a 2 + n 2 σ b 2 ) mA+nB \sim \mathcal{N}(m\mu_a+n\mu_b, m^2\sigma_a^2+n^2\sigma_b^2) mA+nB∼N(mμa+nμb,m2σa2+n2σb2)。因此,有
x t ∼ N ( α ‾ t x 0 , ( 1 − α ‾ t ) I ) \begin{align} x_t \sim \mathcal N(\sqrt{\overline\alpha_t}x_0,(1-\overline\alpha_t)I) \end{align} xt∼N(αtx0,(1−αt)I)
对于扩散过程,我们希望加噪的强度从小到大,即 β 1 < β 2 < . . . < β T − 1 < β T \beta_1 <\beta_2 < ...<\beta_{T-1} < \beta_T β1<β2<...<βT−1<βT,有 1 > α ‾ 1 > . . . > α ‾ T > 0 1>\overline{\alpha}_1 > ... > \overline{\alpha}_T>0 1>α1>...>αT>0
为了网络的训练需要,我们引入另一个性质,即 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t,x_0) q(xt−1∣xt,x0)是可以计算出来的。使用贝叶斯公式,有
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ exp ( − 1 2 ( ( x t − α t x t − 1 ) 2 β t + ( x t − 1 − α ‾ t − 1 x 0 ) 2 1 − α ‾ t − 1 − ( x t − α ‾ t x 0 ) 2 1 − α ‾ t ) ) = exp ( − 1 2 ( ( α t β t + 1 1 − α ‾ t ) x t − 1 2 − ( 2 α t β t x t + 2 α ‾ t 1 − α ‾ t x 0 ) x t − 1 + C ( x 0 , x t ) ) ) \begin{align} q(x_{t-1}|x_t,x_0) &= q(x_t |x_{t-1}, x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} \nonumber \\&\propto \exp\left( -\frac{1}{2}\left(\frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{\beta_t}+\frac{(x_{t-1}-\sqrt{\overline\alpha_{t-1}}x_{0})^2}{1-\overline\alpha_{t-1}}-\frac{(x_{t}-\sqrt{\overline\alpha_{t}}x_{0})^2}{1-\overline\alpha_{t}}\right)\right) \nonumber \\&=\exp\left( -\frac{1}{2}\left((\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_t})x_{t-1}^2 - (\frac{2\sqrt{\alpha_t}}{\beta_t}x_t+\frac{2\sqrt{\overline \alpha_t}}{1-\overline \alpha_t}x_0)x_{t-1} + C(x_0,x_t) \right)\right)\nonumber \end{align} q(xt−1∣xt,x0)=q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)∝exp(−21(βt(xt−αtxt−1)2+1−αt−1(xt−1−αt−1x0)2−1−αt(xt−αtx0)2))=exp(−21((βtαt+1−αt1)xt−12−(βt2αtxt+1−αt2αtx0)xt−1+C(x0,xt)))
对比高斯分布的形式,可以得到条件概率分布 x t − 1 ∣ x t , x 0 x_{t-1}|x_t,x_0 xt−1∣xt,x0 服从均值,方差为如下形式的高斯分布
μ = ( α t β t x t + α ‾ t 1 − α ‾ t x 0 ) / ( α t β t + 1 1 − α ‾ t − 1 ) = α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t x t + α ‾ t − 1 β t 1 − α ‾ t x 0 σ 2 = β ~ t = 1 α t β t + 1 1 − α ‾ t − 1 = 1 − α ‾ t − 1 1 − α ‾ t β t \begin{align} \mu &= (\frac{\sqrt{\alpha_t}}{\beta_t}x_t+\frac{\sqrt{\overline \alpha_t}}{1-\overline \alpha_t}x_0) / (\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_{t-1}}) = \frac{\sqrt{\alpha}_t(1-\overline\alpha_{t-1})}{1-\overline\alpha_t}x_t+\frac{\sqrt{\overline\alpha_{t-1}}\beta_{t}}{1-\overline\alpha_{t}}x_0 \nonumber \\ \sigma^2 &= \tilde{\beta}_t = \frac{1}{\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline\alpha_{t-1}}} = \frac{1-\overline\alpha_{t-1}}{1-\overline \alpha_t}\beta_t \end{align} μσ2=(βtαtxt+1−αtαtx0)/(βtαt+1−αt−11)=1−αtαt(1−αt−1)xt+1−αtαt−1βtx0=β~t=βtαt+1−αt−111=1−αt1−αt−1βt
将 (1) 式代入到其中,消掉 x 0 x_{0} x0,得到
μ = μ ~ t = 1 α t ( x t − β t 1 − α ‾ t z ‾ 0 ) \begin{align} \mu &= \tilde \mu_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\overline z_0\right) \end{align} μ=μ~t=αt1(xt−1−αtβtz0)
最后,考虑损失函数的设计,假定我们使用含参 θ \theta θ 的概率模型 p θ p_\theta pθ 去拟合真实数据分布 q q q,根据 KL 散度的性质,有
− log p θ ( x 0 ) ≤ − log p θ ( x 0 ) + KL ( q ( x 1 : T ∣ x 0 ) ∣ ∣ p θ ( x 1 : T ∣ x 0 ) ) = − log p θ ( x 0 ) + E q ( x 1 : T ∣ x 0 ) [ log q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) / p θ ( x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] \begin{align} -\log p_\theta(x_0) &\leq -\log p_\theta(x_0) +\text{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0)) \nonumber \\&= -\log p_\theta(x_0) +\mathbb E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})/p_\theta(x_0)}] \nonumber \\&= \mathbb E_{q(x_{1:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \nonumber \end{align} −logpθ(x0)≤−logpθ(x0)+KL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))=−logpθ(x0)+Eq(x1:T∣x0)[logpθ(x0:T)/pθ(x0)q(x1:T∣x0)]=Eq(x1:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]
对左右两边求期望,有
E q ( x 0 ) [ − log p θ ( x 0 ) ] ≤ E q ( x 0 : T ∣ x 0 ) [ log q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = △ L VLB \mathbb E_{q(x_0)}[-\log p_\theta(x_0)]\leq \mathbb E_{q(x_{0:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \overset{\triangle}{=} L_\text{VLB} Eq(x0)[−logpθ(x0)]≤Eq(x0:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]=△LVLB
对 L VLB L_\text{VLB} LVLB 进行化简,有
L VLB = E q ( x 0 : T ∣ x 0 ) [ log q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = E q ( x 0 : T ∣ x 0 ) [ − log p ( x T ) + ∑ i = 1 T log q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) ] = ( i ) E q ( x 0 : T ∣ x 0 ) [ − log p ( x T ) + ∑ i = 2 T log ( q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ) + log q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q ( x 0 : T ∣ x 0 ) [ − log p ( x T ) + ∑ i = 2 T log q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + log q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) + log q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q ( x 0 : T ∣ x 0 ) [ ∑ i = 2 T log q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + log q ( x T ∣ x 0 ) p ( x T ) − log p θ ( x 0 ∣ x 1 ) ] = KL ( q ( x T ∣ x 0 ) ∣ ∣ p ( x T ) ) + ∑ t = 2 T KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) − log p θ ( x 0 ∣ x 1 ) \begin{align} L_\text{VLB} &= \mathbb E_{q(x_{0:T}|x_0)}[\log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=1}^T\log\frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)}] \nonumber \\&\overset{(i)}{=}\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=2}^T\log(\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)}) + \log\frac{q(x_1|x_{0})}{p_\theta(x_{0}|x_1)}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[-\log p(x_T)+\sum_{i=2}^T\log\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}+\log\frac{q(x_T|x_0)}{q(x_{1}|x_0)} + \log\frac{q(x_1|x_{0})}{p_\theta(x_{0}|x_1)}] \nonumber \\&=\mathbb E_{q(x_{0:T}|x_0)}[\sum_{i=2}^T\log\frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)}+\log\frac{q(x_T|x_0)}{p(x_T)} - \log p_\theta(x_{0}|x_1)] \nonumber \\&=\text{KL}(q(x_T|x_0)||p(x_T)) +\sum_{t=2}^T\text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))-\log p_\theta(x_0|x_1) \end{align} LVLB=Eq(x0:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]=Eq(x0:T∣x0)[−logp(xT)+i=1∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)]=(i)Eq(x0:T∣x0)[−logp(xT)+i=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0)q(xt−1∣x0)q(xt∣x0))+logpθ(x0∣x1)q(x1∣x0)]=Eq(x0:T∣x0)[−logp(xT)+i=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+logq(x1∣x0)q(xT∣x0)+logpθ(x0∣x1)q(x1∣x0)]=Eq(x0:T∣x0)[i=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+logp(xT)q(xT∣x0)−logpθ(x0∣x1)]=KL(q(xT∣x0)∣∣p(xT))+t=2∑TKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−logpθ(x0∣x1)
其中,等号 (i) 处推导如下所示
q ( x t ∣ x t − 1 ) = q ( x t ∣ x t − 1 , x 0 ) = q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) q(x_t|x_{t-1}) = q(x_t|x_{t-1},x_0)=\frac{q(x_{t-1}|x_t,x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)} q(xt∣xt−1)=q(xt∣xt−1,x0)=q(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0)
我们固定方差 β t \beta_t βt为一超参数,因此对于公式(5)中的第一项是无参的常量,可以忽略;对于最后一项,作者提出简化掉来训练会更好。因为 p θ p_\theta pθ 是我们拟合分布使用的模型,所以我们可以假定 p θ ( x t − 1 ∣ x t ) = N ( μ θ ( x t , t ) , σ θ ( x t , t ) ) p_\theta(x_{t-1}|x_t) = \mathcal N(\mu_\theta(x_t,t),\sigma_\theta(x_t,t)) pθ(xt−1∣xt)=N(μθ(xt,t),σθ(xt,t)),因此该分布仅均值部分与输入有关,可以得到如下式子
KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) = E q [ 1 2 σ t 2 ∣ ∣ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∣ ∣ 2 2 ] + C \begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_q[\frac{1}{2\sigma_t^2}||\tilde\mu_t(x_t,x_0)-\mu_\theta(x_t,t)||_2^2] +C \nonumber \end{align} KL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=Eq[2σt21∣∣μ~t(xt,x0)−μθ(xt,t)∣∣22]+C
设 ϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal N(0,I) ϵ∼N(0,I),使用公式 (1) x 0 x_0 x0 与 ϵ \epsilon ϵ 替换掉里面的 x t x_t xt,同时使用 (4) 式替换掉其中的 μ ~ t \tilde\mu_t μ~t
KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) = E q [ 1 2 σ t 2 ∣ ∣ 1 α t ( x t ( x 0 , ϵ ) − β t 1 − α ‾ t ϵ ) − μ θ ( x t , t ) ∣ ∣ 2 2 ] + C \begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_q[\frac{1}{2\sigma_t^2}||\frac{1}{\sqrt{\alpha_t}}(x_t(x_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\overline\alpha_t}}\epsilon)-\mu_\theta(x_t,t)||_2^2] +C \nonumber \end{align} KL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=Eq[2σt21∣∣αt1(xt(x0,ϵ)−1−αtβtϵ)−μθ(xt,t)∣∣22]+C
我们希望建模使用的概率模型 μ θ \mu_\theta μθ 具有与 μ ~ t \tilde\mu_t μ~t 相同的形式,使得网络能够更好的拟合,即 μ θ \mu_\theta μθ 与 ϵ θ \epsilon_\theta ϵθ 满足如下关系
μ θ ( x t , t ) = 1 α t ( x t ( x 0 , ϵ ) − β t 1 − α ‾ t ϵ θ ( x t , t ) ) \mu_\theta(x_t,t) = \frac{1}{\sqrt{\alpha_t}}( x_t(x_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\overline\alpha_t}}\epsilon_\theta(x_t,t)) μθ(xt,t)=αt1(xt(x0,ϵ)−1−αtβtϵθ(xt,t))
因此,有
KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) = E x 0 , ϵ [ β t 2 2 σ t 2 α t ( 1 − α ‾ t ) ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ 2 2 ] + C \begin{align} \text{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t)) &=\mathbb E_{x_0,\epsilon}[\frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\overline{\alpha}_t)}||\epsilon-\epsilon_\theta(\sqrt{\overline\alpha_t}x_0+\sqrt{1-\overline\alpha_t}\epsilon, t)||_2^2] +C \nonumber \end{align} KL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=Ex0,ϵ[2σt2αt(1−αt)βt2∣∣ϵ−ϵθ(αtx0+1−αtϵ,t)∣∣22]+C
损失函数即为
L ( θ ) = E t , x 0 , ϵ [ ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ 2 2 ] \begin{align} \mathcal L(\theta) &=\mathbb E_{t,x_0,\epsilon}[||\epsilon-\epsilon_\theta(\sqrt{\overline\alpha_t}x_0+\sqrt{1-\overline\alpha_t}\epsilon, t)||_2^2] \end{align} L(θ)=Et,x0,ϵ[∣∣ϵ−ϵθ(αtx0+1−αtϵ,t)∣∣22]
Denoising Diffusion Probabilistic Models
Probabilistic Diffusion Model概率扩散模型理论