DDPM( Denoising Diffusion Probabilistic Model )

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel
NeurIPS 2020

1 Background

DDPM( Denoising Diffusion Probabilistic Model )_第1张图片
Diffusion 模型为隐变量模型, x 1 , . . . , x T \bm{x}_1, ..., \bm{x}_T x1,...,xT 为与原始数据 x 0 ∼ q ( x 0 ) \bm{x}_0 \sim q(\bm{x}_0) x0q(x0) 维度一致的隐变量,所有隐变量之间满足马尔科夫性。已知 p ( x T ) ∼ N ( x T ; 0 , I ) p(\bm{x}_T) \sim \mathcal{N}(\bm{x}_T; \bm{0}, \bm{I}) p(xT)N(xT;0,I),计算联合概率 p θ ( x 0 : T ) p_{\theta}(\bm{x}_{0:T}) pθ(x0:T) 称为 逆过程
p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) , p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) (1) p_{\theta}(\bm{x}_{0:T}) = p(\bm{x}_T) \prod_{t=1}^T p_{\theta}(\bm{x}_{t-1} | \bm{x}_t), \quad p_{\theta}(\bm{x}_{t-1} | \bm{x}_t) = \mathcal{N}(\bm{x}_{t-1}; \bm{\mu}_{\theta}(\bm{x}_t, t), \bm{\Sigma}_{\theta}(\bm{x}_t, t)) \tag{1} pθ(x0:T)=p(xT)t=1Tpθ(xt1xt),pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))(1)

其中, μ θ ( x t , t ) \bm{\mu}_{\theta}(\bm{x}_t, t) μθ(xt,t) Σ θ ( x t , t ) \bm{\Sigma}_{\theta}(\bm{x}_t, t) Σθ(xt,t) 为神经网络。
估计后验 q ( x 1 : T ∣ x 0 ) q(\bm{x}_{1:T} | \bm{x}_0) q(x1:Tx0) 称为 扩散过程
q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) , q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) (2) q(\bm{x}_{1:T} | \bm{x}_0) = \prod_{t=1}^T q(\bm{x}_t | \bm{x}_{t-1}), \quad q(\bm{x}_t | \bm{x}_{t-1}) = \mathcal{N}(\bm{x}_t; \sqrt{1 - \beta_t} \bm{x}_{t-1}, \beta_t \bm{I}) \tag{2} q(x1:Tx0)=t=1Tq(xtxt1),q(xtxt1)=N(xt;1βt xt1,βtI)(2)

扩散过程可以看做,根据超参数 β t \beta_t βt,向数据中添加高斯噪声的过程。
下面介绍扩散过程的一个性质,定义 α t = 1 − β t , α ˉ t = ∏ s = 1 t α s \alpha_t = 1 - \beta_t, \bar{\alpha}_t = \prod_{s=1}^t \alpha_s αt=1βt,αˉt=s=1tαs,根据式 2 可得,
x t = α t x t − 1 + β t ϵ t = α t α t − 1 x t − 2 + α t β t − 1 ϵ t − 1 + β t ϵ t = α t α t − 1 α t − 2 x t − 3 + α t α t − 1 β t − 2 ϵ t − 2 + α t β t − 1 ϵ t − 1 + β t ϵ t = . . . = α ˉ t x 0 + α t α t − 1 . . . α 2 β 1 ϵ 1 + . . . + α t β t − 1 ϵ t − 1 + β t ϵ t (3) \begin{aligned} \bm{x}_t &= \sqrt{\alpha_t} \bm{x}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= \sqrt{\alpha_t \alpha_{t-1}} \bm{x}_{t-2} + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} \bm{x}_{t-3} + \sqrt{\alpha_t \alpha_{t-1} \beta_{t-2}} \bm{\epsilon}_{t-2} + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= ... \\ &= \sqrt{\bar{\alpha}_t} \bm{x}_0 + \sqrt{\alpha_t \alpha_{t-1} ... \alpha_2 \beta_1} \bm{\epsilon}_1 +... + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \end{aligned} \tag{3} xt=αt xt1+βt ϵt=αtαt1 xt2+αtβt1 ϵt1+βt ϵt=αtαt1αt2 xt3+αtαt1βt2 ϵt2+αtβt1 ϵt1+βt ϵt=...=αˉt x0+αtαt1...α2β1 ϵ1+...+αtβt1 ϵt1+βt ϵt(3)

其中,所有 ϵ t ∼ N ( 0 , I ) \bm{\epsilon}_t \sim \mathcal{N}(\bm{0}, \bm{I}) ϵtN(0,I) 均为随机噪声,根据高斯噪声的叠加性可得,
q ( x t ∣ x 0 ) = N ( x t ; α ˉ t x 0 , ( α t α t − 1 . . . α 2 β 1 + . . . + α t β t − 1 + β t ) I ) = N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) (4) \begin{aligned} q(\bm{x}_t | \bm{x}_0) &= \mathcal{N}(\bm{x}_t; \sqrt{\bar{\alpha}_t} \bm{x}_0, (\alpha_t \alpha_{t-1} ... \alpha_2 \beta_1 + ... + \alpha_t \beta_{t-1} + \beta_t)\bm{I}) \\ &= \mathcal{N}(\bm{x}_t; \sqrt{\bar{\alpha}_t} \bm{x}_0, (1 - \bar{\alpha}_t)\bm{I}) \end{aligned} \tag{4} q(xtx0)=N(xt;αˉt x0,(αtαt1...α2β1+...+αtβt1+βt)I)=N(xt;αˉt x0,(1αˉt)I)(4)

根据这条性质,我们可以从 x 0 \bm{x}_0 x0 直接采样 x t − 1 \bm{x}_{t-1} xt1,无需中间步骤。

我们的目的为训练神经网络,使逆过程得到的 p θ ( x 0 ) p_{\theta}(\bm{x}_0) pθ(x0) 尽可能与真实分布 q ( x 0 ) q(\bm{x}_0) q(x0) 接近,即最小化KL散度。
D K L ( q ( x 0 ) ∣ ∣ p θ ( x 0 ) ) = ∫ q ( x 0 ) log ⁡ q ( x 0 ) d x 0 ⏟ c o n s t a n t − ∫ q ( x 0 ) log ⁡ p θ ( x 0 ) d x 0 = c o n s t − ∫ q ( x 0 ) log ⁡ p θ ( x 0 : T ) d x 0 : T = c o n s t   − ∫ q ( x 0 ) log ⁡ q ( x 1 : T ∣ x 0 ) q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) d x 0 : T ≤ c o n s t   − ∫ q ( x 0 : T ) log ⁡ p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) d x 0 : T = c o n s t   + E q [ − log ⁡ p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = c o n s t   + E q [ − log ⁡ p ( x T ) − ∑ t ≥ 1 log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) ] ⏟ L (5) \begin{aligned} D_{KL}(q(\bm{x}_0) || p_{\theta}(\bm{x}_0)) &= \underbrace{\int q(\bm{x}_0)\log q(\bm{x}_0) d \bm{x}_0}_{constant} - \int q(\bm{x}_0)\log p_{\theta}(\bm{x}_0) d \bm{x}_0 \\ &= const - \int q(\bm{x}_0)\log p_{\theta}(\bm{x}_{0:T}) d \bm{x}_{0:T} \\ &= const \ - \int q(\bm{x}_0)\log \frac{q(\bm{x}_{1:T} | \bm{x}_0)}{q(\bm{x}_{1:T} | \bm{x}_0)} p_{\theta}(\bm{x}_{0:T}) d \bm{x}_{0:T} \\ &\le const \ - \int q(\bm{x}_{0:T}) \log \frac{p_{\theta}(\bm{x}_{0:T})}{q(\bm{x}_{1:T} | \bm{x}_0)} d \bm{x}_{0:T} \\ &= const \ + \mathbb{E}_q[- \log \frac{p_{\theta}(\bm{x}_{0:T})}{q(\bm{x}_{1:T} | \bm{x}_0)}] \\ &= const \ + \underbrace{\mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t \ge 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1})}]}_L \end{aligned} \tag{5} DKL(q(x0)∣∣pθ(x0))=constant q(x0)logq(x0)dx0q(x0)logpθ(x0)dx0=constq(x0)logpθ(x0:T)dx0:T=const q(x0)logq(x1:Tx0)q(x1:Tx0)pθ(x0:T)dx0:Tconst q(x0:T)logq(x1:Tx0)pθ(x0:T)dx0:T=const +Eq[logq(x1:Tx0)pθ(x0:T)]=const +L Eq[logp(xT)t1logq(xtxt1)pθ(xt1xt)](5)

由于 q ( x 0 ) q(\bm{x}_0) q(x0) 为真实分布,所以第一项为常数,训练目标变为最小化 L L L
L = E q [ − log ⁡ p ( x T ) − ∑ t > 1 log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) − log ⁡ p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = E q [ − log ⁡ p ( x T ) − ∑ t > 1 log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 , x 0 ) − log ⁡ p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = E q [ − log ⁡ p ( x T ) − ∑ t > 1 log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) − log ⁡ p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = E q [ − log ⁡ p ( x T ) − ∑ t > 1 log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) + ∑ t > 1 q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) − log ⁡ p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = E q [ − log ⁡ p ( x T ) q ( x T ∣ x 0 ) − ∑ t > 1 log ⁡ p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) − log ⁡ p θ ( x 0 ∣ x 1 ) ] = E q [ D K L ( q ( x T ∣ x 0 ) ∣ ∣ p ( x T ) ) ⏟ L T + ∑ t > 1 D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ⏟ L t − 1 − log ⁡ p θ ( x 0 ∣ x 1 ) ⏟ L 0 ] (6) \begin{aligned} L &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1})} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1}, \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} \frac{q(\bm{x}_{t-1} | \bm{x}_0)}{q(\bm{x}_t | \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} + \sum_{t > 1} \frac{q(\bm{x}_{t-1} | \bm{x}_0)}{q(\bm{x}_t | \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log \frac{p(\bm{x}_T)}{q(\bm{x}_T | \bm{x}_0)} - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} - \log p_{\theta}(\bm{x}_0 | \bm{x}_1)] \\ &= \mathbb{E}_q[\underbrace{D_{KL}(q(\bm{x}_T | \bm{x}_0) || p(\bm{x}_T))}_{L_T} + \sum_{t > 1} \underbrace{D_{KL}(q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0) || p_{\theta}(\bm{x}_{t-1} | \bm{x}_t))}_{L_{t-1}} \underbrace{- \log p_{\theta}(\bm{x}_0 | \bm{x}_1)}_{L_0}] \end{aligned} \tag{6} L=Eq[logp(xT)t>1logq(xtxt1)pθ(xt1xt)logq(x1x0)pθ(x0x1)]=Eq[logp(xT)t>1logq(xtxt1,x0)pθ(xt1xt)logq(x1x0)pθ(x0x1)]=Eq[logp(xT)t>1logq(xt1xt,x0)pθ(xt1xt)q(xtx0)q(xt1x0)logq(x1x0)pθ(x0x1)]=Eq[logp(xT)t>1logq(xt1xt,x0)pθ(xt1xt)+t>1q(xtx0)q(xt1x0)logq(x1x0)pθ(x0x1)]=Eq[logq(xTx0)p(xT)t>1logq(xt1xt,x0)pθ(xt1xt)logpθ(x0x1)]=Eq[LT DKL(q(xTx0)∣∣p(xT))+t>1Lt1 DKL(q(xt1xt,x0)∣∣pθ(xt1xt))L0 logpθ(x0x1)](6)

2 Diffusion models and denoising autoencoders

2.1 Forward process and L T L_T LT

由于 q ( x T ∣ x 0 ) q(\bm{x}_T | \bm{x}_0) q(xTx0) p ( x T ) p(\bm{x}_T) p(xT) 均为正态分布,所以 L T = 0 L_T=0 LT=0

2.2 Reverse process and L 1 : T − 1 L_{1:T-1} L1:T1

L t − 1 L_{t-1} Lt1 中, q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ t ( x t , x 0 ) , β ~ t I ) q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0) = \mathcal{N}(\bm{x}_{t-1}; \widetilde{\bm{\mu}}_t(\bm{x}_t, \bm{x}_0), \widetilde{\bm{\beta}}_t \bm{I}) q(xt1xt,x0)=N(xt1;μ t(xt,x0),β tI),证明如下图所示。图源:54、Diffusion Model扩散模型理论与完整PyTorch代码详细解读。
DDPM( Denoising Diffusion Probabilistic Model )_第2张图片
本文作者将 p θ ( x t − 1 ∣ x t ) p_{\theta}(\bm{x}_{t-1} | \bm{x}_t) pθ(xt1xt) 中的 Σ θ ( x t , t ) \bm{\Sigma}_{\theta}(\bm{x}_t, t) Σθ(xt,t) 替换为 σ t 2 \sigma_t^2 σt2 σ t 2 \sigma_t^2 σt2 β t \beta_t βt β ~ t \widetilde{\beta}_t β t,取得了相似的效果。
由于 D K L ( N ( μ 1 , σ 1 2 ) ∣ ∣ N ( μ 2 , σ 2 2 ) ) = ( μ 1 − μ 2 ) 2 + σ 1 2 2 σ 2 2 + log ⁡ σ 2 σ 1 − 1 2 D_{KL}(\mathcal{N}(\mu_1, \sigma_1^2) || \mathcal{N}(\mu_2, \sigma_2^2)) = \frac{(\mu_1 - \mu_2)^2 + \sigma_1^2}{2 \sigma_2^2} + \log \frac{\sigma_2}{\sigma_1} - \frac{1}{2} DKL(N(μ1,σ12)∣∣N(μ2,σ22))=2σ22(μ1μ2)2+σ12+logσ1σ221,所以,
L t − 1 = E q [ 1 2 σ t 2 ∥ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∥ 2 ] + C = E q [ 1 2 σ t 2 ∥ α t ( 1 − α ˉ t − 1 ) 1 − α t ˉ x t + α ˉ t − 1 β t 1 − α t ˉ x 0 − μ θ ( x t , t ) ∥ 2 ] + C = E x 0 , ϵ [ 1 2 σ t 2 ∥ 1 α t ( x t ( x 0 , ϵ ) − β t 1 − α ˉ t ϵ ) − μ θ ( x t ( x 0 , ϵ ) , t ) ∥ 2 ] + C (7) \begin{aligned} L_{t-1} &= \mathbb{E}_q[\frac{1}{2\sigma_t^2} \| \widetilde{\bm{\mu}}_t(\bm{x}_t, \bm{x}_0) - \bm{\mu}_{\theta}(\bm{x}_t, t) \|^2] + C \\ &= \mathbb{E}_q[\frac{1}{2\sigma_t^2} \| \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha_t}} \bm{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha_t}} \bm{x}_0 - \bm{\mu}_{\theta}(\bm{x}_t, t) \|^2] + C \\ &= \mathbb{E}_{\bm{x}_0, \bm{\epsilon}}[\frac{1}{2\sigma_t^2} \| \frac{1}{\sqrt{\alpha_t}} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}) - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\bm{\epsilon} \big) - \bm{\mu}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2] + C \tag{7} \end{aligned} Lt1=Eq[2σt21μ t(xt,x0)μθ(xt,t)2]+C=Eq[2σt211αtˉαt (1αˉt1)xt+1αtˉαˉt1 βtx0μθ(xt,t)2]+C=Ex0,ϵ[2σt21αt 1(xt(x0,ϵ)1αˉt βtϵ)μθ(xt(x0,ϵ),t)2]+C(7)

其中, x t ( x 0 , ϵ ) = α ˉ t x 0 + 1 − α ˉ t ϵ \bm{x}_t(\bm{x}_0, \bm{\epsilon}) = \sqrt{\bar{\alpha}_t}\bm{x}_0 + \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon} xt(x0,ϵ)=αˉt x0+1αˉt ϵ ϵ \bm{\epsilon} ϵ 为由 x 0 \bm{x}_0 x0 采样 x t \bm{x}_t xt 时引入的高斯噪声,C 为常数。
根据式 7 ,可以将神经网络 μ θ \bm{\mu}_{\theta} μθ 的形式变为,
μ θ ( x t , t ) = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) (8) \bm{\mu}_{\theta}(\bm{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \big( \bm{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\bm{\epsilon}_{\theta}(\bm{x}_t, t) \big) \tag{8} μθ(xt,t)=αt 1(xt1αˉt βtϵθ(xt,t))(8)

损失 L t − 1 L_{t-1} Lt1 就变为,
E x 0 , ϵ [ β t 2 2 σ t 2 α t ( 1 − α ˉ t ) ∥ ϵ − ϵ θ ( x t ( x 0 , ϵ ) , t ) ∥ 2 ] + C (9) \mathbb{E}_{\bm{x}_0, \bm{\epsilon}}[\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \| \bm{\epsilon} - \bm{\epsilon}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2] + C \tag{9} Ex0,ϵ[2σt2αt(1αˉt)βt2ϵϵθ(xt(x0,ϵ),t)2]+C(9)

2.3 Reverse process and L 0 L_0 L0

原文中作者根据图像的性质做出了一些变化,本文在此进行忽略。
根据式 1 可知,
p θ ( x 0 ∣ x 1 ) = N ( x 0 ; μ θ ( x 1 , 1 ) , σ 1 2 I ) (10) p_{\theta}(\bm{x}_0 | \bm{x}_1) = \mathcal{N}(\bm{x}_0; \bm{\mu}_{\theta}(\bm{x}_1, 1), \sigma_1^2 \bm{I}) \tag{10} pθ(x0x1)=N(x0;μθ(x1,1),σ12I)(10)

所以,
L 0 = 1 2 σ 1 2 E x 0 , ϵ ˉ 1 [ ∥ x 0 − μ θ ( x 1 ( x 0 , ϵ ˉ 1 ) , 1 ) ∥ 2 ] + C ′ (11) L_0 = \frac{1}{2 \sigma_1^2} \mathbb{E}_{\bm{x}_0, \bar{\bm{\epsilon}}_1} [\| \bm{x}_0 - \bm{\mu}_{\theta} \big(\bm{x}_1(\bm{x}_0, \bar{\bm{\epsilon}}_1), 1 \big) \|^2] + C' \tag{11} L0=2σ121Ex0,ϵˉ1[x0μθ(x1(x0,ϵˉ1),1)2]+C(11)

其中, C ′ C' C 为常数。

2.4 Simplified training objective

作者发现,将损失 L L L 简化为 L s i m p l e L_{simple} Lsimple,能提高模型的效果。
L s i m p l e ( θ ) = E t , x 0 , ϵ [ ∥ ϵ − ϵ θ ( x t ( x 0 , ϵ ) , t ) ∥ 2 ] L_{simple}(\theta) = \mathbb{E}_{t, \bm{x}_0, \bm{\epsilon}}[ \| \bm{\epsilon} - \bm{\epsilon}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2] Lsimple(θ)=Et,x0,ϵ[ϵϵθ(xt(x0,ϵ),t)2]

训练的伪代码如下图所示。
DDPM( Denoising Diffusion Probabilistic Model )_第3张图片

3 Sampling

采样的过程就是采样 x T \bm{x}_T xT 逆扩散生成 x 0 \bm{x}_0 x0 的过程,伪代码如下图所示。
DDPM( Denoising Diffusion Probabilistic Model )_第4张图片

你可能感兴趣的:(算法,深度学习,人工智能,神经网络,机器学习)