Jonathan Ho, Ajay Jain, Pieter Abbeel
NeurIPS 2020
Diffusion 模型为隐变量模型, x 1 , . . . , x T \bm{x}_1, ..., \bm{x}_T x1,...,xT 为与原始数据 x 0 ∼ q ( x 0 ) \bm{x}_0 \sim q(\bm{x}_0) x0∼q(x0) 维度一致的隐变量,所有隐变量之间满足马尔科夫性。已知 p ( x T ) ∼ N ( x T ; 0 , I ) p(\bm{x}_T) \sim \mathcal{N}(\bm{x}_T; \bm{0}, \bm{I}) p(xT)∼N(xT;0,I),计算联合概率 p θ ( x 0 : T ) p_{\theta}(\bm{x}_{0:T}) pθ(x0:T) 称为 逆过程。
p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) , p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) (1) p_{\theta}(\bm{x}_{0:T}) = p(\bm{x}_T) \prod_{t=1}^T p_{\theta}(\bm{x}_{t-1} | \bm{x}_t), \quad p_{\theta}(\bm{x}_{t-1} | \bm{x}_t) = \mathcal{N}(\bm{x}_{t-1}; \bm{\mu}_{\theta}(\bm{x}_t, t), \bm{\Sigma}_{\theta}(\bm{x}_t, t)) \tag{1} pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt),pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))(1)
其中, μ θ ( x t , t ) \bm{\mu}_{\theta}(\bm{x}_t, t) μθ(xt,t) 和 Σ θ ( x t , t ) \bm{\Sigma}_{\theta}(\bm{x}_t, t) Σθ(xt,t) 为神经网络。
估计后验 q ( x 1 : T ∣ x 0 ) q(\bm{x}_{1:T} | \bm{x}_0) q(x1:T∣x0) 称为 扩散过程。
q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) , q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) (2) q(\bm{x}_{1:T} | \bm{x}_0) = \prod_{t=1}^T q(\bm{x}_t | \bm{x}_{t-1}), \quad q(\bm{x}_t | \bm{x}_{t-1}) = \mathcal{N}(\bm{x}_t; \sqrt{1 - \beta_t} \bm{x}_{t-1}, \beta_t \bm{I}) \tag{2} q(x1:T∣x0)=t=1∏Tq(xt∣xt−1),q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)(2)
扩散过程可以看做,根据超参数 β t \beta_t βt,向数据中添加高斯噪声的过程。
下面介绍扩散过程的一个性质,定义 α t = 1 − β t , α ˉ t = ∏ s = 1 t α s \alpha_t = 1 - \beta_t, \bar{\alpha}_t = \prod_{s=1}^t \alpha_s αt=1−βt,αˉt=∏s=1tαs,根据式 2 可得,
x t = α t x t − 1 + β t ϵ t = α t α t − 1 x t − 2 + α t β t − 1 ϵ t − 1 + β t ϵ t = α t α t − 1 α t − 2 x t − 3 + α t α t − 1 β t − 2 ϵ t − 2 + α t β t − 1 ϵ t − 1 + β t ϵ t = . . . = α ˉ t x 0 + α t α t − 1 . . . α 2 β 1 ϵ 1 + . . . + α t β t − 1 ϵ t − 1 + β t ϵ t (3) \begin{aligned} \bm{x}_t &= \sqrt{\alpha_t} \bm{x}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= \sqrt{\alpha_t \alpha_{t-1}} \bm{x}_{t-2} + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} \bm{x}_{t-3} + \sqrt{\alpha_t \alpha_{t-1} \beta_{t-2}} \bm{\epsilon}_{t-2} + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \\ &= ... \\ &= \sqrt{\bar{\alpha}_t} \bm{x}_0 + \sqrt{\alpha_t \alpha_{t-1} ... \alpha_2 \beta_1} \bm{\epsilon}_1 +... + \sqrt{\alpha_t \beta_{t-1}} \bm{\epsilon}_{t-1} + \sqrt{\beta_t} \bm{\epsilon}_t \end{aligned} \tag{3} xt=αtxt−1+βtϵt=αtαt−1xt−2+αtβt−1ϵt−1+βtϵt=αtαt−1αt−2xt−3+αtαt−1βt−2ϵt−2+αtβt−1ϵt−1+βtϵt=...=αˉtx0+αtαt−1...α2β1ϵ1+...+αtβt−1ϵt−1+βtϵt(3)
其中,所有 ϵ t ∼ N ( 0 , I ) \bm{\epsilon}_t \sim \mathcal{N}(\bm{0}, \bm{I}) ϵt∼N(0,I) 均为随机噪声,根据高斯噪声的叠加性可得,
q ( x t ∣ x 0 ) = N ( x t ; α ˉ t x 0 , ( α t α t − 1 . . . α 2 β 1 + . . . + α t β t − 1 + β t ) I ) = N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) (4) \begin{aligned} q(\bm{x}_t | \bm{x}_0) &= \mathcal{N}(\bm{x}_t; \sqrt{\bar{\alpha}_t} \bm{x}_0, (\alpha_t \alpha_{t-1} ... \alpha_2 \beta_1 + ... + \alpha_t \beta_{t-1} + \beta_t)\bm{I}) \\ &= \mathcal{N}(\bm{x}_t; \sqrt{\bar{\alpha}_t} \bm{x}_0, (1 - \bar{\alpha}_t)\bm{I}) \end{aligned} \tag{4} q(xt∣x0)=N(xt;αˉtx0,(αtαt−1...α2β1+...+αtβt−1+βt)I)=N(xt;αˉtx0,(1−αˉt)I)(4)
根据这条性质,我们可以从 x 0 \bm{x}_0 x0 直接采样 x t − 1 \bm{x}_{t-1} xt−1,无需中间步骤。
我们的目的为训练神经网络,使逆过程得到的 p θ ( x 0 ) p_{\theta}(\bm{x}_0) pθ(x0) 尽可能与真实分布 q ( x 0 ) q(\bm{x}_0) q(x0) 接近,即最小化KL散度。
D K L ( q ( x 0 ) ∣ ∣ p θ ( x 0 ) ) = ∫ q ( x 0 ) log q ( x 0 ) d x 0 ⏟ c o n s t a n t − ∫ q ( x 0 ) log p θ ( x 0 ) d x 0 = c o n s t − ∫ q ( x 0 ) log p θ ( x 0 : T ) d x 0 : T = c o n s t − ∫ q ( x 0 ) log q ( x 1 : T ∣ x 0 ) q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) d x 0 : T ≤ c o n s t − ∫ q ( x 0 : T ) log p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) d x 0 : T = c o n s t + E q [ − log p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = c o n s t + E q [ − log p ( x T ) − ∑ t ≥ 1 log p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) ] ⏟ L (5) \begin{aligned} D_{KL}(q(\bm{x}_0) || p_{\theta}(\bm{x}_0)) &= \underbrace{\int q(\bm{x}_0)\log q(\bm{x}_0) d \bm{x}_0}_{constant} - \int q(\bm{x}_0)\log p_{\theta}(\bm{x}_0) d \bm{x}_0 \\ &= const - \int q(\bm{x}_0)\log p_{\theta}(\bm{x}_{0:T}) d \bm{x}_{0:T} \\ &= const \ - \int q(\bm{x}_0)\log \frac{q(\bm{x}_{1:T} | \bm{x}_0)}{q(\bm{x}_{1:T} | \bm{x}_0)} p_{\theta}(\bm{x}_{0:T}) d \bm{x}_{0:T} \\ &\le const \ - \int q(\bm{x}_{0:T}) \log \frac{p_{\theta}(\bm{x}_{0:T})}{q(\bm{x}_{1:T} | \bm{x}_0)} d \bm{x}_{0:T} \\ &= const \ + \mathbb{E}_q[- \log \frac{p_{\theta}(\bm{x}_{0:T})}{q(\bm{x}_{1:T} | \bm{x}_0)}] \\ &= const \ + \underbrace{\mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t \ge 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1})}]}_L \end{aligned} \tag{5} DKL(q(x0)∣∣pθ(x0))=constant ∫q(x0)logq(x0)dx0−∫q(x0)logpθ(x0)dx0=const−∫q(x0)logpθ(x0:T)dx0:T=const −∫q(x0)logq(x1:T∣x0)q(x1:T∣x0)pθ(x0:T)dx0:T≤const −∫q(x0:T)logq(x1:T∣x0)pθ(x0:T)dx0:T=const +Eq[−logq(x1:T∣x0)pθ(x0:T)]=const +L Eq[−logp(xT)−t≥1∑logq(xt∣xt−1)pθ(xt−1∣xt)](5)
由于 q ( x 0 ) q(\bm{x}_0) q(x0) 为真实分布,所以第一项为常数,训练目标变为最小化 L L L。
L = E q [ − log p ( x T ) − ∑ t > 1 log p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) − log p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = E q [ − log p ( x T ) − ∑ t > 1 log p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 , x 0 ) − log p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = E q [ − log p ( x T ) − ∑ t > 1 log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) − log p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = E q [ − log p ( x T ) − ∑ t > 1 log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) + ∑ t > 1 q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) − log p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] = E q [ − log p ( x T ) q ( x T ∣ x 0 ) − ∑ t > 1 log p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) − log p θ ( x 0 ∣ x 1 ) ] = E q [ D K L ( q ( x T ∣ x 0 ) ∣ ∣ p ( x T ) ) ⏟ L T + ∑ t > 1 D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ⏟ L t − 1 − log p θ ( x 0 ∣ x 1 ) ⏟ L 0 ] (6) \begin{aligned} L &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1})} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_t | \bm{x}_{t-1}, \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} \frac{q(\bm{x}_{t-1} | \bm{x}_0)}{q(\bm{x}_t | \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log p(\bm{x}_T) - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} + \sum_{t > 1} \frac{q(\bm{x}_{t-1} | \bm{x}_0)}{q(\bm{x}_t | \bm{x}_0)} - \log \frac{p_{\theta}(\bm{x}_0 | \bm{x}_1)}{q(\bm{x}_1 | \bm{x}_0)}] \\ &= \mathbb{E}_q[- \log \frac{p(\bm{x}_T)}{q(\bm{x}_T | \bm{x}_0)} - \sum_{t > 1} \log \frac{p_{\theta}(\bm{x}_{t-1} | \bm{x}_t)}{q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)} - \log p_{\theta}(\bm{x}_0 | \bm{x}_1)] \\ &= \mathbb{E}_q[\underbrace{D_{KL}(q(\bm{x}_T | \bm{x}_0) || p(\bm{x}_T))}_{L_T} + \sum_{t > 1} \underbrace{D_{KL}(q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0) || p_{\theta}(\bm{x}_{t-1} | \bm{x}_t))}_{L_{t-1}} \underbrace{- \log p_{\theta}(\bm{x}_0 | \bm{x}_1)}_{L_0}] \end{aligned} \tag{6} L=Eq[−logp(xT)−t>1∑logq(xt∣xt−1)pθ(xt−1∣xt)−logq(x1∣x0)pθ(x0∣x1)]=Eq[−logp(xT)−t>1∑logq(xt∣xt−1,x0)pθ(xt−1∣xt)−logq(x1∣x0)pθ(x0∣x1)]=Eq[−logp(xT)−t>1∑logq(xt−1∣xt,x0)pθ(xt−1∣xt)q(xt∣x0)q(xt−1∣x0)−logq(x1∣x0)pθ(x0∣x1)]=Eq[−logp(xT)−t>1∑logq(xt−1∣xt,x0)pθ(xt−1∣xt)+t>1∑q(xt∣x0)q(xt−1∣x0)−logq(x1∣x0)pθ(x0∣x1)]=Eq[−logq(xT∣x0)p(xT)−t>1∑logq(xt−1∣xt,x0)pθ(xt−1∣xt)−logpθ(x0∣x1)]=Eq[LT DKL(q(xT∣x0)∣∣p(xT))+t>1∑Lt−1 DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))L0 −logpθ(x0∣x1)](6)
由于 q ( x T ∣ x 0 ) q(\bm{x}_T | \bm{x}_0) q(xT∣x0) 和 p ( x T ) p(\bm{x}_T) p(xT) 均为正态分布,所以 L T = 0 L_T=0 LT=0。
L t − 1 L_{t-1} Lt−1 中, q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ t ( x t , x 0 ) , β ~ t I ) q(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0) = \mathcal{N}(\bm{x}_{t-1}; \widetilde{\bm{\mu}}_t(\bm{x}_t, \bm{x}_0), \widetilde{\bm{\beta}}_t \bm{I}) q(xt−1∣xt,x0)=N(xt−1;μ t(xt,x0),β tI),证明如下图所示。图源:54、Diffusion Model扩散模型理论与完整PyTorch代码详细解读。
本文作者将 p θ ( x t − 1 ∣ x t ) p_{\theta}(\bm{x}_{t-1} | \bm{x}_t) pθ(xt−1∣xt) 中的 Σ θ ( x t , t ) \bm{\Sigma}_{\theta}(\bm{x}_t, t) Σθ(xt,t) 替换为 σ t 2 \sigma_t^2 σt2, σ t 2 \sigma_t^2 σt2 取 β t \beta_t βt 或 β ~ t \widetilde{\beta}_t β t,取得了相似的效果。
由于 D K L ( N ( μ 1 , σ 1 2 ) ∣ ∣ N ( μ 2 , σ 2 2 ) ) = ( μ 1 − μ 2 ) 2 + σ 1 2 2 σ 2 2 + log σ 2 σ 1 − 1 2 D_{KL}(\mathcal{N}(\mu_1, \sigma_1^2) || \mathcal{N}(\mu_2, \sigma_2^2)) = \frac{(\mu_1 - \mu_2)^2 + \sigma_1^2}{2 \sigma_2^2} + \log \frac{\sigma_2}{\sigma_1} - \frac{1}{2} DKL(N(μ1,σ12)∣∣N(μ2,σ22))=2σ22(μ1−μ2)2+σ12+logσ1σ2−21,所以,
L t − 1 = E q [ 1 2 σ t 2 ∥ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∥ 2 ] + C = E q [ 1 2 σ t 2 ∥ α t ( 1 − α ˉ t − 1 ) 1 − α t ˉ x t + α ˉ t − 1 β t 1 − α t ˉ x 0 − μ θ ( x t , t ) ∥ 2 ] + C = E x 0 , ϵ [ 1 2 σ t 2 ∥ 1 α t ( x t ( x 0 , ϵ ) − β t 1 − α ˉ t ϵ ) − μ θ ( x t ( x 0 , ϵ ) , t ) ∥ 2 ] + C (7) \begin{aligned} L_{t-1} &= \mathbb{E}_q[\frac{1}{2\sigma_t^2} \| \widetilde{\bm{\mu}}_t(\bm{x}_t, \bm{x}_0) - \bm{\mu}_{\theta}(\bm{x}_t, t) \|^2] + C \\ &= \mathbb{E}_q[\frac{1}{2\sigma_t^2} \| \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha_t}} \bm{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha_t}} \bm{x}_0 - \bm{\mu}_{\theta}(\bm{x}_t, t) \|^2] + C \\ &= \mathbb{E}_{\bm{x}_0, \bm{\epsilon}}[\frac{1}{2\sigma_t^2} \| \frac{1}{\sqrt{\alpha_t}} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}) - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\bm{\epsilon} \big) - \bm{\mu}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2] + C \tag{7} \end{aligned} Lt−1=Eq[2σt21∥μ t(xt,x0)−μθ(xt,t)∥2]+C=Eq[2σt21∥1−αtˉαt(1−αˉt−1)xt+1−αtˉαˉt−1βtx0−μθ(xt,t)∥2]+C=Ex0,ϵ[2σt21∥αt1(xt(x0,ϵ)−1−αˉtβtϵ)−μθ(xt(x0,ϵ),t)∥2]+C(7)
其中, x t ( x 0 , ϵ ) = α ˉ t x 0 + 1 − α ˉ t ϵ \bm{x}_t(\bm{x}_0, \bm{\epsilon}) = \sqrt{\bar{\alpha}_t}\bm{x}_0 + \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon} xt(x0,ϵ)=αˉtx0+1−αˉtϵ, ϵ \bm{\epsilon} ϵ 为由 x 0 \bm{x}_0 x0 采样 x t \bm{x}_t xt 时引入的高斯噪声,C 为常数。
根据式 7 ,可以将神经网络 μ θ \bm{\mu}_{\theta} μθ 的形式变为,
μ θ ( x t , t ) = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) (8) \bm{\mu}_{\theta}(\bm{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \big( \bm{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\bm{\epsilon}_{\theta}(\bm{x}_t, t) \big) \tag{8} μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))(8)
损失 L t − 1 L_{t-1} Lt−1 就变为,
E x 0 , ϵ [ β t 2 2 σ t 2 α t ( 1 − α ˉ t ) ∥ ϵ − ϵ θ ( x t ( x 0 , ϵ ) , t ) ∥ 2 ] + C (9) \mathbb{E}_{\bm{x}_0, \bm{\epsilon}}[\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \| \bm{\epsilon} - \bm{\epsilon}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2] + C \tag{9} Ex0,ϵ[2σt2αt(1−αˉt)βt2∥ϵ−ϵθ(xt(x0,ϵ),t)∥2]+C(9)
原文中作者根据图像的性质做出了一些变化,本文在此进行忽略。
根据式 1 可知,
p θ ( x 0 ∣ x 1 ) = N ( x 0 ; μ θ ( x 1 , 1 ) , σ 1 2 I ) (10) p_{\theta}(\bm{x}_0 | \bm{x}_1) = \mathcal{N}(\bm{x}_0; \bm{\mu}_{\theta}(\bm{x}_1, 1), \sigma_1^2 \bm{I}) \tag{10} pθ(x0∣x1)=N(x0;μθ(x1,1),σ12I)(10)
所以,
L 0 = 1 2 σ 1 2 E x 0 , ϵ ˉ 1 [ ∥ x 0 − μ θ ( x 1 ( x 0 , ϵ ˉ 1 ) , 1 ) ∥ 2 ] + C ′ (11) L_0 = \frac{1}{2 \sigma_1^2} \mathbb{E}_{\bm{x}_0, \bar{\bm{\epsilon}}_1} [\| \bm{x}_0 - \bm{\mu}_{\theta} \big(\bm{x}_1(\bm{x}_0, \bar{\bm{\epsilon}}_1), 1 \big) \|^2] + C' \tag{11} L0=2σ121Ex0,ϵˉ1[∥x0−μθ(x1(x0,ϵˉ1),1)∥2]+C′(11)
其中, C ′ C' C′ 为常数。
作者发现,将损失 L L L 简化为 L s i m p l e L_{simple} Lsimple,能提高模型的效果。
L s i m p l e ( θ ) = E t , x 0 , ϵ [ ∥ ϵ − ϵ θ ( x t ( x 0 , ϵ ) , t ) ∥ 2 ] L_{simple}(\theta) = \mathbb{E}_{t, \bm{x}_0, \bm{\epsilon}}[ \| \bm{\epsilon} - \bm{\epsilon}_{\theta} \big( \bm{x}_t(\bm{x}_0, \bm{\epsilon}), t \big) \|^2] Lsimple(θ)=Et,x0,ϵ[∥ϵ−ϵθ(xt(x0,ϵ),t)∥2]
采样的过程就是采样 x T \bm{x}_T xT 逆扩散生成 x 0 \bm{x}_0 x0 的过程,伪代码如下图所示。