对初始数据分布 x 0 x_0 x0~q(x),不断添加高斯噪声,最终使数据分布 X T X_T XT变成各项独立的高斯分布。
前向扩散过程的定义
q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t|x_{t-1})=N(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI) q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_{1:T}|x_0)=\prod_{t=1}^Tq(x_t|x_{t-1}) q(x1:T∣x0)=∏t=1Tq(xt∣xt−1)(马尔科夫链过程)
通过重参数化技巧,可以推导出任意时刻的 q ( x t ) q(x_t) q(xt),无需做迭代
x t = α t x t − 1 + 1 − α t z t − 1 = . . . = α ‾ t x 0 + 1 − α ‾ t z x_t=\sqrt{\alpha_t}x_{t-1}+\sqrt{1-\alpha_t}z_{t-1}=...=\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}z xt=αtxt−1+1−αtzt−1=...=αtx0+1−αtz
其中 α t ‾ = ∏ i = 1 T α i \overline{\alpha_t}=\prod_{i=1}^T\alpha_i αt=∏i=1Tαi;参数重整化体现为 α t ( α t − 1 x t − 2 + 1 − α t − 1 z t − 2 ) + 1 − α t z t − 1 \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_{t-1}}z_{t-2})+\sqrt{1-\alpha_t}z_{t-1} αt(αt−1xt−2+1−αt−1zt−2)+1−αtzt−1中, α t − α t α t − 1 z t − 2 + 1 − α t z t − 1 \sqrt{\alpha_t-\alpha_t\alpha_{t-1}}z_{t-2}+\sqrt{1-\alpha_t}z_{t-1} αt−αtαt−1zt−2+1−αtzt−1为两个正态分布叠加,可以重参数化为 1 − α t α t − 1 z ‾ t − 2 \sqrt{1-\alpha_t\alpha_{t-1}}\overline{z}_{t-2} 1−αtαt−1zt−2
每个时间步所添加的噪声的标准差 β t \beta_t βt给定,且随t增大而增大
每个时间步所添加的噪声的均值与 β t \beta_t βt有关,为了使 x T x_T xT稳定收敛到 N ( 0 , 1 ) N(0,1) N(0,1)
由 x t = α t ‾ x 0 + 1 − α t ‾ z \mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z} xt=αtx0+1−αtz可得
q ( x t ∣ x 0 ) = N ( x t ; α ‾ t x 0 , ( 1 − α ‾ t ) I ) \mathbf{q(x_t|x_0)=N(x_t;\sqrt{\overline{\alpha}_t}x_0,(1-\overline{\alpha}_t)I)} q(xt∣x0)=N(xt;αtx0,(1−αt)I)
随着不断加噪, x t x_t xt逐渐接近纯高斯噪声
x 0 = 1 α ‾ t ( x t − 1 − α ‾ t z t ) \mathbf{x_0=\frac{1}{\sqrt{\overline{\alpha}_t}}(x_t-\sqrt{1-\overline{\alpha}_t}z_t)} x0=αt1(xt−1−αtzt)
扩散过程中的后验条件概率 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t,x_0) q(xt−1∣xt,x0)可以用公式表达,即给定 x t x_t xt、 x 0 x_0 x0,可计算出 x t − 1 x_{t-1} xt−1
假设 β t \beta_t βt足够小时, q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ ( x t , x 0 ) , β t ~ I ) \mathbf{q(x_{t-1}|x_t,x_0)=N(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I)} q(xt−1∣xt,x0)=N(xt−1;μ~(xt,x0),βt~I)
由高斯分布的概率密度函数 f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}} f(x)=2πσ1e−2σ2(x−μ)2和贝叶斯可得
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ e x p ( − 1 2 ( ( α t β t + 1 1 − α ‾ t − 1 x t − 1 2 ) − ( 2 α t β t x t + 2 α ‾ t 1 − α ‾ t x 0 ) x t − 1 + C \begin{aligned} q(x_{t-1}|x_t,x_0) &=q(x_t|x_{t-1},x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} \\& \propto{exp(-\frac{1}{2}((\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline{\alpha}_{t-1}}x_{t-1}^2)-(\frac{2\sqrt{\alpha_t}}{\beta_t}x_t+\frac{2\sqrt{\overline{\alpha}_t}}{1-\overline{\alpha}_t}x_0)x_{t-1}+C} \end{aligned} q(xt−1∣xt,x0)=q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)∝exp(−21((βtαt+1−αt−11xt−12)−(βt2αtxt+1−αt2αtx0)xt−1+C
由二次函数的均值和方差计算可得
μ ~ ( x t , x 0 ) = α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t x t + α ‾ t − 1 β t 1 − α ‾ t x 0 \tilde{\mu}(x_t,x_0)=\frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_t}x_t+\frac{\sqrt{\overline{\alpha}_{t-1}\beta_t}}{1-\overline{\alpha}_t}x_0 μ~(xt,x0)=1−αtαt(1−αt−1)xt+1−αtαt−1βtx0
β t ~ = 1 − α ‾ t − 1 1 − α ‾ t β t \tilde{\beta_t}=\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t}}\beta_t βt~=1−αt1−αt−1βt(DDPM作者使用 β t ~ = β t \mathbf{\tilde{\beta_t}=\beta_t} βt~=βt,认为两者结果近似)
将 x 0 x_0 x0的公式代入得( z t ∼ N ( 0 , I ) z_t \sim N(0,I) zt∼N(0,I)使用了重参数化)
μ ~ t = 1 α t ( x t − β t 1 − α ‾ t z t ) \mathbf{\tilde{\mu}_t=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_t)} μ~t=αt1(xt−1−αtβtzt)
即在 x 0 x_0 x0条件下,后验条件概率分布可通过 x t x_t xt和 z t z_t zt计算得到
从高斯噪声 x T x_T xT中逐步还原出原始数据 x 0 x_0 x0。马尔科夫链过程。
p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) \mathbf{p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)} pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t)
p θ ( x 0 : T ) = p ( x T ) ∏ t − 1 T p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{0:T})=p(x_T)\prod_{t-1}^Tp_{\theta}(x_{t-1}|x_t) pθ(x0:T)=p(xT)∏t−1Tpθ(xt−1∣xt)
对负对数似然 L = E q ( x 0 ) [ − l o g p θ ( x 0 ) ] L=E_{q(x_0)}[-logp_\theta(x_0)] L=Eq(x0)[−logpθ(x0)]使用变分下限(VLB),并进一步推导化简得到最终loss
L t s i m p l e = E t , x 0 , ϵ [ ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ 2 ] \mathbf{L_t^{simple}=E_{t,x_0,\epsilon}[||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||^2]} Ltsimple=Et,x0,ϵ[∣∣ϵ−ϵθ(αtx0+1−αtϵ,t)∣∣2]
在推导的过程中,loss转换为 q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ ( x t , x 0 ) , β t ~ I ) q(x_{t-1}|x_t,x_0)=N(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I) q(xt−1∣xt,x0)=N(xt−1;μ~(xt,x0),βt~I)与 p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t) pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t)两个高斯分布之间的KL散度,将 μ \mu μ与 x t x_t xt的公式代入将loss转化为 ϵ \epsilon ϵ、 x 0 x_0 x0、 t t t的公式
DDPM作者采用了预测随机变量(噪声)法,并不直接预测后验分布的期望值或原始数据
DDPM作者将方差 Σ θ ( x t , t ) \Sigma_{\theta}(x_t,t) Σθ(xt,t)用给定的 β t \beta_t βt或 β t ~ \tilde{\beta_t} βt~代替,训练参数只存在均值中,为了使训练更加稳定
给出原始数据 x 0 ∼ q ( x 0 ) x_0 \sim q(x_0) x0∼q(x0)
设定 t ∼ U n i f o r m ( 1 , . . . , T ) t \sim Uniform({1,...,T}) t∼Uniform(1,...,T)
从标准高斯分布采样一个噪声 ϵ ∼ N ( 0 , I ) \epsilon \sim N(0,I) ϵ∼N(0,I)
采用梯度下降法优化目标函数 ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ ||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)|| ∣∣ϵ−ϵθ(αtx0+1−αtϵ,t)∣∣
每个时间步通过 x t x_t xt和 t t t计算 p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{t-1}|x_t) pθ(xt−1∣xt)
均值 μ θ ( x t , t ) = 1 α t ( x t − β t 1 − α ‾ t z θ ( x t , t ) ) \mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_\theta(x_t,t)) μθ(xt,t)=αt1(xt−1−αtβtzθ(xt,t)),方差 Σ θ ( x t , t ) = β t ~ = β t \Sigma_{\theta}(x_t,t)=\tilde{\beta_t}=\beta_t Σθ(xt,t)=βt~=βt
p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)) pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
通过重参数从 p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{t-1}|x_t) pθ(xt−1∣xt)中采样得到 x t − 1 x_{t-1} xt−1
通过不断迭代最终得到 x 0 x_0 x0
定义时间步数、 β t \beta_t βt、 α t ‾ \sqrt{\overline{\alpha_t}} αt等公式计算中需要用到的常量
x t = α t ‾ x 0 + 1 − α t ‾ z \mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z} xt=αtx0+1−αtz
μ θ = 1 α t ( x t − β t 1 − α ‾ t z θ ( x t , t ) ) \mathbf{\mu_\theta=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_\theta(x_t,t))} μθ=αt1(xt−1−αtβtzθ(xt,t))
Σ θ ( x t , t ) = β t ~ = β t \mathbf{\Sigma_{\theta}(x_t,t)=\tilde{\beta_t}=\beta_t} Σθ(xt,t)=βt~=βt
DDPM论文中作者将时间步数 T T T设置为1000, β t \beta_t βt为0.0001到0.02之间的线性插值
num_timesteps = 1000
schedule_low = 1e-4
schedule_high = 0.02
betas = torch.tensor(np.linspace(schedule_low, schedule_high, num_timesteps), dtype=torch.float32)
alphas = 1 - betas
alphas_cumprod = np.cumprod(alphas)
sqrt_alphas_cumprod = np.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = np.sqrt(1 - alphas_cumprod)
reciprocal_sqrt_alphas = np.sqrt(1 / alphas)
betas_over_sqrt_one_minus_alphas_cumprod = (betas / sqrt_one_minus_alphas_cumprod)
sqrt_betas = np.sqrt(betas)
前向扩散过程
x t = α t ‾ x 0 + 1 − α t ‾ z \mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z} xt=αtx0+1−αtz
∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ \mathbf{||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||} ∣∣ϵ−ϵθ(αtx0+1−αtϵ,t)∣∣
def forward_diffusion_process(model, x0, num_timesteps, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod):
batch_size = x0.shape[0]
t = torch.randint(0, num_timesteps, size=(batch_size,))
noise = torch.randn_like(x0)
xt = sqrt_alphas_cumprod[t] * x0 + sqrt_one_minus_alphas_cumprod[t] * noise
estimated_noise = model(xt, t)
loss = (noise - estimated_noise).square().mean()
return loss
逆向扩散过程
p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) \mathbf{p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t))} pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
μ θ ( x t , t ) = 1 α t ( x t − β t 1 − α ‾ t z θ ( x t , t ) ) \mathbf{\mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_\theta(x_t,t))} μθ(xt,t)=αt1(xt−1−αtβtzθ(xt,t))
Σ θ ( x t , t ) = β t ~ = β t \mathbf{\Sigma_{\theta}(x_t,t)=\tilde{\beta_t}=\beta_t} Σθ(xt,t)=βt~=βt
def reverse_diffusion_process(model, shape, num_timesteps, reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas):
current_x = torch.randn(shape)
x_seq = [current_x]
for t in reversed(range(num_timesteps)):
current_x = sample(model, current_x, t, shape[0], reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas)
x_seq.append(current_x)
return x_seq
def sample(model, xt, t, batch_size, reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas):
ts = torch.full([batch_size, 1], t)
estimated_noise = model(xt, ts)
mean = reciprocal_sqrt_alphas[ts] * (xt - betas_over_sqrt_one_minus_alphas_cumprod[ts] * estimated_noise)
if t > 0:
z = torch.randn_like(xt)
else:
z = 0
sample = mean + sqrt_betas[t] * z
return sample