扩散模型(diffusion model)笔记

扩散模型(diffusion model)

扩散过程

对初始数据分布 x 0 x_0 x0~q(x),不断添加高斯噪声,最终使数据分布 X T X_T XT变成各项独立的高斯分布。

  • 前向扩散过程的定义

    q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t|x_{t-1})=N(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI) q(xtxt1)=N(xt;1βt xt1,βtI)

    q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_{1:T}|x_0)=\prod_{t=1}^Tq(x_t|x_{t-1}) q(x1:Tx0)=t=1Tq(xtxt1)(马尔科夫链过程)

  • 通过重参数化技巧,可以推导出任意时刻的 q ( x t ) q(x_t) q(xt),无需做迭代

    x t = α t x t − 1 + 1 − α t z t − 1 = . . . = α ‾ t x 0 + 1 − α ‾ t z x_t=\sqrt{\alpha_t}x_{t-1}+\sqrt{1-\alpha_t}z_{t-1}=...=\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}z xt=αt xt1+1αt zt1=...=αt x0+1αt z

    其中 α t ‾ = ∏ i = 1 T α i \overline{\alpha_t}=\prod_{i=1}^T\alpha_i αt=i=1Tαi;参数重整化体现为 α t ( α t − 1 x t − 2 + 1 − α t − 1 z t − 2 ) + 1 − α t z t − 1 \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_{t-1}}z_{t-2})+\sqrt{1-\alpha_t}z_{t-1} αt (αt1 xt2+1αt1 zt2)+1αt zt1中, α t − α t α t − 1 z t − 2 + 1 − α t z t − 1 \sqrt{\alpha_t-\alpha_t\alpha_{t-1}}z_{t-2}+\sqrt{1-\alpha_t}z_{t-1} αtαtαt1 zt2+1αt zt1为两个正态分布叠加,可以重参数化为 1 − α t α t − 1 z ‾ t − 2 \sqrt{1-\alpha_t\alpha_{t-1}}\overline{z}_{t-2} 1αtαt1 zt2

  • 每个时间步所添加的噪声的标准差 β t \beta_t βt给定,且随t增大而增大

  • 每个时间步所添加的噪声的均值与 β t \beta_t βt有关,为了使 x T x_T xT稳定收敛到 N ( 0 , 1 ) N(0,1) N(0,1)

  • x t = α t ‾ x 0 + 1 − α t ‾ z \mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z} xt=αt x0+1αt z可得

    • q ( x t ∣ x 0 ) = N ( x t ; α ‾ t x 0 , ( 1 − α ‾ t ) I ) \mathbf{q(x_t|x_0)=N(x_t;\sqrt{\overline{\alpha}_t}x_0,(1-\overline{\alpha}_t)I)} q(xtx0)=N(xt;αt x0,(1αt)I)

    • 随着不断加噪, x t x_t xt逐渐接近纯高斯噪声

    • x 0 = 1 α ‾ t ( x t − 1 − α ‾ t z t ) \mathbf{x_0=\frac{1}{\sqrt{\overline{\alpha}_t}}(x_t-\sqrt{1-\overline{\alpha}_t}z_t)} x0=αt 1(xt1αt zt)

  • 扩散过程中的后验条件概率 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t,x_0) q(xt1xt,x0)可以用公式表达,即给定 x t x_t xt x 0 x_0 x0,可计算出 x t − 1 x_{t-1} xt1

    假设 β t \beta_t βt足够小时, q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ ( x t , x 0 ) , β t ~ I ) \mathbf{q(x_{t-1}|x_t,x_0)=N(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I)} q(xt1xt,x0)=N(xt1;μ~(xt,x0),βt~I)

    由高斯分布的概率密度函数 f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}} f(x)=2π σ1e2σ2(xμ)2和贝叶斯可得

    q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ e x p ( − 1 2 ( ( α t β t + 1 1 − α ‾ t − 1 x t − 1 2 ) − ( 2 α t β t x t + 2 α ‾ t 1 − α ‾ t x 0 ) x t − 1 + C \begin{aligned} q(x_{t-1}|x_t,x_0) &=q(x_t|x_{t-1},x_0)\frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} \\& \propto{exp(-\frac{1}{2}((\frac{\alpha_t}{\beta_t}+\frac{1}{1-\overline{\alpha}_{t-1}}x_{t-1}^2)-(\frac{2\sqrt{\alpha_t}}{\beta_t}x_t+\frac{2\sqrt{\overline{\alpha}_t}}{1-\overline{\alpha}_t}x_0)x_{t-1}+C} \end{aligned} q(xt1xt,x0)=q(xtxt1,x0)q(xtx0)q(xt1x0)exp(21((βtαt+1αt11xt12)(βt2αt xt+1αt2αt x0)xt1+C

    由二次函数的均值和方差计算可得

    μ ~ ( x t , x 0 ) = α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t x t + α ‾ t − 1 β t 1 − α ‾ t x 0 \tilde{\mu}(x_t,x_0)=\frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_t}x_t+\frac{\sqrt{\overline{\alpha}_{t-1}\beta_t}}{1-\overline{\alpha}_t}x_0 μ~(xt,x0)=1αtαt (1αt1)xt+1αtαt1βt x0

    β t ~ = 1 − α ‾ t − 1 1 − α ‾ t β t \tilde{\beta_t}=\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t}}\beta_t βt~=1αt1αt1βt(DDPM作者使用 β t ~ = β t \mathbf{\tilde{\beta_t}=\beta_t} βt~=βt,认为两者结果近似)

    x 0 x_0 x0的公式代入得( z t ∼ N ( 0 , I ) z_t \sim N(0,I) ztN(0,I)使用了重参数化)

    μ ~ t = 1 α t ( x t − β t 1 − α ‾ t z t ) \mathbf{\tilde{\mu}_t=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_t)} μ~t=αt 1(xt1αt βtzt)

    即在 x 0 x_0 x0条件下,后验条件概率分布可通过 x t x_t xt z t z_t zt计算得到

逆扩散过程

从高斯噪声 x T x_T xT中逐步还原出原始数据 x 0 x_0 x0。马尔科夫链过程。

  • p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) \mathbf{p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)} pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t)

    p θ ( x 0 : T ) = p ( x T ) ∏ t − 1 T p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{0:T})=p(x_T)\prod_{t-1}^Tp_{\theta}(x_{t-1}|x_t) pθ(x0:T)=p(xT)t1Tpθ(xt1xt)

目标函数

对负对数似然 L = E q ( x 0 ) [ − l o g p θ ( x 0 ) ] L=E_{q(x_0)}[-logp_\theta(x_0)] L=Eq(x0)[logpθ(x0)]使用变分下限(VLB),并进一步推导化简得到最终loss

  • L t s i m p l e = E t , x 0 , ϵ [ ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ 2 ] \mathbf{L_t^{simple}=E_{t,x_0,\epsilon}[||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||^2]} Ltsimple=Et,x0,ϵ[∣∣ϵϵθ(αt x0+1αt ϵ,t)2]

  • 在推导的过程中,loss转换为 q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ ( x t , x 0 ) , β t ~ I ) q(x_{t-1}|x_t,x_0)=N(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I) q(xt1xt,x0)=N(xt1;μ~(xt,x0),βt~I) p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t) pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t)两个高斯分布之间的KL散度,将 μ \mu μ x t x_t xt的公式代入将loss转化为 ϵ \epsilon ϵ x 0 x_0 x0 t t t的公式

  • DDPM作者采用了预测随机变量(噪声)法,并不直接预测后验分布的期望值或原始数据

  • DDPM作者将方差 Σ θ ( x t , t ) \Sigma_{\theta}(x_t,t) Σθ(xt,t)用给定的 β t \beta_t βt β t ~ \tilde{\beta_t} βt~代替,训练参数只存在均值中,为了使训练更加稳定

训练过程

  1. 给出原始数据 x 0 ∼ q ( x 0 ) x_0 \sim q(x_0) x0q(x0)

  2. 设定 t ∼ U n i f o r m ( 1 , . . . , T ) t \sim Uniform({1,...,T}) tUniform(1,...,T)

  3. 从标准高斯分布采样一个噪声 ϵ ∼ N ( 0 , I ) \epsilon \sim N(0,I) ϵN(0,I)

  4. 采用梯度下降法优化目标函数 ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ ||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)|| ∣∣ϵϵθ(αt x0+1αt ϵ,t)∣∣

推断过程

  1. 每个时间步通过 x t x_t xt t t t计算 p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{t-1}|x_t) pθ(xt1xt)

    均值 μ θ ( x t , t ) = 1 α t ( x t − β t 1 − α ‾ t z θ ( x t , t ) ) \mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_\theta(x_t,t)) μθ(xt,t)=αt 1(xt1αt βtzθ(xt,t)),方差 Σ θ ( x t , t ) = β t ~ = β t \Sigma_{\theta}(x_t,t)=\tilde{\beta_t}=\beta_t Σθ(xt,t)=βt~=βt

    p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)) pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))

  2. 通过重参数从 p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{t-1}|x_t) pθ(xt1xt)中采样得到 x t − 1 x_{t-1} xt1

  3. 通过不断迭代最终得到 x 0 x_0 x0

代码实现

  • 定义时间步数、 β t \beta_t βt α t ‾ \sqrt{\overline{\alpha_t}} αt 等公式计算中需要用到的常量

    x t = α t ‾ x 0 + 1 − α t ‾ z \mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z} xt=αt x0+1αt z

    μ θ = 1 α t ( x t − β t 1 − α ‾ t z θ ( x t , t ) ) \mathbf{\mu_\theta=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_\theta(x_t,t))} μθ=αt 1(xt1αt βtzθ(xt,t))

    Σ θ ( x t , t ) = β t ~ = β t \mathbf{\Sigma_{\theta}(x_t,t)=\tilde{\beta_t}=\beta_t} Σθ(xt,t)=βt~=βt

    DDPM论文中作者将时间步数 T T T设置为1000, β t \beta_t βt为0.0001到0.02之间的线性插值

    num_timesteps = 1000
    schedule_low = 1e-4
    schedule_high = 0.02
    betas = torch.tensor(np.linspace(schedule_low, schedule_high, num_timesteps), dtype=torch.float32)
    
    alphas = 1 - betas
    alphas_cumprod = np.cumprod(alphas)
    sqrt_alphas_cumprod = np.sqrt(alphas_cumprod)
    sqrt_one_minus_alphas_cumprod = np.sqrt(1 - alphas_cumprod)
    reciprocal_sqrt_alphas = np.sqrt(1 / alphas)
    betas_over_sqrt_one_minus_alphas_cumprod = (betas / sqrt_one_minus_alphas_cumprod)
    sqrt_betas = np.sqrt(betas)
    
  • 前向扩散过程

    x t = α t ‾ x 0 + 1 − α t ‾ z \mathbf{x_t=\sqrt{\overline{\alpha_t}}x_0+\sqrt{1-\overline{\alpha_t}}z} xt=αt x0+1αt z

    ∣ ∣ ϵ − ϵ θ ( α ‾ t x 0 + 1 − α ‾ t ϵ , t ) ∣ ∣ \mathbf{||\epsilon-\epsilon_\theta(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||} ∣∣ϵϵθ(αt x0+1αt ϵ,t)∣∣

    def forward_diffusion_process(model, x0, num_timesteps, sqrt_alphas_cumprod, sqrt_one_minus_alphas_cumprod):
        batch_size = x0.shape[0]
        t = torch.randint(0, num_timesteps, size=(batch_size,))
        noise = torch.randn_like(x0)
        xt = sqrt_alphas_cumprod[t] * x0 + sqrt_one_minus_alphas_cumprod[t] * noise
        estimated_noise = model(xt, t)
        loss = (noise - estimated_noise).square().mean()
        return loss
    
  • 逆向扩散过程

    p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) \mathbf{p_{\theta}(x_{t-1}|x_t)=N(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t))} pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))

    μ θ ( x t , t ) = 1 α t ( x t − β t 1 − α ‾ t z θ ( x t , t ) ) \mathbf{\mu_\theta(x_t,t)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}z_\theta(x_t,t))} μθ(xt,t)=αt 1(xt1αt βtzθ(xt,t))

    Σ θ ( x t , t ) = β t ~ = β t \mathbf{\Sigma_{\theta}(x_t,t)=\tilde{\beta_t}=\beta_t} Σθ(xt,t)=βt~=βt

    def reverse_diffusion_process(model, shape, num_timesteps, reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas):
        current_x = torch.randn(shape)
        x_seq = [current_x]
        for t in reversed(range(num_timesteps)):
            current_x = sample(model, current_x, t, shape[0], reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas)
            x_seq.append(current_x)
        return x_seq
    
    def sample(model, xt, t, batch_size, reciprocal_sqrt_alphas, betas_over_sqrt_one_minus_alphas_cumprod, sqrt_betas):
        ts = torch.full([batch_size, 1], t)
        estimated_noise = model(xt, ts)
        mean = reciprocal_sqrt_alphas[ts] * (xt - betas_over_sqrt_one_minus_alphas_cumprod[ts] * estimated_noise)
        if t > 0:
            z = torch.randn_like(xt)
        else:
            z = 0
        sample = mean + sqrt_betas[t] * z
        return sample
    

你可能感兴趣的:(扩散模型(diffusion model)笔记)