[读论文] [ 蒸馏-diffusion] BOOT : Data-free Distillation of Denoising DiffusionModels with Bootstrapping

苹果 宾夕法尼亚

[读论文] [ 蒸馏-diffusion] BOOT : Data-free Distillation of Denoising DiffusionModels with Bootstrapping_第1张图片


Diffusion models have demonstrated excellent potential for generating diverse images.
However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as are medy which can reduce the number of inference steps to one or a few, without significant quality degradation.
However, existing distillation methods either require significant amounts of offline computation for generating synthetic training data from the teacher model, or need to perform expensive online learning with the help of real data.
In this work, we present a novel technique called
BOOT , that overcomes these limitations with an efficient data-free distillation algorithm.
The core idea is to learn a time-conditioned model that predicts the output of a pre trained diffusion model teacher given any time-step.
Such a model can be efficiently trained based on
bootstrapping from two consecutive sampled steps.
Furthermore, our method can be easily adapted to large-scale text-to-image diffusion models,
which are challenging for conventional methods given the fact that the training sets are often large and difficult to access.
We demonstrate the effectiveness of our approach on several benchmark datasets in the DDIM setting, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher.
The text-to-image results show that the proposed approach is able to handle highly complex distributions, shedding light on more efficient generative modeling。




xxx, standard diffusion models often have slow inference times (around 50 1000× slower than single-step models like GANs)

To address this issue, previous studies have proposed using knowledge distillation to improve the inference speed ( Hinton et al. , 2015 ).
The idea is to train a faster student model that can replicate the output of a pre-trained diffusion model.
In this work, we focus on learning
single-step models that only require one neural function evaluation (NFE).
However, conventional methods, such as Luhman & Luhman ( 2021 ), r equire executing the full teacher sampling to generate synthetic targets for every student update, which is impractical for distilling large diffusion models like StableDiffusion (SD, Rombach et al. , 2021 ).
Recently, several techniques have been proposed to avoid sampling using the concept of "bootstrap".
For example,
Salimans & Ho ( 2022 ) gradually reduces the number of inference steps based on the previous stage’s student, while Song et al. ( 2023 ) and Berthelot et al.
( 2023 ) train single-step denoisers by enforcing self-consistency between adjacent student outputs along the same diffusion trajectory (see Fig. 2 ).
However, these approaches rely on the availability of real data to simulate the intermediate diffusion states as input, which limits their applicability in scenarios where the desired real data is not accessible.

为了解决这个问题,之前的研究已经提出使用知识蒸馏来提高推理速度(Hinton et al., 2015)。
然而,传统方法,如Luhman & Luhman(2021),需要执行完整的教师抽样来为每次学生更新生成合成目标,这对于提取StableDiffusion等大型扩散模型是不切实际的(SD, Rombach等人,2021)。
例如,Salimans & Ho(2022)基于前一阶段的学生逐渐减少了推理步骤的数量
[读论文] [ 蒸馏-diffusion] BOOT : Data-free Distillation of Denoising DiffusionModels with Bootstrapping_第2张图片

In this paper, we propose BOOT, a data-free knowledge distillation method for denoising diffusion models based on bootstrapping.
BOOT is partially motivated by the observation made by consistency
model (CM, Song et al., 2023) that all points on the same diffusion trajectory (also known as PF-ODE (Song et al., 2020b)) have a deterministic mapping between each other.
Unlike CM, which
seeks self-consistency from any xt to x0, BOOT predicts all possible xt given the same noise point ϵ and a time indicator t.
Since our model
gθ always reads pure Gaussian noise, there is no need to sample from real data.
Moreover, learning all
xt from the same ϵ enables bootstrapping:
it is easier
to predict xt if the model has already learned to generate xtwhere t> t. However, formulating bootstrapping in this way presents additional challenges, such as noisy sample prediction, which is non-trivial for neural networks.
To address this, we learn the
student model from a novel Signal-ODE derived from the original PF-ODE.
We also design
objectives and boundary conditions to enhance the sampling quality and diversity.
This enables efficient inference of large diffusion models in scenarios
where the original training corpus is inaccessible due to privacy or other concerns.
For example,
we can obtain an efficient model for synthesizing images of "raccoon astronaut" by distilling the text-to-image model with the corresponding prompts (shown in Fig. 3), even though collecting such data in reality is difficult.

BOOT的部分动机来自于一致性模型(CM, Song et al., 2023一致性模型)的观察,即相同扩散轨迹上的所有点(也称为PF-ODE (Song et al., 2020b))彼此之间具有确定性映射。
此外,从相同的ϵ 中学习所有的xt可以实现自举bootstrapping:
如果模型已经学会生成t > t的xt,则更容易预测xt。


In the experiments, we first demonstrate the efficacy of BOOT on various challenging image generation benchmarks, including unconditional and class-conditional settings.
Next, we show that the proposed method can be easily adopted to distill text-to-image diffusion models. An illustration of sampled images from our distilled text-to-image model is shown in Fig.
1 .



2 Preliminaries
2.1 Diffusion Models 
2.2 Knowledge Distillation
Orthogonal to the development of ODE solvers, distillation-based techniques have been proposed to learn faster student models from a pre-trained diffusion teacher.
The most straightforward approach is to perform
direct distillation ( Luhman & Luhman , 2021 ), where a student model g θ is trained to learn from the output of the diffusion model, which is computationally expensive itself:
Here, ODE-solver refers to any solvers like DDIM as mentioned above.
While this naive approach shows promising results, it typically requires over 50 steps of evaluations to obtain reasonable distillation targets, which becomes a bottleneck when learning large-scale models.
Alternatively, recent studies ( Salimans & Ho , 2022 ; Song et al. , 2023 ; Berthelot et al. , 2023 ) have proposed methods to avoid running the full diffusion path during distillation.
For instance, the consistency model (CM,
Song et al. , 2023 ) trains a time-conditioned student model g θ ( x t , t ) to predict self-consistent outputs along the diffusion trajectory in a bootstrap fashion:
where x s = ODE-Solver ( f ϕ , x t , t s ) , typically with a single-step evaluation using Eq. ( 2 ).
In this case,
θ represents an exponential moving average (EMA) of the student parameters θ , which is important to prevent the self-consistency objectives from collapsing into trivial solutions by always predicting similar outputs.
After training, samples can be generated by executing
g θ ( x T , T ) with a single NFE.
It is worth noting that Eq. (
4 ) requires sampling x t from the real data sample x , which is the essence of bootstrapping:
the model learns to denoise increasingly noisy inputs until
x T .
However, in many tasks, the original training data x for distillation is inaccessible.
For example, text-to-image generation models require billions of paired data for training.
One possible solution is to use a different dataset for distillation; however, the mismatch in the distributions of the two datasets would result in suboptimal distillation performance.

最直接的方法是执行直接蒸馏(Luhman & Luhman, 2021),其中训练学生模型gθ从扩散模型的输出中学习,这本身在计算上是昂贵的:


或者,最近的研究(Salimans & Ho, 2022;Song et al., 2023;Berthelot等人,2023)提出了避免在蒸馏过程中运行完整扩散路径的方法。
例如,一致性模型(CM, Song et al., 2023)训练一个时间条件学生模型gθ(xt, t),以自举方式沿扩散轨迹预测自洽输出:

其中xs = ODE-Solver(fϕ, xt, t→s),通常使用Eq.(2)进行单步评估。
训练完成后,可以用单个NFE执行gθ(xT, T)生成样本。

3 Method

In this section, we present BOOT, a novel distillation approach inspired by the concept of bootstrapping without requiring target domain data during training.
We begin by introducing
signal-ODE , a modeling technique focused exclusively on signals (§ 3.1 ), and its corresponding distillation process (§ 3.2 ).
Subsequently, we explore the application of BOOT in text-to-image generation (§
3.3 ).
The training pipeline is depicted in Fig.
3 , providing an overview of the process.


Figure 3: Training pipeline of BOOT.
and t are two consecutive timesteps where s < t.
From a noise map
ϵ, the objective of BOOT minimizes the difference between the output of a student model at timestep s, and the output of stacking the same student model and a teacher model at an earlier time t.
The whole process is data-free.

s和t是两个连续的时间步,其中s < t。

3.1 Signal-ODE
We utilize a time-conditioned student model g θ ( ϵ , t ) in our approach.
Similar to direct distillation (
Luhman & Luhman , 2021 ), BOOT always takes random noise ϵ as input and approximates the intermediate diffusion model variable: g θ ( ϵ , t ) x t = ODE-Solver ( f ϕ , ϵ , T t ) , ϵ ∼ N (0 , I ) .

在我们的方法中,我们使用了一个时间条件学生模型gθ(λ, t)。
与直接蒸馏类似(Luhman & Luhman, 2021), BOOT总是以随机噪声为输入并近似中间扩散模型变量:gθ(λ, t)≈xt = ODE-Solver(fφ, λ, t→t), λ ~ N (0, I)。

This approach eliminates the need to sample from real data during training.
The final sample can be obtained as
g θ ( ϵ , 0) x 0 .
However, it poses a challenge to train
g θ effectively, as neural networks struggle to predict partially noisy images ( Berthelot et al. , 2023 ), leading to out-of-distribution (OOD) problems and additional complexities in learning g θ accurately.

最终的样本可以得到gθ(λ, 0)≈x0。
然而,它对有效训练gθ提出了挑战,因为神经网络难以预测部分噪声图像(Berthelot et al., 2023),导致分布外(OOD)问题和准确学习gθ的额外复杂性。

To overcome the aforementioned challenge, we propose an alternative approach where we predict y t = ( x t σ t ϵ ) t .
In this case,
y t represents the low-frequency "signal" component of x t , which is easier for neural networks to learn.
The initial noise for diffusion is denoted by
ϵ . This prediction target is reasonable since it aligns with the boundary condition of the teacher model, where y 0 = x 0 .
Furthermore, we can derive an iterative equation from Eq. ( 2 ) for consecutive timesteps:

Figure 4: Comparison between the generated outputs of DDIM/Signal-ODE and our distilled model given the same prompt A raccoon wearing a space suit, wearing a helmet.
Oil painting in the style of
Rembrandt and initial noise input.
By definition, signal-ODE converges to the same final sample as
the original DDIM, while the distilled single-step model does not necessarily follow.

