苹果 宾夕法尼亚
摘要
Diffusion models have demonstrated excellent potential for generating diverse images.
However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as are medy which can reduce the number of inference steps to one or a few, without significant quality degradation.
However, existing distillation methods either require significant amounts of offline computation for generating synthetic training data from the teacher model, or need to perform expensive online learning with the help of real data.
In this work, we present a novel technique called BOOT , that overcomes these limitations with an efficient data-free distillation algorithm.
The core idea is to learn a time-conditioned model that predicts the output of a pre trained diffusion model teacher given any time-step.
Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps.
Furthermore, our method can be easily adapted to large-scale text-to-image diffusion models,
which are challenging for conventional methods given the fact that the training sets are often large and difficult to access.
We demonstrate the effectiveness of our approach on several benchmark datasets in the DDIM setting, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher.
The text-to-image results show that the proposed approach is able to handle highly complex distributions, shedding light on more efficient generative modeling。
扩散模型已经证明了生成不同图像的良好潜力。
然而,由于迭代去噪,它们的性能往往受到生成缓慢的影响。
最近提出的知识蒸馏是一种可以将推理步骤减少到一个或几个,而不会显著降低质量的方法。
然而,现有的蒸馏方法要么需要大量的离线计算来从教师模型生成综合训练数据,要么需要在真实数据的帮助下执行昂贵的在线学习。
在这项工作中,我们提出了一种名为BOOT的新技术,该技术通过有效的无数据蒸馏算法克服了这些限制。
核心思想是学习一个时间条件模型,该模型可以预测预先训练过的扩散模型教师给定任何时间步长的输出。
基于连续两个采样步的引导可以有效地训练该模型。
此外,我们的方法可以很容易地适应大规模的文本到图像扩散模型,
这对传统方法来说是一个挑战,因为训练集通常很大,很难访问。
我们在DDIM设置下的几个基准数据集上证明了我们的方法的有效性,实现了相当的生成质量,同时比扩散教师快了几个数量级。
文本到图像的结果表明,该方法能够处理高度复杂的分布,为更有效的生成建模提供了线索
Introduction
上:一致性模型,
下:BOOT
xxx, standard diffusion models often have slow inference times (around 50 ∼ 1000× slower than single-step models like GANs)
To address this issue, previous studies have proposed using knowledge distillation to improve the inference speed ( Hinton et al. , 2015 ).
最早的KD方法
The idea is to train a faster student model that can replicate the output of a pre-trained diffusion model.
In this work, we focus on learning single-step models that only require one neural function evaluation (NFE).
However, conventional methods, such as Luhman & Luhman ( 2021 ), r equire executing the full teacher sampling to generate synthetic targets for every student update, which is impractical for distilling large diffusion models like StableDiffusion (SD, Rombach et al. , 2021 ).
Recently, several techniques have been proposed to avoid sampling using the concept of "bootstrap".
For example, Salimans & Ho ( 2022 ) gradually reduces the number of inference steps based on the previous stage’s student, while Song et al. ( 2023 ) and Berthelot et al.
( 2023 ) train single-step denoisers by enforcing self-consistency between adjacent student outputs along the same diffusion trajectory (see Fig. 2 ).
However, these approaches rely on the availability of real data to simulate the intermediate diffusion states as input, which limits their applicability in scenarios where the desired real data is not accessible.
为了解决这个问题,之前的研究已经提出使用知识蒸馏来提高推理速度(Hinton et al., 2015)。
我的意思是是训练一个更快的学生模型,它可以复制预训练的扩散模型的输出。
在这项工作中,我们专注于学习只需要一个神经功能评估(NFE)的单步模型。
然而,传统方法,如Luhman & Luhman(2021),需要执行完整的教师抽样来为每次学生更新生成合成目标,这对于提取StableDiffusion等大型扩散模型是不切实际的(SD, Rombach等人,2021)。
最近,已经提出了几种使用“自举”概念来避免采样的技术。
例如,Salimans & Ho(2022)基于前一阶段的学生逐渐减少了推理步骤的数量,
而Song等人(2023)和Berthelot等人(2023)通过在沿相同扩散轨迹的相邻学生输出之间加强自一致性来训练单步去噪器(见图2)。
然而,这些方法依赖于真实数据的可用性来模拟中间扩散状态作为输入,这限制了它们在无法访问所需真实数据的情况下的适用性。
In this paper, we propose BOOT, a data-free knowledge distillation method for denoising diffusion models based on bootstrapping.
BOOT is partially motivated by the observation made by consistency model (CM, Song et al., 2023) that all points on the same diffusion trajectory (also known as PF-ODE (Song et al., 2020b)) have a deterministic mapping between each other.
Unlike CM, which seeks self-consistency from any xt to x0, BOOT predicts all possible xt given the same noise point ϵ and a time indicator t.
Since our model gθ always reads pure Gaussian noise, there is no need to sample from real data.
Moreover, learning all xt from the same ϵ enables bootstrapping:
it is easier to predict xt if the model has already learned to generate xt′ where t′ > t. However, formulating bootstrapping in this way presents additional challenges, such as noisy sample prediction, which is non-trivial for neural networks.
To address this, we learn the student model from a novel Signal-ODE derived from the original PF-ODE.
We also design objectives and boundary conditions to enhance the sampling quality and diversity.
This enables efficient inference of large diffusion models in scenarios where the original training corpus is inaccessible due to privacy or other concerns.
For example, we can obtain an efficient model for synthesizing images of "raccoon astronaut" by distilling the text-to-image model with the corresponding prompts (shown in Fig. 3), even though collecting such data in reality is difficult.
本文提出了一种基于自举bootstrapping的扩散模型去噪的无数据知识蒸馏方法BOOT。
BOOT的部分动机来自于一致性模型(CM, Song et al., 2023一致性模型)的观察,即相同扩散轨迹上的所有点(也称为PF-ODE (Song et al., 2020b))彼此之间具有确定性映射。
与CM不同,它寻求从任何xt到x0的自一致性,BOOT在给定相同的噪声点λ和时间指示器t的情况下预测所有可能的xt。
由于我们的模型gθ总是读取纯高斯噪声,因此不需要从实际数据中采样。
此外,从相同的ϵ 中学习所有的xt可以实现自举bootstrapping:
如果模型已经学会生成t > t的xt,则更容易预测xt。
然而,以这种方式制定自举会带来额外的挑战,例如噪声样本预测,这对于神经网络来说是非平凡的。
为了解决这个问题,我们从原始的PF-ODE衍生出的新颖信号ODE中学习学生模型。
我们还设计了目标和边界条件,以提高采样质量和多样性。
这使得在原始训练语料库由于隐私或其他问题而无法访问的情况下,能够有效地推断大型扩散模型。
例如,我们可以通过提取文本到图像的模型和相应的提示(如图3所示)来获得一个高效的“浣熊宇航员”图像合成模型,尽管在现实中很难收集到这样的数据。
In the experiments, we first demonstrate the efficacy of BOOT on various challenging image generation benchmarks, including unconditional and class-conditional settings.
Next, we show that the proposed method can be easily adopted to distill text-to-image diffusion models. An illustration of sampled images from our distilled text-to-image model is shown in Fig. 1 .
在实验中,我们首先证明了BOOT在各种具有挑战性的图像生成基准测试中的有效性,包括无条件和类条件设置。
接下来,我们证明了所提出的方法可以很容易地用于提取文本到图像的扩散模型。图1显示了从我们提炼的文本到图像模型中采样图像的示例。
2 Preliminaries
2.1 Diffusion Models
2.2 Knowledge Distillation
Orthogonal to the development of ODE solvers, distillation-based techniques have been proposed to learn faster student models from a pre-trained diffusion teacher.
The most straightforward approach is to perform direct distillation ( Luhman & Luhman , 2021 ), where a student model g θ is trained to learn from the output of the diffusion model, which is computationally expensive itself:
Here, ODE-solver refers to any solvers like DDIM as mentioned above.
While this naive approach shows promising results, it typically requires over 50 steps of evaluations to obtain reasonable distillation targets, which becomes a bottleneck when learning large-scale models.
Alternatively, recent studies ( Salimans & Ho , 2022 ; Song et al. , 2023 ; Berthelot et al. , 2023 ) have proposed methods to avoid running the full diffusion path during distillation.
For instance, the consistency model (CM, Song et al. , 2023 ) trains a time-conditioned student model g θ ( x t , t ) to predict self-consistent outputs along the diffusion trajectory in a bootstrap fashion:
where x s = ODE-Solver ( f ϕ , x t , t → s ) , typically with a single-step evaluation using Eq. ( 2 ).
In this case, θ − represents an exponential moving average (EMA) of the student parameters θ , which is important to prevent the self-consistency objectives from collapsing into trivial solutions by always predicting similar outputs.
After training, samples can be generated by executing g θ ( x T , T ) with a single NFE.
It is worth noting that Eq. ( 4 ) requires sampling x t from the real data sample x , which is the essence of bootstrapping:
the model learns to denoise increasingly noisy inputs until x T .
However, in many tasks, the original training data x for distillation is inaccessible.
For example, text-to-image generation models require billions of paired data for training.
One possible solution is to use a different dataset for distillation; however, the mismatch in the distributions of the two datasets would result in suboptimal distillation performance.
与ODE求解器的发展正交,已经提出了基于蒸馏的技术,以便从预训练的扩散教师那里更快地学习学生模型。
最直接的方法是执行直接蒸馏(Luhman & Luhman, 2021),其中训练学生模型gθ从扩散模型的输出中学习,这本身在计算上是昂贵的:
这里,ODE-solver指的是任何像上面提到的DDIM这样的求解器。
虽然这种朴素的方法显示出有希望的结果,但通常需要超过50步的评估才能获得合理的蒸馏目标,这在学习大规模模型时成为瓶颈。
或者,最近的研究(Salimans & Ho, 2022;Song et al., 2023;Berthelot等人,2023)提出了避免在蒸馏过程中运行完整扩散路径的方法。
例如,一致性模型(CM, Song et al., 2023)训练一个时间条件学生模型gθ(xt, t),以自举方式沿扩散轨迹预测自洽输出:
其中xs = ODE-Solver(fϕ, xt, t→s),通常使用Eq.(2)进行单步评估。
在这种情况下,θ−表示学生参数θ的指数移动平均(EMA),这对于通过总是预测相似的输出来防止自洽目标崩溃为平凡的解决方案非常重要。
训练完成后,可以用单个NFE执行gθ(xT, T)生成样本。
值得注意的是,Eq.(4)要求从真实数据样本x中抽取xt,这就是bootstrapping的本质:
模型学习去噪越来越多的噪声输入,直到xT。
然而,在许多任务中,用于蒸馏的原始训练数据x是不可访问的。
例如,文本到图像生成模型需要数十亿对数据进行训练。
一个可能的解决方案是使用不同的数据集进行蒸馏;
然而,两个数据集分布的不匹配会导致蒸馏性能的次优。
3 Method
In this section, we present BOOT, a novel distillation approach inspired by the concept of bootstrapping without requiring target domain data during training.
We begin by introducing signal-ODE , a modeling technique focused exclusively on signals (§ 3.1 ), and its corresponding distillation process (§ 3.2 ).
Subsequently, we explore the application of BOOT in text-to-image generation (§ 3.3 ).
The training pipeline is depicted in Fig. 3 , providing an overview of the process.
在本节中,我们介绍BOOT,这是一种受bootstrapping概念启发的新型蒸馏方法,在训练期间不需要目标域数据。
我们首先介绍signal-ODE,这是一种专门针对信号(§3.1)的建模技术,及其相应的蒸馏过程(§3.2)。
随后,我们探索BOOT在文本到图像生成中的应用(第3.3节)。
训练管道如图3所示,提供了过程的概述。
Figure 3: Training pipeline of BOOT.
s and t are two consecutive timesteps where s < t.
From a noise map ϵ, the objective of BOOT minimizes the difference between the output of a student model at timestep s, and the output of stacking the same student model and a teacher model at an earlier time t.
The whole process is data-free.
BOOT培训流水线。
s和t是两个连续的时间步,其中s < t。
噪声映射λ中,BOOT的目标是最小化学生模型在时间步长s的输出与在更早时间t叠加相同学生模型和教师模型的输出之间的差异。
整个过程是无数据的。
3.1 Signal-ODE
We utilize a time-conditioned student model g θ ( ϵ , t ) in our approach.
Similar to direct distillation ( Luhman & Luhman , 2021 ), BOOT always takes random noise ϵ as input and approximates the intermediate diffusion model variable: g θ ( ϵ , t ) ≈ x t = ODE-Solver ( f ϕ , ϵ , T → t ) , ϵ ∼ N (0 , I ) .
在我们的方法中,我们使用了一个时间条件学生模型gθ(λ, t)。
与直接蒸馏类似(Luhman & Luhman, 2021), BOOT总是以随机噪声为输入并近似中间扩散模型变量:gθ(λ, t)≈xt = ODE-Solver(fφ, λ, t→t), λ ~ N (0, I)。
This approach eliminates the need to sample from real data during training.
The final sample can be obtained as g θ ( ϵ , 0) ≈ x 0 .
However, it poses a challenge to train g θ effectively, as neural networks struggle to predict partially noisy images ( Berthelot et al. , 2023 ), leading to out-of-distribution (OOD) problems and additional complexities in learning g θ accurately.
这种方法消除了在训练期间从真实数据中采样的需要。
最终的样本可以得到gθ(λ, 0)≈x0。
然而,它对有效训练gθ提出了挑战,因为神经网络难以预测部分噪声图像(Berthelot et al., 2023),导致分布外(OOD)问题和准确学习gθ的额外复杂性。
To overcome the aforementioned challenge, we propose an alternative approach where we predict y t = ( x t − σ t ϵ ) /α t .
In this case, y t represents the low-frequency "signal" component of x t , which is easier for neural networks to learn.
The initial noise for diffusion is denoted by ϵ . This prediction target is reasonable since it aligns with the boundary condition of the teacher model, where y 0 = x 0 .
Furthermore, we can derive an iterative equation from Eq. ( 2 ) for consecutive timesteps:
Figure 4: Comparison between the generated outputs of DDIM/Signal-ODE and our distilled model given the same prompt A raccoon wearing a space suit, wearing a helmet.
Oil painting in the style of Rembrandt and initial noise input.
By definition, signal-ODE converges to the same final sample as the original DDIM, while the distilled single-step model does not necessarily follow.
图4:在相同提示下,DDIM/Signal-ODE生成的输出与我们的蒸馏模型的输出对比:
一只浣熊穿着宇航服,戴着头盔。
伦勃朗风格的油画和初始噪声输入。
根据定义,signal-ODE收敛于与原始DDIM相同的最终样本,而经过提炼的单步模型则不一定遵循。