读论文---Scalable Diffusion Models with Transformers

 Figure 1. Diffusion models with transformer backbones achieve state-of-the-art image quality. We show selected samples from two of our class-conditional DiT-XL/2 models trained on ImageNet at 512×512 and 256×256 resolution, respectively.



We explore a new class of diffusion models based on the transformer architecture.
We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches.
We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops.
We find that DiTs with higher Gflops—through in- creased transformer depth/width or increased number of in- put tokens—consistently have lower FID.
n addition to pos- sessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class- conditional ImageNet 512×512 and 256×256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

我们训练图像的潜在扩散模型,用一个对latent patches操作的变压器取代常用的U-Net骨干网
除了拥有良好的可扩展性属性,我们最大的DiT-XL/2模型在类条件ImageNet 512×512和256×256基准上优于所有先前的扩散模型,在后者上实现了2.27的最先进的FID。


Machine learning is experiencing a renaissance powered by transformers.
Over the past five years, neural architectures for natural language processing [8, 39], vision [10] and several other domains have largely been subsumed by transformers [57].
Many classes of image-level generative models remain holdouts to the trend, though—while transformers see widespread use in autoregressive models [3,6,40,44], they have seen less adoption in other generative modeling frameworks.
For example, diffusion models have been at the forefront of recent advances in image-level generative models [9,43];
yet, they all adopt a convolutional U-Net architecture as the defacto choice of backbone.


The seminal work of Ho et al. [19] first introduced the U-Net backbone for diffusion models. The design choice was inherited from PixelCNN++ [49, 55], an autoregressive generative model, with a few architectural changes. The model is convolutional, comprised primarily of ResNet [15] blocks.
In contrast to the standard U-Net [46], additional spatial self-attention blocks, which are essential components in transformers, are interspersed at lower resolutions.
Dhariwal and Nichol [9] ablated several architecture choices for the U-Net, such as the use of adaptive normalization layers [37] to inject conditional information and channel counts for convolutional layers.
However, the high-level design of the U-Net from Ho et al. has largely remained intact.

设计选择继承自pixelcnn++[49, 55],一个自回归生成模型,有一些架构上的变化。该模型是卷积的,主要由ResNet[15]块组成。

With this work, we aim to demystify the significance of architectural choices in diffusion models and offer empirical baselines for future generative modeling research.
We show that the U-Net inductive bias is not crucial to the performance of diffusion models, and they can be readily replaced with standard designs such as transformers.
As a result, diffusion models are well-poised to benefit from the recent trend of architecture unification—e.g., by inheriting best practices and training recipes from other domains, as well as retaining favorable properties like scalability, ro- bustness and efficiency.
A standardized architecture would also open up new possibilities for cross-domain research.

我们表明,U-Net inductive bias对扩散模型的性能不是至关重要的,并且它们可以很容易地用标准设计(如变压器)代替

In this paper, we focus on a new class of diffusion models based on transformers.
We call them Diffusion Transformers, or DiTs for short.
DiTs adhere to the best practices of Vision Transformers (ViTs) [10], which have been shown to scale more effectively for visual recognition than traditional convolutional networks (e.g., ResNet [15]).


More specifically, we study the scaling behavior of transformers with respect to network complexity vs. sample quality. We show that by constructing and benchmarking the DiT design space under the Latent Diffusion Models (LDMs) [45] framework, where diffusion models are trained within a VAE’s latent space, we can successfully replace the U-Net backbone with a transformer.
We further show that DiTs are scalable architectures for diffusion models: there is a strong correlation between the network complexity (measured by Gflops) vs. sample quality (measured by FID).
By simply scaling-up DiT and training an LDM with a high-capacity backbone (118.6 Gflops), we are able to achieve a state-of-the-art result of 2.27 FID on the class- conditional 256 × 256 ImageNet generation benchmark.

通过简单地扩展DiT并训练具有高容量主干(118.6 Gflops)的LDM,我们能够在类别条件256 × 256 ImageNet生成基准上实现2.27 FID的最先进结果。

相关工作related work


 Figure 3. The Diffusion Transformer (DiT) architecture. 
Left: We train conditional latent DiT models.
The input latent is decomposed into patches and processed by several DiT blocks
Right: Details of our DiT blocks. We experiment with variants of standard transformer blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.
