基于隐扩散模型的高分辨率图像生成

1 Title

        High-Resolution Image Synthesis with Latent Diffusion Models(Robin Rombach, Andreas Blattmann,Dominik Lorenz,Patrick Esser,Bjorn Ommer)

2 Conclusion

         since DM models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluationsTo enable DM training on limited computational resources while retaining their quality and flexibility, this paper apply them in the latent space of powerful pretrained autoencoders.  Our latent diffusion models (LDMs) achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs

3 Good Sentences

        1、In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.(The advantages of this method in contrast to previous works)
        2、To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility(The main questions of DM to improvement)
        3、While approaches to jointly learn an encoding/decoding model together with a
score-based prior exist , they still require a difficult weighting between reconstruction and generative capabilities and are outperformed by our approach (The advantages of this method in contrast to previous work)


简介

        Diffusion model是一种likelihood-based的模型,相比GAN可以取得更好的生成效果。然而该模型是一种自回归模型,需要反复迭代计算,因而训练和推理都十分昂贵。本文提出一种diffusion的过程改为在latent space上做的方法,从而大大减少计算复杂度,同时也能达到十分不错的生成效果。除此以外,还提出了cross-attention的方法来实现多模态训练,使得class-condition, text-to-image, layout-to-image也可以实现。

Method

基于隐扩散模型的高分辨率图像生成_第1张图片整体框架如图,先训练好一个AutoEncoder(包括一个encoder和decoder)。可以利用encoder压缩后的数据做diffusion操作,再用decoder恢复即可。
        具体扩散过程其实没有变,只不过现在扩散和重建的目标为latent space的向量了。Diffusion model具体实现为 time-conditional UNet。


        在latent space中,高频的细节将会被去除,因此它是一个高效的低维空间,更加方便似然

Perceptual Image Compression

        给一个RGB空间内的图像x,编码器把x编码为latent表示,解码器D则从latent中重建图像。编码器通过将图像下采样,为了避免高方差的latent空间,用了两种正则化,分别是KL-reg和QV-reg,

Conditioning Mechanisms

        为了使DM转为更加灵活的条件图像生成器,引入交叉注意力机制。而为了预处理来自各种模态(如语言提示)的y,又引入了一个特定于领域的编码器,该编码器将y投影到中间表示,然后通过交叉注意力层实现将其映射到UNet的中间层,其中

你可能感兴趣的:(人工智能)