Photorealistic Text-to-Image Diffusion Modelswith Deep Language Understanding

1 Title 

        Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding(Chitwan Saharia, William Chan, Saurabh Saxenay, Lala Liy, Jay Whangy,Emily Denton, Seyed KamyarSeyed Ghasemipour, Burcu Karagol Ayan,S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans,Jonathan Hoy, David J Fleety, Mohammad Norouzi)

2 Conclusion

         This paper presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion  models in high-fidelity image generation. Our key discovery is that generic large language models pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model.

3 Good Sentences

        1.In contrast to prior work that uses only image-text data for model training, the key finding behind Imagen is that text embeddings from large LMs, pretrained on text-only corpora, are remarkably effective for text-to-image synthesis.(The differences between Imagen and previous works which use image-text data training)
        2.DrawBench enables deeper insights through a multi-dimensional evaluation of text-to-image models, with text prompts designed to probe different semantic properties of models..(The contribution of DrawBench)
        3. Large language models can be another models of choice to encode text for text-to-image generation. Recent progress in large language models have led to leaps in textual understanding and generative capabilities.Language models are trained on text only corpus significantly larger than paired image-text data,thus being exposed to a very rich and wide distribution of text. These models are also generally much larger than text encoders in current image-text models( Why use Large language models as pretrained text encoders)
        4.This is a train-test mismatch, and since the diffusion model is iteratively applied on its own output throughout sampling, the sampling process produces unnatural images and sometimes even diverges. To counter this problem, we investigate static thresholding and dynamic thresholding.(The problems of traing and sampling and the solution of them)


       本文的主要贡献如下:

        1.发现使用纯文本数据训练的frozen language model 在text to image 任务上比一些多模态数据训练的模型来说是更好用的文本编码器,并且在提高样本质量方面,缩放这个编码器的大小比缩放扩散模型的大小更有效

        2.dynamic thresholding动态阈值,一种新的扩散采样技术,利用高指导权重,并生成比以前更逼真和详细的图像。

        3.强调了几个重要的扩散架构设计选择,并提出了Efficient U-Net,,一个新的架构变体,更简单,收敛速度更快,内存效率更高

        4.介绍了一个新的全面的和具有挑战性的文本到图像任务的评估基准DrawBench


Imagen

Imagen由一个文本编码器和一系列条件扩散模型组成,前者将文本映射到一系列嵌入,后者将这些嵌入映射到分辨率不断提高的图像

Photorealistic Text-to-Image Diffusion Modelswith Deep Language Understanding_第1张图片


        Imagen探索了预先训练的文本编码器:BERT,T5和CLIP,并冻结了这几个编码器的权重。

Diffusion models and classifier-free guidance

Diffusion models是一类生成模型,通过迭代去噪过程将高斯噪声转换为来自学习数据分布的样本。

classifier guidance是一种在采样期间使用来自预训练模型的梯度来提高样本质量同时降低条件扩散模型中的多样性的技术。Classifier-free guidance则是它的替代技术,通过在训练期间随机丢弃c(例如,以10%的概率)来联合训练条件和无条件目标的单扩散模型,从而避免这种预训练模型。

Neural network architecture

预先练的文本编码器可以使用BERT, T5 and CLIP

Base model: 使用 U-Net 作为生成64*64图片的扩散模型,该网络通过池化嵌入向量,把以文本为条件的嵌入添加到扩散模型的时间步长嵌入

Super-resolution models: 使用Efficient U-Net作为将图片高分辨率化的模型

你可能感兴趣的:(人工智能)