Generative Adversarial Text to Image Synthesis[reading notes]


  • Abstract
  • Introduction
  • Background
    • Generative adversarial networks
    • Deep symmetric structured joint embedding
  • Method


In this work, we develop a novel deeparchitecture and GAN formulation to effectively bridge these advances in text and image model-ing, translating visual concepts from charactersto pixels.


In this work we are interested in translating text in the formof single-sentence human-written descriptions directly intoimage pixels.
将单句文本转化为 图像像素。

“this small bird has a short,pointy orange beak and white belly”
”the petals of thisflower are pink and the anther are yellow”.

this type of detailed visual information aboutan object has been captured in attribute representations -distinguishing characteristics the object category encodedinto a vector (Farhadi et al., 2009; Kumar et al., 2009;Parikh & Grauman, 2011; Lampert et al., 2014)
(属性表示 - 区分特征对象类别编码到向量)

in particular to enable zero-shot visual recognition (Fu et al., 2014;Akata et al., 2015),
zero-shot 视觉识别

and recently for conditional image generation (Yan et al., 2015).

Recently, deep convolutional and recurrent networks fortext have yielded highly discriminative and generaliz-able (in the zero-shot learning sense) text representationslearned automatically from words and characters (Reedet al., 2016).


  • first, learn a text feature representation that captures the important visual details;
  • second, use these features to synthesize a compelling image that a human might mistake for real

the distribution of images conditionedon a text description is highly multimodal, in the sense thatthere are very many plausible configurations of pixels thatcorrectly illustrate the description.(在文本描述中调节的图像分布是高度多模态的,在某种意义上,有很多合理的像素配置可以正确地说明描述。)

但是通过根据链规则可以顺序分解单词或字符序列这一事实,学习变得切实可行; 即
one trains the model to predict the nexttoken conditioned on the image and all previous tokens,which is a more well-defined prediction problem.

to develop a sim-ple and effective GAN architecture and training strat-egy that enables compelling text to image synthesis ofbird and flower images from human-written descriptions.


  • Caltech-UCSD Birds dataset
  • Oxford-102 Flowers dataset

每张图片with five text descriptions

Ourmodel is trained on a subset of training categories, and wedemonstrate its performance both on the training set cate-gories and on the testing set, i.e. “zero-shot” text to imagesynthesis. (我们的模型在训练类别的子集上进行训练,并且我们在训练集类别和测试集上表现出它的表现,即对图像合成的“零射击”文本。)


Generative adversarial networks

Generative Adversarial Text to Image Synthesis[reading notes]_第1张图片

Deep symmetric structured joint embedding

Generative Adversarial Text to Image Synthesis[reading notes]_第2张图片

Generative Adversarial Text to Image Synthesis[reading notes]_第3张图片


Our approach is to train a deep convolutional generativeadversarial network (DC-GAN) conditioned on text fea-tures encoded by a hybrid character-level convolutional-recurrent neural network. Both the generator network Gand the discriminator network D perform feed-forward in-ference conditioned on the text feature.
我们的方法是训练深度卷积生成对抗网络(DC-GAN),其条件是由混合字符级卷积 - 递归神经网络编码的文本特征。 发生器网络G和鉴别器网络D都以文本特征为条件执行前馈信息。
