【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that
(1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks
(2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models.


unified的含义:

  • 下游任务微调即可以进行生成任务,又可以进行理解任务(比较全面)
  • 此模型的编码器和解码器都是transformer结构的

简介

【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA_第1张图片
现有的方法

Although significant improvements have been reported on individual downstream tasks using different pre-trained models, it remains challenging to pre-train a single, unified model that is universally applicable, via fine-tuning, to a wide range of vision-language tasks as disparate as vision-language generation (e.g., image captioning) and understanding (e.g., VQA).


本文的motivation:现存方法的通用性不足,往往仅适用单一的下游任务种类,不能适用多种下游任务。

To this end, we propose a unified encoder-decoder model, called the Vision-Language Pre-training (VLP) model, which can be fine-tuned for both vision-language generation and understanding tasks. The VLP model uses a shared multi-layer Transformer network (Vaswani et al. 2017) for encoding and decoding, pre-trained on large amounts of image-caption pairs (Sharma et al. 2018), and optimized for two unsupervised vision-language prediction tasks: bidirectional and sequence to sequence (seq2seq) masked language prediction.


关于文章工作内容的总结:

  • 提出的方法叫VLP(Vision-Language Pre-training),是一种预训练方法
  • 此方法得到的模型即可以微调进行理解和也可进行生成任务
  • 此方法将Transformer的编码器和解码器融合在了一个模型里面
  • 此方法的预训练数据是图片理解数据集
  • 此预训练方法有两个预训练任务:双向预测和序列预测
  • 双向预测可以借助待预测词左右两侧的表征信息进行预测
  • 序列预测仅可以借助待预测词左侧的表征信息进行预测

VLP

关于过程的数学描述:

  • 图片-I
  • 文本描述-S
  • 使用off-the-shelf检测到N个目标序列R=[R1,…,Rn],每个目标对应一个d维的特征
  • 与目标序列对应的序列标签序列C=[C1,…,Cn],每个标签是L维的概率分布向量,L是指类别的数量
  • 目标的位置信息序列G=[G1,…,Gn],每个位置信息有5维,分别对应左上角和右下角坐标,以及面积占比

网络结构

【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA_第2张图片

The region embedding is defined as:
在这里插入图片描述
where [·|·] indicates the concatenation on the feature dimension, LayerNorm represents Layer Normalization. The second term mimics the positional embedding in BERT, but adding extra region class information, and Wr, Wp, Wc, Wg are the embedding weights (the bias term and the nonlinearity term are omitted).


  • ri中融入了标签信息和位置信息,模仿了BERT的输入层
  • 各个W对应的线性变换中,偏置项和非线性变换被省略了

  • 文本的编码方式和BERT是一样的

预训练目标

【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA_第3张图片

During the pre-training, we alternate per-batch between the two objectives and the proportions of seq2seq and bidirectional are determined by hyper parameters λ and 1 − λ, respectively.


  • 两种方法的区别已经在上面介绍过了,这里可以结合图中红框理解
  • 补充细节:在训练过程中两种目标交替使用,并且使用λ和1-λ作为权重组合

下游任务微调

生成图片描述

We fine-tune the pre-trained VLP model on the target dataset using the seq2seq objective. During inference, we first encode the image regions along with the special [CLS] and [SEP] tokens and then start the generation by feeding in a [MASK] token and sampling a word from the word likelihood output (e.g., greedy sampling). Then, the [MASK] token in the previous input sequence is replaced by the sampled word and a new [MASK] token is appended to the input sequence to trigger the next prediction. The generation terminates when the [STOP] token is chosen. Other inference approaches like beam search could apply as well.


  • 每次向序列尾部加[MASK]
  • 然后就像预训练时一样,让模型预测[MASK]
  • 根据模型预测的likelihood,使用贪婪策略进行取样
  • 直到模型选择[STOP]
  • 采样是也可以使用束搜索(像做机器翻译那样)

图片问答

We frame VQA as a multi-label classification problem. In this work we focus on open domain VQA where top k most frequent answers are selected as answer vocabulary and used as class labels. Following (Anderson et al. 2018) we set k to 3129.
During the fine-tuning, a multi-layer Perceptron (Linear+ReLU+Linear+Sigmoid) on top of the element-wise product of the last hidden states of [CLS] and [SEP] is learned, similar to (Lu et al. 2019).


似乎还是把问答看作多分类问题,一共有3129的候选项。
我感觉问答如果能做成生成任务就好了。
是利用[CLS]和[SEP]两个标签做分类。

实验

【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA_第4张图片
【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA_第5张图片
【论文笔记】Unified Vision-Language Pre-Training for Image Captioning and VQA_第6张图片

你可能感兴趣的:(transformer,自然语言处理,计算机视觉)