LXMERT: Learning Cross-Modality Encoder Representations from Transformers

目录

  • Model Architecture
  • Pre-Training Strategies
  • Experimental Setup and Results
  • References

LXMERT: Learning Cross-Modality Encoder Represen-tations from Transformers

Model Architecture

LXMERT: Learning Cross-Modality Encoder Representations from Transformers_第1张图片

  • Input Embeddings: input embedding layers 负责将 sentence 和 image 分别转化为 word-level sentence embeddings 和 object-level image embeddings
    • Word-Level Sentence Embeddings: 首先使用 WordPiece tokenizer 对句子进行分词,然后将 Word embedding 和 positional embedding 相加后得到 index-aware word embedding:
      在这里插入图片描述
    • Object-Level Image Embeddings: 首先由 Faster-RCNN 检测出 m m m 个物体并返回物体的 position feature (i.e., bounding box coordinates) p j p_j pj 和 2048- d d d RoI feature f j f_j fj,然后再通过 FC 层得到 position-aware embedding:
      LXMERT: Learning Cross-Modality Encoder Representations from Transformers_第2张图片The layer normalization is applied to the projected features before summation so as to balance the energy of the two different types of features.
  • Encoders: single-modality encoders + cross-modality encoder. 它们主要基于 self-attention layers 和 cross-attention layers (multi-head attention)
    • Single-Modality Encoders: language encoder + object-relationship encoder. 结构与 Transformer Encoder 的 Basic Block 相同
    • Cross-Modality Encoder: 有两个 self-attention sub-layers、一个 bi-directional cross-attention sub-layer 和两个 feed-forward sub-layers 组成。其中 bi-directional cross-attention sub-layer 由两个 uni-directional cross-attention sub-layers 组成 (one from language to vision and one from vision to language). 设 k − 1 k-1 k1 层的 language features 为 { h i k − 1 } \{h_i^{k-1}\} {hik1},vision features 为 { v j k − 1 } \{v_j^{k-1}\} {vjk1},则 cross-attention 可以表示为 (函数的第一个参数为 query,第二个参数为 key 和 value 集):
      在这里插入图片描述self-attention 可以表示为
      LXMERT: Learning Cross-Modality Encoder Representations from Transformers_第3张图片
  • Output Representations: 由 cross-modality encoder 生成的 feature sequences 即为 language and vision outputs. 在输入的最开始设置的特殊 token [CLS] 对应的输出即为 cross-modality output

N L , N X , N R N_L,N_X,N_R NL,NX,NR 分别被设置为 9, 5, 5,hidden size 设为与 BERT BASE \text{BERT}_{\text{BASE}} BERTBASE 相同的 768。可以看到,language encoder 使用了更多的层数来平衡从 Faster-RCNN 中抽取出的视觉特征。如果将一个 single modality layer 视为 cross-modality layer 的一半,则相当于设置了 ( 9 + 5 ) / 2 + 5 = 12 (9 + 5)/2 + 5 = 12 (9+5)/2+5=12 个 cross-modality layers,这与 BERT BASE \text{BERT}_{\text{BASE}} BERTBASE 一样

Pre-Training Strategies

Pre-Training Tasks

LXMERT: Learning Cross-Modality Encoder Representations from Transformers_第4张图片

  • Language Task: Masked Cross-Modality language model (LM): 类似于 BERT,words 以 15% 的几率被随机遮掩,模型需要根据其余 words 和图像信息预测被遮掩的 words
  • Vision Task: Masked Object Prediction: objects 以 15% 的几率被随机遮掩 (i.e., masking RoI features with zeros),模型需要根据其余物体和文字信息预测被遮掩的 objects. 按照预测方法可以分为两个子任务:(1) RoI-Feature Regression: 利用 L2 损失对 object RoI feature f j f_j fj 进行回归;(2) Detected-Label Classification: 利用交叉熵损失对物体所属类别进行预测 (Although most of our pre-training images have object-level annotations, the ground truth labels of the annotated objects are inconsistent in different datasets (e.g., different number of label classes). For these reasons, we take detected labels output by Faster R-CNN)
  • Cross-Modality Tasks: (1) Cross-Modality Matching: 给定 image-text pair,以 50% 的几率将其中的句子随机替换为别的句子,然后额外训练一个分类器用于判断图片和句子是否匹配;(2) Image Question Answering (QA): 要求模型做 QA (设置了一个含有 9500 个候选答案的 answer table,大约覆盖了 image QA datasets 中 90% 的问题)

这些预训练任务的 loss 都被加到了一起进行训练 (We train the model for 20 epochs with a batch size of 256. We only pre-train with image QA task for the last 10 epochs, because this task converges faster and empirically needs a smaller learning rate. )


Pre-Training Data

  • 我们对 5 个基于 MS COCO 或 Visual Genome 的 vision-and-language datasets 进行了集成 (只收集 train, dev set 的数据),最终得到了 180K 张图片上的 9.18M 个 image-and-sentence pairs
    LXMERT: Learning Cross-Modality Encoder Representations from Transformers_第5张图片

Pre-Training Procedure

  • (1) We consistently keep 36 objects for each image to maximize the pre-training compute utilization by avoiding padding.
  • (2) 预训练时,encoder 和 embedding layers 中的参数均是从头开始训练。如果加载预训练的 BERT 模型参数作为初始化参数,效果会更差

Experimental Setup and Results

LXMERT: Learning Cross-Modality Encoder Representations from Transformers_第6张图片

References

  • LXMERT: Learning Cross-Modality Encoder Representations from Transformers
  • code: https://github.com/airsplay/lxmert

你可能感兴趣的:(#,多模态,多模态)