阅读笔记:XGPT: Cross-modal Generative Pre-Training for Image Captioning

XGPT: Cross-modal Generative Pre-Training for Image Captioning

阅读笔记:XGPT: Cross-modal Generative Pre-Training for Image Captioning_第1张图片

Contribution

  • 现有大多数VL pre-trained models基本都是Transformer-Encoder结构的,不适用于Vision-and-language generation tasks,因为:

    On one hand, pre-trained models developed for understanding tasks only provides the encoder. To support generation tasks, separate decoders have to be trained, like the methods proposed by VideoBERT and CBT. On the other hand, existing VL pre-training objectives are almost all related to the masked region or span prediction, including VLP. None of the pre-training tasks is designed for the whole sentence
    generation.

  • 本文提出一个encoder-decoder的generative式的VL model。其中,共享decoder和encoder的参数,在decoder那侧,decoder block中 self-attention 和 encoder-decoder attention 参数也是shared。

  • 提出三个新的预训练任务(加上原来的Image Captioning一共四个):

    1. Image Captioning (IC)

      输入image regions,generative的生成 caption

    2. Image-conditioned Masked Language Modeling (IMLM)

      预测被masked的连续的tokens 。和BERT的MLM的区别在于:

      the decoder has to generate masked tokens of the fragment, and extract useful image-conditioned information from the encoder side

    3. Image-conditioned Denoising Autoencoding (IDA)

      输入encoder端的一个被masked 的token fragment只用一个【MASK】来标记,所以模型需要在不知道masked fragment的长度的情况下在decoder端对原句进行重建。decoder的时候是有ground truth来进行teacher-forcing的

      为了实现text-image token的对齐,计算一个attention矩阵 A A A

    在这里插入图片描述

    使用A以后每个word token就加权结合了每个region的信息,之后再送入encoder对masked fragment进行生成。

    1. Text-conditioned Image Feature Generation (TIFG)

      TIFG aims to regress the decoder output of all image regions conditioned on text descriptions rather than only the masked regions.

      encoder部分只输入text,然后在decoder对所有生成的与image vector相同纬度的向量取平均,再与ground truth 计算MSE。

Minor Concern

The probability of each target token is estimated by the decoder given the cross-attention performed over the final hidden layer of the encoder.

  • cross-attention在哪体现了?是不是就是指最后一个encoder传过去的K和V

  • Text-conditioned Image Feature Generation (TIFG)这个预训练任务中,只在encoder端输入text是如何在decoder端得到image的语义向量的?如果是在decoder端先输入一个白板然后generate的话,考虑到image 天然没有order,那要怎么进行teacher-forcing呢?
    在这里插入图片描述
    image squence中region之间的顺序也许是人为定义实现的(从上到下,从左至右)

你可能感兴趣的:(论文阅读笔记,自然语言处理,深度学习)