阅读笔记:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Task

阅读笔记:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

阅读笔记:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Task_第1张图片

Contribution

  • 提出 ViLBERT 模型(two streams model),由两个BERT结构分别对text和image进行学习,通过cross-attention进行信息交流,在两个预训练任务(proxy tasks)上进行预训练。最后在4个task上进行finetune:visual question answering、visual commonsense reasoning, referring expressions、caption-based image retrieval

  • 指出主流visual-text model的问题:

    the dominant strategy is to start with separate language and vision models pretrained for other large-scale tasks and then learn grounding as part of task training – often resulting in myopic gr

你可能感兴趣的:(论文阅读笔记,深度学习,自然语言处理)