X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks论文笔记

Title:X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Code

1. Motivation

  • CLIP这一类方法只能进行图片级别的视觉和文本对齐;
  • 也有一些方法利用预训练的目标检测器进行目标级别的视觉和文本对齐,但是只能编码目标内部的特征,无法有效表达多目标上下文关联;
  • 本文致力于进行多粒度(objects, regions, and images)的视觉文本对齐预训练任务;

2. 模型结构

X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks论文笔记_第1张图片

3. 损失函数

3.1 contrastive loss

  1. 文本特征和视觉特征之间的相似性定义:

在这里插入图片描述
3. vision-to-text similarity

X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks论文笔记_第2张图片
4. text-to-vision similarity
X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks论文笔记_第3张图片
5. GT:one-hot
X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks论文笔记_第4张图片
6. cross-entropy loss

在这里插入图片描述

3.2 matching loss

  1. For each visual concept in a mini-batch, we sample an in-batch hard negative text by following p v 2 t ( V ) p^{v2t}(V) pv2t(V). (与当前视觉特征越接近的文本越可能被采样)
  2. We also sample one hard negative visual concept for each text.
  3. put the pairs as inputs for the fusion module, and then we use xcls, the output [CLS] embedding of the fusion module, to predict the matching probability p m a t c h p^{match} pmatch , and the loss is:
    在这里插入图片描述

3.3 masked language modeling loss (MLM)

X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks论文笔记_第5张图片

3.4 bbox loss

X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks论文笔记_第6张图片

你可能感兴趣的:(多模态,论文阅读)