论文阅读 [TPAMI-2022] MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

论文阅读 [TPAMI-2022] MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

论文搜索(studyai.com)

搜索论文: MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

搜索论文: http://www.studyai.com/search/whole-site/?q=MRA-Net:+Improving+VQA+Via+Multi-Modal+Relation+Attention+Network

关键字(Keywords)

Visualization; Feature extraction; Semantics; Knowledge discovery; Cognition; Task analysis; Natural languages; Visual question answering; visual relation; attention mechanism; relation attention

机器学习; 机器视觉; 自然语言处理

细粒度视觉; 语义分析; NLP问答; 视觉问答; 视觉推理; 视觉空间关系; 注意力机制; 多模态感知

摘要(Abstract)

Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images.

视觉问答(VQA)是回答与视觉图像内容相关的自然语言问题的任务。.

Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning.

最近的VQA方法通常采用注意力机制来关注相关的视觉对象和/或通过视觉关系推理中的现成方法来考虑对象之间的关系。.

However, they still suffer from several drawbacks.

然而,它们仍然存在一些缺点。.

First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge.

首先,它们大多对对象之间的简单关系进行建模,这导致许多复杂问题无法正确回答,因为无法提供足够的知识。.

Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature.

其次,他们很少利用视觉外观特征与关系特征的和谐配合。.

To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network (MRA-Net).

为了解决这些问题,我们提出了一种新的端到端VQA模型,称为多模态关系注意网络(MRA-Net)。.

The proposed model explores both textual and visual relations to improve performance and interpretability.

提出的模型探索了文本和视觉关系,以提高性能和可解释性。.

In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two question-adaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations.

具体来说,我们设计了1)一个自我引导的词语关系注意方案,该方案探索词语之间潜在的语义关系;2) 两个问题自适应视觉关系注意模块,不仅可以提取对象之间细粒度和精确的二元关系,还可以提取更复杂的三元关系。.

Both kinds of question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering.

这两种与问题相关的视觉关系提供了更多、更深层次的视觉语义,从而提高了问答的视觉推理能力。.

Furthermore, the proposed model also combines appearance feature with relation feature to reconcile the two types of features effectively.

此外,该模型还结合了外观特征和关系特征,有效地协调了这两类特征。.

Extensive experiments on five large benchmark datasets, VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC, demonstrate that our proposed model outperforms state-of-the-art approaches…

在五个大型基准数据集(VQA-1.0、VQA-2.0、COCO-QA、VQA-CP v2和TDIUC)上进行的大量实验表明,我们提出的模型优于最先进的方法。。.

作者(Authors)

[‘Liang Peng’, ‘Yang Yang’, ‘Zheng Wang’, ‘Zi Huang’, ‘Heng Tao Shen’]

你可能感兴趣的:(人工智能,计算机视觉,机器学习,深度学习,CVPR)