【论文笔记】LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Vision-and-languagereasoningrequiresanunderstandingofvisualconcepts,languagesemantics,and,mostimportantly,thealignmentandrelationshipsbetweenthesetwomodalities.做视觉文本的理解任务,需要模型能理解视觉概念和文本语义信息,但最重要的是视觉和文