阅读笔记:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Task
PretrainingTask-AgnosticVisiolinguisticRepresentationsforVision-and-LanguageTasksContribution提出ViLBERT模型(twostreamsmodel),由两个BERT结构分别对text和image进行学习,通过