Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods


Visual Description Generation

Image Description Generation

Standard Image Description Generation

Dense Image Description Generation:旨在局部目标处生成描述

Image Paragraph Generation:生成段落

Spoken Language Image Description Generation:变写为说

Stylistic Image Description Generation:添加语言风格,例如幽默,

Unseen Objects Image Description Generation:

Diverse Image Description Generation:

Controllable Image Description Generation: control and select the objects in an image to generate descriptions. 

Video Description Generation

Global Video Description Generation: 

Dense Video Description Generation: 类似与Dense Image Description Generation

Movie Description Generation: movie clips are used as input

Visual Storytelling

Image Storytelling:

Video Storytelling:

Visual Question Answering

Image Question Answering

Video Question Answering

Visual Dialog

Image Dialog

Video Dialog

Visual Reasoning

Image Reasoning

Video Reasoning

Video Referring Expression

Image Referring Expression

Video Referring Expression

Visual Entailment

Image Entailment

Language-to-Vision Generation

Language-to-Image Generation
Sentence-level Language-to-Image Generation
Image Manipulation(图像编辑):生通过本文来引导图像的编辑, 同时保持其他文本不相关的区域,另一种方法是交互式的修改图像内容,还有一种是通过对话修改。

Fine-grain Image Generation(细粒度的图像生成):

Sequential Image Generation(序列图像生成):给定一段文字(多个句子),生成一系列的图像,就像故事的可视化,与image storytelling相反。

Language-to-Video Generation


Vision-and-Language Navigation

Image and Language Navigation

Multimodal Machine Translation

Machine Translation with Image:将描述一副图像的源语言句子翻译成目标语言。

Multisource MMT:不同点:多种语言同时描述一副图像

Machine Translation with Video

Image Description Generation

  • Flickr8K

  • Flickr30K:

  • Flickr30K-Entities:

  • MSCOCO-Entities:

  • STAIR(Japanese captions):

  • Multi30K-CLID(German captions)

  • Conceptual Captions(large scale dataset):

Video Description Generation

  • Microsoft Video Description (MSVD,contain Chinese, English, German etc):

  • MPII Cooking(consists of 65 different cooking activities):

  • YouCook:

  • YouCook II:

  • Textually Annotated Cooking Scenes (TACoS):

  • TACoS-MultiLevel:

  • MPII Movie Description (MPII-MD):
  • Montreal Video Annotation (M-VAD):
  • MSR Video to Text (MSR-VTT):
  • Videos Titles in the Wild (VTW):
  • ActivityNet Captions (ANetCap):
  • ActivityNet Entities (ANetEntities):

Image Storytelling

  • New York City Storytelling (NYC-Storytelling):数据集被分成0.8,0.1,0.1,

  • Disneyland Storytelling:数据集被分成0.8,0.1,0.1

  • SIND:大规模数据集,数据集被分成0.8,0.1,0.1

  • VIST:是SIND第二个版本,

Video Storytelling

  • VideoStory
  • VideoStory-NUS


Image Question Answering

  • VQA v1.0:answers也是open-ended,要么是少数单词,要么从多个给定答案中选择一个,

  • VQA v2.0

  • OK-VQA:

  • KVQA:

Video Question Answering

  • MovieQA:
  • TVQA:
  • TVQA+:

Image Dialog

  • VisDial:
  • CLEVR-Dialog:

Video Dialog

  • The Scene-Aware Dialog (AVSD):

Image Reasoning

  • Compositional Language and Elementary Visual Reasoning (CLEVR):
  • CLEVR-CoGenT:
  • GQA:
  • Relational and Analogical Visual rEasoNing (RAVEN):

Video Reasoning

  • COG:

Image Referring Expression

Real Images

  • RefCOCO:
  • RefCOCO+,
  • RefCOCOg:
  • RefClef:
  • GuessWhat:

Synthetic Images

  • CLEVR-Ref+:

Video Referring Expression

  • Cityscapes:
  • ORGaze:

Image Entailment

  • V-SNLI
  • SNLI-VE:

Image Generation

  • Oxford-102:
  • Caltech-UCSD Birds (CUB):
  • MSCOCO-Gen:

Video Generation


  • Text2Video

Image-and-Language Navigation

  • Room-2-Room (R2R):

Machine Translation with Image

  • Multi30K-MMT:

Machine Translation with Video

  • VATEX:



Visual Genome

  • Visual Genome: comprehend interactions and relationships between objects observed in an image,
  • How2:
  • Berkeley Deep Drive eXplanation (BDD-X):au-tonomous driving,



Image Representation

  • global feature representation: 常使用AlexNet, VGG, GoogLeNet, Inception-v3, Residual Nets (ResNet)  and DenseNets学习全局特征,然而,一些语言和视觉结合任务不适合使用预训练的特征。
  • local feature representation: R-CNN等

Video Representation


常使用RNN, LSTM, BiLSTM, GRU, BiGRU, Transformer

Vision and Language

Visual Storytelling

Visual Dialog

Visual Reasoning

Visual Referring Expression

Visual Entailment

Language-to-Vision Generation

Vision-and-Language Navigation

Multimodal Machine Translation

Evaluation Measures

Common Measures

Language Metrics


  • Bilingual Evaluation Understudy (BLEU):是为机器翻译提出的,以比较机器生成的输出与人类的ground truth, 常用于Visual Caption Generation, Visual Storytelling, Video Dialog and Multimodal Machine Translation,,
  • Metric for Evaluation of Translation with Explicit Ordering (METEOR): 常用于Visual Caption Generation, Visual Storytelling, Video Dialog and Multimodal Machine Translation,
  • Recall Oriented Understudy for Gisting Evaluation (ROUGE): 常用于Visual Caption Generation,
  • Consensus based Image Description Evaluation (CIDEr): 常用于image caption generation evaluation, Video Caption Generation, Visual Storytelling and Video Dialog
  • Semantic Propositional Image Captioning Evaluation (SPICE):

Retriev al Metrics

  • Recall@k (R@k)
  • Median Rank (MedRank)
  • Mean Reciprocal Rank (MRR)
  • Mean Rank (Mean)
  • Normalized Discounted Cumulative Gain (NDCG)

Task-specific Metrics

Image Reasoning

  • Querying attribute (QA)
  • Compare Attribute (CA)
  • Compare Numbers (CN)
  • Count
  • Exist

Video Reasoning 

  • Pointing
  • Yes/No
  • Conditional (Condit)
  • Attribute-related (Atts)

Language-to-Vision Generation

  • Inception Score (IS)
  • Fréchet Inception distance (FID)
  • R-precision

Vision-and-Language Na vigation

  • Path Length (PL)
  • Navigation Error (NE)
  • Success Rate (SR)
  • Oracle Success Rate (OSR) 
  • Success Path Length (SPL)

Human Evaluation

State-of-the-Art Results

Visual Storytelling Results

Image Storytelling

Video Storytelling

Visual Dialog Results

Image Dialog

Video Dialog

Visual Reasoning Results

Image Reasoning

Video Reasoning

Image Referring Expression

Image Referring Expression

Video Referring Expression

Visual Entailment Results

Image Entailment
Language-to-Vision Generation Results

Language-to-Image Generation

Language-to-Video Generation

Vision-and-Language Navigation Results

Image-and-Language Navigation

Multimodal Machine Translation Results

Machine Translation with Image

Machine Translation with Video

Future Directions

  • Leveraging External Knowledge
  • Addressing Large-scale Data Limitations
  • Combining Multiple Tasks
  • Novel Neural Architectures for Representation: transformer
  • Image vs Video:需要更多的关注video与language的结合
  • Automatic Evaluation Measures:


