1、数据集
Flickr8k[55]《Framing image description as a ranking task: Data, models and evaluation metrics》
http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_Dataset.zip
http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/Flickr8k_text.zip
Flickr30k[113]《Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models》
http://shannon.cs.illinois.edu/DenotationGraph/data/index.html
MS-COCO[83]《Microsoft coco: Common objects in context》
https://cocodataset.org/#download
http://images.cocodataset.org/zips/train2014.zip
http://images.cocodataset.org/zips/val2014.zip
http://images.cocodataset.org/zips/test2014.zip
http://images.cocodataset.org/annotations/annotations_trainval2014.zip
http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (ACL 2018)
2、评价指标:
1)Bleu【110】《BLEU: a method for automatic evaluation of machine translation》
2)Rouge【81】《Rouge: A package for automatic evaluation of summaries》
3)meteor【9】《METEOR: An automatic metric for MT evaluation with improved correlation with human judgments》
4)cider【139】《Cider: Consensus-based image description evaluation》
5)spice【3】《Spice: Semantic propositional image caption evaluation》
3、Encoder-Decoder:
【142】《Show and tell: A neural image caption generator》
《Deep visual-semantic alignments for generating image descriptions》
《Rethinking the Form of Latent States in Image Captioning (ECCV 2018) 》
《What Value Do Explicit High Level Concepts Have in Vision to Language Problems》CVPR2016
改进encoder:
《Image Captioning with Visual-Semantic LSTM》
*《Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering》2018CVPR ATTENTION
改进decoder:
《SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text》
《Stack-Captioning: Coarse-to-Fine Learning for Image Captioning》
《What the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? (ACL 2017)》
【92】《Guiding the long-short term memory model for image caption generation》
【144】《Image captioning with deep bidirectional LSTMs》
4、attention
【152】《Show, attend and tell: Neural image caption generation with visual attention》spatial attention
【61】《Aligning where to see and what to tell: image caption with region-based attention and scene factorization》
【151】《Encode, Review, and Decode: Reviewer Module for Caption Generation》
【112】《Areas of Attention for Image Captioning》
【88】《Knowing when to look: Adaptive attention via A visual sentinel for image captioning》2017CVPR
【84】《Attention Correctness in Neural Image Captioning》
【21】《SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning》CVPR2017 channel wise attention
【134】《Paying Attention to Descriptions Generated by Image Captioning Models》
【111】《Attend to You: Personalized Image Captioning with Context Sequence Memory Networks》
《SimNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions》
《What value does explicit high level concepts have in vision to language problems》
《Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning (IJCAI 2018)》
5、semanic
【155】《Boosting image captioning with attributes》ICCV2017
【156】《Image captioning with semantic attention》
【150】《Image captioning and visual question answering based on attributes and external knowledge》
【41】《Semantic compositional networks for visual captioning》2017CVPR
【148】《Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition》
《Semantic Regularisation for Recurrent Image Annotation (CVPR 2017)》
6、合成架构
【33】《From captions to visual concepts and back》传统的语言建模方式
【90】《Describing images by feeding LSTM with structural words》
【135】《Rich image captioning in the wild》
《Review Networks for Caption Generation (NIPS 2016)》
《Recurrent Fusion Network for Image Captioning (ECCV 2018)》
《Learning to Guide Decoding for Image Captioning (AAAI 2018)》
《A Neural Compositional Paradigm for Image Captioning(NIPS 2018)》
7、CNN语言模型
【5】《Convolutional image captioning》CVPR2018
【147】《CNN+ CNN: Convolutional Decoders for Image Captioning》
8、非监督学习
Exposure Bias(RNN造成的累积误差)
解决方法:1)使用scheduled-sampling,在训练阶段使用的输入以p的概率选择真实样本,以1-p的概率选择上一个词的输出。p随着训练次数的增加衰减(指数函数、反sigmoid函数、线性函数)。
Word-Level Oracle Word《Scheduled sampling for sequence prediction with recurrent neural networks》Google 2015
《Scheduled Sampling for Transformers》
Sentence-Level Oracle Word(Beam-Search《Sequence-to-Sequence Learning as Beam-Search Optimization》)
《Guiding Long-Short Term Memory for Image Caption Generation (ICCV 2015)》
《Guided Open Vocabulary Image Captioning with Constrained Beam Search (NMNLP2017)》
《Bridging the Gap between Training and Inference for Neural Machine Translation》
2)强化学习:https://github.com/zhjohnchan/awesome-image-captioning
《Deep Reinforcement Learning-based Image Captioning with Embedding Reward》
*《Self-critical sequence training for image captioning》CVPR2017
《Actor-critic sequence training for image captioning》(NIPS 2017 workshop)
《Sequence level training with recurrent neural networks》
《Improved Image Captioning via Policy Gradient optimization of SPIDEr》
3)对抗训练: (给训练阶段引入了扰动,作为一种正则化手段来提高模型的泛化能力) 梯度惩罚
4)GAN:
问题1)数据离散不可微——Policy gradients
问题2)梯度消失和误差传播 ——SeqGAN《Sequence generative adversarial nets with policy gradient》
《Towards Diverse and Natural Image Descriptions via a Conditional GAN》https://github.com/doubledaibo/gancaption_iccv2017
《Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training》https://github.com/rakshithShetty/captionGAN
《Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner 》ICCV2017
9、Others
风格多样的image caption
可选择对象的image caption
密集的image caption
对抗样本
无监督的image caption
开放词汇《Neural Baby Talk》 CVPR2018
知识图谱
10、最新进展
《Reflective Decoding Network for Image Captioning》2019 CVPR 同时应用视觉注意力和文本注意力
《Learning to Collocate Neural Modules for Image Captioning》2019 ICCV 1个功能词模块和3个视觉内容词模块(名词、形容词、动词)
《Learning to Generate Grounded Image Captions without Localization Supervision》不需要有监督的目标位置信息生成caption
《Better Captioning with Sequence-Level Exploration》2020 CVPR 提高sequence-level模型的recall。方法:在损失函数中添加sequence level exploration term,最大化生成的captions的距离。
https://github.com/zhjohnchan/awesome-image-captioning