Image-sentence Matching 模型整理 (持续更新)

  • DeViSE: DeViSE: A Deep Visual-Semantic Embedding Model, NIPS, 2013 (tri, AlexNet, w2v)
  • SDT-RNN: Grounded Compositional Semantics for Finding and Describing Images with Sentences (tri, CNN, w2v + RNN*)
  • VSE0: Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, NIPSw, 2014 (tri, CNN, w2v + LSTM)
  • Deep Fragment: Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, NIPS, 2014 (tri, R-CNN, w2v)
  • m-RNN: Explain images with multimodal recurrent neural networks, arXiv, 2014 (LL, VGG16, one-hot + simple RNN)
  • DCCA: Deep Correlation for Matching Images and Text, CVPR, 2015 (corr, AlexNet, TF-IDF)
  • DVSA: Deep Visual-Semantic Alignments for Generating Image Descriptions, ICCV, 2015 (tri, R-CNN, w2v + RNN)
  • LRCN: Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR, 2015 (LL, VGG16, one-hot + LSTM)
  • m-CNN: Multimodal Convolutional Neural Networks for Matching Image and Sentence, ICCV, 2015 (tri, VGG19, w2v + CNN)
  • GMM-FV: Associating neural word embeddings with deep image representations using fisher vectors, CVPR, 2015 (VGG19, w2v + GMM + HGLMM)
  • VQA-A: Leveraging visual question answering for image-caption ranking, ECCV, 2016 (LL, VGG19, BOW + LSTM)
  • RNN-FV: RNN Fisher Vectors for Action Recognition and Image Annotation, ECCV, 2016 (LL, VGG19, GMM-FV)
  • SPE: Learning Deep Structure-Preserving Image-Text Embeddings, CVPR, 2016 (tri, VGG19, GMM-FV)
  • HM-LSTM: Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding, ICCV, 2017 (tri, R-CNN, w2v + LSTM)
  • sm-LSTM: Instance-aware Image and Sentence Matching with Selective Multimodal LSTM, CVPR, 2017 (tri, VGG19, w2v + Bi-LSTM)
  • RRF-Net: Learning a Recurrent Residual Fusion Network for Multimodal Matching, ICCV, 2017 (tri, ResNet152, GMM-FV)
  • 2WayNet: Linking Image and Text with 2-Way Nets, CVPR, 2017 (corr, VGG16, GMM-FV)
  • DAN: Dual Attention Networks for Multimodal Reasoning and Matching, CVPR, 2017 (tri, ResNet152, one-hot + Bi-LSTM)
  • DPC: Dual-Path Convolutional Image-Text Embedding with Instance Loss, arXiv, 2017 (tri + CE, ResNet152, w2v + ResNet152)
  • VSE++: VSE++: Improving Visual-Semantic Embeddings with Hard Negatives, BMVC, 2018 (tri, ResNet152, w2v + GRU)
  • SCO: Learning Semantic Concepts and Order for Image and Sentence Matching, CVPR, 2018, (tri, ResNet152, one-hot + conventional LSTM)
  • GNX: Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models, CVPR, 2018 (tri + CE + RL + GAN, ResNet152, Bi-GRU)
  • SCAN: Stacked Cross Attention for Image-Text Matching, ECCV, 2018 (tri, Faster R-CNN (ResNet101), one-hot -> w2v + Bi-GRU)
  • Multi-task Learning of Hierarchical Vision-Language Representation, CVPR, 2019
  • Saliency-Guided Attention Network for Image-Sentence Matching, arXiv, 2019 (SOTA now!)

你可能感兴趣的:(Deep,Learning)