VQA- 近五年视觉问答顶会论文创新点笔记

简要梳理近五年顶级会议发表的视觉问答(Visual Question Answering, VQA)相关论文的创新点。选取自NIPS、CVPR、ICCV、ACL等,已整理86篇。

2019.10.21修订,新增5篇ACL 2019。
VQA - 近五年视觉问答顶会论文创新点笔记
2014 A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Advances in neural information processing systems. 2014: 1682-1690.

本文是VQA的概念萌芽作,但此后的文章【2015 VQA Visual Question Answering】认为本文定义的问题把answers限制在了预定义的16种基础颜色和894种目标类别中,只算VQA Efforts,没有真正定义VQA。
[Figure 1: Overview of our approach to question answering with multiple latent worlds in contrast to single world approach.]


2015 Are You Talking to a Machine Dataset and Methods for Multilingual Image Question

Gao H, Mao J, Zhou J, et al. Are you talking to a machine? dataset and methods for multilingual image question[C]//Advances in neural information processing systems. 2015: 2296-2304.

[Figure 2: Illustration of the mQA model architecture. We input an image and a question about the image (i.e. “What is the cat doing?”) to the model. The model is trained to generate the answer to the question (i.e. “Sitting on the umbrella”). The weight matrix in the word embedding layers of the two LSTMs (one for the question and one for the answer) are shared. In addition, as in [25], this weight matrix is also shared, in a transposed manner, with the weight matrix in the Softmax layer. Different colors in the figure represent different components of the model. (Best viewed in color.)]



本文提供了一个自由风格多语种图像问答数据集(Freestyle Multilingual Image Question Answering, FM-IQA)。
2015 Ask Your Neurons A Neural-Based Approach to Answering Questions About Images

Malinowski M, Rohrbach M, Fritz M. Ask your neurons: A neural-based approach to answering questions about images[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1-9.

[Figure 1. Our approach Neural-Image-QA to question answering with a Recurrent Neural Network using Long Short Term Memory (LSTM). To answer a question about an image, we feed in both, the image (CNN features) and the question (green boxes) into the LSTM. After the (variable length) question is encoded, we generate the answers (multiple words, orange boxes). During the answer generation phase the previously predicted answers are fed into the LSTM until the END symbol is predicted.]
2015 Exploring Models and Data for Image Question Answering

Ren M, Kiros R, Zemel R. Exploring models and data for image question answering[C]//Advances in neural information processing systems. 2015: 2953-2961.

[Figure 2: VIS+LSTM Model]

也是一个类似encoder-decoder framework的东西,把图像特征和问题句子的各单词以此输入LSTM中进行编码,但没有解码输出句子,而是把编码完成时的向量用来在预定义词汇上做分类,预测答案单词。

2015 Visalogy Answering Visual Analogy Questions

Sadeghi F, Zitnick C L, Farhadi A. Visalogy: Answering visual analogy questions[C]//Advances in Neural Information Processing Systems. 2015: 1882-1890.

本文研究视觉类比(Visual Analogy)问题。图像A对图像B,正如图像C和那个图像?

本文使用四路暹罗架构(quadruple Siamese architecture)的卷积神经网络。
[Figure 2: VISALOGY Network has quadruple Siamese architecture with shared θ parameters. The network is trained with correct analogy quadruples of images [I1, I2, I3, I4] along with wrong analogy quadruples as negative samples. The contrastive loss function pushes (I1; I2) and (I3; I4) of correct analogies close to each other in the embedding space while forcing the distance between (I1; I2) and (I3; I4) in negative samples to be more than margin m.]

2015 VisKE Visual Knowledge Extraction and Question Answering by Visual Verification of Relation Phrases

Sadeghi F, Kumar Divvala S K, Farhadi A. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1456-1464.



本文首次提出对关系短语的视觉验证的研究问题,并开发了视觉知识抽取系统VisKE(Visual Knowledge Extraction system)
[Figure 2. Approach Overview. Given a relation predicate, such as fish(bear,salmon) VisKE formulates visual verification as the problem of estimating the most probable explanation (MPE) by searching for visual consistencies among the patterns of subject, object and the action being involved.]

输入的关系谓词:熊(noun, subjective) 捕鱼(verb) 鲑鱼(salmon, objective)。给定一个关系谓词,如:熊捕鱼,VisKE把视觉验证建模为对最可能解释(most probable explanation, MPE)的估计问题,通过搜素主语、宾语和动作三者模式之间的视觉一致性(visual consistencies)实现。
2015 Visual Madlibs Fill in the Blank Description Generation and Question Answering

Yu L, Park E, Berg A C, et al. Visual madlibs: Fill in the blank description generation and question answering[C]//Proceedings of the ieee international conference on computer vision. 2015: 2461-2469.

[Figure 1. An example from the Visual Madlibs Dataset, including a variety of targeted descriptions for people and objects.]

本文发布Visual Madlibs数据集,通过填空模板生成对人物、目标、外表、活动、互动、场景的描述。
2015 VQA Visual Question Answering

Antol S, Agrawal A, Lu J, et al. Vqa: Visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2015: 2425-2433.

[Figure 1: Examples of free-form, open-ended questions collected for images via Amazon Mechanical Turk. Note that commonsense knowledge is needed along with a visual understanding of the scene to answer many questions.]
2016 Answer-Type Prediction for Visual Question Answering

Kafle K, Kanan C. Answer-type prediction for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4976-4984.

[Figure 1: In the open-ended VQA problem, an algorithm is given an image and a question, and it must output a string containing the answer. We obtain state-of-the-art results on multiple VQA datasets by adopting a Bayesian approach that incorporates information about the form the answer should take. In this example, the system is given an image of a bear and it is asked about the color of the bear. Our method explicitly infers that this is a “color” question and uses that information in its predictive process.]


2016 Ask Me Anything Free-Form Visual Question Answering Based on Knowledge from External Sources

Wu Q, Wang P, Shen C, et al. Ask me anything: Free-form visual question answering based on knowledge from external sources[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4622-4630.


[Figure 2. Our proposed framework: given an image, a CNN is first applied to produce the attribute-based representation Vatt(I). The internal textual representation is made up of image captions generated based on the image-attributes. The hidden state of the caption-LSTM after it has generated the last word in each caption is used as its vector representation. These vectors are then aggregated as Vcap(I) with average-pooling. The external knowledge is mined from the KB (in this case DBpedia) and the responses encoded by Doc2Vec, which produces a vector Vknow(I). The 3 vectorsV are combined into a single representation of scene content, which is input to the VQA LSTM model which interprets the question and generates an answer.]

通过image captioning把图像转文本,并向量化为
;通过image annotation把图像转单词属性,检索知识图谱获取文本描述,并向量化为;然后把以上, ,

2016 Hierarchical Question-Image Co-Attention for Visual Question Answering

Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering[C]//Advances In Neural Information Processing Systems. 2016: 289-297.



本文对问题文本构建了层次注意力机制(word level, phrase level, question level)。
[Figure 1: Flowchart of our proposed hierarchical co-attention model. Given a question, we extract its word level, phrase level and question level embeddings. At each level, we apply co-attention on both the image and question. The final answer prediction is based on all the co-attended image and question features.]

2016 Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction

Noh H, Hongsuck Seo P, Han B. Image question answering using convolutional neural network with dynamic parameter prediction[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 30-38.





[Figure 2. Overall architecture of the proposed Dynamic Parameter Prediction network (DPPnet), which is composed of the classification network and the parameter prediction network. The weights in the dynamic parameter layer are mapped by a hashing trick from the candidate weights obtained from the parameter prediction network.]

2016 MovieQA Understanding Stories in Movies Through Question-Answering

Tapaswi M, Zhu Y, Stiefelhagen R, et al. Movieqa: Understanding stories in movies through question-answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4631-4640.

[Figure 1: Our MovieQA dataset contains 14,944 questions about 408 movies. It contains multiple sources of information: plots, subtitles, video clips, scripts, and DVS transcriptions. In this figure we show example QAs from The Matrix and localize them in the timeline.]



本文参考了MemN2N模型设计了本文面向QA的Memory Network。
2016 Stacked Attention Networks for Image Question Answering

Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21-29.

本文提出栈式注意力网络(stacked attention networks, SANs)。

[Figure 1: Model architecture and visualization. (b) Visualization of the learned multiple attention layers. The stacked attention network first focuses on all referred concepts, e.g., and objects in the basket (dogs) in bicycle, basket the first attention layer and then further narrows down the focus in the second layer and finds out the answer dog.]




14×14 is the number of regions in the image and 512 is the dimension of the feature vector for each region.

本文作者Zichao Yang、Xiaodong He等人恰好是Hierarchical Attention Network, HAN的提出者。我很喜欢他们的论文,对阐明原理非常负责任,总是能用最清晰的思路、最准确的表达来把技术原理讲得清清楚楚。在此致谢!
2016 Visual Question Answering with Question Representation Update (QRU)

Li R, Jia J. Visual question answering with question representation update (qru)[C]//Advances in Neural Information Processing Systems. 2016: 4655-4663.

本文的方法是,对每一个图像区域进行迭代,每次迭代计算该图像区域与问题的相关性,选出与问题相关的图像区域来对问题表示(question representation)进行更新,并进一步学习给出正确答案,
[Figure 2: The overall architecture of our model with single reasoning layer for VQA]


2016 Visual7W Grounded Question Answering in Images

Zhu Y, Groth O, Bernstein M, et al. Visual7w: Grounded question answering in images[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4995-5004.



[Figure 1: Deep image understanding relies on detailed knowledge about different image parts. We employ diverse questions to acquire detailed information on images, ground objects mentioned in text with their visual appearances, and provide a multiple-choice setting for evaluating a visual question answering task with both textual and visual answers.]

2016 Where to Look Focus Regions for Visual Question Answering

Shih K J, Singh S, Hoiem D. Where to look: Focus regions for visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4613-4621.



[Figure 3. Overview of our network for the example question-answer pairing: “What color is the fire hydrant? Yellow.” Question and answer representations are concatenated, fed through the network, then combined with selectively weighted image region features to produce a score.]

图像区域特征向量 + 文本特征向量 // 此处“+”是连接

dot product, softmax是一次注意力机制,根据文本特征关注图像区域。region向量和text向量映射到公共向量空间中。

2016 Yin and Yang Balancing and Answering Binary Visual Questions

Zhang P, Goyal Y, Summers-Stay D, et al. Yin and yang: Balancing and answering binary visual questions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5014-5022.


[Figure 1: We address the problem of answering binary questions about images. To eliminate strong language priors that shadow the role of detailed visual understanding in visual question answering (VQA), we use abstract scenes to collect a balanced dataset containing pairs of complementary scenes: the two scenes have opposite answers to the same question, while being visually as similar as possible. We view the task of answering binary questions as a visual verification task: we convert the question into a tuple that concisely summarizes the visual concept, which if present, result in the answer of the question being “yes”, and otherwise “no”. Our approach attends to relevant portions of the image when verifying the presence of the visual concept.]


2017 A Dataset and Exploration of Models for Understanding Video Data Through Fill-In-The-Blank Question-Answering

Maharaj T, Ballas N, Rohrbach A, et al. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6884-6893.


本文给出MovieFIB(Movie Fill-In-the-Blank)数据集,含30万个样本,基于为视障人士准备的描述性视频注释。
[Figure 1. Two examples from the training set of our fill-in-the-blank dataset.]

2017 An Analysis of Visual Question Answering Algorithms

Kafle K, Kanan C. An analysis of visual question answering algorithms[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1965-1973.

本文主要是给出一个新数据集——任务驱动图像理解挑战(Task Driven Image Understanding Challenge, TDIUC),包含的问题分类12个问题类别。
[Figure 1: A good VQA benchmark tests a wide range of computer vision tasks in an unbiased manner. In this paper, we propose a new dataset with 12 distinct tasks and evaluation metrics that compensate for bias, so that the strengths and limitations of algorithms can be better measured.]

2017 An Empirical Evaluation of Visual Question Answering for Novel Objects

Ramakrishnan S K, Pal A, Sharma G, et al. An empirical evaluation of visual question answering for novel objects[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4392-4401.

[Figure 1: We are interested in answering questions about images containing objects not seen at training.]



2017 Are You Smarter Than a Sixth Grader Textbook Question Answering for Multimodal Machine Comprehension

Kembhavi A, Seo M, Schwenk D, et al. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4999-5007.

本文研究的教科书问答问题属于多模态机器理解(Multi-Modal Machine Comprehension, M3C):给定一个文本、流程图和图像组成的上下文,让机器能够回答多模态问题。
[Figure 1. An overview of the Multi-modal Machine Comprehension (M3C) paradigm, statistics of the proposed Textbook Question Answering (TQA) dataset and an illustration of a lesson in it. TQA can be downloaded at http://textbookqa.org .]

本文发布教科书问答(Textbook Question Answering, TQA)数据集。

2017 Creativity Generating Diverse Questions Using Variational Autoencoders

Jain U, Zhang Z, Schwing A G. Creativity: Generating diverse questions using variational autoencoders[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6485-6494.

[Figure 3: High level VAE overview of our approach.]

本文提出结合变分自编码器(variational autoencoder, VAE)和LSTM来构建一个有创造力的算法,用于解决视觉问题生成问题。
2017 End-To-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Yu Y, Ko H, Choi J, et al. End-to-end concept word detection for video captioning, retrieval, and question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3165-3173.

该检测器根据输入的视频,生成一列概念词(concept words),提供给语言生成模型。
[Figure 1. The intuition of the proposed concept word detector. Given a video clip, a set of tracing LSTMs extract multiple concept words that consistently appear across frame regions. We then employ semantic attention to combine the detected concepts with text encoding/decoding for several video-to-language tasks of LSMDC 2016, such as captioning, retrieval, and question answering.]

本文的概念词检测器(concept word detector),输入是视频及其对应的描述语句,训练后能够对每个视频生成一组高层概念词。
[Figure 2. The architecture of the concept word detection in a top red box (section 2.2), and our video description model in bottom, which uses semantic attention on the detected concept words (section 3.1).]

2017 Explicit Knowledge-based Reasoning for Visual Question Answering

Wang P, Wu Q, Shen C, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 1290-1296.

[Figure 1: A real example of the proposed KB-VQA dataset and the results given by Ahab, the proposed VQA approach. Our approach answers questions by extracting several types of visual concepts from an image and aligning them to large-scale structured knowledge bases. Apart from answers, our approach can also provide reasons and explanations for certain types of questions.]

[Figure 3: Top: An RDF graph such as might be constructed by Ahab. For simplicity, we only show entities that are relevant to answering the questions in Fig. 1. Each arrow corresponds to one triple in the graph, with circles representing entities and green text reflecting predicate type. The graph of extracted visual concepts (left side) is linked to DBpedia (right side) by mapping object/attribute/scene to DBpedia entities using the predicate same-concept. Bottom: The question processing pipeline. The input question is parsed using a set of NLP tools to identify the appropriate template. The extracted slot-phrases are then mapped to entities in the KB. Next, KB queries are generated to mine the relevant relationships for the KB-entities. Finally, the answer and reason are generated based on the query results. The predicate category/?broader is used to obtain the categories transitively.]

构建RDF图(Resource Description Framework [Cyganiak et al., 2014] (RDF)),拓展知识库,对自然语言进行解析、映射和逻辑查询取得推理过程可解释的答案。
2017 Graph-Structured Representations for Visual Question Answering

Teney D, Liu L, van den Hengel A. Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1-9.




[Figure 2. Architecture of the proposed neural network. The input is provided as a description of the scene (a list of objects with their visual characteristics) and a parsed question (words with their syntactic relations). The scene-graph contains a node with a feature vector for each object, and edge features that represent their spatial relationships. The question-graph reflects the parse tree of the question, with a word embedding for each node, and a vector embedding of types of syntactic dependencies for edges. A recurrent unit (GRU) is associated with each node of both graphs. Over multiple iterations, the GRU updates a representation of each node that integrates context from its neighbours within the graph. Features of all objects and all words are combined (concatenated) pairwise, and they are weighted with a form of attention. That effectively matches elements between the question and the scene. The weighted sum of features is passed through a final classifier that predicts scores over a fixed set of candidate answers.]


问题图(question-graph):句法解析后的问题语句。问题图是问题语句的解析树,每个单词对应一个节点。节点包含该单词的词嵌入(word embedding),节点之间的边包含单词之间句法依存关系的向量嵌入。
所有目标和单词的特征向量两两成对组合组合起来,即图2中的Words-Objects矩阵,并通过注意力机制加权求和(Matching weights矩阵为注意力权重矩阵)。

局限:本文的scene graph只是包含空间上的相对位置(relative position)。
2017 High-Order Attention Models for Visual Question Answering

Schwartz I, Schwing A, Hazan T. High-order attention models for visual question answering[C]//Advances in Neural Information Processing Systems. 2017: 3664-3674.



泛化性好(generally applicable),能够广泛应用于各种任务的注意力机制。
高阶相关性(high-order correlation),能够学习不同数据模态之间的高阶相关性。k阶相关性能够建模k种模态之间的相关性。



[Figure 2: Our state-of-the-art VQA system]


一元势(unary potentials):

成对势(pairwise potentials):
三元势(ternary potentials):


本文的决策产生(decision making)阶段使用MCB和MCT池化:

MCB池化(Multimodal Compact Bilinear Pooling):本文的决策生成阶段使用该双线性池化把成对情况(pairwise setting)下的两种模态做池化输出。
MCT池化(Multimodal Compact Trilinear Pooling):本文的决策生成阶段使用该三线性池化把三种模态的数据池化输出。

2017 Knowledge Acquisition for Visual Question Answering via Iterative Querying

Zhu Y, Lim J J, Fei-Fei L. Knowledge acquisition for visual question answering via iterative querying[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1154-1163.


[Figure 2: (a) An illustration of a standard VQA model. (b) An overview of our iterative model. © Detailed flowchart of our model. The model consists of two major components: core network (green) and query generator (blue). The query generator proposes task-driven queries to fetch evidence from external sources. Acquired knowledge is encoded and stored as memories in the core network for answering a question.]

具体而言,本文的模型通过对知识源(knowledge sources)的一系列查询(queries)获取支撑依据。获取到的依据被编码存储进记忆银行(memory bank)。随后,模型使用刚更新的记忆来提出下一轮的查询,或给出目标问题的答案。
2017 Learning to Disambiguate by Asking Discriminative Questions

Li Y, Huang C, Tang X, et al. Learning to disambiguate by asking discriminative questions[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3419-3428.

人类能够通过问问题来了解信息,认知世界并消解歧义。本文受此启发,提出一种新研究问题——“如何生成有判别力的问题(discriminative questions)来帮助消解视觉实例的歧义?”。
[Figure 4: Overview of the attribute-conditioned question generation process. Given a pair of ambiguous images, we first extract semantic attributes from the images respectively. The attribute scores are sent into a selection model to select the distinguishing attributes pair, which reflects the most obvious difference between the ambiguous images. Then the visual feature and selected attribute pair are fed into an attribute-conditioned LSTM model to generate discriminative questions.]



2017 Learning to Reason End-to-End Module Networks for Visual Question Answering

Hu R, Andreas J, Rohrbach M, et al. Learning to reason: End-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 804-813.

基于近期提出的神经模块网络架构(Neural Module Network architecture, NMN),本文提出端到端模块网络(End-to-End Module Networks, N2NMNs),在不使用NMN中的解析器的情况下,通过预测特定实例网络布局。

[Figure 2: Model overview. Our approach first computes a deep representation of the question, and uses this as an input to a layout-prediction policy implemented with a recurrent neural network. This policy emits both a sequence of structural actions, specifying a template for a modular neural network in reverse Polish notation, and a sequence of attentive actions, extracting parameters for these neural modules from the input sentence. These two sequences are passed to a network builder, which dynamically instantiates an appropriate neural network and applies it to the input image to obtain an answer.]



2017 Making the V in VQA Matter Elevating the Role of Image Understanding in Visual Question Answering

Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6904-6913.


[Figure 1: Examples from our balanced VQA dataset.]

具体地,本文为每一个问题找一对语义互补的图像,实现正负例平衡(例如:man/woman, yes/no),避免VQA模型受到视觉无关的统计规律影响。本文的全平衡数据集为VisualQA数据集。
2017 MarioQA Answering Questions by Watching Gameplay Videos

Mun J, Hongsuck Seo P, Jung I, et al. Marioqa: Answering questions by watching gameplay videos[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2867-2875.

[Figure 1: Overall QA generation procedure. Given a gameplay video and event logs shown on the left, (a) target event is selected (marked as a green box), (b) question semantic chunk is generated from the target event, © question template is sampled from template pool, and (d) QA pairs are generated by filling the template and the linguistically realizing answer.]

2017 Multi-level Attention Networks for Visual Question Answering

Yu D, Fu J, Mei T, et al. Multi-level attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4709-4717.



从CNN的高层语义生成语义概念(semantic concepts),并选出与问题相关地概念作为语义注意力(semantic attention)。
通过双向RNN把基于区域的CNN中层输出编码为空间嵌入表示,并用MLP进一步定位与回答相关的区域作为视觉注意力(visual attention)。

[Figure 2. Overall framework of multi-level attention networks. Our framework consists of three components: (A) semantic attention, (B) context-aware visual attention and © joint attention learning. Here, we denote by vq the representation of the question Q, by vimg, vc the representation of image content on the visual and semantic level queried by the question, respectively. vr and pimg c is the activation of the last convolutional layer and the probability layer from the CNN.]




:从CNN的高层语义生成语义概念(semantic concepts),并通过语义注意力选择后的图像表示。

2017 Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering

Yu Z, Yu J, Fan J, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2017: 1821-1830.

本文提出多模态因子分解双线性池化方法(Multi-modal Factorized Bilinear (MFB) pooling approach),提高多模态特征融合能力,以改进VQA。

[Figure 3. MFB with Co-Attention network architecture for VQA. Different from the network of MFB baseline, the images and questions are firstly represented as the fine-grained features respectively. Then, Question Attention and Image Attention modules are jointly modeled in the framework to provide more accurate answer predictions.]

Multi-modal Compact Bilinear pooling (MCB)对两个特征向量做外积,因二次方膨胀产生了非常高维的特征向量。MLB通过低阶映射矩阵改进了高维问题。

Multi-modal Low-rank Bilinear Pooling (MLB):




Multi-modal Factorized Bilinear pooling (MFB):

2017 Multimodal Learning and Reasoning for Visual Question Answering

Ilievski I, Feng J. Multimodal learning and reasoning for visual question answering[C]//Advances in Neural Information Processing Systems. 2017: 551-562.




Namely, the VQA problem can be solved by modeling the likelihood probability distribution 

which for each answer in the answer set outputs the probability of being the correct answer, given a question about an image


使以为参数的模型的似然概率分布 在输入问题、图像的条件下,输出正确答案

[Figure 1: Network architecture diagram of the ReasonNet model applied on the VQA task. Round rectangles represent attention modules, squared rectangles represent classification modules, small trapezoids represent encoder units (Eq. (3)), thin rectangles represent the learned multimodal representation vectors, x represents the bilinear interaction model (Eq. (4)), and the big trapezoid is a multi-layer perceptron network that classifies the reasoning vector g to an answer a (Eq. (7))]


表示双线性交互模型(bilinear interaction model);

ReasonNet通过多个模块(注意力模块、分类模块)对图像和问题做处理,处理结果在编码后做双线性交互(bilinear interaction),最终取得的各个向量连接为长向量,用于最后的回答分类器做分类。
2017 MUTAN Multimodal Tucker Fusion for Visual Question Answering

Ben-Younes H, Cadene R, Cord M, et al. Mutan: Multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2612-2620.


[Figure 2: MUTAN fusion scheme for global Visual QA. The prediction is modeled as a bilinear interaction between visual and linguistic features, parametrized by the tensor T . In MUTAN, we factorise the tensor T using a Tucker decomposition, resulting in an architecture with three intra-modal matrices Wq, Wv and Wo, and a smaller tensor T c. The complexity of T c is controlled via a structured sparsity constraint on the slice matrices of the tensor.]

2017 Structured Attentions for Visual Question Answering

Zhu C, Zhao Y, Huang S, et al. Structured attentions for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1291-1300.

本文认为VQA的问题很可能牵涉到多个图像区域之间的复杂关系,而现在很少有注意力模型能够有效编码跨区域关系(cross-region relations)。

本文通过展示ResNet作用有限的感受野,说明编码区域间关系的重要性。因此,本文提出把图像区域上的视觉注意力建模为网格结构条件随机场(CRF)上的多变量分布。本文解释了如何把迭代推理算法(Mean Field和Loopy Belief Propagation)转换为端到端神经网络的循环层。
[Figure 2: The whole picture of the proposed model. The inputs to the recurrent inference layers are the unary potential ψi(zi) and pairwise potential ψij(zi, zj), computed with Eq. 8. ψi(zi) can also be used as an additional glimpse, which usually detects the key nouns. In the inference layers, xi represents b(i) for MF and m(i) for LBP. The recurrent inference layers generates a structured glimpse with MF or LBP. The 2 glimpses are used to weight-sum the visual feature vectors. The classifier use both of the attended visual features and the question feature to predict the answer. The demonstration is a real case.]
2017 TGIF-QA Toward Spatio-Temporal Reasoning in Visual Question Answering

Jang Y, Song Y, Yu Y, et al. Tgif-qa: Toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2758-2766.



    重复计数(Repetition count):回答一个动作发生了多少次;
    重复动作(Repeating action):回答视频中重复的动作是什么;
    状态转换(State transition):回答例如:表情、动作、地点、目标属性的状态转换情况。

[Figure 3. The proposed ST-VQA model for spatio-temporal VQA. See Figure 4 for the structure of spatial and temporal attention modules.]
2017 The VQA-Machine Learning How to Use Existing Vision Algorithms to Answer New Questions

Wang P, Wu Q, Shen C, et al. The vqa-machine: Learning how to use existing vision algorithms to answer new questions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1173-1182.


[Figure 2: The proposed VQA model. The input question, facts and image features are weighted at three question-encoding levels. Given the co-weighted features at all levels, a multi-layer perceptron (MLP) classifier is used to predict answers. Then the ranked facts are used to generate reasons.]

模型输入:问题、视觉事实(visual facts)、图像;
    问题经过层次问题编码(Hierarchical Question Encoding)表示,包含三层:单词、短语、句子;
    视觉事实通过三元组(subject, relation, object)表示;




[Figure 3: The sequential co-attention module. Given the feature sequences for the question (Q), facts (F) and image (V), this module sequentially generates weighted features (˜v, ˜q,˜f ).]

本文的顺序协同注意力(Sequential Co-attention)机制指的是每次用其余特征作为导向,生成某个特征的注意力权重。可形式化为:


2017 Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Zhao Z, Yang Q, Cai D, et al. Video question answering via hierarchical spatio-temporal attention networks[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3518-3524.

本文认为当前VQA研究着眼于静态图像,现有方法无法有效应对视频问答,因为没有对视频内容中的时间动力学(temporal dynamics)进行建模。

本文从时空注意力编解码器学习框架(spatio-temporal attentional encoder decoder learning framework)着手,提出层次时空注意力网络(hierarchical spatio-temporal attention network),根据给定问题,学习动态视频内容的联合表示。本文开发了包含多步推理流程的时空注意力网络用于视频问答。
[Figure 2: The Overview of Open-Ended Video Question Answering via Hierarchical Spatial-Temporal Attentional Encoder-Decoder Learning Framework (r-STAN in case of r = 2). The hierarchical spatio-temporal attentional encoder networks learn the joint representation of multimodal spatio-temporal attentional video and textual question with multiple reasoning steps, and the recurrent decoder network generates the natural language answer for open-ended video question answering.]
2017 VQS Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Gan C, Li Y, Li H, et al. Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1811-1820.

本文把COCO数据集中的实例分割标注和VQA数据集中的问题和答案标注联系起来,命名为VQS(Visual Questions and Segmentation answers)数据集。新的实例分割标注可能有助于开辟新的研究问题和模型。
2017 What’s in a Question Using Visual Questions as a Form of Supervision

Ganju S, Russakovsky O, Gupta A. What’s in a question: Using visual questions as a form of supervision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 241-250.



[Figure 5: Framework of the iBOWIMG-2x model. The representation consists of three parts: (1) visual image features, (2) text embedding of the target question, and (3) text embedding of the other questions concatenated together. This representation is passed through a learned fully connected layer to predict the answer to the target question.]


其它问题连接起来后的文本嵌入。 // 增加了图像的其它问题作为弱监督学习信息

2018 Chain of Reasoning for Visual Question Answering

Wu C, Liu J, Wang X, et al. Chain of reasoning for visual question answering[C]//Advances in Neural Information Processing Systems. 2018: 275-285.



本文构造了一个推理链(chain of reasoning, CoR)模型,支持对变化的关系和目标实现多步、动态推理。具体地,关系推理操作形成目标间新的关系,而目标修正操作从关系中生成新的复合目标。

本文构造的推理链模型中,关系(relation)和复合目标(compound objects)都是推理链中的节点。



的关系推理(relational reasoning)可表述为:


2018 Cross-Dataset Adaptation for Visual Question Answering

Chao W L, Hu H, Sha F. Cross-dataset adaptation for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5716-5725.


本文提出了一款领域适配算法(domain adaptation algorithm)。该算法通过变换目标数据集中数据的特征表示,来缩小源数据集与目标数据集之间的统计分布差异。另外,该算法还能够使在源数据集上训练的VQA模型在目标数据集上回答正确时似然概率最大。
[Figure 1. An illustration of the dataset bias in visual question answering. Given the same image, Visual QA datasets like VQA [4] (right) and Visual7W [50] (left) provide different styles of questions, correct answers (red), and candidate answer sets, each can contributes to the bias to prevent cross-dataset generalization.]
2018 Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Shin A, Ushiku Y, Harada T. Customized Image Narrative Generation via Interactive Visual Question Generation and Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8925-8933.


本文提出一个自定义图像叙述生成任务(customized image narrative generation task),用户通过回答给出的问题来叙述图像。
[Figure 4: Questions that allow for multiple responses are generated to reflect user’s interest and corresponding regions proceed to image narrative generation process.]
2018 Differential Attention for Visual Question Answering

Patro B, Namboodiri V P. Differential attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7680-7688.


因此,本文提出通过一或多个支持和反对范例来取得一个微分注意力区域(differential attention region)。

在认知研究中的范例理论(exexplar theory)里,个体会拿新刺激和记忆中已知的实例作比较,并基于这些范例找到回答。本文的目的就是通过范例模型来取得注意力。本文的前提是,语义最近的范例和远语义范例之间存在差异,这样的差异能够引导注意力关注于一个特定的图像区域。
[Figure 2. Differential Attention Network]


根据输入图像和问题取得引用注意力嵌入(reference attention embedding);
通过微分注意力网络(differential attention network, DAN)或微分上下文网络(differential context network)分别可以改进注意力或取得微分上下文特征,这两种方法可以提升注意力与人工注意力的相关性;

2018 Don’t Just Assume; Look and Answer Overcoming Priors for Visual Question Answering

Agrawal A, Batra D, Parikh D, et al. Don’t just assume; look and answer: Overcoming priors for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4971-4980.



本文提出视觉根据问答模型(Grounded Visual Question Answering model, GVQA)
[Figure 3: The proposed Grounded Visual Question Answering (GVQA) model.]

视觉概念分类器(Visual Concept Classifier, VCC)在任何情况下都工作,但回答聚类预测器(Answer Cluster Predictor, ACP)和概念抽取器(Concept Extractor, CE)是二选一的。回答预测器(Answer Predictor, AP)和视觉验证器(Visual Verifier, VV)也是二选一的。



2018 DVQA Understanding Data Visualizations via Question Answering

Kafle K, Price B, Cohen S, et al. DVQA: Understanding data visualizations via question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5648-5656.

本文提出数据可视化问答(Data Visualizations Question Answering, DVQA),并给出DVQA数据集,包含基于问答框架的对条形图(bar charts)的各种层面的理解。
[Figure 4: Overview of our Multi-Output Model (MOM) for DVQA. MOM uses two sub-networks: 1) classification sub-network that is responsible for generic answers, and 2) OCR sub-network that is responsible for chart-specific answers.]

本文的DVQA多输出模型(Multi-Output Model, MOM)包含两个子网络:


2018 Embodied Question Answering

Das A, Datta S, Gkioxari G, et al. Embodied question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018: 2054-2063.

[Figure 1: Embodied Question Answering – EmbodiedQA– tasks agents with navigating rich 3D environments in order to answer questions. These agents must jointly learn language understanding, visual reasoning, and goal-driven navigation to succeed.]

The embodiment hypothesis is the idea that intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity.

-Smith and Gasser, “The development of embodied cognition: six lessons from babies.,” Artificial life, vol. 11, no. 1-2, 2005.

2018 Focal Visual-Text Attention for Visual Question Answering

Liang J, Jiang L, Cao L, et al. Focal visual-text attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6135-6143.

本文研究真实生活中的VQA问题,研究一组图片序列或视频(算是Video VQA),来回答问题,而不是传统的一张静态图片。

本文提出焦点视觉文本注意力网络(Focal Visual-Text Attention network, FVTA)。本文的FVTA模型解决的是如何关注序列数据中与问题相关部分的问题。FVTA模型不仅能回答问题,还能给出回答的理由。
[Figure 2. An overview of Focal Visual-Text Attention (FVTA) model. For visual-text embedding, we use a pre-trained convolutional neural network to embed the photos and pre-trained word vectors to embed the words. We use a bi directional LSTM as the sequence encoder. All hidden states from the question and the context are used to calculate the FVTA tensor. Based on the FVTA attention, both question and the context are summarized into single vectors for the output layer to produce final answer. The output layer is used for multiple choice question classification. The text embedding of the answer choice is also used as the input. This input is not shown in the figure.]
2018 FVQA Fact-Based Visual Question Answering

Wang P, Wu Q, Shen C, et al. Fvqa: Fact-based visual question answering[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 40(10): 2413-2427.


本文提出基于事实的视觉问答FVQA(Fact-based VQA)数据集,该数据集包含了需要外部信息才能回答的问题,能够支撑更深层的推理研究。

[Fig. 3. An example of the reasoning process of the proposed VQA approach. The visual concepts (objects, scene, attributes) of the input image are extracted using trained models, which are further linked to the corresponding semantic entities in the knowledge base. The input question is first mapped to one of the query types using the LSTM model shown in Section 4.2. The types of key relationships, key visual concept and answer source can be determined accordingly. A specific query (see Section 4.3) is then performed to find all facts meeting the search conditions in KB. These facts are further matched to the keywords extracted from the question sentence. The fact with the highest matching score is selected and the answer is also obtained accordingly.]

首先,图像和视觉概念的收集方面,本文从微软COCO数据集的验证集和ImageNet的测试集中采样了2190张图像。微软COCO数据集中的图像提供更多的上下文信息,而ImageNet中的图像内容更简单,但包含更多的目标类型。(200 in ImageNet versus 80 in Microsoft COCO)。本文在2190个图像上,通过人工标注建立了5826个问题,含32个问题类型。
[The Relationships in Different Knowledge Bases Used for Generating Questions]


DBpedia,通过众包从维基百科中抽取的结构化知识。本文使用其中的视觉概念的类属关系(concepts are linked to their categories and super-categories based on the SKOS Vocabulary)。
ConceptNet,从Open Mind Common Sense (OMCS) 项目中的句子中自动生成。由几种常识关系(commonsense relations)组成,如:UsedFor, CreatedBy, IsA。本文使用其中的11种常见关系。
WebChild,从Web中自动抽取生成,一个被忽视(overlooked)的常识事实库,涉及比较关系(comparative relations),如:Faster, Bigger, Heavier。


选择概念(Selecting Concept):给定图像及其中的一些视觉概念(目标、场景和行为),标注者需要从中选出一项视觉概念;
选择事实(Selecting Fact):特定视觉概念被选定后,系统给出与该视觉概念相关的一些事实。标注人员需要选出一项正确的且与图像相关的事实。
问问题并给答案(Asking Question and Giving Answer):根据选出的视觉概念和事实,标注者需要提出一个问题,并给出答案。提出的问题必须是要同时依靠图像信息和事实信息才能回答的问题,而该问题的答案则必须是选定事实中的两个概念的其中之一,即,答案要么是第一步选出的视觉概念,要么是第二步选出的事实中的相关概念。

2018 Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering

Nguyen D K, Okatani T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6087-6096.



[Figure 2: The internal structure of a single dense coattention layer of layer index l + 1.]

本文设计的注意力层称为密集协同注意力层(dense co-attention layer)。

2018 IQA Visual Question Answering in Interactive Environments

Gordon D, Kembhavi A, Rastegari M, et al. Iqa: Visual question answering in interactive environments[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4089-4098.

本文提出交互问答(Interactive Question Answering, IQA)任务,在该任务中,智能体需要与一个动态的视觉环境交互来回答问题。
[Figure 2. An overview of the Hierarchical Interactive Memory Network (HIMN)]
2018 iVQA Inverse Visual Question Answering

Liu F, Xiang T, Hospedales T M, et al. iVQA: Inverse visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8611-8619.


[Figure 2. Overall architecture of the proposed iVQA model]
2018 Learning Answer Embeddings for Visual Question Answering

Hu H, Chao W L, Sha F. Learning answer embeddings for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5428-5436.




[Figure 1. Conceptual diagram of our approach. We learn two embedding functions to transform image question pair (i, q) and (possible) answer a into a joint embedding space. The distance (by inner products) between the embedded (i, q) and a is then measured and the closest a (in red) would be selected as the output answer.]

2018 Learning by Asking Questions

Misra I, Girshick R, Fergus R, et al. Learning by asking questions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 11-20.


[Figure 3: Our approach to the learning-by-asking setting for VQA. Given an image I, the agent generates a diverse set of questions using a question generator g. It then filters out “irrelevant” questions using a relevance model r to produce a list of question proposals. The agent then answers its own questions using the VQA model v. With these predicted answers and its self-knowledge of past performance, it selects one question from the proposals to be answered by the oracle. The oracle provides answer-level supervision from which the agent learns to ask informative questions in subsequent iterations.]



问题生成器(Question Generator)

问题相关性(Question Relevance)

问题回答模块(Question Answering Module)中,智能体自己的VQA模型

根据预测出的答案以及智能体已经取得的自我知识,问题选择模块(Question Selection Module)从问题提议


2018 Learning Conditioned Graph Structures for Interpretable Visual Question Answering

Norcliffe-Brown W, Vafeias S, Parisot S. Learning conditioned graph structures for interpretable visual question answering[C]//Advances in Neural Information Processing Systems. 2018: 8334-8343.

现有的研究很少有基于高层图像表示的,很少去捕捉语义和空间关系。现有的VQA研究大多扑在创造新的注意力架构上,而没有去建模场景中的目标之间语义关系。对于现有的典型的场景图生成研究【2017 Scene Graph Generation by Iterative Message Passing】(研究场景图自动生成方法的论文),本文认为场景图通过图结构表示图像,能够显式建模互动关系,例如图像中的目标及其互动关系,由此在近期的VQA研究中很受关注。但另一方面,本文认为现有的场景图研究需要大量工程量,而且是针对特定图像的而不是针对问题,还存在难以从虚拟场景迁移到真实图像、可解释性差的问题。

本文提出一种基于图的VQA方法,本文的方法加入了一个图学习模块(graph learner module),能够学习输入图片对特定问题的图表示(question specific graph representation)。具体地,本文的方法通过图卷积,学习能够捕捉与问题相关的互动信息的图像表示。

[Figure 2: Overview of the proposed model architecture. We model the VQA problem as a classification problem, where each answer from the training set is a class. The core of our method is the graph learner, which takes as input a question encoding, and a set of object bounding boxes with corresponding image features. The graph learner module learns a graph representation of the image that is conditioned on the question, and models the relevant interactions between objects in the scene. We use this graph representation to learn image features that are influenced by their relevant neighbours using graph convolutions, followed by max-pooling, element-wise product and fully connected layers.]

本文方法的核心在于图学习器(graph learner)。










是一个全连接的邻接矩阵,这样的邻接矩阵及由此计算边集的定义没有对图的稀疏性(sparsity)做任何约束。这种全连接的密集边集不仅计算量大,而且对VQA没有帮助,因为VQA需要的是关注与问题有关的节点。把邻接矩阵学习到的图结构作为图卷积层(graph convolution layers)的backbone,在做图卷积计算之前应该先筛选,关注与VQA任务相关的一部分节点和边,而不需要所有节点之间的关系。因此本文通过排序个最大值,对邻接矩阵

2018 Learning to Specialize with Knowledge Distillation for Visual Question Answering

Mun J, Lee K, Shin J, et al. Learning to specialize with knowledge distillation for visual question answering[C]//Advances in Neural Information Processing Systems. 2018: 8081-8091.

本文研究VQA中的知识蒸馏(Knowledge Distillation)。

Visual Question Answering (VQA) is a notoriously challenging problem because it involves various heterogeneous tasks defined by questions within a unified framework. Learning specialized models for individual types of tasks is intuitively attracting but surprisingly difficult; it is not straightforward to outperform naïve independent ensemble approach. We present a principled algorithm to learn specialized models with knowledge distillation under a multiple choice learning (MCL) framework, where training examples are assigned dynamically to a subset of models for updating network parameters. The assigned and non-assigned models are learned to predict ground-truth answers and imitate their own base models before specialization, respectively. Our approach alleviates the limitation of data deficiency in existing MCL frameworks, and allows each model to learn its own specialized expertise without forgetting general knowledge. The proposed framework is model-agnostic and applicable to any tasks other than VQA, e.g., image classification with a large number of labels but few per-class examples, which is known to be difficult under existing MCL schemes. Our experimental results indeed demonstrate that our method outperforms other baselines for VQA and
image classification.

[Figure 2: Overall framework of our multiple choice learning with knowledge distillation]
2018 Learning Visual Knowledge Memory Networks for Visual Question Answering

Su Z, Zhu C, Dong Y, et al. Learning visual knowledge memory networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7736-7745.


明显目标(apparent objective):从图像中可以直接回答出来;
隐约目标(indiscernible objective):视觉识别不清,需要借助常识作为约束;
不可见目标(invisible objective):无法借助视觉内容回答,需要对外部知识做归纳/推理才行。


本文提出视觉知识记忆网络(visual knowledge memory network, VKMN),VKMN能够在端到端学习框架下,把结构化的人类知识和深度视觉特征无缝整合进记忆网络中。


把视觉内容和知识事实做集成的机制。VKMN模型通过把知识三元组(subject, relation, target)和深度视觉特征联合嵌入进视觉知识特征的方式实现该机制。


[Figure 2: Illustration of VKMN for the VQA task. Note that three replicated memory sub-blocks (different combination of s, r, t as key-part or value-part) are used to handle the ambiguity on which part of the knowledge triple is missing in the query question.] [Figure 3: Diagram of Visual Knowledge Memory Network based VQA system]
2018 Motion-Appearance Co-Memory Networks for Video Question Answering

Gao J, Ge R, Chen K, et al. Motion-appearance co-memory networks for video question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6576-6585.



基于视频VQA和图像VQA的区别,本文为视频QA提出一个运动-出现协同记忆网络(motion-appearance co-memory network)。该网络基于动态记忆网络(Dynamic Memory Network, DMN)的概念,并建立了新的机制:

协同记忆注意力机制(co-memory attention mechanism):根据运动和出现线索来生成注意力;
时间卷积-反卷积网络(temporal conv-deconv network):生成多层上下文事实;
动态事实集成方法(dynamic fact ensemble method):动态构建对不同问题的时间表示。

[Figure 4. Co-memory attention module extracts useful cues from both appearance and motion memories to generate attention gat/gbt for motion and appearance separately. Dynamic fact ensemble takes the multi-layer contextual facts AL/BL and the attention scores gat/gbt to construct proper facts As/h L /Bs/h L , which are encoded by an attention-based GRU. The final hidden state ct b/ct a of the GRU is used to update the memory mt b/mt a. The final output memorymh is the concatenation of the motion and appearance memory, and it is used to generate answers.]
2018 Neural-Symbolic VQA Disentangling Reasoning from Vision and Language Understanding

Yi K, Wu J, Gan C, et al. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding[C]//Advances in Neural Information Processing Systems. 2018: 1031-1042.

本文把推理从视觉及语言理解中解脱出来。本文认为不必纠结在构建视觉理解、语言理解的深度表示中做知识推理。本文认为,通过深度表示学习来实现视觉识别和语言理解,通过符号程序的执行来实现推理。本文提出神经-符号视觉问答(neural-symbolic visual question answering, NS-VQA)系统,首先从图像中构建结构化场景表示,从问题中构建程序跟踪(program trace)。随后,NS-VQA在场景表示上执行程序进行推理并取得答案。



[Figure 2: Our model has three components: first, a scene parser (de-renderer) that segments an input image (a-b) and recovers a structural scene representation ©; second, a question parser (program generator) that converts a question in natural language (d) into a program (e); third, a program executor that runs the program on the structural scene representation to obtain the answer.]


关于推理的实验基本都在CLEVR数据集上做,但CLEVR数据集毕竟是合成的数据集,都是cube, cylinder之类的东西,不同的相对位置、颜色外观等。NS-VQA方法能泛化至real world的图像么?Minecraft world可能能够在一定程度上验证其泛化能力。本文利用Minecraft生成了1万个游戏画面场景,每个场景3~6个物体目标。当Programs数量达到500时,Accuracy能达到87.3%。
2018 Out of the Box Reasoning with Graph Convolution Nets for Factual Visual Question Answering

Narasimhan M, Lazebnik S, Schwing A. Out of the box: Reasoning with graph convolution nets for factual visual question answering[C]//Advances in Neural Information Processing Systems. 2018: 2654-2665.

本文观察了这个持续过程,认为每次根据一条事实来形成局部决策是次优(sub-optimal)的。因此,本文开发一种实体图(entity graph),并使用图卷积神经网络GCN来联合考虑所有节点以推理出正确答案。
[Figure 2: Outline of the proposed approach: Given an image and a question, we use a similarity scoring technique (1) to obtain relevant facts from the fact space. An LSTM (2) predicts the relation from the question to further reduce the set of relevant facts and its entities. An entity embedding is obtained by concatenating the visual concepts embedding of the image (3), the LSTM embedding of the question (4), and the LSTM embedding of the entity (5). Each entity forms a single node in the graph and the relations constitute the edges (6). A GCN followed by an MLP performs joint assessment (7) to predict the answer. Our approach is trained end-to end.]


从事实空间(Fact Space)中取得相关事实;
通过把图像中视觉概念嵌入(visual concepts embedding)连接起来获得一个实体嵌入;

2018 Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Ramakrishnan S, Agrawal A, Lee S. Overcoming language priors in visual question answering with adversarial regularization[C]//Advances in Neural Information Processing Systems. 2018: 1541-1551.


本文为VQA提出一种正则化模式(regularization scheme)。该模式引入一个question-only模型,只取VQA的问题作为输入,该模型必须依赖语言来做预测。该question-only模型继而与VQA模型形成对抗博弈,以此达到让VQA模型在问题编码中避免language biases的作用。
[Figure 1: Given an arbitrary base VQA model (A), we introduce two regularizers. First, we build a question-only adversary (B) that takes the question embedding qi from the VQA model and is trained to output the correct answer from this information alone. For this network to succeed, qi must capture language biases from the dataset – the same biases that lead the base VQA model to ignore visual content. To reduce these biases, we set the base VQA model and the question-only adversary against each other, with the base VQA network modifying its question embedding to reduce question-only performance (shown here as gradient negation of the question-only model loss) Further, the question-only model allows estimation of the change in answer confidence given image ©, which we maximize explicitly.]
2018 Textbook Question Answering Under Instructor Guidance With Memory Networks

Li J, Su H, Zhu J, et al. Textbook question answering under instructor guidance with memory networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 3655-3663.



本文提出使用记忆网络的导师指导(Instructor Guidance with Memory Networks, IGMN)方法。IGMN方法通过查找候选答案与对应上下文之间的矛盾来实现TQA任务。

本文构造矛盾实体关系图(Contradiction Entity-Relationship Graph, CERG)。CERG能够把篇章级多模态矛盾扩展到短文级。机器随后扮演导师的角色,提取短文级矛盾作为指导(Guidance)。随后,通过记忆网络来捕捉指导(Guidance)中的信息,通过注意力机制来对多模态输入的全局特征联合推理。

[Figure 2: Overall architecture of our proposed method, the Instructor Guidance with Memory Networks (IGMN). The lower part of the figure is the module of Instructor-Guided Knowledge Extraction (IGKE), which represents facts in the long essays and images with the Contradiction Entity- Relationship Graphs (CERGs). The upper part is the module of Answer Generation by Joint Reasoning (AGJR), which accesses the Guidance under a memory network and consequently generates answers by reasoning over the integrated latent facts accordingly by the attention mechanisms.]
2018 Tips and Tricks for Visual Question Answering Learnings from the 2017 Challenge

Teney D, Anderson P, He X, et al. Tips and tricks for visual question answering: Learnings from the 2017 challenge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4223-4232.




使用sigmoid outputs:能够允许对每个问题的多个正确答案;
使用soft scores作为ground truth target:把问题转换为对候选答案的得分回归问题,而不是传统分类;
使用gated tanh activations:非线性层的激活函数;
使用image features from buttom-up attention:提供特定区域的特征,而不是对传统的从CNN中取得的特征映射图做网格划分;
使用pretrained representations of candidate answers:初始化输出层的权重;
在随机梯度下降SGD训练中,使用large mini-batches和smart shuffling处理数据。

[Figure 2. Overview of the proposed model. A deep neural network implements a joint embedding of the input question and image, followed by a multi-label classifier over a fixed set of candidate answers. Gray numbers indicate the dimensions of the vector representations between layers. Yellow elements use learned parameters. The elements w represent linear layers, and w non-linear layers (gated tanh).]

2018 Two can play this Game Visual Dialog with Discriminative Question Generation

Jain U, Lazebnik S, Schwing A G. Two can play this game: visual dialog with discriminative question generation and answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5754-5763.

[Figure 3: Architecture of our model for selecting the best answer option from a set of 100 candidates. LSTM nets transform all sequential inputs to a fixed size representation. The combined representations of T −1 previous question-answer pairs are concatenated to obtain the final history representation. Multi-class cross-entropy loss is computed by comparing a one-hot ground truth vector (based on the correct option) to output probabilities of the answer options.]
2018 Visual Question Answering with Memory-Augmented Networks

Ma C, Shen C, Dick A, et al. Visual question answering with memory-augmented networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6975-6984.

本文研究如何用记忆增强神经网络(memory-augmented neural network)来实现准确的VQA模型,甚至是在训练集中的极低频答案,也能预测正确。

[Figure 3. Flowchart of the proposed algorithm. We use the last pooling layer of pre-trained CNNs to extract image features that encode spatial layout information. We employ bi-directional LSTMs to generate a fixed-length feature vector for each word. A co-attention mechanism attends to relevant image regions and textual words. We concatenate the attended image and question feature vectors and feed them into a memory-augmented network, which consists of a standard LSTM as controller and an augmented external memory. The controller LSTM determines when to write or read from the external memory. The memory-augmented network plays a key role in maintaining a long-term memory of scarce training data. We take the outputs of the memory-augmented network as final embedding for the image and question pair, and feed this embedding into a classifier to predicts answers.]



2018 Visual Question Generation as Dual Task of Visual Question Answering

Li Y, Duan N, Zhou B, et al. Visual question generation as dual task of visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6116-6124.

[Figure 1. Problem solving schemes of VQA (top) and VQG (bottom), both of which utilize the encoder-fusion-decoder pipeline with Q and A in inverse order. v, q and a respectively denote the encoded features of input image, question, and answer, while ˆa and ˆq represent the predicted answer/question features.]

本文提出一款端到端的可反转问答网络(Invertible Question Answering Network, iQAN),iQAN把VQA和VQG视为对偶任务。iQAN模型中,本文提出的可反转双线性融合模块(invertible bilinear fusion module)和参数共享模式(parameter sharing scheme)可以同时实现VQA及对偶任务VQG。训练时,iQAN模型使用本文提出的双正则化器(称为Dual Training)对VQA和VQG任务联合训练。测试时,训练好的iQAN模型在输入answer时就预测输出question,在输入question时就预测输出answer。基于对VQA和VQG的对偶学习,iQAN的模型对图像、问题和回答之间的交互关系理解能力更好。
[Figure 2. Overview of Invertible Question Answering Network (iQAN), which consists two components for VQA and VQG respectively. The upper component is MUTAN VQA component [3], and the lower component is its dual VQG model. Input questions and answers are encoded respectively by an RNN and a lookup table Ea into fixed-length features. With attention and MUTAN fusion module, predicted features are obtained. The predict features are used for obtaining output (by LSTM andWa for questions and answers respectively). A duality and Q duality are duality regularizers to constrain the similarity between the answer and question representations in both models. Two components share the MUTAN and Attention Modules. (·) ∗ denotes the dual form. Ea also shares parameters withWa.]
2018 Visual Question Reasoning on General Dependency Tree

Cao Q, Liang X, Li B, et al. Visual question reasoning on general dependency tree[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7249-7257.

本文提出名为对抗组合模块网络(Adversarial Modular Network, ACMN)来实现能够在多样且不受限的情况下,有效对齐图像和语言域全局上下文推理。ACMN模型包含两个协同模块:

一个对抗注意力模块(adversarial attention module):提取每一个从问题中解析出的单词的局部视觉依据;

一个残差组合模块(residual composition module):组合已经挖掘出的依据。

[Figure 1: Illustration of our Adversarial Composition Module Network (ACMN) that sequentially performs reasoning over a dependency tree parsed from the question. Conditioning on preceding word nodes, our ACMN alternatively mines visual evidence for nodes with modifier relations via an adversarial attention module and integrates features of child nodes of nodes with clausal predicate relation via a residual composition module.] [Figure 2: The modules in our ACMN: a) each ACMN module that is composed by an adversarial attention module and residual composition module; b) adversarial attention module; c) residual composition module. The blue arrows indicate the modifier relation and the yellow arrows represent the clausal predicate relation. Each node receives the output attention maps and the hidden features from its children, as well as the image feature and word encoding. The adversarial attention module is employed to generate a new attention map conditioned on image feature, word encoding and previous attended regions given by modifier-dependent children. The residual composition module is learned to evolve higher-level representation by integrating features of its children and local visual evidence.]
2018 VizWiz Grand Challenge Answering Visual Questions from Blind People

Gurari D, Li Q, Stangl A J, et al. Vizwiz grand challenge: Answering visual questions from blind people[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 3608-3617.


[Figure 1. Examples of visual questions asked by blind people and corresponding answers agreed upon by crowd workers. The examples include questions that both can be answered from the image (top row) and cannot be answered from the image (bottom row).]
2019 Answer Them All! Toward Universal Visual Question Answering Models

Shrestha R, Kafle K, Kanan C. Answer them all! toward universal visual question answering models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 10472-10481.





[Figure 2: Our recurrent aggregation of multimodal embeddings network (RAMEN).]

本文提出多模态嵌入的循环聚合网络(Recurrent Aggregation of Multimodal Embeddings Network, RAMEN)。


视觉和语言特征的早期融合(Early fusion of vision and language features):已有研究表明,视觉特征和语言特征的早期融合有助于组合推理。RAMEN模型通过把问题特征与其空间上定位的视觉特征做连接。
通过共享投影学习双模态嵌入(Learning bimodal embeddings via shared projections):视觉+问题的连接特征输入到一个共享网络中,生成空间定位的双模态嵌入。该阶段帮助网络学习视觉与文本特征之间的相互关系。
循环聚合习得的双模态嵌入(Recurrent aggregation of the learned bimodal embeddings):通过双向GRU(bi-GRU)聚合整个场景下的双模态嵌入来捕捉双模态之间的交互。最终的前向和后向状态基本上保留了回答问题所需的所有信息。

2019 Cycle-Consistency for Robust Visual Question Answering

Shah M, Chen X, Rohrbach M, et al. Cycle-consistency for robust visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6649-6658.




VQA-Rephrasing数据集从VQA v2.0发展而来,包含4万个图像,对应的4万个问题通过人工改写成了3个表述方式不同的问题语句。
[Figure 2. (a) Abstract representation of the proposed cycle-consistent training scheme: Given a triplet of image I, question Q, and ground truth answer A, a VQA model is a transformation F : (Q, I) 7→ A′ used to predict the answer A′. Similarly, a VQG model G : (A′, I) 7→ Q′ is used to generate a rephrasing Q′ of Q. The generated rephrasing Q′ is passed through F to obtain A′′ and consistency is enforced between Q and Q′ and between A′ and A′′. Image I is not shown for clarity. (b) Detailed architecture of our visual question generation module G. The predicted answer A′ and image I are embedded to a lower dimension using task specific encoders and the resulting feature maps are summed up with additive noise and fed to an LSTM to generate questions rephrasings Q′.]


(a) 重新生成的问题和答案应与ground-truth保持一致
(b) 视觉问题生成模块的架构细节

2019 Deep Modular Co-Attention Networks for Visual Question Answering

Yu Z, Yu J, Cui Y, et al. Deep Modular Co-Attention Networks for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6281-6290.



本文提出深度模块化协同注意力网络(Modular Co-Attention Network, MCAN),MCAN模型由模块协同注意力(Modular Co-Attention, MCA)层在深度上级联组成。每一个MCA层通过两个基本注意力单元组成的模块,对问题和图像的自注意力、图像的问题导向注意力(question-guided-attention)进行联合建模。
[Figure 4: Overall flowchart of the deep Modular Co-Attention Networks (MCAN). In the Deep Co-attention Learning stage, we have two alternative strategies for deep co-attention learning, namely stacking and encoder decoder.]

密集协同注意力机制(dense co-attention mechanism)建模了任意图像区域与任意问题单词之间的密集交互关系,解决了跨模态交互不足,无法正确理解图像-问题之间关系以回答问题的难题。目前的密集协同注意力机制模型BAN、DCN都能级联增加深度,但这些模型相较于浅层模型或粗糙协同注意力模型MFH却没有明显的提升。本文认为深层协同注意力模型的瓶颈在于,缺乏对每个模态密集自注意的同时建模,例如:问题中单词与单词之间的关系、图像中区域与区域之间的关系。
[Figure 2: Two basic attention units with multi-head attention for different types of inputs. SA takes one group of input features X and output the attended features Z for X; GA takes two groups of input features X and Y and output the attended features Z for X guided by Y .] [Figure 3: Flowcharts of three MCA variants for VQA. (Y) and (X) denote the question and image features respectively.]


自注意力单元(self-attention, SA):建模模态内的密集交互(单词与单词、区域与区域);

导向注意力单元(guided-attention, GA):建模模态间的交互(单词与区域);
模块协同注意力(Modular Co-Attention ,MCA)层则通过组合SA和GA单元实现。MCA层支持深度级联。多个级联的MCA层组成了本文提出的深度MCAN模型。

2019 Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

Gao P, Jiang Z, You H, et al. Dynamic Fusion With Intra-and Inter-Modality Attention Flow for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6639-6648.


本文提出基于模态内和模态间信息流的多模态特征动态融合方法——模态内和模态间注意力流动态融合(Fusion with Intra- and Inter-modality Attention Flow, DFAF)框架实现高效的多模态特征融合,以便准确回答视觉问题。该方法能够在视觉模态和语言模态之间传递二者的动态信息。该方法还能稳定捕捉语言域与视觉域之间的高层交互,从而大幅提高VQA能力。本文提出的以其它模态为条件的动态模态内注意力流能够动态调整目标模态的模态内注意力,对多模态特征融合有重要意义。
[Figure 1: Illustration of the proposed Dynamic Fusion with Intra- and Inter-modality Attention Flow (DFAF) for visual question answering. Each DFAF module contains one Inter-Modality Attention Flow and one of Intra Modality Attention Flow Module. Stacking several blocks of DFAF can help the network gradually focus on important image regions , question words and the latent alignments.]


[Figure 2: Illustration of the proposed Dynamic Intra- Modality Attention Flow module. Only intra-modality attention flow within the visual modality conditioned on question are shown. By average pooling over question features, the conditional gating vector can be acquired for controlling the information flows among region features. Attention will focus on question related information flows. Row-wise softmax is applied to obtain the attention weight.]

动态模态内注意力流(Dynamic Intra-Modality Attention Flow)模块;
对问题特征平均池化出的条件门控向量(conditional gating vector)可以控制区域特征之间流动的信息。,这样一来,注意力机制就会关注于与问题相关的信息流。

2019 Explicit Bias Discovery in Visual Question Answering Models

Manjunatha V, Saini N, Davis L S. Explicit Bias Discovery in Visual Question Answering Models[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 9562-9571.

现有的VQA模型过于学习了数据中的统计偏差(statistical biases)。根据数据中的统计规律来回答,而非依据视觉内容。

本文通过简单的规则挖掘(rule mining)算法,发现了一些人类可解释规则,能够给我们对模型的这种行为带来独特的视角。
[Figure 1. On the left, we show examples of two questions from the VQA dataset of [6, 18] where a model would require a “skill” to answer correctly (such as telling the time, or reading the English language), and a third which can be answered using statistical biases in the data itself. On the right, we show examples of statistical biases for a set of questions containing the phrase “What time?” and various visual elements (antecedents). Note that each row in this figure represents multiple questions in the VQA validation set. The * next to the answer (or consequent) reminds us that it is from the set of answer words. There are several visual words associated with afternoon and night, but we have provided only two for brevity.]


2019 Generating Question Relevant Captions to Aid Visual Question Answering

JialinWu, Zeyuan Hu, Raymond J. Mooney. Generating Question Relevant Captions to Aid Visual Question Answering[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3585–3594.

[Figure 2: Overall structure of our model that generates question-relevant captions to aid VQA. Our model is first trained to generate question-relevant captions as determined in an online fashion in phase 1. Then, the VQA model is fine-tuned with generated captions from the first phase to predict answers. ⌦ denotes element-wise multiplication and denotes element-wise addition. Blue arrows denote fully-connected layers (fc) and yellow arrows denote attention embedding.]


阶段2,用阶段1生成的问题相关描述来fine-tune VQA模型,并预测输出回答。

2019 Improving Visual Question Answering by Referring to Generated Paragraph Captions

Hyounghun Kim, Mohit Bansal. Improving Visual Question Answering by Referring to Generated Paragraph Captions[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3606-3612.

本文认为段落风格的图像描述(paragraph-style image caption)比单句描述给出的信息更多。单句描述只是对图像做一个笼统描述,而段落描述则能够描述图像中的不同方面的信息、更抽象的信息、更易于理解的信息、符号表示型的信息,这些信息能够和图像本身所能表达的语义进行互补。
[Figure 1: VTQA Architecture: Early, Late, and Later Fusion between the Vision and Paragraph Features.]

本文提出视觉和文本问答(Visual and Textual Question Answering, VTQA)模型。该模型基于图像及其段落描述,给定问题,输出回答。


早期融合(Early Fusion):该阶段把视觉特征(Visual Feature)和段落描述(Paragraph Caption)与目标属性(Object Properties)特征进行融合。
晚期融合(Late Fusion):该阶段把各模块输出的逻辑值整合到一个向量中。

2019 Information Maximizing Visual Question Generation

Krishna R, Bernstein M, Fei-Fei L. Information Maximizing Visual Question Generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 2008-2018.





2019 Multi-grained Attention with Object-level Grounding for Visual Question Answering

Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, Yong Zhu. Multi-grained Attention with Object-level Grounding for Visual Question Answering[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3595–3600.



主流的VQA系统就是通过深度神经网络实现端到端训练。把问题和图片分别编码为表示向量,然后把多模态特征融合为统一表示,来预测回答。找出与问题最相关的图像区域很重要!目前主要的方法是通过注意力机制。通过空间注意力分布(spatial attention distribution)来体现视觉上关注的位置。
[Figure 2: The architecture of our proposed model. The enhanced modules are illustrated in dot lines.]

Word-Label Attention
Word-Object Attention
Sentence-Object Attention
把WL, WO, SO三种注意力权重加起来,取得Object features的多粒度注意力权重结果,用这个多粒度注意力去加权平均object features取得attended object feature向量,作为视觉信息表示。最终与Sentence embedding组合为融合特征,用作VQA答案分类。

2019 MUREL Multimodal Relational Reasoning for Visual Question Answering

Cadene R, Ben-Younes H, Cord M, et al. Murel: Multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 1989-1998.



MuRel cell:一个原子化的推理基元,能够通过一个富向量表示来表示问题和图像区域之间的交互,通过成对结合建模区域之间的关系。
MulRel network:逐步修正视觉和问题交互,比仅使用注意力映射图可以更好定义可视化模式。

[Figure 2. MuRel cell. In the MuRel cell, the bilinear fusion represents rich and fine-grained interactions between question and region vectors q and si. All the resulting multimodal vectors mi pass through a pairwise modeling block to provide a context-aware embedding xi per region. The cell’s output ˆsi is finally computed as a sum between si and xi, acting as residual function of si.]

MuRel cell

通过成对关系建模(Pairwise Relational Modeling)块为每个区域生成一个上下文感知嵌入



MuRel cell的输入是问题向量
(还有对应的bounding box坐标信息



个区域的局部多模态嵌入(local multimodal embedding);
一个成对关系建模(Pairwise Relational Modeling)组件会根据每个融合过的区域特征向量

[Figure 3. MuRel network. The MuRel network merges the question embedding q into spatially-grounded visual representations {vi} by iterating through a single MuRel cell. This module takes as input a set of localized vectors {si} and updates their representation using a multimodal fusion component. Moreover, it models all the possible pairwise relations between regions by combining spatial and semantic information. To construct the importance map at step t, we count the number of time each region provides the maximal value of maxi{s t i} (over the 2048 dimensions).]
2019 OK-VQA A Visual Question Answering Benchmark Requiring External Knowledge

Marino K, Rastegari M, Farhadi A, et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 3195-3204.



[Figure 2: Dataset examples. Some example questions and their corresponding images and answers have been shown. We show one example question for each knowledge category.]

2019 Psycholinguistics Meets Continual Learning Measuring Catastrophic Forgetting in Visual Question Answering

Claudio Greco, Barbara Plank, Raquel Fernández, Raffaella Bernardi. Psycholinguistics Meets Continual Learning Measuring Catastrophic Forgetting in Visual Question Answering[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3601-3605.

心理语言学遇到持续学习(Psycholinguistics Meets Continual Learning)?
[Figure 1: Overview of our linguistically-informed CL setup for VQA.]

本文具体是评估和分析了VQA中存在的剧烈遗忘(dramatic forgetting)或突变遗忘(catastrophic forgetting)问题。
2019 Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension

Daesik Kim, Seonhoon Kim, Nojun Kwak. Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3568-3584.



[Figure 3: Overall framework of our model: (a) The preparation step for the k-th answer among n candidates. The context m is determined by TF-IDF score with the question and the k-th answer. Then, the context m is converted to a context graph m. The question and the k-th answer are also embedded by GloVe and character embedding. This step is repeated for n candidates. (b) The embedding step uses RNNC as a sequence embedding module and f-GCN as a graph embedding module.With attention methods, we can obtain combined features. After concatenation, RNNS and the fully connected module predict final distribution in the solving step.]


2019 Transfer Learning via Unsupervised Task Discovery for Visual Question Answering

Noh H, Kim T, Mun J, et al. Transfer Learning via Unsupervised Task Discovery for Visual Question Answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 8385-8394.

[Figure 2. Overview of the proposed algorithm. (a) Unsupervised task discovery samples a task specification for a sampled visual data (a , I , b), where I, b and a are an image, a bounding box and an label (answer), respectively. It leverages linguistic knowledge sources such as visual description d and WordNet. (b) A visual data with a task specification, denoted by (a , I , b , t), is employed to pretrain a task conditional visual classifier. © The pretrained task conditional visual classifier is transferred to VQA and the parameters are frozen. Attention layer and question encoder are learned from scratch with VQA dataset. The terms label and answer are used interchangeably.]
2019 Visual Question Answering as Reading Comprehension

Li H, Wang P, Shen C, et al. Visual Question Answering as Reading Comprehension[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6319-6328.


[Figure 1 – Comparison between VQA and TQA. Question1 is observation based, which can be inferred from the image itself. Question2 is knowledge based, which has to refer knowledge beyond the image. Extra knowledge commonly appears in text, which is easier to be combined to the context paragraph in TQA.]

