VQA文献阅读 Learning Conditioned Graph Structures for Interpretable Visual Question Answering

1.动机

作者认为:
1.现有的基于图结构的VQA方法是定制的 不能从抽象图像扩展到真实图像
2.没有考虑将问题信息添加进来
3.没有直观的展示得到结果的过程(Interpretable)

2.贡献

1.提出一个新的、Interpretable、基于图卷积网络的VQA方法
图中的节点表示Image features中的Bounding box ,节点之间的线条表示image中各个节点的联系强度(联系越强,线条越粗)。
线条的学习中,引入了先验知识----问题信息
2.模型的可解释性
通过Image上的bounding box 和 edges 之间的关联,来展示模型的可解释性
3.实验结果
66.18% on VQAv2数据集

3.网络结构

VQA文献阅读 Learning Conditioned Graph Structures for Interpretable Visual Question Answering_第1张图片

1.We develop a deep neural network that combines spatial, image and textual features in a novel manner in order to answer a question about an image.
2.Our graph learning module then learns an adjacency matrix of the image objects that is conditioned on a given question
3.the spatial graph convolutions - to focus not only on the objects but also on the object relationships that are the
most relevant to the question

4.方法步骤

Step 1.计算模型输入
1.embedding images and questions
Images----(object detecor)---->visual features(bounding box)----->embedding for each bounding box(mean of the corresponding area of the convolutional feature map.)
Questions----(pre-trained word embeddings,e.g GLOVE )------>variable length sequence of embeddings+RNN(GRU) ----Encoded–> single questions embeddings q

Step 2.图网络学习器(基于问题 生成输入图像的图结构表示)
1.图网络学习器概述—学习基于问题的 最相关的邻居节点
在这里插入图片描述

V:图中的节点(image features中的bounding box)
E:图中节点之间的边
A:图网络学习到的邻接矩阵
2.joint embedding
为了描述feature vectors中的相似性以及与问题的相关性,将feature vectors 与questions embeddings 相连接, 得到下述joint embedding
VQA文献阅读 Learning Conditioned Graph Structures for Interpretable Visual Question Answering_第2张图片

再将joint embedding 与一个矩阵相连 则可以定义一个带自卷的邻接矩阵A = EE^T 则ai,j = ei^Tej
3.neighbourhood system(对于图中的每一个节点,学习得到在问题的先验知识条件下与这个节点最相关的topm个邻居节点)
Ranking strategy: 如下图所示
在这里插入图片描述

Step 3.空间图卷积(在Step2中得到的questions specific图结构上添加spatial information )
1.coordinate system
前人:for each vertex i, a coordinate system centred at i, with u(i, j) being the coordinates of vertex j in that system
本文:function u(i, j) returns a polar coordinate vector (ρ, θ), describing the relative spatial positions of the centres of the bounding boxes associated with vertices i and j
2.patch operator
前人:
VQA文献阅读 Learning Conditioned Graph Structures for Interpretable Visual Question Answering_第3张图片

本文:
VQA文献阅读 Learning Conditioned Graph Structures for Interpretable Visual Question Answering_第4张图片

3.Output

where each Gk ∈ R
d Kh ×dv is a matrix of learnable weights (the convolution fifilters), with dh as the chosen dimensionality of the outputted convolved features. This results in a convolved graph
representation H ∈ RN×dh .

Step 4.预测层
1.graph representation H -------->node dimension maxpooling --------> global vector representation of the graph hmax
2.hmax * question embeddings q (through element-wise)
3.2–>2-layer MLP(with Relu) compute logits

Step 5.损失函数
在这里插入图片描述

5.启示

1.Questions often require answers that cannot be found in the predefifined answers
2.scalar edge weights may not be able to capture the full complexity of the relationships between graph items

你可能感兴趣的:(VQA,深度学习,机器学习,pytorch)