文章题目:《Show, Ask,Attend, and Answer: A Strong Baseline For Visual Question Answering》
文章引用格式:Vahid Kazemi and Ali Elqursh. "Show, Ask,Attend, and Answer: A Strong Baseline For Visual Question Answering." arXiv preprint, arXiv: 1704.03162, 2017.
This paper presents a new baseline for visual question answering task. Given an image and a question in natural language, our model produces accurate answers according to the content of the image. Our model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark. On VQA 1.0 [2] open ended challenge, our model achieves 64.6% accuracy on the teststandard set without using additional data, an improvement of 0.4% over state of the art, and on newly released VQA 2.0 [8], our model scores 59.7% on validation set outperforming best previously reported results by 0.5%. The results presented in this paper are especially interesting because very similar models have been tried before [32] but significantly lower performance were reported. In light of the new results we hope to see more meaningful research on visual question answering in the future.
这篇文章对于VQA提出了一种新的方法。该方法在VQA 1.0数据集上比最好的方法还多提高了0.4%的精度,达到了64.6%。在VQA 2.0数据集上,比最好的模型提高了0.5%,精度达到了59.7%。作者也提到了,这篇文章与SAN很像,(关于SAN之前的解读:【文献阅读】SAN——一种利用双层注意力的VQA网络(T. Do等人,ArXiv,2015,有代码)),但表现比SAN更好一些。
实验中,dropout设置为0.5,优化器为Adam optimizer,batch size为128,训练10万个epoch。初始学习率为0.001,并在5万个epoch后开始衰减,下图为实验结果:
• No l2 norm: ResNet features are not l2 normalized. 指ResNet没有经过l2范数归一化
• No dropout on FC/Conv: Dropout is not applied to the inputs of fully connected and convolution layers. 在全连接和卷积层中没有使用dropout
• No dropout on LSTM: Dropout is not applied to the inputs of LSTM layers. 在LSTM中没有使用dropout
• No attention: Instead of using soft-attention we perform average spatial pooling before feeding image features to the classifier. 用空间池化替代空间注意力
• Sampled loss: Instead of averaging the log-likelihood of correct answers we sample one answer at a time. 只有一个正确答案(之前给出的是10个正确答案,然后计算预测答案和这10个答案的正确得分)
• With positional features: Image features φ are augmented with x and y coordinates of each cell along the depth dimension producing a tensor of size 14 × 14 × 2050. 图像特征增加了x和y坐标
• Bidirectional LSTM: We use a bidirectional LSTM to encode the question. 编码问题时采用双向LSTM
• Word embedding size: We try word embeddings of different sizes including 100, 300 (default), and 500. 单词嵌入的维度分别设置为100, 300,500
• LSTM state size: We explore different configurations of LSTM state sizes, this include a one layer LSTM of size 512, 1024 (default), and 2048 or a stacked two layer LSTM of size 1024. LSTM的输出大小分别设置为512,1024,2048(或者2个1024的叠加)
• Attention size: Different attention configurations are explored. First number indicates the size of first convolution layer and the second number indicates the number of attention glimpses. 不同的注意力层数和注意力数量。
• Classifier size: By default classifier G is consisted of a fully connected layer of size 1024 with ReLU nonlinearity followed by a M = 3000 dimensional linear layer followed by softmax. We explore shallower, deeper, and wider alternatives. 使用更浅层或者更深层或者对应替代的分类器