【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

一、Background

With its superior global dependency modeling capabilities, Transformer and its variants have become the primary structure for many visual and language tasks. However, in tasks such as visual question answering (VQA) and directed expression understanding (REC), multimodal prediction usually requires visual information from macro to micro. Therefore, how to dynamically schedule global and local dependency modeling in Transformer becomes an emerging problem.

二、Motivation

1)In some V&L tasks, such as visual question answering (VQA) and directed expressive comprehension (REC), multimodal reasoning usually requires visual attention from different receptive fields. Not only should the model understand the overall semantics, but more importantly, it needs to capture the local relationships in order to answer the right answer.
2)In this paper, the authors propose a new lightweight routing scheme called Transformer Routing (TRAR), which enables automatic attention selection without increasing computation and video memory overhead.
【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第1张图片

三、Model

(一)The framework of TRAR

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第2张图片

(二)Routing Process

  • To achieve the dynamic routing goals of each example, the intuitive solution is to create a multi-branch network structure, with each layer equipped with modules with different Settings. Specifically, given the fea_x0002_tures of the last inference step, X∈R^n×d and routing space, F=[F_0,…,F_n] of the previous inference step, the output of the next inference step X^’ is obtained in the following way
    【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第3张图片

  • However, from the above equation, we can see that such a routing scheme will inevitably make the network very complicated and greatly increase the training cost.

  • The key to reduce the experimental burden is to optimize the definition of routing. By revisiting the definition of the standard self-attention, defined as:
    -【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第4张图片

  • We can see that SA can be regarded as a feature update function of a fully connected graph, when A∈R^n×n is regarded as a weighted adjacency matrix. Therefore, to obtain characteristics of different attention spans, you only need to limit the graph connections for each input element. This can be achieved by multiplying the result of dot-product by the mask D∈R^n×n of an adjacency, as shown below

  • 在这里插入图片描述

  • Based on the above equations, a routing layer for SA is then defined as:
    【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第5张图片

  • However, the above formula is still computationally expensive. Therefore, the author further simplifies the module selection problem to the selection of adjacency mask D, defined as:
    在这里插入图片描述

  • In TRAR, each SA layer is equipped with a path controller to predict the probabilities of routing options, i.e., the module selection. Specifically, given the input features X∈R^n×d, the path probabilities α∈R^n is defined as:
    【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第6张图片

(三)Optimization

  • By applying the Softmax function, routing path classification can be selected as a continuous differentiable operation. The Router and the entire network can then be combined for end-to-end optimization using task-specific objective functions arg min_w,zL_train(w,z). During the test, the features of different attention spans were combined dynamically. Because soft routing does not require additional parameters, training is relatively easy.

  • Hard routing is to achieve binary path selection, which can further introduce specific CUDA kernels to accelerate the model inference. However, classified routing makes the weight of Router non-differentiable, and binarization of the results of soft routing may lead to feature gap between training and testing. To solve this problem, the authors introduce the Gumbel-softmax Trick to implement differentiable path routing:

在这里插入图片描述

四、Experiment

  • To validate the proposed TRAR, the author apply it to visual question answering (VQA) and referring expression comprehension(REC), and conduct extensive experiments on five benchmark datasets, VQA2.0, CLVER, RefCOCO, RefCOCO+ and RefCOCOg.

(一) Ablations

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第7张图片

(二) Comparison with SOTA

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第8张图片
【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第9张图片
【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第10张图片

(三) Qualitative Analysis

【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第11张图片
【Transformer论文解读】TRAR: Routing the Attention Spans in Transformer for Visual Question Answering_第12张图片

五、Conclusion

  • In this paper, the author examines dependency modeling for Transformer in two visual-language tasks, namely VQA and REC. These two tasks typically require visual attention from different receptive fields, and standard Transformer can’t handle them completely.
  • To this end, the authors propose a lightweight routing scheme called Transformer Routing (TRAR) to help the model dynamically select attention span for each sample. TRAR transforms the module selection problem into a selective attention mask problem, thus making the extra computing and video memory overhead negligible.
  • In order to verify the effectiveness of TRAR, a large number of experiments were carried out on five benchmark data sets, and the experimental results confirmed the superiority of TRAR.

你可能感兴趣的:(Transformer,transformer,深度学习,人工智能)