自然语言处理中的注意力机制Summary of Attention Module in NLP_state of art method till 2019 (Updating)

Summary of Attention Module in NLP_state of art method till 2019

  • First proposed in 'Neural Machine Translation by Jointly Learning to Align and Translate'(2015). But more originally in computer vision field.
  • Calc the weight of linear combination of encoder/decoder hidden layers to make up a part of decoder hidden states

The reason we need attention model:

    • Original NMT failed to capture information in long sentences
    • RNN, or encoder-decoder structure was proposed to deal with the long-term memory problem
    • But encoder-decoder still has an issue with very long sentences, because a fix-length hidden state cannot store infinite information while test paragraph can be very long(you can input 5000 words in Google's translation). Bahdanau indicated that the calculation of context vector could be the bottleneck of the generation performance.
    • Besides, information in the near sentences may not be more important than the farther one, which means we need to take into the relevance between words and sentences into account, rather than only considering words' distance
    • Attention was first proposed in image processing, then in Bahdanau's paper in 2014
    • [1409.0473] Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473

Calculation of attention:

  • Alignment model (proposed by Bahdanau et al.)

    • calc01.png
  • ![exp (eij) Oij eaj ](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image002.png)

  • where e is a position function (alignment function) judging how well the word y_i matches x_j

  • the author used a relevant weight matrix obtained from a feedforward networks

    • We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system
    • ![img](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image003.png)
  • Therefore, one point we can modify is the e function, hereby provide several classic methods:

  • Bahadanau's addictive attention:

    • ![计算机生成了可选文字: 艹re@,=tanh(Wa[st;ht])](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image004.png)
  • Multiplicative attention

    • ![= (htrwasj general ](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image005.png)
    • Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication.
    • http://ruder.io/deep-learning-nlp-best-practices/index.html#fn35
  • Addictive attention

    • ![VI sjl) general = + was) linear ](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image006.png)
    • additive attention performs better for larger dimensions
    • http://ruder.io/deep-learning-nlp-best-practices/index.html#fn35
  • Location based function:

    • Only computed based on input location;
    • ![计算机生成了可选文字: 《=softmax(Wast)《](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image007.png)
    • Effective Approaches to Attention-based Neural Machine Translation https://www.aclweb.org/anthology/D15-1166
  • Scaled function:

    • ![img](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image008.png)
    • scaling is used to deal with the derivation backward propagation problem, when attention weight is too large, it would be difficult to fine-tuned in training
    • [1706.03762] Attention Is All You Need https://arxiv.org/abs/1706.03762
  • Difference of the selection standard:

    • soft attention(consider all states, and parameterized) and hard attention(only select most relevant state, using monte-carlo stochastic sampling)

      • drawback of hard attention: non-differentiable
    • global attention(alignment of all states) and local attention(only focus on part of the source)

      • Global score function for hidden state, ht is current decoder state and hs is encoder state under estimation

      • ![计算机生成了可选文字: exp(score@:hs)) 乥,exp(score@,))](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image009.png)

      • Local attention score function

      • choose a position in a limit 'window' in the encoder hidden states

      • ![img](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image010.jpg)

      • influence of the encoder hidden state vanish as the distance between current state and the chose state

      • ![img](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image011.jpg)

      • ref:

        • Attention Mechanism: Benefits and Applications https://www.saama.com/blog/attention-mechanism-benefits-and-applications/
        • Attention机制论文阅读——global attention和local attention - nbawj的博客 - CSDN博客 https://blog.csdn.net/nbawj/article/details/80551404
        • 模型汇总24 - 深度学习中Attention Mechanism详细介绍:原理、分类及应用 - 知乎 https://zhuanlan.zhihu.com/p/31547842
    • local attention is kind of similar to hard attention

  • self attention:

  • Hierarchical model:

    • ![img](file:///C:/Users/Thinkpad/AppData/Local/Temp/msohtmlclip1/02/clip_image012.png)
    • 16HLT-hierarchical-attention-networks.pdf https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf
  • Difference between Memory mechanism and Attention mechanism?

    • Facebook proposed a famous memory architecture in 2015 which is hierarchical memory, it's also one kind of attention.
    • two papers introducing Memory mechanism
    • https://arxiv.org/abs/1410.3916
    • https://arxiv.org/abs/1503.08895
  • SOTA modal:

  • ref:

    • Attention Mechanism: Benefits and Applications https://www.saama.com/blog/attention-mechanism-benefits-and-applications/

你可能感兴趣的:(自然语言处理中的注意力机制Summary of Attention Module in NLP_state of art method till 2019 (Updating))