Note 1: Transformer

Attention Is All You Need [1]

[1]

1. Encoder-Decoder

  • The encoder maps an input sequence of symbol representations to a sequence of continuous representations .
  • Given , the decoder then generates an output sequence of symbols one element at a time.
  • At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.
Overview of Transformer
  • Encoder
    • It has 6 identical layers.
    • Each layer has a multi-head self-attention sub-layer and a position-wise fully connect feed-forward sub-layer in turn.
    • Each sub-layer uses the residual connection mechanism, and followed by a normalization layer.
    • The residual connection mechanism uses x+F(x) as its final result.
  • Decoder
    • It has 6 identical layers.
    • Each layer has three sub-layers:
      • Masked multi-head attention layer ensues the prediction at time t can only depends on the known output at positions less than t.
      • Multi-head attention layer further adds the output of the encoder stack into this decoder.
      • Position-wise fully connect feed-forward layer.

2. Attention

  • Mapping a query and a set of key-value pairs to a weighted output.
  • Scaled Dot-product Attention
    Scaled Dot-product Attention
    • Given three vectors, a query , keys and values :
  • Multi-head Attention
    Multi-head Attention
    • Jointly collect information from different representation subspace focused on different positions.
    • Given queries , keys and values :

      • , , , , , .
      • The and have the same dimension .
      • First, it linearly projects the queries, keys and values times to learn different , and .
      • Next, it concatenates all yield output values together.
      • At last, it projects the concatenated vector into a -dimension vector.
        Example [2]
  • Attention in Transformer
    • Encoder's multi-head:
      • ===the output of previous layer.
      • Each position in the encoder can attend to all positions in the previous layer of the encoder.
    • Decoder's masked multi-head:
      • ===the masked output of previous layer.
      • For example, if we predict the -th output token, all tokens after timestamp have to be marked.
      • This prevents leftward information flow in the decoder in order to preserve the auto-regressive property.
      • It masks out (setting to ) all values in the input of the softmax which correspond to illegal connections during the scaled dot-product attention.
    • Decoder's multi-head:
      • =the output of the previous decoder layer, ==the encoder stack's output.
      • This allows every position in the decoder to attend over all positions in the input sequence.

3. Position-wise Feed-forward Networks

  • ReLu activation function: .
  • It's applied to each position separately and identically.

4. Positional Encoding

  • To make use of the order of the sequence.
  • Add at the bottoms of the encoder and decoder stacks.
  • Have the same dimension as the embeddings.

    • is the index of position, in encoder while in decoder.
    • is the index of the dimension , .

Reference

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[2] 口仆. Transformer 原理解析 https://zhuanlan.zhihu.com/p/135873679

你可能感兴趣的:(Note 1: Transformer)