Note 1: Transformer

Attention Is All You Need [1]

[1]

1. Encoder-Decoder

The encoder maps an input sequence of symbol representations to a sequence of continuous representations .
Given , the decoder then generates an output sequence of symbols one element at a time.
At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

Overview of Transformer

Encoder
- It has 6 identical layers.
- Each layer has a multi-head self-attention sub-layer and a position-wise fully connect feed-forward sub-layer in turn.
- Each sub-layer uses the residual connection mechanism, and followed by a normalization layer.
- The residual connection mechanism uses x+F(x) as its final result.
Decoder
- It has 6 identical layers.
- Each layer has three sub-layers:
  - Masked multi-head attention layer ensues the prediction at time t can only depends on the known output at positions less than t.
  - Multi-head attention layer further adds the output of the encoder stack into this decoder.
  - Position-wise fully connect feed-forward layer.

2. Attention

Mapping a query and a set of key-value pairs to a weighted output.
Scaled Dot-product Attention

Scaled Dot-product Attention
- Given three vectors, a query , keys and values :
Multi-head Attention

Multi-head Attention
- Jointly collect information from different representation subspace focused on different positions.
- Given queries , keys and values :
  - , , , , , .
  - The and have the same dimension .
  - First, it linearly projects the queries, keys and values times to learn different , and .
  - Next, it concatenates all yield output values together.
  - At last, it projects the concatenated vector into a -dimension vector.
    
    Example [2]
Attention in Transformer
- Encoder's multi-head:
  - ===the output of previous layer.
  - Each position in the encoder can attend to all positions in the previous layer of the encoder.
- Decoder's masked multi-head:
  - ===the masked output of previous layer.
  - For example, if we predict the -th output token, all tokens after timestamp have to be marked.
  - This prevents leftward information flow in the decoder in order to preserve the auto-regressive property.
  - It masks out (setting to ) all values in the input of the softmax which correspond to illegal connections during the scaled dot-product attention.
- Decoder's multi-head:
  - =the output of the previous decoder layer, ==the encoder stack's output.
  - This allows every position in the decoder to attend over all positions in the input sequence.

3. Position-wise Feed-forward Networks

ReLu activation function: .
It's applied to each position separately and identically.

4. Positional Encoding

To make use of the order of the sequence.
Add at the bottoms of the encoder and decoder stacks.
Have the same dimension as the embeddings.
- is the index of position, in encoder while in decoder.
- is the index of the dimension , .

Reference

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
[2] 口仆. Transformer 原理解析 https://zhuanlan.zhihu.com/p/135873679