LayerNorm(x + Sublayer(x))
整理的Transformer 伪代码
输入 Inputs 输出 Outputs
X = Positional_Encoding(Input_Embedding(Inputs))
X = LayerNorm(X + Multi-Head_Attention(X))
X = LayerNorm(X + Feed_Forward(X))
Y = Positional_Encoding(Output_Embedding(Outputs))
Y = LayerNorm(Y + Masked_Multi-Head_Attention(Y))
Y = LayerNorm(Y + Multi-Head_Attention( X Q X_Q XQ, X K X_K XK, Y V Y_V YV))
Y = LayerNorm(Y + Feed_Forward(Y))
Y = Linear(Y)
Output Probabilities = Softmax(Y)
Attention(Q, K, V) = softmax( Q K T d k \frac{QK^T}{\sqrt{d_k}} dkQKT)V
MultiHead(Q,K,V) = Concat( h e a d 1 head_1 head1,…, h e a d h head_h headh) W O W^O WO
where h e a d i head_i headi = Attention( Q W i Q QW_i^Q QWiQ, K W i K KW_i^K KWiK, V W i V VW_i^V VWiV)
W i Q ∈ R d m o d e l ∗ d k W_i^Q ∈ R^{d_{model} \quad * \quad d_k} WiQ∈Rdmodel∗dk
W i K ∈ R d m o d e l ∗ d k W_i^K ∈ R^{d_{model} \quad * \quad d_k} WiK∈Rdmodel∗dk
W i V ∈ R d m o d e l ∗ d v W_i^V ∈ R^{d_{model} \quad * \quad d_v} WiV∈Rdmodel∗dv
W i O ∈ R h d v ∗ d m o d e l W_i^O ∈ R^{hd_v \quad * \quad d_{model}} WiO∈Rhdv∗dmodel
In this work we employ h = 8 parallel attention layers, or heads.
For each of these we use d k d_k dk = d v d_v dv = d m o d e l d_{model} dmodel/h = 64
Due to the reduced dimention of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.
FFN(x) = max(0, x W 1 xW_1 xW1 + b 1 b_1 b1) W 2 W_2 W2 + b 2 b_2 b2
$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $
$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $
where pos is the position and i is the dimension.
输入 Inputs 输出 Outputs
X = Positional_Encoding(Input_Embedding(Inputs))
Q X Q_X QX, K X K_X KX, V X V_X VX = X
X = LayerNorm(X + Multi-Head_Attention( Q X Q_X QX, K X K_X KX, V X V_X VX))
X = LayerNorm(X + Feed_Forward(X))
Q X Q_X QX, K X K_X KX, V X V_X VX = X
Y = Positional_Encoding(Output_Embedding(Outputs))
Q Y Q_Y QY, K Y K_Y KY, V Y V_Y VY = Y
Y = LayerNorm(Y + Masked_Multi-Head_Attention( Q Y Q_Y QY, K Y K_Y KY, V Y V_Y VY))
Q Y Q_Y QY, K Y K_Y KY, V Y V_Y VY = Y
Y = LayerNorm(Y + Multi-Head_Attention( Q X Q_X QX, K X K_X KX, V Y V_Y VY))
Y = LayerNorm(Y + Feed_Forward(Y))
Y = Linear(Y)
Output Probabilities = Softmax(Y)
We trained our models on one machine with 8 NVIDIA P100 GPUs.
We trained the base models for a total of 100,000 steps or 12 hours.
The big models were trained for 300,000 steps(3.5days)
We used the Adam optimizer with with β 1 β_1 β1 = 0.9, β 2 β_2 β2 = 0.98 and ϵ \epsilon ϵ= 10^{−9}$
Residual Dropout
Label Smoothing