Transformer小结

Attention is all you need

Transformer

Transformer小结_第1张图片

LayerNorm(x + Sublayer(x))

整理的Transformer 伪代码
输入 Inputs 输出 Outputs

X = Positional_Encoding(Input_Embedding(Inputs))
X = LayerNorm(X + Multi-Head_Attention(X))
X = LayerNorm(X + Feed_Forward(X))

Y = Positional_Encoding(Output_Embedding(Outputs))
Y = LayerNorm(Y + Masked_Multi-Head_Attention(Y))
Y = LayerNorm(Y + Multi-Head_Attention( X Q X_Q XQ, X K X_K XK, Y V Y_V YV))
Y = LayerNorm(Y + Feed_Forward(Y))

Y = Linear(Y)
Output Probabilities = Softmax(Y)

Transformer小结_第2张图片

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax( Q K T d k \frac{QK^T}{\sqrt{d_k}} dk QKT)V

Multi-Head Attention

MultiHead(Q,K,V) = Concat( h e a d 1 head_1 head1,…, h e a d h head_h headh) W O W^O WO

where h e a d i head_i headi = Attention( Q W i Q QW_i^Q QWiQ, K W i K KW_i^K KWiK, V W i V VW_i^V VWiV)

W i Q ∈ R d m o d e l ∗ d k W_i^Q ∈ R^{d_{model} \quad * \quad d_k} WiQRdmodeldk
W i K ∈ R d m o d e l ∗ d k W_i^K ∈ R^{d_{model} \quad * \quad d_k} WiKRdmodeldk
W i V ∈ R d m o d e l ∗ d v W_i^V ∈ R^{d_{model} \quad * \quad d_v} WiVRdmodeldv
W i O ∈ R h d v ∗ d m o d e l W_i^O ∈ R^{hd_v \quad * \quad d_{model}} WiORhdvdmodel

In this work we employ h = 8 parallel attention layers, or heads.
For each of these we use d k d_k dk = d v d_v dv = d m o d e l d_{model} dmodel/h = 64
Due to the reduced dimention of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.

Position-wise Feead-Forward Networks

FFN(x) = max(0, x W 1 xW_1 xW1 + b 1 b_1 b1) W 2 W_2 W2 + b 2 b_2 b2

Positional Encoding

$PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $

$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $

where pos is the position and i is the dimension.

重新写Transformer的伪代码

输入 Inputs 输出 Outputs

X = Positional_Encoding(Input_Embedding(Inputs))

Q X Q_X QX, K X K_X KX, V X V_X VX = X

X = LayerNorm(X + Multi-Head_Attention( Q X Q_X QX, K X K_X KX, V X V_X VX))

X = LayerNorm(X + Feed_Forward(X))

Q X Q_X QX, K X K_X KX, V X V_X VX = X

Y = Positional_Encoding(Output_Embedding(Outputs))

Q Y Q_Y QY, K Y K_Y KY, V Y V_Y VY = Y

Y = LayerNorm(Y + Masked_Multi-Head_Attention( Q Y Q_Y QY, K Y K_Y KY, V Y V_Y VY))

Q Y Q_Y QY, K Y K_Y KY, V Y V_Y VY = Y

Y = LayerNorm(Y + Multi-Head_Attention( Q X Q_X QX, K X K_X KX, V Y V_Y VY))

Y = LayerNorm(Y + Feed_Forward(Y))

Y = Linear(Y)
Output Probabilities = Softmax(Y)

Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs.
We trained the base models for a total of 100,000 steps or 12 hours.
The big models were trained for 300,000 steps(3.5days)

Optimizer

We used the Adam optimizer with with β 1 β_1 β1 = 0.9, β 2 β_2 β2 = 0.98 and ϵ \epsilon ϵ= 10^{−9}$

Regularization

Residual Dropout
Label Smoothing

你可能感兴趣的:(AI,Deep,Learning,人工智能,NLP)