Transformer

关于Attention机制的介绍可详细见另一篇文章《注意力机制》，这篇主要是介绍Transformer的结构。

1. Transformer结构

Transformer是一个Encoder-Decoder的过程，其中Encoders是由个（论文中）Encoder block构成，Decoders由个（论文中）Decoder block构成。整体结构图如下：

图1：对Transformer拆解

每个Encoder block有两层：

Multi-Head Attention（Self-Attention）
Feed Forward（全连接层）

Decoders由个（论文中）Decoder block构成，每个Decoder block由三层：

Multi-Head Attention（Self-Attention）
Multi-Head Attention（Encoder-Decoder Attention）
Feed Forward（全连接层）

Decoders后是Linear和Softmax层。另外需要说明的是，Transformer在每一个sub-layer层（如Multi-Head Attention、Feed Forward）后用了residual connection，每一层sub-layer的输出就变成。Residual connection如下：

图2：residual connection

Transformer细节结构如下图所示：

图3：Transformer结构

关于Self-Attention和Multi-Head Attention可见另一篇文章《注意力机制》，这里就不做详细介绍了。

2. Transformer过程详解

2.1 Input

Transformer的Input包含两方面的信息：

word embedding
time signal

其中time signal通过Positional Encoding表示，与embedding有同样的dimension，即。可见下图：

图4：word embedding + Positional encoding

2.1.1 Positional Encoding

由于Self-Attention没有考虑时序信息，因此Transformer在输入中会加入一些其他信息，比如相对或者绝对的位置信息。另外，positional encoding可以用learned和fixed两种方式实现，具体可以见论文《Convolutional sequence to sequence learning》。

Transformer中采用的是fixed encoding，具体做法：

其中，表示位置，表示维度，。

That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, can be represented as a linear function of .

另外，在实验过程中，作者也对比了learned方法，结果发现两种方式的效果相近。

2.2 Encoders

前面说过，Encoders是多个Encoder block组成，每个Encoder block有两层sub-layer，如下图：

图5：Encoder

下面具体介绍Encoder的计算过程。

2.2.1 self-attention过程

step1: Embedding通过得到。因为是Self-Attention，所以其实是一样的，都是Embedding。（从代码上看到的）

图6：self-attention step1

step2: 计算的点积

图7：self-attention step2

step3+step4：scaled + softmax

图8：self-attention step3+step4

step5+step6: softmax后乘上，并求和，得到Attention。

图9：self-attention step5+step6

步骤step1-6得到的是一个head的Attention，Multi-Head Attention是通过并行计算多个以上过程。

2.2.2 Multi-Head Attention过程

图10: Multi-Head Attention:step1

图11: Multi-Head Attention:step2-step6

图12: Multi-Head Attention: concat

最终concat后再乘以权重矩阵得到，作为Encoders的输出。

2.2.3 小结Encoders

Encoders的总过程如下图：

图13: Encoders总计算过程

2.3 Decoders

Decoders由多个decoder block组成，其中每个decoder有3层sub-layers:

Self-Attention
Encoder-Decoder Attention
Feed Forward

Decoder block如下图：

图14: Decoder

decoder的过程如下图所示：

图15: decoder过程

在Decoder中要注意的：

2.3.1 mask attention

在Decoder过程中，为了防止下文信息被泄漏，Attention过程中只会关注已生成的单词，而下文都将被mask。

Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

2.3.2 Self-Attention 和 Encoder-Decoder Attention

在训练阶段，第一个sub-layer是Self-Attention，那么该部分的均为的embedding；第二个sub-layer是Encoder-Decoder Attention，那么该部分的来自于Encoders的输出，为Self-Attention的输出。

在预测阶段，由于是未知的，因此在第一个attention时会有个类似于的标识符作为的第一个字符。

2.3.3 Label Smoothing

在训练阶段，Transformer采用了label smoothing，且。具体label smoothing可见《Rethinking the inception architecture for computer vision》。

2.4 The Final Linear and Softmax Layer

图16: Linear and Softmax Layer

学习资料

transformer

《Attention Is All You Need》
The Illustrated Transformer
深度学习中的注意力机制
BERT大火却不懂Transformer？读这一篇就够了
The Annotated Transformer
《Attention is All You Need》浅读（简介+代码）
拆 Transformer 系列一：Encoder-Decoder 模型架构详解
拆 Transformer 系列二：Multi- Head Attention 机制详解
Transformer 原理解析
github tensor2tensor

positional Encoding

positional encoding位置编码详解：绝对位置与相对位置编码对比

mask机制

Transformer 源码中 Mask 机制的实现

label smoothing

label smoothing(标签平滑)学习笔记

其他资料

github transformers
transformer tensorflow版code
tensor2tensor github