Attention Is All You Need

Attention Is All You Need

注意力机制是你需要的全部
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

attention [ə'tenʃ(ə)n]:n. 注意,关注,注意力,关心 int. 注意,立正
attention mechanism:注意力机制
Computer Science,CS:计算机科学
Computer Vision,CV:计算机视觉
Computation and Language,CL
Bilingual Evaluation Understudy,BLEU:双语评估替换,双语评估替补
University of Toronto,UofT or UToronto:多伦多大学,多大
Google Research
long short-term memory,LSTM:长短期记忆
recurrent neural network,RNN:循环神经网络
recursive neural network,RvNN:递归神经网络

arXiv (archive - the X represents the Greek letter chi [χ]) is a repository of electronic preprints approved for posting after moderation, but not full peer review.

Google Brain is a deep learning artificial intelligence research team at Google.

December 6, 2017

Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
同等贡献。名单顺序随机。Jakob 建议以 self-attention 取代 RNN,并开始努力评估这一想法。Ashish 与 Illia 一起设计并实现了第一个 Transformer 模型,并在这项工作中的各个方面起着至关重要的作用。Noam 提出了 scaled dot-product attention, multi-head attention 和参数无关的位置表示,并成为涉及几乎每个细节的另一个人。Niki 在我们原始的代码库和 tensor2tensor 中设计、实现、调优和评估了无数模型变体。Llion 还尝试了新的模型变体,负责我们的初始代码库以及高效的推理和可视化。Lukasz 和 Aidan 花了无数漫长的时间来设计和实现 tensor2tensor 的各个部分,以取代我们之前的代码库,从而大大改善了结果并极大地加速了我们的研究。

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

Neural Information Processing Systems,NeurIPS,NIPS:神经信息处理系统
Long Beach:长滩市,长堤
California,CA:加利福尼亚州、加州、黄金之州、金州

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
主流的序列转换模型基于复杂的循环或卷积神经网络 (RNN/CNN),这些神经网络包含一个编码器和一个解码器。表现最佳的模型还通过 attention 机制连接编码器和解码器。我们提出了一种新的简单网络架构 Transformer,它仅仅基于 attention 机制,完全避免了循环和卷积。在两个机器翻译任务上进行的实验表明,这些模型在质量上具有优势,同时具有更高的可并行性,并且所需的训练时间明显更少。我们的模型在 2014 年 WMT 英语到德语翻译任务中达到 28.4 BLEU,比包括整合模型在内的现有最佳结果提高了 2 BLEU 以上。在 2014 年 WMT 英语到法语翻译任务中,我们的模型在八个 GPU 上进行了 3.5 天的训练后 (这个时间只是目前文献中记载的最好的模型训练成本的一小部分),建立了一个新的单模型最新的 BLEU 分数 41.8。我们通过将其成功地应用于大量和有限训练数据的英语相关的成分分析任务上,表明了 Transformer 对其他任务具有很好的通用性。

主流的序列到序列转换模型都是基于含有 encoder 和 decoder 的复杂循环或卷积网络,性能最好的模型在 encoder和 decoder 之间加了 attention 机制。本文提出一种新的网络结构,摒弃了循环和卷积网络,仅基于 attention 机制。

主流序列传导模型大多基于 RNN/CNN。Google Transformer 完全舍弃了 RNN/CNN 结构,从自然语言本身的特性出发,实现了完全基于注意力机制的 Transformer 机器翻译网络架构。

transduction [træns'dʌkʃən]:n. 转导 (作用),能量转换
dispense [dɪ'spens]:v. 分配,分发,提供 (尤指服务),配 (药)
ensemble [ɒn'sɒmb(ə)l]:n. 乐团,整体,全体,全套服装
constituency [kən'stɪtjʊənsi]:n. (选举议会议员的) 选区,选区的选民,(统称) 支持者

1 Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
循环神经网络,特别是长短期记忆 [13] 和门控循环 [7] 神经网络已被牢固地确立为序列建模和转换问题 (如语言建模和机器翻译) 中的最先进方法 [35, 2, 5]。自那以来,已经进行了许多努力来突破循环语言模型和编码器-解码器体系结构的发展 [38, 24, 15]。

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h t h_t ht, as a function of the previous hidden state h t − 1 h_{t-1} ht1 and the input for position t t t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
循环模型通常沿输入和输出序列的符号位置进行因子计算。通过在计算期间将位置与步骤对齐,它们根据前一步的隐藏状态 h t − 1 h_{t-1} ht1 和输入产生位置 t t t 的隐藏状态序列 h t h_t ht。这种固有的顺序特性阻止了训练样本内的并行化,这在较长的序列长度上变得至关重要,因为有限的内存限制了样本的批处理大小。最近的工作通过巧妙的因式分解 [21] 和条件计算 (conditional computation) [32] 在计算效率上取得了显着提高,同时在后者的情况下还提高了模型性能。但是,顺序计算的基本约束仍然存在。

循环网络模型通常是考虑了输入和输出序列的中的字符位置的计算。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
在各种任务中,attention 机制已经成为序列建模和转换模型不可或缺的一部分,可以建模依赖关系而不考虑其在输入或输出序列中的距离 [2, 19]。然而除少数情况外 [27],在所有情况下,此类 attention 机制都与循环网络结合使用。

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
在这项工作中,我们提出了 Transformer,它是一种避免循环发生的模型体系结构,完全依赖于注意力机制来绘制输入和输出之间的全局依存关系。该 Transformer 可以实现更多的并行化,在八个 P100 GPU 上进行了长达 12 个小时的训练之后,可以在翻译质量方面达到新的最佳水平。

RNN/LSTM/GRU 是建立 sequence modeling and transduction problems 的标配,尤其是在 language modeling and machine translation 领域。典型的循环网络会根据输入和输出序列的符号位置信息进行运算,每一步得到的一个隐变量 h t h_t ht 是上一个隐变量 h t − 1 h_{t-1} ht1 和当前位置输入 t t t 的一个函数。输入、输出信息的序列化特性,却成为训练并行化的阻碍。Attention 机制旨在解决序列中长距离依赖的问题。Transformer 模型规避了循环而完全只依赖于注意力机制,为输入和输出序列刻画全局的依赖信息。

基于 RNN 的 seq2seq 模型难以处理长序列的句子,无法实现并行,面临对齐的问题。加入 attention 的 seq2seq 模型在精度上有所提升,现在的 seq2seq 模型都是 RNN + attention 的模型。

神经网络需要能够将源语句的所有必要信息压缩成固定长度的向量,可能使得神经网络难以应付长的句子,特别是那些比训练语料库中的句子更长的句子。每个时间步的输出需要依赖于前面时间步的输出,这使得模型没有办法并行,效率低,面临对齐问题。CNN 不能直接用于处理变长的序列样本但可以实现并行计算。完全基于 CNN的 seq2seq 模型虽然可以并行实现,但非常占内存,大数据量上参数调整并不容易。Attention Is All You Need 抛弃了先前 encoder-decoder 模型结合 RNN/CNN 的模式,只用 attention。减少计算量和提高并行效率的同时不损害最终的精度。

未来发展方向:输入的方向性 (单向 -> 双向)

2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
减少顺序计算的目标也构成了 Extended Neural GPU [16], ByteNet [18] and ConvS2S [9] 的基础,它们全部使用卷积神经网络作为基本构建模块,并行计算所有输入和输出位置的隐藏表示。在这些模型中,关联来自两个任意输入或输出位置的信号所需的操作数会随着位置之间的距离而增加,对于 ConvS2S 线性增长,而对于 ByteNet 则对数增长。这使得学习远距离之间的依存关系变得更加困难 [12]。在 Transformer 中,这可以减少到固定的操作次数,尽管会因用 attention 权重化的位置取平均而导致有效分辨率降低的代价,这是我们在第 3.2 节中所述的 Multi-Head Attention 抵消的效果。

albeit [ɔːl'biːɪt]:conj. 尽管,虽然
counteract [.kaʊntər'ækt]:v. 抵消,抵抗,抵制
entailment [en'teɪlmənt]:n. 导出

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
self-attention,有时也称为 intra-attention,是一种注意力机制,它关联单个序列的不同位置以计算序列的表示。self-attention 已经成功地用于各种任务中,包括阅读理解、抽象概括、文本蕴涵和学习与任务无关的句子表示 [4, 27, 28, 22]。

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
端到端的记忆网络基于循环 attention 机制,而不是序列对齐的循环,并且已被证明在简单语言问答和语言建模任务中表现良好 [34]。

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].
然而,据我们所知,Transformer 是第一个完全依靠 self-attention 来计算其输入和输出表示的转换模型,而无需使用序列对齐的 RNN 或卷积。在以下各节中,我们将描述 Transformer,激发自注意并讨论其相对于模型 [17, 18] 和 [9] 的优势。

3 Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations ( x 1 , . . . , x n ) (x_{1}, ..., x_{n}) (x1,...,xn) to a sequence of continuous representations z = ( z 1 , . . . , z n ) \mathbf{z} = (z_{1}, ..., z_{n}) z=(z1,...,zn). Given z \mathbf{z} z, the decoder then generates an output sequence ( y 1 , . . . , y m ) (y_{1}, ..., y_{m}) (y1,...,ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
大多数有竞争力的神经序列转换模型具有编码器-解码器结构 [5, 2, 35]。在此,编码器将符号表示 ( x 1 , . . . , x n ) (x_{1}, ..., x_{n}) (x1,...,xn) 的输入序列映射到一系列连续表示 z = ( z 1 , . . . , z n ) \mathbf{z} = (z_{1}, ..., z_{n}) z=(z1,...,zn)。在给定 z \mathbf{z} z 的情况下,解码器然后一次生成一个符号的符号输出序列 ( y 1 , . . . , y m ) (y_{1}, ..., y_{m}) (y1,...,ym)。模型的每一步都是自回归的 [10],在生成下一个时,会将先前生成的符号用作附加输入。

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer 遵循这种总体架构,对编码器和解码器使用堆叠式 self-attention 和逐点、全连接层,分别如图 1 的左半部分和右半部分所示。

Attention Is All You Need_第1张图片

Figure 1: The Transformer - model architecture.

The Transformer
Attention Is All You Need_第2张图片

sequence-to-sequence,Seq2Seq:序列到序列
左边 encoder 输入,右边 decoder 输出。
Attention Is All You Need_第3张图片

左边 encoder 和右边 decoder 结合。encoder 里面是有 N 层,decoder 里面是有 N 层。encoder 的输出会和每一层的 decoder 进行结合。
Attention Is All You Need_第4张图片

Attention Is All You Need_第5张图片

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 N = 6 N=6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm( x x x + Sublayer( x x x)), where Sublayer( x x x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d model = 512 d_\text{model} = 512 dmodel=512.
编码器由 N = 6 N = 6 N=6 个相同的层堆叠组成。每层都有两个子层。第一个子层是一个 multi-head self-attention 机制,第二个子层是一个简单的、位置完全连接的前馈网络。我们对每个子层再采用一个残差连接 [11],然后进行 layer normalization [1]。也就是说,每个子层的输出是 LayerNorm( x x x + Sublayer( x x x)),其中 Sublayer( x x x) 是由子层本身实现的函数。为了方便这些残差连接,模型中的所有子层以及嵌入层均产生维度为 d model = 512 d_\text{model} = 512 dmodel=512 的输出。

encoder 的 N = 6 N = 6 N=6 层,每层包括两个 sub-layers。
第一个 sub-layer 是 multi-head self-attention mechanism,用来计算输入的 self-attention。
第二个 sub-layer 是简单的全连接网络。

在每个 sub-layer 都模拟了残差网络,每个 sub-layer 的输出都是 每个子层的输出是 LayerNorm( x x x + Sublayer( x x x)),其中 Sublayer( x x x) 是由子层本身实现的函数。Sublayer( x x x) 表示 sub-layer 对输入 x x x 做的映射,为了确保连接,所有的 sub-layers 和 embedding layer 输出的维数都相同。

Attention Is All You Need_第6张图片

Decoder: The decoder is also composed of a stack of N = 6 N = 6 N=6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i i i can depend only on the known outputs at positions less than i i i.
解码器还由 N = 6 N = 6 N=6 个相同的层堆叠组成。除了每个编码器层中的两个子层之外,解码器还插入第三子层,该第三子层对编码器堆栈的输出执行 multi-head attention。与编码器类似,我们在每个子层再采用残差连接,然后进行 layer normalization。我们还修改了解码器堆栈中的 self-attention 子层,以防止位置关注后续位置。这种掩码,加上输出嵌入偏移一个位置的事实,确保了对位置 i i i 的预测只能依赖于位置小于 i i i 的已知输出。

attend [ə'tend]:v. 参加,出席,注意,专心
facilitate [fə'sɪləteɪt]:v. 促进,促使,使便利

decoder 的 N = 6 N=6 N=6 层,每层包括 3 个 sub-layers。
第一个 masked multi-head attention,也是计算输入的 self-attention。因为是生成过程,在时刻 i 的时候,大于 i 的时刻都没有结果,只有小于 i 的时刻有结果,因此需要做 mask。
第二个 sub-layer 是全连接网络,与 encoder 相同。
第三个 sub-layer 是对 encoder 的输入进行 attention 计算。
decoder 中的 self-attention 层需要进行修改,只能获取到当前时刻之前的输入,只对时刻 t 之前的时刻输入进行 attention 计算,称为 mask 操作。

Attention Is All You Need_第7张图片

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
attention 函数可以描述为将 query 和一组键值对映射到输出,其中 query, keys, values 和输出都是向量。将输出为 values 的加权和,其中分配给每个值的权重是通过 query 与相应 key 的兼容性函数来计算的。

attention 用于计算相关程度 (例如在翻译过程中,不同的英文对中文的依赖程度不同)。attention 表示为将 a query (Q) 和 a set of key-value pairs 映射到 an output,其中 query, keys, values, and output 都是向量,输出是所有 values (V) 的加权和,其中权重是由 query 和 the corresponding key 计算出来的,计算方法分为三步。

  1. 比较 Q Q Q K K K 的相似度,用 f f f 表示。
    f ( Q , K i ) ,   i = 1 , 2 , 3... f(Q, K_{i}), \ i = 1, 2, 3 ... f(Q,Ki), i=1,2,3...

  2. 计算得到的相似度进行 softmax 运算。
    w i = e f ( Q , K i ) ∑ i = 1 m e f ( Q , K i ) ,   i = 1 , 2 , 3... w_{i} = \frac{e^{f(Q, K_{i})}}{\sum_{i=1}^{m} e^{f(Q, K_{i})}}, \ i = 1, 2, 3 ... wi=i=1mef(Q,Ki)ef(Q,Ki), i=1,2,3...

  3. 计算得到的权重 w i w_{i} wi,对 V V V 中所有的 value 进行加权求和计算,得到 attention 向量。
    ∑ i = 1 m w i V i ,   i = 1 , 2 , 3... \sum^m_{i=1} w_{i} V_{i}, \ i = 1, 2, 3 ... i=1mwiVi, i=1,2,3...

第一步中计算方法包括以下四种:
点乘 (dot product): f ( Q , K i ) = Q T K i ,   i = 1 , 2 , 3... f(Q, K_{i}) = Q^{T}K_{i}, \ i = 1, 2, 3 ... f(Q,Ki)=QTKi, i=1,2,3...
权重: f ( Q , K i ) = Q T W K i ,   i = 1 , 2 , 3... f(Q, K_{i}) = Q^{T}WK_{i}, \ i = 1, 2, 3 ... f(Q,Ki)=QTWKi, i=1,2,3...
拼接权重 (concat): f ( Q , K i ) = W [ Q T ; K i ] ,   i = 1 , 2 , 3... f(Q, K_{i}) = W[Q^{T}; K_{i}], \ i = 1, 2, 3 ... f(Q,Ki)=W[QT;Ki], i=1,2,3...
感知器 (perceptron): f ( Q , K i ) = V T t a n h ( W Q + U K i ) ,   i = 1 , 2 , 3... f(Q, K_{i}) = V^{T} tanh(WQ + UK_{i}), \ i = 1, 2, 3 ... f(Q,Ki)=VTtanh(WQ+UKi), i=1,2,3...

Attention Is All You Need_第8张图片

Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.

3.2.1 Scaled Dot-Product Attention

We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension d k d_k dk, and values of dimension d v d_v dv. We compute the dot products of the query with all keys, divide each by d k \sqrt{dk} dk , and apply a softmax function to obtain the weights on the values.
我们称我们特别的 attention “Scaled Dot-Product Attention” (Figure 2)。输入由 queries 和维度为 d k d_k dk 的 keys 以及维度为 d v d_v dv 的 values 组成。我们计算 query 和所有 keys 的点积,除以 d k \sqrt{dk} dk ,然后应用 softmax 函数获得 values 的权重。

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q Q Q. The keys and values are also packed together into matrices K K K and V V V. We compute the matrix of outputs as:
实际上,我们在一组 queries 上同时计算注意力函数,将它们打包成矩阵 Q Q Q。keys and values 也打包到矩阵 K K K and V V V 中。我们将输出矩阵计算为:
A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V (1) Attention(Q, K, V) = softmax(\frac{QK^{T}} {\sqrt{d_{k}}}) V \tag{1} Attention(Q,K,V)=softmax(dk QKT)V(1)

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of 1 d k \frac{1}{\sqrt{d_{k}}} dk 1. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
两个最常用的 attention 函数是加法 attention [2] 和点积 (乘法) attention。除 1 d k \frac{1}{\sqrt{d_{k}}} dk 1 的缩放因子外,点积 attention 与我们的算法相同。加法 attention 使用具有单个隐藏层的前馈网络来计算兼容性函数。尽管两者在理论上的复杂度相似,但是在实践中点积的 attention 要快得多,而且空间效率更高,因为可以使用高度优化的矩阵乘法代码来实现。

While for small values of d k d_k dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of d k d_k dk [3]. We suspect that for large values of d k d_k dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1 d k \frac{1}{\sqrt{d_k}} dk 1.
d k d_k dk 的值比较小的时候,这两个机制的性能相差相近,当 d k d_k dk 的值比较大时,加法 attention 比不带缩放的点积 attention 性能好 [3]。我们猜想,对于很大的 d k d_k dk 值,点积大幅度增长,将 softmax 函数推向具有极小梯度的区域。为了抵消这种影响,我们缩小点积 1 d k \frac{1}{\sqrt{d_k}} dk 1 倍。

To illustrate why the dot products get large, assume that the components of q q q and k k k are independent random variables with mean 0 and variance 1. Then their dot product, q ⋅ k = ∑ i = 1 d k q i k i q \cdot k = \sum^{d_k}_{i=1} q_{i}k_{i} qk=i=1dkqiki, has mean 0 and variance d k d_k dk.
为了说明点积为何变大,假定 q q q and k k k 的分量是均值为 0 且方差为 1 的独立随机变量。

  1. Scaled Dot-Product Attention 中的 Q , K , V Q, K, V Q,K,V。根据输入 X X X,通过 3 个线性转换把 X X X 转换为 Q , K , V Q, K, V Q,K,V。例如两个单词 ThinkingMachines,通过嵌入变换得到 X 1 , X 2 X_{1}, X_2 X1,X2 两个向量 [ 1 × 4 ] [1 \times 4] [1×4]。然后分别与 ( W Q , W K , W V ) (W^{Q}, W^{K}, W^{V}) (WQ,WK,WV) 三个矩阵 [ 4 × 3 ] [4 \times 3] [4×3] 点乘得到 { q 1 , q 2 } , { k 1 , k 2 } , { v 1 , v 2 } \{q_{1}, q_{2}\}, \{k_{1}, k_{2}\}, \{v_{1}, v_{2}\} {q1,q2},{k1,k2},{v1,v2} 6 个向量 [ 1 × 3 ] [1 \times 3] [1×3]
    Attention Is All You Need_第9张图片

  2. 向量 q 1 , k 1 {q_{1}, k_{1}} q1,k1 点乘计算得分 (score) 112, q 1 , k 2 {q_{1}, k_{2}} q1,k2 点乘计算得分 (score) 96。
    Attention Is All You Need_第10张图片

  3. 为了使得梯度更稳定,对得分进行规范化,除以 8。然后对得分 [ 14 , 12 ] [14, 12] [14,12] 做 softmax 得到比例 [ 0.88 , 0.12 ] [0.88, 0.12] [0.88,0.12]
    Attention Is All You Need_第11张图片

  4. 比例 [ 0.88 , 0.12 ] [0.88, 0.12] [0.88,0.12] 乘以 [ v 1 , v 2 ] [v_{1}, v_{2}] [v1,v2] 值 (values) 得到一个加权后的值。然后将这些值加起来得到 z 1 z_{1} z1,这就是这一层的输出。用 Q , K Q, K Q,K 去计算一个 thinking 对于 thinking, machine 的权重,用权重乘以 thinking, machine 的 V V V 得到加权后的 thinking,machine 的 V V V,最后求和得到针对各单词的输出 Z Z Z
    Attention Is All You Need_第12张图片

  5. 区别于前面单个向量的运算。下面展示的是矩阵运算,输入是一个 [ 2 × 4 ] [2 \times 4] [2×4] 的矩阵 (单词嵌入),每个运算是 [ 4 × 3 ] [4 \times 3] [4×3] 的矩阵,得到 Q , K , V Q, K, V Q,K,V
    Attention Is All You Need_第13张图片

Q Q Q K K K 的转置做点乘,除以 d k d_k dk 的平方根。softmax 得到和为 1 的比例,对 V V V 做点乘得到输出 Z Z Z Z Z Z 就是一个考虑过 thinking 周围单词 (machine) 的输出。
Attention Is All You Need_第14张图片

Q K T QK^T QKT 会组成一个 word2word 的 attention map (softmax 之后就是一个和为 1 的权重)。例如输入是一句话 I have a dream 总共 4 个单词,就会形成一张 4 × 4 4 \times 4 4×4 的注意力机制的图,每一个单词就对应每一个单词有一个权重。
Attention Is All You Need_第15张图片

encoder 里面是 self-attention,decoder 里面是 masked self-attention。这里的 masked 就是要在做 language modelling (或者翻译) 的时候,不给模型看到未来的信息。mask 就是沿着对角线把灰色的区域用 0 覆盖掉,不给模型看到未来的信息。

Attention Is All You Need_第16张图片

I 作为第一个单词,只能有和 I 自己的 attention。have 作为第二个单词,有和 I, have 两个 attention。a 作为第三个单词,有和 I, have, a 前面三个单词的 attention。到了最后一个单词 dream 的时候,才有对整个句子 4 个单词的 attention。softmax 操作后横轴和为 1。
Attention Is All You Need_第17张图片

3.2.2 Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
Multi-head attention 允许模型的不同表示子空间联合关注不同位置的信息。如果只有一个 attention head,它的平均值会削弱这个信息。

MultiHead ( Q , K , V ) = Concat ( head 1 , . . . , head h ) W O where  head i = Attention ( Q W i Q , K W i K , V W i V ) \begin{aligned} \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_{1}, ..., \text{head}_{h})W^{O} \\ \text{where} \ \text{head}_{i} = \text{Attention}(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V}) \\ \end{aligned} MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV)

Where the projections are parameter matrices W i Q ∈ R d model × d k , W i K ∈ R d model × d k , W i V ∈ R d model × d v and W i O ∈ R h d v × d model W_{i}^{Q} \in \mathbb{ℝ}^{d_\text{model}\times d_{k}}, W_{i}^{K} \in \mathbb{ℝ}^{d_\text{model}\times d_{k}}, W_{i}^{V} \in \mathbb{ℝ}^{d_\text{model}\times d_{v}} \text{and} W_{i}^{O} \in \mathbb{ℝ}^{h d_{v} \times d_\text{model}} WiQRdmodel×dk,WiKRdmodel×dk,WiVRdmodel×dvandWiORhdv×dmodel.
其中,映射为参数矩阵为 W i Q ∈ R d model × d k , W i K ∈ R d model × d k , W i V ∈ R d model × d v and W i O ∈ R h d v × d model W_{i}^{Q} \in \mathbb{ℝ}^{d_\text{model}\times d_{k}}, W_{i}^{K} \in \mathbb{ℝ}^{d_\text{model}\times d_{k}}, W_{i}^{V} \in \mathbb{ℝ}^{d_\text{model}\times d_{v}} \text{and} W_{i}^{O} \in \mathbb{ℝ}^{h d_{v} \times d_\text{model}} WiQRdmodel×dk,WiKRdmodel×dk,WiVRdmodel×dvandWiORhdv×dmodel

In this work we employ h = 8 h = 8 h=8 parallel attention layers, or heads. For each of these we use d k = d v = d m o d e l / h = 64 d_{k} = d_{v} = d_{model}/h = 64 dk=dv=dmodel/h=64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
在这项工作中,我们采用 h = 8 h = 8 h=8 个并行 attention 层或 heads。对每个 head,我们使用 d k = d v = d m o d e l / h = 64 d_{k} = d_{v} = d_{model}/h = 64 dk=dv=dmodel/h=64。由于每个 head 的维度减小,总的计算成本与具有全部维度的单个 head attention 相似。

inhibit [ɪn'hɪbɪt]:v. 抑制,阻止,阻碍,使拘束
projection [prə'dʒekʃ(ə)n]:n. 投影,投射,预测,放映

Multi-Head Attention 就是把 Scaled Dot-Product Attention 的过程做 h h h 次,然后把输出 Z Z Z 合起来。

Attention Is All You Need_第18张图片
重复执行 8 次相似的操作,得到 8 个 Z i Z_i Zi 矩阵。
为了使得输出与输入结构对标,乘以一个线性 W 0 W^0 W0 得到最终的 Z Z Z

Attention Is All You Need_第19张图片

3.2.3 Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways:

  • In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
    在 encoder-decoder attention 层中,queries 来自上面解码器层,而存储 keys and values 来自编码器的输出。这允许解码器中的每个位置都能关注到输入序列中的所有位置。这模仿了序列到序列模型 (例如 [38, 2, 9]) 中的典型编码器-解码器注意机制。

  • The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
    编码器包含 self-attention 层。在 self-attention 层中,所有 keys, values and queries 都来自同一位置,在这种情况下,是编码器中前一层的输出。编码器中的每个位置都可以关注编码器上一层中的所有位置。

  • Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to − ∞ -\infty ) all values in the input of the softmax which correspond to illegal connections. See Figure 2.
    类似地,解码器中的 self-attention 层允许解码器中的每个位置都关注解码器中的所有位置,直到包括该位置为止。我们需要防止解码器中的向左信息流,以保留自回归属性。通过屏蔽 softmax 的输入中所有不合法连接的值 (设置为 − ∞ -\infty ),我们在 scaled dot-product attention 中实现此目标。参见图 2。

mimic ['mɪmɪk]:v. 模仿 (某人的言行举止),(尤指) 做滑稽模仿,似 adj. 模仿的,拟态的 n. 会模仿的人 (或动物)

3.3 Position-wise Feed-Forward Networks - 基于位置的前馈网络

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
除了 attention 子层之外,我们的编码器和解码器中的每个层都包含一个全连接的前馈网络,该前馈网络单独且相同地应用于每个位置。它由两个线性变换组成,之间有一个 ReLU 激活。

FFN ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2 (2) \text{FFN}(x) = \text{max}(0, xW_{1} + b_{1})W_{2} + b_{2} \tag{2} FFN(x)=max(0,xW1+b1)W2+b2(2)

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is d model = 512 d_\text{model} = 512 dmodel=512, and the inner-layer has dimensionality d f f = 2048 d_{ff} = 2048 dff=2048.
尽管线性变换在不同位置上是相同的,但它们层与层之间使用不同的参数。它的另一种描述方式是两个内核大小为1 的卷积。输入和输出的维度为 d model = 512 d_\text{model} = 512 dmodel=512,内部层的维度为 d f f = 2048 d_{ff} = 2048 dff=2048

3.4 Embeddings and Softmax - 嵌入和 Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d model d_\text{model} dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by d model \sqrt{d_\text{model}} dmodel .
与其他序列转导模型类似,我们使用学习的嵌入将输入词符和输出词符转换为维度为 d model d_\text{model} dmodel 的向量。我们还使用普通学习的线性变换和 softmax 函数将解码器输出转换为预测的下一个词符的概率。在我们的模型中,两个嵌入层之间和 pre-softmax 线性变换共享相同的权重矩阵,类似于 [30]。在嵌入层中,我们将这些权重乘以 d model \sqrt{d_\text{model}} dmodel

Attention Is All You Need_第20张图片

decoder 的堆栈输出作为输入,从底部开始,进行最终的预测。

3.5 Positional Encoding (位置编码)

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension d model d_\text{model} dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].
由于我们的模型不包含循环和卷积,为了让模型利用序列的顺序,我们必须注入序列中关于词符相对或者绝对位置的一些信息。为此,我们将位置编码 (positional encodings) 和 embeddings 相加,作为编码器和解码器堆栈底部的输入。位置编码和嵌入的维度 d model d_\text{model} dmodel 相同,所以它们可以相加。有多种位置编码可以选择,例如通过学习得到的位置编码和固定的位置编码 [9]。

In this work, we use sine and cosine functions of different frequencies:
在这项工作中,我们使用不同频率的正弦和余弦函数:

P E ( p o s , 2 i ) = sin ⁡ ( p o s / 1000 0 2 i / d model ) P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s / 1000 0 2 i / d model ) \begin{aligned} PE(pos, 2i) = \sin(pos/10000^{2i/d_\text{model}}) \\ PE(pos, 2i+1) = \cos(pos/10000^{2i/d_\text{model}}) \\ \end{aligned} PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)

where p o s pos pos is the position and i i i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2 π 2\pi 2π to 10000 ⋅ 2 π 10000 \cdot 2\pi 100002π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k k k, P E p o s + k PE_{pos+k} PEpos+k can be represented as a linear function of P E p o s PE_{pos} PEpos.
其中 p o s pos pos 是位置, i i i 是维度。也就是说,位置编码的每个维度对应于一个正弦曲线。这些波长形成一个几何级数,从 2 π 2\pi 2π 10000 ⋅ 2 π 10000 \cdot 2\pi 100002π。我们选择这个函数是因为我们假设它允许模型很容易学习对相对位置的关注,因为对任意确定的偏移 k k k, P E p o s + k PE_{pos+k} PEpos+k 可以表示为 P E p o s PE_{pos} PEpos 的线性函数。

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
我们还尝试使用学习的位置嵌入 [9] 进行实验,发现这两个版本产生了几乎相同的结果 (see Table 3 row (E))。我们选择正弦曲线版本是因为它可以使模型外推到比训练过程中遇到的序列长度更长的序列长度。

sinusoidal [ˌsɪnə'sɔɪdl]:adj. 血窦,曲形,正弦,正弦曲线的
extrapolate [ɪk'stræpəleɪt]:v. 推断,推知,判定,推测
hypothesize [haɪ'pɒθəsaɪz]:v. 假设,假定

Attention Is All You Need_第21张图片

模型不包括 recurrence/convolution,无法捕捉到序列顺序信息,例如将 K K K V V V 按行进行打乱,attention 后的结果是一样的。但是序列信息非常重要,代表着全局的结构,因此必须将序列的分词相对或者绝对位置信息利用起来。

每个分词的 position embedding 向量维度是 d model d_\text{model} dmodel, 然后将原本的 input embedding 和 position embedding 加起来组成最终的 embedding 作为 encoder/decoder 的输入。其中 position embedding 计算公式如下:

P E ( p o s , 2 i ) = sin ⁡ ( p o s / 1000 0 2 i / d model ) P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s / 1000 0 2 i / d model ) \begin{aligned} PE(pos, 2i) = \sin(pos/10000^{2i/d_\text{model}}) \\ PE(pos, 2i+1) = \cos(pos/10000^{2i/d_\text{model}}) \\ \end{aligned} PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)

其中 p o s pos pos 表示 position index, i i i 表示 dimension index。

position embedding 本身是一个绝对位置的信息,但在语言中,相对位置也很重要,Google 选择前述的位置向量公式的一个重要原因是,由于我们有:

sin ⁡ ( α + β ) = sin ⁡ α cos ⁡ β + cos ⁡ α sin ⁡ β cos ⁡ ( α + β ) = cos ⁡ α cos ⁡ β − sin ⁡ α sin ⁡ β \begin{aligned} \sin(\alpha + \beta) = \sin \alpha \cos \beta + \cos \alpha \sin \beta \\ \cos(\alpha + \beta) = \cos \alpha \cos \beta - \sin \alpha \sin \beta \end{aligned} sin(α+β)=sinαcosβ+cosαsinβcos(α+β)=cosαcosβsinαsinβ

这表明位置 p + k p+k p+k 的向量可以表示成位置 p p p 的向量的线性变换,这提供了表达相对位置信息的可能性。

Attention Is All You Need_第22张图片

position embedding 通常是一个训练的向量,但是 position embedding 只是 extra features,有该信息会更好,但是没有性能也不会产生极大下降。RNN/CNN 本身就能够捕捉到位置信息,但是在 Transformer 模型中,position embedding 是位置信息的唯一来源,因此是该模型的核心成分,并非是辅助性质的特征。

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations ( x 1 , … , x n ) (x_{1}, …, x_{n}) (x1,,xn) to another sequence of equal length ( z 1 , … , z n ) (z_{1}, …, z_{n}) (z1,,zn), with x i , z i ∈ R d x_{i}, z_{i} \in \mathbb{R}^{d} xi,ziRd, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.
本节,我们比较 self-attention 与循环层、卷积层的各个方面,它们通常用于映射变长的符号序列表示 ( x 1 , … , x n ) (x_{1}, …, x_{n}) (x1,,xn) 到另一个等长的序列 ( z 1 , … , z n ) (z_{1}, …, z_{n}) (z1,,zn),其中 x i , z i ∈ R d x_{i}, z_{i} \in \mathbb{R}^{d} xi,ziRd,例如一个典型的序列转导编码器或解码器中的隐藏层。我们使用 self-attention 是考虑到解决三个问题。

One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
一个是每层的总计算复杂度。另一个是可以并行化的计算量,以所需的最少顺序操作数衡量。

The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
第三个是网络中长距离依赖之间的路径长度。学习长距离依赖性是许多序列转导任务中的关键挑战。影响学习这种依赖性能力的一个关键因素是前向和后向信号必须在网络中传播的路径长度。输入和输出序列中任意位置组合之间的这些路径越短,学习长距离依赖性就越容易 [12]。因此,我们还比较了由不同图层类型组成的网络中任意两个输入和输出位置之间的最大路径长度。

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O ( n ) O(n) O(n) sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n n n is smaller than the representation dimensionality d d d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r r r in the input sequence centered around the respective output position. This would increase the maximum path length to O ( n / r ) O(n/r) O(n/r). We plan to investigate this approach further in future work.
如表 1 所示,self-attention 层将所有位置连接到恒定数量的顺序执行的操作,而循环层需要 O ( n ) O(n) O(n) 顺序操作。在计算复杂性方面,当序列长度 n n n 小于表示维度 d d d 时,self-attention 层比循环层快,这是机器翻译中最先进的模型最常见情况,例如 word-piece [38] and byte-pair [31] 表示法。为了提高涉及很长序列的任务的计算性能,可以将 self-attention 限制在仅考虑大小为 r r r 的邻域。这会将最大路径长度增加到 O ( n / r ) O(n/r) O(n/r)。我们计划在未来的工作中进一步调查这种方法。

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n n n is the sequence length, d d d is the representation dimension, k k k is the kernel size of convolutions and r r r the size of the neighborhood in restricted self-attention.
不同层类型的最大路径长度、每层复杂度和最少顺序操作数。 n n n 为序列的长度, d d d 为表示的维度, k k k 为卷积的核的大小, r r r 为受限 self-attention 中邻域的大小。
Attention Is All You Need_第23张图片

A single convolutional layer with kernel width k < n k < n k<n does not connect all pairs of input and output positions. Doing so requires a stack of O ( n / k ) O(n/k) O(n/k) convolutional layers in the case of contiguous kernels, or O ( l o g k ( n ) ) O(log_{k}(n)) O(logk(n)) in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k k k. Separable convolutions [6], however, decrease the complexity considerably, to O ( k ⋅ n ⋅ d + n ⋅ d 2 ) O(k \cdot n \cdot d + n \cdot d^{2}) O(knd+nd2). Even with k = n k = n k=n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
kernel width k < n k < n k<n 的单卷积层不会连接每一对输入和输出的位置。要这么做,在邻近核的情况下需要堆叠 O ( n / k ) O(n/k) O(n/k) 个卷积层,在 dilated convolutions [18] 的情况下需要 O ( l o g k ( n ) ) O(log_{k}(n)) O(logk(n)) 个层,它们增加了网络中任意两个位置之间的最长路径的长度。卷积层通常比循环层更昂贵,大约是 k k k 倍。然而,separable convolutions [6] 大幅减少复杂度到 O ( k ⋅ n ⋅ d + n ⋅ d 2 ) O(k \cdot n \cdot d + n \cdot d^{2}) O(knd+nd2)。然而,即使 k = n k = n k=n,一个可分卷积的复杂度等同于self-attention层和 point-wise 前向层的组合,这是我们在模型中采用的方法。

As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
作为附带的好处,self-attention 可以产生更多可解释的模型。我们从模型中检查注意力分布,并在附录中介绍和讨论示例。每个 attention head 不仅清楚地学习到执行不同的任务,许多似乎展现与句子的句法和语义结构的行为。

desideratum:n. 急需品 (desiderata 是 desideratum 的复数)
present [prɪ'zent]:n. 目前,现在,礼物,礼品 adj. 存在,出席,在场,出现 v. 出现,提出,显示,提交

5 Training

This section describes the training regime for our models.

regime [reɪ'ʒiːm]:n. 政体,组织方法,管理体制
token ['təʊkən]:n. 表示,(用以启动某些机器或用作支付方式的) 代币,代金券,赠券 adj. 装样子的,装点门面的,敷衍的,象征性的

5.1 Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
我们在标准的 WMT 2014 英语-德语数据集上进行训练,该数据集包含约 450 万个句子对。句子是使用字节对编码 [3] 编码的,字节对编码 [3] 具有源语句和目标语句共享大约37000个词符的词汇表。对于英语-法语,我们使用了更大的 WMT 2014 英语-法语数据集,该数据集由 36M 个句子和将 tokens 拆分成 32000 个词条的词汇表组成 [38]。序列长度相近的句子一起进行批处理。每个训练批次包含一系列句子对,其中包含大约 25000 个源词符和 25000 个目标词符。

5.2 Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, (described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).
对于使用本文所述的超参数的基本模型,每个训练步骤大约需要 0.4 秒。我们对基本模型进行了总共 100,000 步或 12 个小时的训练。对于我们的大型模型 (在表 3 的底部描述),步长为 1.0 秒。 型模型接受了 300,000 步 (3.5天) 的训练。

5.3 Optimizer - 优化器

We used the Adam optimizer [20] with β 1 = 0.9 \beta_{1} = 0.9 β1=0.9, β 2 = 0.98 \beta_{2} = 0.98 β2=0.98 and ϵ = 1 0 − 9 \epsilon = 10^{-9} ϵ=109. We varied the learning rate over the course of training, according to the formula:
我们使用 Adam 优化器 [20],其中 β 1 = 0.9 \beta_{1} = 0.9 β1=0.9, β 2 = 0.98 \beta_{2} = 0.98 β2=0.98 and ϵ = 1 0 − 9 \epsilon = 10^{-9} ϵ=109。我们根据以下公式在训练过程中改变学习率:

l r a t e = d model − 0.5 ⋅ min ( s t e p _ n u m − 0.5 , s t e p _ n u m ⋅ w a r m u p _ s t e p s − 1.5 ) (3) lrate = d^{-0.5}_{\text{model}} \cdot \text{min}(step\_num^{-0.5}, step\_num \cdot warmup\_steps^{-1.5}) \tag{3} lrate=dmodel0.5min(step_num0.5,step_numwarmup_steps1.5)(3)

This corresponds to increasing the learning rate linearly for the first w a r m u p _ s t e p s warmup\_steps warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used w a r m u p _ s t e p s = 4000 warmup\_steps = 4000 warmup_steps=4000.
这对应于前 w a r m u p _ s t e p s warmup\_steps warmup_steps 训练步骤的线性增加学习速率,并且随后将其与步骤数的平方根成比例地降低学习速率。我们使用 w a r m u p _ s t e p s = 4000 warmup\_steps = 4000 warmup_steps=4000

5.4 Regularization - 正则化

We employ three types of regularization during training:
训练期间我们采用三种正则化方法:

Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of P d r o p = 0.1 P_{drop} = 0.1 Pdrop=0.1.
我们将 dropout [33] 应用到每个子层的输出,在将它与子层的输入相加和 normalization 之前。此外,在编码器和解码器堆叠中,我们将 dropout 应用到嵌入和位置编码的和。对于基本模型,我们使用 P d r o p = 0.1 P_{drop} = 0.1 Pdrop=0.1 丢弃率。

Label Smoothing During training, we employed label smoothing of value ϵ l s = 0.1 \epsilon_{ls} = 0.1 ϵls=0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
在训练过程中,我们使用的 label smoothing 的值为 ϵ l s = 0.1 \epsilon_{ls} = 0.1 ϵls=0.1 [36]。这让模型不易理解,模型学得更加不确定,但提高了准确性和 BLEU 得分。

hurt [hɜː(r)t]:v. 受伤,感到疼痛,使不快,使烦恼 n. 委屈,心灵创伤 adj.(身体上) 受伤的,(感情上) 受伤的
perplexity [pə(r)'pleksəti]:n. 困惑,迷惘,难以理解的事物,疑团

6 Results

6.1 Machine Translation

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3:5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
在 WMT 2014 英语-德语翻译任务中,大型 transformer 模型 (表 2 中的 Transformer (big)) 比以前报道的最佳模型 (包括整合模型) 高出 2.0 个 BLEU 以上,确立了一个全新的新的最高 BLEU 分数为28.4。该模型的配置列在表 3 的底部。训练在 8 个 P100 GPU 上花费 3.5 天。即使我们的基础模型也超过了以前发布的所有模型和整合模型,且训练成本只是这些模型的一小部分。

Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
Attention Is All You Need_第24张图片

Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.
Transformer 架构的变体。未列出的值与基本模型的值相同。所有指标都基于英文到德文翻译开发集 newstest2013。
Attention Is All You Need_第25张图片
对模型自身的参数执行改变自变量的测试,确认哪些参数对模型的影响比较大。

workpiece ['wɜ:kˌpi:s]:na. 工作件
perplexity [pə(r)'pleksəti]:n. 困惑,迷惘,难以理解的事物,疑团

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate P d r o p = 0.1 P_{drop} = 0.1 Pdrop=0.1, instead of 0.3.
在 WMT 2014 英语-法语翻译任务中,我们的大型模型的 BLEU 得分为 41.0,超过了之前发布的所有单一模型,训练成本低于先前最先进模型的 1/4。英语-法语的 Transformer (big) 模型使用丢弃率为 dropout rate P d r o p = 0.1 P_{drop} = 0.1 Pdrop=0.1,而不是 0.3。

For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty α = 0.6 \alpha = 0.6 α=0.6 [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [38].
对于基础模型,我们使用的单个模型来自最后 5 个检查点的平均值,这些检查点每 10 分钟写一次。对于大型模型,我们对最后 20 个检查点进行了平均。我们使用 beam search,beam 大小为 4,长度惩罚 α = 0.6 \alpha = 0.6 α=0.6 [38]。这些超参数是在开发集上进行实验后选定的。在推断时,我们设置最大输出长度为输入长度 + 50,但在可能时尽早终止 [38]。

beam search:定向搜索
beam [biːm]:n. 梁,光线,平衡木,(电波的) 波束 v. 照射,发光,笑容满面,眉开眼笑

Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU.
表 2 总结了我们的结果,并将我们的翻译质量和训练成本与文献中的其他模型架构进行了比较。我们通过将训练时间、所使用的 GPU 的数量以及每个 GPU 的持续单精度浮点能力的估计值相乘来估计用于训练模型的浮点运算的数量。

We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively

6.2 Model Variations

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.
为了评估 Transformer 不同组件的重要性,我们以不同的方式改变我们的基础模型,测量开发集 newstest2013 上英文-德文翻译的性能变化。我们使用前一节所述的 beam search,但没有平均检查点。我们在 Table 3 中列出这些结果.

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
在 Table 3 rows (A) 中,我们改变 attention head 的数量和 attention key 和 value 的维度,保持计算量不变,如 3.2.2节所述。虽然只有一个 head attention 比最佳设置差 0.9 BLEU,但质量也随着 head 太多而下降。

In Table 3 rows (B), we observe that reducing the attention key size d k d_k dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
在 Table 3 rows (B) 中,我们观察到减小 key 的大小 d k d_k dk 会有损模型质量。这表明确定兼容性并不容易,并且比点积更复杂的兼容性函数可能更有用。我们在 rows (C) and (D) 中进一步观察到,如预期的那样,更大的模型更好,并且丢弃对避免过度拟合非常有帮助。在 row (E) 中,我们用学习到的位置嵌入 [9] 来替换我们的正弦位置编码,并观察到与基本模型几乎相同的结果。

determine [dɪ'tɜː(r)mɪn]:v. 确定,决定,测定,查明
sophisticate [sə'fɪstɪkeɪt]:v. 用诡辩欺骗,使迷惑,窜改,掺坏 n. 老于世故的人,见多识广的人

6.3 English Constituency Parsing

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37].
为了评估 Transformer 是否可以推广到其他任务,我们对 English constituency parsing 进行了实验。这项任务提出了具体的挑战:输出受到强大的结构约束,并且比输入要长得多。此外,RNN 序列到序列模型还无法在小数据体制中获得最好的结果 [37]。

attain [ə'teɪn]:v. 得到,达到 (某年龄、水平、状况)

Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)
Attention Is All You Need_第26张图片

We trained a 4-layer transformer with d m o d e l = 1024 d_{model} = 1024 dmodel=1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
我们用 d m o d e l = 1024 d_{model} = 1024 dmodel=1024 在 Penn Treebank [25] 的 Wall Street Journal (WSJ) 部分训练了一个 4 层 transformer,约 40K 个训练句子。我们还使用更大的高置信度和 BerkleyParser 语料库,在半监督环境中对其进行了训练,大约 17M 个句子 [37]。我们使用了一个 16K 词符的词汇表作为 WSJ 唯一设置,和一个 32K 词符的词汇表用于半监督设置。

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length + 300. We used a beam size of 21 and α = 0.3 \alpha = 0.3 α=0.3 for both WSJ only and the semi-supervised setting.
我们仅进行了少量实验,以选择 Section 22 开发集上的 dropout、注意力和残差 (section 5.4)、学习率和波束大小,所有其他参数在英语到德语基础翻译模型中均保持不变。在推论过程中,我们将最大输出长度增加到输入长度 + 300。对于仅 WSJ 和半监督设置,我们使用 21 的波束大小和 α = 0.3 \alpha = 0.3 α=0.3

corpus ['kɔːpərə]:n. 语料,全集,文集,语料库

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
表 4 中我们的结果表明,尽管缺少特定任务的调优,我们的模型表现得非常好,得到的结果比之前报告的除循环Recurrent Neural Network Grammar [8] 之外的所有模型都好。

In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences.
与 RNN 序列到序列模型 [37] 相比,即使仅在 40K 句子的 WSJ training set 上训练时,Transformer 也胜过 BerkeleyParser [29]。

7 Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
在这项工作中,我们介绍了 Transformer,这是完全基于注意力的第一个序列转换模型,用 multi-headed self-attention代替了编码器-解码器体系结构中最常用的循环层。

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.
对于翻译任务,与基于循环层或卷积层的体系结构相比,可以大大加快 Transformer 的训练速度。在 WMT 2014 English-to-German and WMT 2014 English-to-French 翻译任务上,我们都达到了新的最佳水平。在前面的任务中,我们最好的模型甚至胜过以前报道过的所有整合模型。

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.
我们对基于注意力的模型的未来感到兴奋,并计划将其应用于其他任务。我们计划将 Transformer 扩展到文本以外的涉及输入和输出方式的问题,并研究局部受限的注意机制,以有效处理大型输入和输出,例如图像、音频和视频。 让生成具有更少的顺序性是我们的另一个研究目标。

The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.

Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.

fruitful ['fruːtf(ə)l]:adj. 成果丰硕的,富有成效的,富饶的,丰产的
inspiration [.ɪnspə'reɪʃ(ə)n]:n. 灵感,妙计,启发灵感的人 (或事物),使人产生动机的人 (或事物)
modality [məʊ'dæləti]:n. 形态,情态,形式,方式

Attention Visualizations

Attention Is All You Need_第27张图片
Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making…more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads. Best viewed in color.
attention 机制的一个示例,5/6 层的编码器 self-attention 中的长距离依赖。很多 attention head 都关注与动词 making 的远距离依赖关系,正好补全 making…more difficult 这个短语。不同的颜色代表不同的 head。彩色效果最佳。

spirit ['spɪrɪt]:n. 精神,灵魂,心灵,勇气 v. 偷偷带走,让人不可思议地弄走
registration [.redʒɪ'streɪʃ(ə)n]:n. 登记,注册,挂号,登记文档

Attention Is All You Need_第28张图片
Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5 and 6. Note that the attentions are very sharp for this word.
两个 attention head,也在 5/6 层,显然有逆向照应 (下文的词返指或代替上文的词)。Top: head 5 的完整的 attention。Bottom: 仅将 attention heads 5 和 6 中单词 its 的 attention 分离出来。请注意,这个词的 attention 非常明确。

anaphora [ə'næfərə]:n. 逆向照应 (下文的词返指或代替上文的词)

Attention Is All You Need_第29张图片
Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.
很多 attention head 表现出的行为似乎与句子的结构有关。我们给出了两个这样的例子,来自编码器 5/6 层 self-attention 的两个不同的 head。Heads 清楚地学会了执行不同的任务。

References

公众号 ID:小小挖掘机
公众号 ID:深度学习自然语言处理

你可能感兴趣的:(speech,recognition,-,语音识别)