橙色--目的、结论、优点;洋红--突破性重要内容或结论,对我来说特别急需紧要的知识点;红色--特别重要的内容;黄色--重要内容;绿色--问题;蓝色--解决方案;灰色--未经证实的个人怀疑或假设或过时不重要的标记内容
Transformer三大核心:
-->dependencies: position and steps, aligning, 限制了并行化parallelization, 修修补补的策略不管用 ==> attention mechanism==>Transformer, 允许more parallelization
-->因为关联输入、输出信号的操作,越离越远,操作越多,所以用Transformer==>有效分辨率降低==>采用Multi-Head
目录
Abstract
1. Introduction
1.1 background
1.2 Recurrent models
1.3 Attention mechanisms
1.4 Transformer
2. Background
2.1 self-attention
2.2 End-to-End memory networks
2.3 Transformer
3. Model Architecture
Encoder-Decoder Structure
3.1 Encoder and Decoder Stacks
3.2 Attention
3.2.1 Scaled Dot-Product Attention
3.2.2 Multi-Head Attention
3.2.3 Applications of Attention in our Model
3.3 Position-wise Feed-Forward Networks
3.4 Embeddings and Softmax
3.5 positional encoding
4. Why Self-Attention
reasons for choosing self-attention
5. Training
5.4 Regularization 正则化
6. Results
7. conclusions
7.2 future research of Transformer
Reference
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an eccoder and a decoder.
主流的序列转码模型基于包含编码器-解码器的复杂循环或卷机神经网络。
The best performing models also connect the encoder and decoder through an attention mechanism.
表现最好的模型也通过attention机制将编码器与解码器连接起来
A new simple network architecture, Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
一种新的简便网络架构,Transformer,只基于attention机制,完全不使用循环和卷机。
RNN, lstm, gated recurrent neural networks have been firmly established as state-of-the-art approaches in sequence modeling and transduction problems such as language modeling and machine translation.
RNN, lstm, 门循环神经网络已经被牢固地确立为序列建模和转码问题(语言模型和机器翻译)的最新方法。
Numerous effects have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
大量的努力已经持续推动着循环语言模型和编码器-解码器架构的边界。
factor computation along with the symbol positions of the input and output sequences.
循环模型通常会考虑计算以及输入输出序列的符号位置
Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a funtion of the previous hidden state ht-1 and the input for position t.
通过将位置与计算时间中的步骤对齐,它们生成了一个隐含状态ht的序列,作为一个先前隐含状态ht-1和位置t的函数。
problem: This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
问题:这种固有的对齐序列性质限制了训练实例内的并行化,但并行化计算在更长的序列长度中非常重要,因为内存限制了实例间的batching批处理。
solve:Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance in case of latter.
最近的工作通过分解技巧和条件计算在计算效率方面实现了显著提升,同时在后者的情况下提升了模型性能。
problem: the fundamental constriant of sequential computation remains.
但是,这种基础的序列计算限制仍然存在
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.
attention机制已经成为不同任务中引人注目的序列建模和转导模型的组成部分,允许无视依赖在输入输出序列中的距离而对其依赖建模。
almost cases, attention mechanisms are used in conjunction with a recurrent network.
a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
allows for significantly more parallelization and can reach a new state of the art in translation quality.
Transformer是一种排除循环只完全依赖attention机制的架构来绘制输入输出之间的依赖关系。
允许更多的并行化,并且在翻译质量方面达到了新的水平。
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet, and ConS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.
减少序列计算的目标也构成了扩展神经GPU,ByteNet和ConS2S的基础,所以这些都使用卷机神经网络作为基础构建模块,并行计算所有输入输出位置的隐含表示。
problem: the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.
问题:关联来自两个任意输入输出位置的信号需要的操作数量随位置之间距离的增加而增加,对ConvS2S来说是线性增加的,对ByteNet来说是对数增加的
This makes it more difficult to learn dependencies between distant positions.
solve: Transformer this is reduced to a constant number of operations.
Transformer将这降低到常数量的操作数目。
problem: at the cost of reduced effective resolution due to averaging attention-weighted positions.
因为平均加权attention位置而使有效分辨率降低为代价
solve:an effect we counteract with Multi-Head Attention in 3.2
我们用Multi-Head来克服有效分辨率降低的问题
called intra-attention, relating different positions of a single sequence in order to compute a representation of the sequence.
self-attention又称intra-attention, 关联单个序列的不同位置以计算序列的表示
are based on a recurrent attention mechanism instead of sequence-aligned recurrence.
端到端记忆网络基于一个循环注意力机制,而不是序列对齐的循环
first transduction model, relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.
Transformer第一个完全只依赖self-attention计算自身输入输出表示的序列转导模型,而无需考虑序列对齐的RNNs或卷积。
Most competitive neural sequence transduction models have an encoder-decoder structure.
大多数有竞争力的神经序列转导模型都有编码器-解码器结构。
编码器:符号表示为x1,x2,...,xn的输入序列映射为一个连续表示序列z1,z2,...,zn
解码器:给定z-->符号的输出序列 生成一个输出序列y1,y2,...,yn,一次一个元素
在每一步,模型是自回归的 -->用前面生成的符号作为生成下一个符号的的额外输入。
Transformer model architecture
扩展:
Transformer模型相关疑问以及解答,https://www.jianshu.com/p/4064217e1c19
Encoder: The encoder is composed of a stack of N = 6 identical layers.
编码器由6个相同层的堆栈组成,==> 这6个层是并联关系
参考:Transformer模型中,decoder的第一个输入是什么? - 知乎
Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network.
每一层由两个子层组成。第一子层是multi-head self-attention,第二子层是一个简单、逐点全连接的前向网络。
We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1].
我们在每一个子层周围都应用了残差连接,然后是层归一化。
That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.
To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer,
the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
解码器插入了第三层,它对编码器堆栈输出执行multi-head attention。
Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
并且,我们修改了解码器堆栈中的self-attention子层,用掩码以防止位置参与后续的位置运算。
This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
这种掩码与偏移一个位置的输入嵌入output embedings相结合,确保i位置的预测只依赖于小于i位置的已知输出。
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.
attention函数可以被描述为将一个查询和一组键值对映射为一个输出,其中query,key,values,output都是向量。
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
输出是值的加权和,其中分配给每个值的权重是由query与其对应的key的兼容性函数来计算的。
(left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel
We call our particular attention "Scaled Dot-Product Attention" (Figure 2).
The input consists of queries and keys of dimension dk , and values of dimension dv . We compute the dot products of the query with all keys, divide each by , and apply a softmax function to obtain the weights on the values.
缩放点积attention由queries、维度dk的keys、维度dv的values组成。我们用所有keys计算查询的点积,每个key除以,并用一个softmax函数获取values的权重。
-->点积(dot product)和哈达马积(hadamard product)
点积:
哈达马积:两个维度相同的矩阵相乘,得到另一个维数相同的矩阵
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q . The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as:
The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
加性attention和点积(乘性)attention区别:
尽管这两种attention在原理复杂度上相似,但点积attention在实践中更快、空间效率更高,因为它可以使用高度优化的矩阵乘法代码。
problem: Instead of performing a single attention function with dmodel-dimensional keys, values and queries,
相比较于用维的keys、values、queries执行一个单attention函数
solve: we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.
我们发现用更有益的方式,它用不同的学习到的线性映射器linear projections将queries、keys、values线性映射h次,分别映射为dk, dk, dv维。
On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.
对于每一个映射的quries、keys、values,我们并行执行attention函数,产生dv维的输出值,这些值再被连接和映射后,生成最终值。
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
multi-head attention允许模型共同关注来自不同位置的不同表示子空间的信息,而使用single attention head,平均会抑制这种情况。
The Transformer uses multi-head attention in three different ways:
Transformer以三种方式使用multi-head attention:
在encoder-decoder attention层,来自decoder层的queries和来自encoder输出的keys、values,这允许decoder中的每个位置可以处理input sequence中的所有位置。
在encoder的self-attention层,self-attention层中的所有keys、values、queries都来自encoder之前层的输出。
在decoder中的self-attention层,允许解码器中的每个位置参与处理解码器中的所有位置,直到包括该位置。
我们需要防止解码器中的左向信息流以保持自回归特性。==>我们需要在放缩点积attention内部执行这些操作,通过对所有softmax输入的值进行掩码,以应对softmax中的非法连接。
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
除了attention子层,我们encoder-decoder框架中每一层都包含一个全连接的前馈网络,它分别相同地应用于每个位置。它由两个线性变换和中间的一个ReLU激活函数组成
learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel.
embeddings将输入和输出tokens转换为向量
the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.
线性变换和softmax函数将decoder输出转换为预测的写一个token概率
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by .
problem: Since our model contains no recurrence and no convolution,
因为Transformer模型不包含循环或卷积
in order for the model to make use of the order of the sequence, 为了利用序列顺序
solve: we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.
我们插入一些关于序列tokens绝对位置的信息,为此,我们在encoder和decoder堆栈地步的输入嵌入中加入了positional encodings。
The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].
Motivating our use of self-attention we consider three desiderata:
problem: Learning long-range dependencies is a key challenge in many sequence transduction tasks. 在许多序列转到任务中,学习长范围依赖是一项关键挑战
analysis: One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.学习这种依赖的关键影响因素是网络中前馈和后馈信号必须穿越的路径
The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12].
输入输出序列中任意位置组合之间的路径越短,学习长范围依赖就越容易。
Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
因此,我们还比较了由不同层类型组成的网络中任意两个输入和输出位置之间的最大路径长度。
在计算复杂度方面,当序列长度n小于表示维度d时,self-attention层比循环层更快
a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.
因为self-attention层用常数目的序列执行操作连接所有位置,而循环层需要O(n)序列操作
为了提高包含长序列任务的计算性能,self-attention可以被限制为仅考虑以输入序列为中心大小为z的邻域及相应的输出位置。
This would increase the maximum path length to O(n=r) . We plan to investigate this approach further in future work.
the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
独立卷积的复杂度与self-attention和逐点前馈层相等
self-attention could yield more interpretable models.
self-attention能够产生解释性更强的模型
We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized.
我们在每个子层输出加入子层输入和归一化之前,将dropout正则化用于每一个子层输出。
In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop = 0:1.
我们还将dropout用于编码器解码器堆栈中embeddings和positional encodings的和。
During training, we employed label smoothing of value ls = 0:1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
标签平滑虽然损伤了困惑度,但模型更加不确定,并提升了准确度和BLEU分值
bigger models are better, and dropout is very helpful in avoiding over-fitting.
模型越大,性能越好,而且dropout在避免过拟合方面很有帮助
our models is available at https://github.com/tensorflow/tensor2tensor.