【Attention,Self-Attention Self Attention Self_Attention】通俗易懂

Attention is, to some extent, motivated by how we pay visual attention to different regions of an image or correlate words in one sentence. Take the picture of a Shiba Inu in Fig. 1 as an example.
在某种程度上,注意力是由我们如何对图像的不同区域进行视觉注意或将一个句子中的单词关联起来所激发的。以图1中的柴犬为例。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第1张图片
Human visual attention allows us to focus on a certain region with “high resolution” (i.e. look at the pointy ear in the yellow box) while perceiving the surrounding image in “low resolution” (i.e. now how about the snowy background and the outfit?), and then adjust the focal point or do the inference accordingly.
人类的视觉注意力使我们能够以 "高分辨率 "关注某个区域(即看黄框中的尖耳朵),同时以 "低分辨率 "感知周围的图像(即现在的雪地背景和衣服如何?),然后相应地调整焦点或做推理。
Given a small patch of an image, pixels in the rest provide clues what should be displayed there. We expect to see a pointy ear in the yellow box because we have seen a dog’s nose, another pointy ear on the right, and Shiba’s mystery eyes (stuff in the red boxes). However, the sweater and blanket at the bottom would not be as helpful as those doggy features.
给定图像的一小块区域,其余部分的像素提供了线索,说明那里应该显示什么。我们希望在黄色框中看到一个尖耳朵,因为我们已经看到了狗的鼻子,右边的另一个尖耳朵,以及狗的神秘眼睛(红色框中的东西)。然而,底部的毛衣和毯子不会像那些狗的特征那样有帮助。
Similarly, we can explain the relationship between words in one sentence or close context. When we see “eating”, we expect to encounter a food word very soon. The color term describes the food, but probably not so much with “eating” directly.
同样地,我们可以在一个句子或接近的上下文中解释词语之间的关系。当我们看到 "吃 "时,我们预计很快就会遇到一个食物词。颜色词描述了食物,但可能与 "吃 "的直接关系不大。

一个词对同一句子中的其他词的 "关注 "程度不同
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第2张图片

In a nutshell, attention in deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or “attends to” as you may have read in many papers) other elements and take the sum of their values weighted by the attention vector as the approximation of the target.
简而言之,深度学习中的注意力可以被广义地解释为重要性权重的向量:为了预测或推断一个元素,如图像中的一个像素或句子中的一个词,我们用注意力向量估计它与其他元素的关联程度(或 “关注”,你可能在许多论文中读到过),并把它们的值加权后的总和作为目标的近似值。

What’s Wrong with Seq2Seq Model?

(Seq2Seq模型有什么问题?)
The seq2seq model was born in the field of language modeling (Sutskever, et al. 2014). Broadly speaking, it aims to transform an input sequence (source) to a new one (target) and both sequences can be of arbitrary lengths. Examples of transformation tasks include machine translation between multiple languages in either text or audio, question-answer dialog generation, or even parsing sentences into grammar trees.
seq2seq模型诞生于语言建模领域(Sutskever,等人,2014)。广义上讲,它的目的是将一个输入序列(源)转换为一个新的序列(目标),两个序列可以是任意长度的。转化任务的例子包括文本或音频中多种语言之间的机器翻译,问答式对话的生成,甚至将句子解析为语法树。
The seq2seq model normally has an encoder-decoder architecture, composed of:
(seq2seq模型通常有一个编码器-解码器结构,由以下部分组成:)

  • An encoder processes the input sequence and compresses the information into a context vector (also known as sentence embedding or “thought” vector) of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sequence.(编码器处理输入序列,并将信息压缩成一个固定长度的上下文向量(也称为句子嵌入或 "思想 "向量)。这种表示方式有望成为整个源序列有意义的良好总结。)
  • A decoder is initialized with the context vector to emit the transformed output. The early work only used the last state of the encoder network as the decoder initial state.(解码器用上下文向量进行初始化,以发出转换后的输出。早期的工作只使用编码器网络的最后状态作为解码器的初始状态。)

Both the encoder and decoder are recurrent neural networks, i.e. using LSTM or GRU units.
编码器和解码器都是递归神经网络,即使用LSTM或GRU单元。

编码器-解码器模型,将 "她在吃青苹果 "这句话翻译成中文。编码器和解码器的可视化都是在时间上展开的。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第3张图片

A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. The attention mechanism was born (Bahdanau et al., 2015) to resolve this problem.
这种固定长度的语境向量设计的一个关键和明显的缺点是没有能力记住长句子。一旦它完成了对整个输入的处理,往往就会忘记第一部分。为了解决这个问题,注意力机制诞生了(Bahdanau等人,2015)。

Born for Translation(从翻译任务而来)

The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder’s last hidden state, the secret sauce invented by attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.
注意力机制的诞生是为了帮助记忆神经机器翻译(NMT)中的长源句子。注意力发明的秘诀不是从编码器的最后一个隐藏状态中建立一个单一的语境向量,而是在语境向量和整个源输入之间建立联系。这些联系所在的连接权重对于每个输出元素都是可定制的。
While the context vector has access to the entire input sequence, we don’t need to worry about forgetting. The alignment between the source and target is learned and controlled by the context vector. Essentially the context vector consumes three pieces of information:
虽然上下文向量可以访问整个输入序列,但我们不需要担心遗忘的问题。源和目标之间的对齐是由上下文向量学习和控制的。从本质上讲,上下文向量消耗了三条信息:

  • encoder hidden states(编码器的隐藏状态);
  • decoder hidden states(解码器的隐藏状态);
  • alignment between source and target.(源和目标之间的对齐。)
编码器-解码器模型与加法注意力机制。(Bahdanau等人, 2015年)
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第4张图片

Definition(定义)
Now let’s define the attention mechanism introduced in NMT in a scientific way. Say, we have a source sequence x of length n and try to output a target sequence y of length m :
现在让我们以科学的方式来定义NMT中引入的注意力机制。比如,我们有一个长度为n的源序列x,并试图输出一个长度为m的目标序列y
在这里插入图片描述
The encoder is a bidirectional RNN (or other recurrent network setting of your choice) with a forward hidden state h i ⃗ \vec{h_{i}} hi and a backward one h i ← \overleftarrow{h_{i}} hi . A simple concatenation of two represents the encoder state. The motivation is to include both the preceding and following words in the annotation of one word.
编码器是一个双向的RNN(或你选择的其他递归网络设置),有一个前向隐藏状态 h i ⃗ \vec{h_{i}} hi 和一个后向隐藏状态 h i ← \overleftarrow{h_{i}} hi ,两者的简单连接代表了编码器的状态。其动机是在一个词的注释中包括前面和后面的词。
在这里插入图片描述
The decoder network has hidden state s t = f ( s t − 1 , y t − 1 , c t ) s_{t} = f(s_{t-1},y_{t-1},c_{t}) st=f(st1,yt1,ct) for the output word at position t ( t = 1 , 2 , 3 , . . , m ) t(t=1,2,3,..,m) t(t=1,2,3,..,m), where the context vector c t c_{t} ct is a sum of hidden states of the input sequence, weighted by alignment scores:
解码器网络对位置 t ( t = 1 , 2 , 3 , . . , m ) t(t=1,2,3,..,m) t(t=1,2,3,..,m)的输出词有隐藏状态 s t = f ( s t − 1 , y t − 1 , c t ) s_{t} = f(s_{t-1},y_{t-1},c_{t}) st=f(st1,yt1,ct),其中上下文向量 c t c_{t} ct是输入序列的隐藏状态之和,由对齐分数加权计算得出。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第5张图片
The alignment model assigns a score a t , i a_{t,i} at,i to the pair of input at position i i i and output at position t t t ( ( y t , x i ) (y_{t},x_{i}) (yt,xi)), based on how well they match. The set of { a t , i } \left \{ a_{t,i} \right \} {at,i} are weights defining how much of each source hidden state should be considered for each output. In Bahdanau’s paper, the alignment score a a a is parametrized by a feed-forward network with a single hidden layer and this network is jointly trained with other parts of the model. The score function is therefore in the following form, given that tanh is used as the non-linear activation function:
对齐模型根据位置 i i i 的输入和位置 t t t 的输出的匹配程度,给这对输入分配一个分数 a t , i a_{t,i} at,i。这组 { a t , i } \left \{ a_{t,i} \right \} {at,i} 是定义每个输出应考虑多少每个源隐藏状态的权重。在Bahdanau的论文中,对齐得分 a a a 是由一个具有单一隐藏层的前馈网络来参数化的,这个网络与模型的其他部分联合训练。因此,鉴于tanh被用作非线性激活函数,得分函数有如下形式:
在这里插入图片描述
其中 v a v_{a} va W a W_{a} Wa都是要在对齐模型中学习的权重矩阵。

The matrix of alignment scores is a nice byproduct to explicitly show the correlation between source and target words.
对齐分数的矩阵是一个很好的副产品,可以明确地显示源词和目标词之间的相关性

L’accord sur l’Espace économique européen a été signé en août 1992"(法语)及其英文翻译 "The agreement on the European Economic Area was signed in August 1992 "的排列矩阵。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第6张图片

A Family of Attention Mechanisms(A Family of 注意力机制)

With the help of the attention, the dependencies between source and target sequences are not restricted by the in-between distance anymore! Given the big improvement by attention in machine translation, it soon got extended into the computer vision field (Xu et al. 2015) and people started exploring various other forms of attention mechanisms (Luong, et al., 2015; Britz et al., 2017; Vaswani, et al., 2017).
在注意力的帮助下,源序列目标序列之间的依存关系不再受到中间距离的限制,鉴于注意力在机器翻译中的巨大进步,它很快被扩展到计算机视觉领域,人们开始探索各种其他形式的注意力机制。

Summary
下面是几个流行的注意力机制和相应的关注得分函数。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第7张图片
(^) 它增加了一个缩放因子 1 n \frac{1}{\sqrt{n}} n 1,其动机是担心当输入较大时,softmax函数可能有一个极小的梯度,难以有效学习。
以下是对更广泛类别的注意机制的总结。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第8张图片
selft-attention解释:将同一输入序列的不同位置联系起来。理论上说,selft-attention可以采用上述任何评分函数,但只要用相同的输入序列替换目标序列即可。

Self-Attention

Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.
自注意,也被称为内部注意,是在单一序列中不同位置相关的注意机制,以便计算同一序列的表征。它已被证明在机器阅读、抽象概括或图像描述生成中非常有用。
The long short-term memory network paper used self-attention to do machine reading. In the example below, the self-attention mechanism enables us to learn the correlation between the current words and the previous part of the sentence.
长短时记忆网络论文用自我注意来做机器阅读。在下面的例子中,自我注意机制使我们能够学习当前单词和句子的前一部分之间的关联性。

当前的字是红色的,蓝色阴影的大小表示与红色字的激活程度。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第9张图片

Soft vs Hard Attention(软注意与硬注意)

In the show, attend and tell paper, attention mechanism is applied to images to generate captions. The image is first encoded by a CNN to extract features. Then a LSTM decoder consumes the convolution features to produce descriptive words one by one, where the weights are learned through attention. The visualization of the attention weights clearly demonstrates which regions of the image the model is paying attention to so as to output a certain word.
在《show, attend and tell》这篇论文中,注意力机制被应用于图像以生成标题。图像首先由一个CNN编码以提取特征。然后,一个LSTM解码器消耗卷积特征,逐一产生描述性词语,其中的权重是通过注意力学习的。注意力权重的可视化清楚地表明了该模型正在关注图像的哪些区域,以便输出某个词。

(example)一个女人正在公园里扔飞盘。

This paper first proposed the distinction between “soft” vs “hard” attention, based on whether the attention has access to the entire image or only a patch:
本文首先提出了 "软 "与 "硬 "注意力的区别,其依据是注意力是可以接触到整个图像还是只有一个斑块:

  • Soft Attention: the alignment weights are learned and placed “softly” over all patches in the source image; essentially the same type of attention as in Bahdanau et al., 2015.(软注意:对齐权重被学习并 "软 "放在源图像的所有斑块上;本质上与Bahdanau等人,2015年的注意类型相同。)
    • 优点:该模型是平滑的和可微分的。
    • 缺点:当源输入较大时,expensive 。
  • Hard Attention: only selects one patch of the image to attend to at a time.(硬性注意:一次只选择图像中的一个斑块进行关注。)
    • 优点:推理时的计算量较少。
    • 缺点:该模型是不可微分的,需要更复杂的技术,如减少方差或强化学习来训练。

Global vs Local Attention( 全局关注与局部关注)
Luong, et al., 2015 proposed the “global” and “local” attention. The global attention is similar to the soft attention, while the local one is an interesting blend between hard and soft, an improvement over the hard attention to make it differentiable: the model first predicts a single aligned position for the current target word and a window centered around the source position is then used to compute a context vector.
Luong, et al., 2015提出了 "全局 "和 "局部 "的注意力。全局注意类似于软性注意,而局部注意则是硬性和软性之间有趣的融合,是对硬性注意的改进,使其具有微性:模型首先预测当前目标词的单一对齐位置,然后用以源位置为中心的窗口来计算上下文向量。

全局注意力和局部注意力对比
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第10张图片

Neural Turing Machines(神经图灵机)

Alan Turing in 1936 proposed a minimalistic model of computation. It is composed of a infinitely long tape and a head to interact with the tape. The tape has countless cells on it, each filled with a symbol: 0, 1 or blank (" “). The operation head can read symbols, edit symbols and move left/right on the tape. Theoretically a Turing machine can simulate any computer algorithm, irrespective of how complex or expensive the procedure might be. The infinite memory gives a Turing machine an edge to be mathematically limitless. However, infinite memory is not feasible in real modern computers and then we only consider Turing machine as a mathematical model of computation.
阿兰-图灵在1936年提出了一个最小化的计算模型。它由一个无限长的磁带和一个与磁带互动的磁头组成。磁带上有无数个单元,每个单元都充满了一个符号。0、1或空白(“”)。操作头可以读取符号、编辑符号并在磁带上左右移动。理论上,图灵机可以模拟任何计算机算法,无论该程序有多复杂或多昂贵。无限的内存使图灵机在数学上具有无限的优势。然而,在真正的现代计算机中,无限内存是不可行的,那么我们就只能把图灵机看作是计算的数学模型。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第11张图片
Neural Turing Machine (NTM, Graves, Wayne & Danihelka, 2014) is a model architecture for coupling a neural network with external memory storage. The memory mimics the Turing machine tape and the neural network controls the operation heads to read from or write to the tape. However, the memory in NTM is finite, and thus it probably looks more like a “Neural von Neumann Machine”.
神经图灵机(NTM,Graves, Wayne & Danihelka, 2014)是一个将神经网络与外部内存存储耦合的模型架构。存储器模仿图灵机磁带,神经网络控制操作头从磁带上读取或写入。然而,NTM中的内存是有限的,因此它可能看起来更像一个 “神经冯诺伊曼机”。
NTM contains two major components, a controller neural network and a memory bank. Controller: is in charge of executing operations on the memory. It can be any type of neural network, feed-forward or recurrent. Memory: stores processed information. It is a matrix of size N X M NXM NXM , containing N N N vector rows and each has M M M dimensions.
NTM包含两个主要部分,一个控制器神经网络和一个存储库。控制器:负责在存储器上执行操作。它可以是任何类型的神经网络,前馈式或循环式。存储器:存储处理过的信息。它是一个大小为 N X M NXM NXM的矩阵,包含 N N N个向量行,每个都有 M M M个维度。
In one update iteration, the controller processes the input and interacts with the memory bank accordingly to generate output. The interaction is handled by a set of parallel read and write heads. Both read and write operations are “blurry” by softly attending to all the memory addresses.
在一次更新迭代中,控制器处理输入并与存储库进行相应的互动以产生输出。这种互动是由一组并行的读写头处理的。读和写的操作都是 "模糊 "的,即软性地注意到所有的内存地址。

神经图灵机架构
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第12张图片

Reading and Writing

When reading from the memory at time t t t, an attention vector of size N N N , w t w_{t} wt controls how much attention to assign to different memory locations (matrix rows). The read vector r t r_t rt is a sum weighted by attention intensity:
当在 t t t时间从存储器中读取时,一个大小为 N N N的注意力向量, w t w_{t} wt控制着对不同存储器位置(矩阵行)的注意力分配。读取向量 r t r_t rt是一个由注意力强度加权的总和。
在这里插入图片描述
其中 w t ( i ) w_{t}(i) wt(i) w t w_{t} wt中的第 i i i个元素, M t ( i ) M_{t}(i) Mt(i)是内存中的第 i i i行向量。
When writing into the memory at time t t t, as inspired by the input and forget gates in LSTM, a write head first wipes off some old content according to an erase vector e t e_{t} et and then adds new information by an add vector a t a_{t} at.
受LSTM中输入和遗忘门的启发,在时间 t t t向存储器中写入时,写头首先根据擦除向量 e t e_{t} et擦除一些旧内容,然后通过添加向量 a t a_{t} at添加新的信息。
在这里插入图片描述

Attention Mechanisms(注意机制)

In Neural Turing Machine, how to generate the attention distribution w t w_{t} wt depends on the addressing mechanisms: NTM uses a mixture of content-based and location-based addressings.
在神经图灵机中,如何产生注意力分布 w t w_{t} wt 取决于寻址机制。NTM使用基于内容和基于位置的混合寻址方式。
Content-based addressing(基于内容的寻址)
The content-addressing creates attention vectors based on the similarity between the key vector k t k_{t} kt extracted by the controller from the input and memory rows. The content-based attention scores are computed as cosine similarity and then normalized by softmax. In addition, NTM adds a strength multiplier β t \beta _{t} βt to amplify or attenuate the focus of the distribution.
基于内容的寻址根据控制器从输入行和存储行中提取的关键向量 k t k_{t} kt 之间的相似性创建注意力向量。基于内容的注意力分数被计算为余弦相似度,然后通过softmax进行归一化。此外,NTM增加了一个强度乘数 β t \beta _{t} βt,以放大或衰减分布的焦点。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第13张图片
Interpolation(内插)
Then an interpolation gate scalar g t g_{t} gt is used to blend the newly generated content-based attention vector with the attention weights in the last time step:
然后用插值门标量 g t g_{t} gt将新生成的基于内容的注意力向量与最后一个时间步骤中的注意力权重相混合:
在这里插入图片描述

Location-based addressing(基于位置的寻址)
The location-based addressing sums up the values at different positions in the attention vector, weighted by a weighting distribution over allowable integer shifts. It is equivalent to a 1-d convolution with a kernel s t ( ⋅ ) s_{t}(·) st(), a function of the position offset. There are multiple ways to define this distribution. See Fig. 11. for inspiration.
基于位置的寻址将注意力矢量中不同位置的数值相加,通过对允许的整数移位的加权分布进行加权。它相当于一个带有内核 s t ( ⋅ ) s_{t}(·) st()的1-d卷积,是位置偏移的函数。有多种方法来定义这个分布。见图以获得灵感。

表示移位加权分布 s t s_{t} st的两种方法。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第14张图片

最后,注意力分布被一个锐化标量 γ t > = 1 \gamma _{t} >=1 γt>=1 所增强。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第15张图片
The complete process of generating the attention vector w t w_{t} wt at time step t t t is illustrated in Fig. 12. All the parameters produced by the controller are unique for each head. If there are multiple read and write heads in parallel, the controller would output multiple sets.
下图说明了在时间步骤 t t t产生注意力矢量 w t w_{t} wt的完整过程。控制器产生的所有参数对每个磁头都是唯一的。如果有多个读写头并行,控制器会输出多组。

神经图灵机中寻址机制的流程图。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第16张图片

Pointer Network(指针网络)

In problems like sorting or travelling salesman, both input and output are sequential data. Unfortunately, they cannot be easily solved by classic seq-2-seq or NMT models, given that the discrete categories of output elements are not determined in advance, but depends on the variable input size. The Pointer Net (Ptr-Net; Vinyals, et al. 2015) is proposed to resolve this type of problems: When the output elements correspond to positions in an input sequence. Rather than using attention to blend hidden units of an encoder into a context vector (See Fig. 8), the Pointer Net applies attention over the input elements to pick one as the output at each decoder step.
在排序或旅行推销员等问题中,输入和输出都是连续的数据。不幸的是,鉴于输出元素的离散类别不是事先确定的,而是取决于可变的输入大小,它们不能被经典的seq-2-seq或NMT模型轻易解决。指针网(Ptr-Net; Vinyals, et al. 2015)被提出来解决这种类型的问题。当输出元素对应于输入序列中的位置时。指针网不是用注意力将编码器的隐藏单元混合成一个上下文向量,而是在输入元素上应用注意力,在每个解码器步骤中挑选一个作为输出。

指针网络模型的架构。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第17张图片

The Ptr-Net outputs a sequence of integer indices, c given a sequence of input vectors x = ( x 1 , x 2 , x 3 , . . . , x n ) x=(x_{1},x_{2},x_{3},...,x_{n}) x=(x1,x2,x3,...,xn) and ( 1 < = c i < = n 1<=c_{i}<=n 1<=ci<=n). The model still embraces an encoder-decoder framework. The encoder and decoder hidden states are denoted as ( h 1 , h 2 , h 3 , . . . , h n ) (h_{1},h_{2},h_{3},...,h_{n}) (h1,h2,h3,...,hn) and ( s 1 , s 2 , s 3 , . . . , s m ) (s_{1},s_{2},s_{3},...,s_{m}) (s1,s2,s3,...,sm) , respectively. Note that s is the output gate after cell activation in the decoder. The Ptr-Net applies additive attention between states and then normalizes it by softmax to model the output conditional probability:
给定一串输入向量 x = ( x 1 , x 2 , x 3 , . . . , x n ) x=(x_{1},x_{2},x_{3},...,x_{n}) x=(x1,x2,x3,...,xn)并且( 1 < = c i < = n 1<=c_{i}<=n 1<=ci<=n),Ptr-Net输出一串整数索引, c = ( c 1 , c 2 , c 3 , . . . , c m ) c=(c_{1},c_{2},c_{3},...,c_{m}) c=(c1,c2,c3,...,cm),该模型仍然包含一个编码器-解码器框架。编码器和解码器的隐藏状态分别表示为 ( h 1 , h 2 , h 3 , . . . , h n ) (h_{1},h_{2},h_{3},...,h_{n}) (h1,h2,h3,...,hn) ( s 1 , s 2 , s 3 , . . . , s m ) (s_{1},s_{2},s_{3},...,s_{m}) (s1,s2,s3,...,sm)。注意,s是解码器中单元激活后的输出门。Ptr-Net在状态之间应用加性注意,然后通过softmax将其归一化,以建立输出条件概率模型:
在这里插入图片描述
The attention mechanism is simplified, as Ptr-Net does not blend the encoder states into the output with attention weights. In this way, the output only responds to the positions but not the input content.
注意力机制被简化了,因为Ptr-Net并没有将编码器的状态与注意力权重混合到输出中。这样一来,输出只对位置做出反应,而不是对输入内容做出反应。

Transformer

“Attention is All you Need” (Vaswani, et al., 2017), without a doubt, is one of the most impactful and interesting paper in 2017. It presented a lot of improvements to the soft attention and make it possible to do seq2seq modeling without recurrent network units. The proposed “transformer” model is entirely built on the self-attention mechanisms without using sequence-aligned recurrent architecture.
“Attention is All you Need”(Vaswani, et al., 2017),毫无疑问,是2017年最有影响力和最有趣的论文之一。它提出了很多对软注意力的改进,并使没有递归网络单元也能做seq2seq建模。提出的 "变压器 "模型完全建立在自我注意机制上,没有使用序列对齐的递归架构。

Key, Value and Query(键、值和查询)

The major component in the transformer is the unit of multi-head self-attention mechanism. The transformer views the encoded representation of the input as a set of key-value pairs, ( K , V ) (K,V) K,V, both of dimension n n n (input sequence length); in the context of NMT, both the keys and values are the encoder hidden states. In the decoder, the previous output is compressed into a query ( Q Q Q of dimension m m m ) and the next output is produced by mapping this query and the set of keys and values.
Transformer中的主要部件是多头自关注机制单元。Transformer将输入的编码表示视为一组键值对 ( K , V ) (K,V) K,V,维数均为 n n n(输入序列长度);在NMT的背景下,键和值均为编码器的隐藏状态。在解码器中,前一个输出被压缩成一个查询(维度为 m m m Q Q Q),下一个输出是通过映射这个查询和键的集合产生的。
The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys:
transformer 采用了缩放点积注意:输出是一个加权的数值之和,其中分配给每个数值的权重由查询与所有键的点积决定。
在这里插入图片描述
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第18张图片
Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel. The independent attention outputs are simply concatenated and linearly transformed into the expected dimensions. I assume the motivation is because ensembling always helps? According to the paper, “multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this."
多头机制不是只计算一次注意力,而是并行地多次运行按比例的点积注意力。独立的注意力输出被简单地串联起来,并被线性转换为预期的维度。我猜想其动机是由于集合总是有帮助的?) 根据这篇论文,“多头注意允许模型在不同的位置联合注意来自不同表征子空间的信息。而在单头注意的情况下,平均化抑制了这一点”。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第19张图片【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第20张图片
编码器生成一个基于注意力的表征,有能力从潜在的无限大的环境中找到特定的信息片段。

  • 一个由N=6个相同的层组成的堆栈。
  • 每一层都有一个多头自我注意层和一个简单的位置导向全连接前馈网络。
  • 每个子层都采用残差连接和层的归一化。所有的子层都输出相同维度的数据 d m o d e l = 512 d_{model} =512 dmodel=512
    【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第21张图片
    The decoder is able to retrieval from the encoded representation.
    解码器能够从编码的表示中进行检索。
  • 一个由N=6个相同的层组成的堆栈
  • 每层有两个多头注意机制的子层和一个全连接前馈网络的子层。
  • 与编码器类似,每个子层都采用了残差连接和层的归一化。
  • 第一个多头注意子层被修改,以防止位置注意到随后的位置,因为我们不想在预测当前位置时看目标序列的未来。

Full Architecture(完整的架构)

Finally here is the complete view of the transformer’s architecture:

  • 源序列和目标序列都首先经过嵌入层,产生相同维度 d m o d e l = 512 d_{model} =512 dmodel=512的数据。
  • 为了保留位置信息,一个基于正弦波的位置编码被应用并与嵌入输出相加。
  • 在最终的解码器输出中加入了一个softmax和线性层。
    【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第22张图片

SNAIL

The transformer has no recurrent or convolutional structure, even with the positional encoding added to the embedding vector, the sequential order is only weakly incorporated. For problems sensitive to the positional dependency like reinforcement learning, this can be a big problem.
transformer 没有递归或卷积结构,即使在嵌入向量中加入了位置编码,也只是弱化了顺序。对于像强化学习这样对位置依赖性敏感的问题,这可能是一个大问题。
The Simple Neural Attention Meta-Learner (SNAIL) (Mishra et al., 2017) was developed partially to resolve the problem with positioning in the transformer model by combining the self-attention mechanism in transformer with temporal convolutions. It has been demonstrated to be good at both supervised learning and reinforcement learning tasks.
简单神经注意元学习器(SNAIL)(Mishra等人,2017)的研究是为了部分解决transformer 模型中的位置问题,将transformer 中的自我注意机制与时间卷积相结合。它已被证明在监督学习和强化学习任务中都有良好的表现。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第23张图片
SNAIL was born in the field of meta-learning, which is another big topic worthy of a post by itself. But in simple words, the meta-learning model is expected to be generalizable to novel, unseen tasks in the similar distribution. Read this nice introduction if interested.
SNAIL诞生于元学习领域,这是另一个值得单独发帖的大话题。但简单地说,元学习模型有望在类似的分布中对新的、未见过的任务具有普遍性。如果有兴趣,请阅读这个不错的介绍。

Self-Attention GAN

自我注意GAN(SAGAN;Zhang等人,2018)在GAN中加入了自我注意层,使生成器和判别器都能更好地模拟空间区域之间的关系。
The classic DCGAN (Deep Convolutional GAN) represents both discriminator and generator as multi-layer convolutional networks. However, the representation capacity of the network is restrained by the filter size, as the feature of one pixel is limited to a small local region. In order to connect regions far apart, the features have to be dilute through layers of convolutional operations and the dependencies are not guaranteed to be maintained.
经典的DCGAN(深度卷积GAN)将判别器和生成器都表示为多层卷积网络。然而,网络的表示能力受到过滤器大小的限制,因为一个像素的特征被限制在一个小的局部区域。为了连接相距甚远的区域,必须通过多层卷积运算来稀释特征,而且不能保证保持依赖关系。
As the (soft) self-attention in the vision context is designed to explicitly learn the relationship between one pixel and all other positions, even regions far apart, it can easily capture global dependencies. Hence GAN equipped with self-attention is expected to handle details better, hooray!
由于视觉背景下的(软)自我注意被设计为明确地学习一个像素和所有其他位置之间的关系,甚至是相隔很远的区域,它可以很容易地捕获全局依赖。因此,配备了自我注意的GAN有望更好地处理细节,万岁!"。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第24张图片
The SAGAN adopts the non-local neural network to apply the attention computation. The convolutional image feature maps x x x is branched out into three copies, corresponding to the concepts of key, value, and query in the transformer:
SAGAN采用了非局部神经网络来应用注意力计算。卷积图像特征图 x x x被分成三份,与transformer中的键、值和查询概念相对应。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第25张图片
Then we apply the dot-product attention to output the self-attention feature maps:
然后我们应用点积注意力来输出自我注意力特征图:
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第26张图片
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第27张图片
Note that a i , j a_{i,j} ai,j is one entry in the attention map, indicating how much attention the model should pay to the i i i position when synthesizing the j j j location. W f W_{f} Wf, W g W_{g} Wg, and W h W_{h} Wh are all 1x1 convolution filters. If you feel that 1x1 conv sounds like a weird concept (i.e., isn’t it just to multiply the whole feature map with one number?), watch this short tutorial by Andrew Ng. The output o j o_{j} oj is a column vector of the final output ot.
请注意, a i , j a_{i,j} ai,j是注意力图中的一个条目,表示模型在合成第 j j j个位置时应该对第 i i i个位置给予多大的注意。 W f W_{f} Wf W g W_{g} Wg W h W_{h} Wh都是1x1卷积滤波器。如果你觉得1x1卷积听起来是个奇怪的概念(也就是说,不就是把整个特征图乘以一个数字吗?),请看Andrew Ng的这个简短教程。输出 o j o_{j} oj是最终输出 o = ( o 1 , o 2 , o j , . . . , o N ) o=(o_{1},o_{2},o_{j},...,o_{N}) o=(o1,o2,oj,...,oN)的列向量。
Furthermore, the output of the attention layer is multiplied by a scale parameter and added back to the original input feature map:
此外,注意层的输出被乘以一个比例参数,并加回原始输入特征图。
在这里插入图片描述

While the scaling parameter γ \gamma γ is increased gradually from 0 during the training, the network is configured to first rely on the cues in the local regions and then gradually learn to assign more weight to the regions that are further away.
在训练过程中,缩放参数 γ \gamma γ从0开始逐渐增加,网络被配置为首先依赖本地区域的线索,然后逐渐学习对更远的区域分配更多的权重。
【Attention,Self-Attention Self Attention Self_Attention】通俗易懂_第28张图片

References

[1] “Attention and Memory in Deep Learning and NLP." - Jan 3, 2016 by Denny Britz

[2] “Neural Machine Translation (seq2seq) Tutorial”

[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate." ICLR 2015.

[4] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. “Show, attend and tell: Neural image caption generation with visual attention." ICML, 2015.

[5] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks." NIPS 2014.

[6] Thang Luong, Hieu Pham, Christopher D. Manning. “Effective Approaches to Attention-based Neural Machine Translation." EMNLP 2015.

[7] Denny Britz, Anna Goldie, Thang Luong, and Quoc Le. “Massive exploration of neural machine translation architectures." ACL 2017.

[8] Ashish Vaswani, et al. “Attention is all you need." NIPS 2017.

[9] Jianpeng Cheng, Li Dong, and Mirella Lapata. “Long short-term memory-networks for machine reading." EMNLP 2016.

[10] Xiaolong Wang, et al. “Non-local Neural Networks." CVPR 2018

[11] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. “Self-Attention Generative Adversarial Networks." arXiv preprint arXiv:1805.08318 (2018).

[12] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. “A simple neural attentive meta-learner." ICLR 2018.

[13] “WaveNet: A Generative Model for Raw Audio” - Sep 8, 2016 by DeepMind.

[14] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. “Pointer networks." NIPS 2015.

[15] Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).

你可能感兴趣的:(NLP,深度学习,计算机视觉,人工智能,nlp,神经网络)