DeepLearning.ai code笔记5:序列模型

注意力机制模型

模型: 分为 Encoder层,Attention层 和 Decoder层。

将 Encoder层 的每个时间片的激活值 s<t> s < t > 拷贝 Tx 次然后和全部激活值 a (Tx个时间片) 串联作为Attention 层的输入,经过Attention层的计算输出 ny n y 个阿尔法 α,使用不同激活值 a 作为不同阿尔法 α 对每个单词的注意力权重,相乘,即 α⋅a,然后将 ny n y 个这样的相乘作为attention层的输出,即作为 decoder层 一个时间片上输入参与后续运算。

主要思想是将一个时间片的激活值分别和不同的单词注意力权重(使用不同时间片的激活值作为权重)相乘。

这里写图片描述

下图是上图的 一个 Attention 到 context 部分,也是Attention mechanism(注意力机制)的实现:

这里写图片描述

过程:

  • There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through Tx T x time steps; the post-attention LSTM goes through Ty T y time steps.

    翻译:模型中有两个LSTM层,主要区别在于使用注意力机制的前后以及相连的时间片不同。下面的叫 pre-attention Bi-LSTM,上面的叫 post-attention LSTM,其实这两个分别属于Seq2Seq的编码(Encoder)和解码(Decoder)部分。Bi-LSTM是指双向 LSTM。pre-attention Bi-LSTM 和 Tx 个输入时间片,post-attention LSTM 和 Ty 个输出时间片相连。

  • The post-attention LSTM passes st,ct s ⟨ t ⟩ , c ⟨ t ⟩ from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations st s ⟨ t ⟩ . But since we are using an LSTM here, the LSTM has both the output activation st s ⟨ t ⟩ and the hidden cell state ct c ⟨ t ⟩ . However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time t t does will not take the specific generated yt1 y ⟨ t − 1 ⟩ as input; it only takes st s ⟨ t ⟩ and ct c ⟨ t ⟩ as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn’t as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.

    翻译: 指向 LSTM 的 s post-attention LSTM 的隐藏层的激活值。post-attention LSTM 没有使用双向的原因是这个例子中的机器时间前后相关不大。这一层的 激活值 s 和 记忆细胞的 c 的 初始输入和普通的LSTM一样(一般取0) ,而输入来自 Attention层 的计算。

  • We use at=[at;at] a ⟨ t ⟩ = [ a → ⟨ t ⟩ ; a ← ⟨ t ⟩ ] to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM.

    翻译:双向 RNN 的说明

  • The diagram on the right uses a RepeatVector node to copy st1 s ⟨ t − 1 ⟩ ’s value Tx T x times, and then Concatenation to concatenate st1 s ⟨ t − 1 ⟩ and at a ⟨ t ⟩ to compute et,t e ⟨ t , t ′ , which is then passed through a softmax to compute αt,t α ⟨ t , t ′ ⟩ . We’ll explain how to use RepeatVector and Concatenation in Keras below.

    翻译: Attention层 的 s<t> s < t > 是指 Encoder层一个时间片上的输出(激活值),将一个时间片上的激活值复制 Tx 次以便和 Encoder 层 Tx 个时间片的输出 a 进行串联(类似增广矩阵),然后一起作为 Attention层 的输入。

模型实现: implementing two functions: one_step_attention() and model() .

  • one_step_attention(): At step t t , given all the hidden states of the Bi-LSTM ([a<1>,a<2>,...,a<Tx>] [ a < 1 > , a < 2 > , . . . , a < T x > ] ) and the previous hidden state of the second LSTM ( s<t1> s < t − 1 > ), one_step_attention() will compute the attention weights ( [α<t,1>,α<t,2>,...,α<t,Tx>] [ α < t , 1 > , α < t , 2 > , . . . , α < t , T x > ] ) and output the context vector (see Figure 1 (right) for details):

    context<t>=t=0Txα<t,t>a<t>;(1) (1) c o n t e x t < t > = ∑ t ′ = 0 T x 阿 尔 法 α < t , t ′ > ⋅ a < t ′ > ;

    Note that we are denoting the attention in this notebook contextt c o n t e x t ⟨ t ⟩ . In the lecture videos, the context was denoted ct c ⟨ t ⟩ , but here we are calling it contextt c o n t e x t ⟨ t ⟩ to avoid confusion with the (post-attention) LSTM’s internal memory cell variable, which is sometimes also denoted ct c ⟨ t ⟩ .

    翻译: 在每个时间片 t,将Encoder层在 t 时间片上的输出 s<t1> s < t − 1 > (激活值, t-1而不是 t 是因为代码从0开始)复制 Tx 次以便和Encoder层所以输出串联作为Attention层的输入,经过计算得到 Tx 个输出 阿尔法 α, 将每个 α 和每个时间片的输出(作为注意力权重)相乘,然后求和作为最终的输入 context,也就是 Decoder层的输入。

    DeepLearning.ai code笔记5:序列模型_第1张图片

  • model(): Implements the entire model. It first runs the input through a Bi-LSTM to get back [a<1>,a<2>,...,a<Tx>] [ a < 1 > , a < 2 > , . . . , a < T x > ] . Then, it calls one_step_attention() Ty T y times (for loop). At each iteration of this loop, it gives the computed context vector c<t> c < t > to the second LSTM, and runs the output of the LSTM through a dense layer with softmax activation to generate a prediction y^<t> y ^ < t > .

    翻译: 在model函数中先计算Encoder层,然后利用缓存进行 decoder 层的计算,经过 Ty 个时间片运算得到最终的输出,在每个时间片都会调用一次 Attention层(每次都涉及 Tx 个时间片的 Encoder层 缓存输出)

你可能感兴趣的:(机器学习,深度学习,吴恩达深度学习编程作业梳理,DeepLearning.ai,序列模型,Attention,Mechanism)