[github上tensorflow官方文档link]
[参考link]
[参考link]

主要参考14年和15年的两篇论文：

发表年月	论文链接
1409.0473	NMT_BahdanauAttention
1508.04025	LuongAttention

目前tensorflow也是提供对这两种Attention机制的支持的：
tf.contrib.seq2seq.BahdanauAttention
tf.contrib.seq2seq.LuongAttention

-1- 固定长度的Context Vector

Encoder_Decoder.png

Seq2Seq.png

早期的Encoder Decoder结构、Seq2Seq结构的模型，在Encoder(一层or多层RNN/LSTM/GRU)和Decoder(一层or多层RNN/LSTM/GRU)之间，靠一个固定长度的Context Vector来传递信息，相当于Input Sentence经过Encoder后被压缩成一个固定长度的Context Vector，然后这个压缩向量再被Decoder解码成Output Sentence。

固定长度的Context Vector在完成对Input Sentence的压缩功能时，不可避免的会产生一个问题：对于较长的、含有较多信息的句子，仍用定长向量压缩的话，压缩会造成更多的信息损失，导致模型表现不够准确。

-2- 引入Attention机制

⚠️先入为主，现在开始给自己灌输一个概念：Attention机制就是在Encoder、Decoder的基础上，加了一个Alignment_Model or a()函数 or score()函数。而且这个Alignment_Model是可以和Encoder、Decoder一起，通过反向传播的方式jointly训练的。⚠️

注意力机制通过利用原始的sentence信息，减少信息损失。在Decoder解码生成每个时刻的y时，不再仅仅利用最后时刻的隐藏状态向量Context Vector，还会同时利用到x1、x2、x3 ... 同时注意力机制还能使翻译器zoom in or out（使用局部或全局信息）。

注意力机制听起来很高大上、很神秘，其实它的整个实现只需要一些参数和简单的数学运算。

Seq2Seq_Attention.png

如上图所示，左边蓝色的代表Encoder，红色的代表Decoder。Attention机制在传统Encoder和Decoder的基础上，计算了更多的上下文向量context vector，对于Decoder中 每个 要生成的 word_y ，都会生成一个上下文向量context vector。

每个上下文向量context vector都是由Input_Sentence的每个word_x的信息加权求和得到的，其中权重向量就是注意力向量，它代表生成Output_Sentence在此刻的word_y时，Input_Sentence的各个word_x对该word_y的重要程度。最后将上下文向量和此刻的y的信息进行融合作为输出。

构建每个上下文向量context vector过程很简单：
首先：
对于一个固定的target word_y，我们把这个target state_y跟所有的Encoder的state_x进行比较，这样对每个state_x得到了一个score;
然后：
使用softmax对这些score_x进行归一化，这样就得到了基于该target state_y的条件概率分布。
最后：
对source的state_x进行加权求和，得到上下文向量context_vector_y，将上下文向量与target state_y融合作为最终的输出。

image.png

-3- BahdanauAttention

-3.1- Bahdanau Attention模型结构

Bahdanau Attention模型由一个Encoder、一个Decoder和一个Alignment Model构成，其中：
Encoder：
Encoder是一个双向RNN的结构，由2个RNN_cell以pair的形式构成。
为什么Encoder要使用双向RNN的结构呢？
对于Input Sentence的word_x_j，使用双向RNN计算的隐状态h_j，既包含了前向的语意信息，又包含了后向的语意信息；考虑到RNN的健忘性，隐状态h_j更多表征了word_x_j附近几个words(preceding_words & following_words)的信息。
这样的隐状态h_j组成的序列，接下来被用于和Decoder和Alignment Model一起计算各Context_Vector_i。
Decoder：
Decoder是一个单向RNN的结构，由1个RNN_cell构成。
Alignment Model：
Alignment Model 对齐模型，是一个前馈神经网络，训练时和模型中的其他组件一起jointly训练得到。Alignment Model可以通过损失函数的gradient反向传播方式优化。

Bahdanau Attention 模型结构(未展示Alignment Model).png

-3.2- Bahdanau Attention模型中待训练的参数矩阵

Encoder(bi-RNN)模型参数：
left_input_layer_matrix：
left_hidden_layer_matrix：
left_output_layer_matrix：
right_input_layer_matrix：
right_hidden_layer_matrix：
right_output_layer_matrix：

Decoder(RNN)模型参数：
input_layer_matrix：
hidden_layer_matrix：
output_layer_matrix：

Alignment Model 对齐模型的模型参数：
Alignment Model a()：
Alignment Model 对齐模型，是一个前馈神经网络，训练时和模型中的其他组件一起jointly训练得到。Alignment Model可以通过损失函数的gradient反向传播方式优化。

-3.3- Bahdanau Attention模型的前向传播(predict/calculate_error)及Attention_Prob的计算

-3.3.1- Output Sentence：
, ... ,, ... ,
每一个都是output_dim的向量

-3.3.2- Input Sentence：
, ... ,, ... ,
每一个都是imput_dim的向量

-3.3.3- Probality 's associated energy （中间计算结果，非模型参数）：

$\begin{bmatrix} \epsilon_{11} & {\cdots} & \epsilon_{i1} & {\cdots} & \epsilon_{Ty1} \\ \ {\vdots} & { } & {\vdots} & { } & {\vdots} \\ \epsilon_{1j} & {\cdots} & \epsilon_{ij} & {\cdots} & \epsilon_{Tyj} \\ \ {\vdots} & { } & {\vdots} & { } & {\vdots} \\ \epsilon_{1Tx} & {\cdots} & \epsilon_{iTx} & {\cdots} & \epsilon_{TyTx} \\ \end{bmatrix}$
E矩阵的每个元素都是一个标量

-3.3.4- Probality （中间计算结果，非模型参数）：

$\begin{bmatrix} \alpha_{11} & {...} & \alpha_{i1} & {...} & \alpha_{Ty1} \\ \ {\vdots} & { } & {\vdots} & { } & {\vdots} \\ \alpha_{1j} & {\cdots} & \alpha_{ij} & {\cdots} & \alpha_{Tyj} \\ \ {\vdots} & { } & {\vdots} & { } & {\vdots} \\ \alpha_{1Tx} & {\cdots} & \alpha_{iTx} & {\cdots} & \alpha_{TyTx} \\ \end{bmatrix}$
A矩阵的每个元素都是一个标量

-3.3.5- Context Vector （中间计算结果，非模型参数）：

对标的每个Context Vector 都是一个 hidden_dim 的向量

-3.3.6- 输出（这块还不是很懂）
BahdanauAttention对Encoder和Decoder的双向的RNN的state拼接起来??作为输出???

-4- LuongAttention - Global

-4.1- Luong Attention - Global 模型结构

Luong Attention模型由一个Encoder、一个Decoder和一个Alignment Model构成，其中：

Luong Attention - Global 模型结构.png

-4.2- Luong Attention - Global 模型中待训练的参数矩阵

Encoder(bi-RNN)模型参数：...
Decoder(RNN)模型参数：...
Alignment Model 对齐模型参数score() or Wa：
在Luong Attention - Global 模型中，对齐模型score()由参数矩阵Wa构成：

score()有三种可选方式
(注意：上述表达式并不是参数分三个区间的三个表达式，而是三选一的关系)

-4.3- Luong Attention - Global 模型的前向传播(predict/calculate_error)及Attention_Prob的计算

-4.3.1- Output Sentence：
, ... ,, ... ,
每一个都是output_dim的向量

-4.3.2- Input Sentence：
, ... ,, ... ,
每一个都是imput_dim的向量

-4.3.3- Probality 's associated energy （中间计算结果，非模型参数）：

E矩阵的每个元素都是一个标量

-4.3.4- Probality （中间计算结果，非模型参数）：

A矩阵的每个元素都是一个标量

-4.3.5- Context Vector （中间计算结果，非模型参数）：

对标的每个Context Vector 都是一个 hidden_dim 的向量

-5- LuongAttention - Local

-5.1- Luong Attention - Local 模型结构

Luong Attention - Local 模型结构.png

-5.2- Luong Attention - Local 模型中待训练的参数矩阵

Encoder(bi-RNN)模型参数：...
Decoder(RNN)模型参数：...
Alignment Model 对齐模型参数Wa：

-5.3- Luong Attention - Local 模型的前向传播(predict/calculate_error)及Attention_Prob的计算

-5.3.1- Output Sentence：
, ... ,, ... ,
每一个都是output_dim的向量

-5.3.2- Input Sentence：
, ... ,, ... ,
每一个都是imput_dim的向量

-5.3.3- Probality （中间计算结果，非模型参数）：

A矩阵的每个元素都是一个标量

-5.3.4- Context Vector （中间计算结果，非模型参数）：

对标的每个Context Vector 都是一个 hidden_dim 的向量

序列建模（三）：BahdanauAttention & LuongAttention