Tensorflow中的AttentionCellWrapper：一种更通用的Attention机制

AttentionCellWrapper的疑问

关注Attention机制的同学们都知道，Attention最初是在Encoder-Decoder结构中由Bahdanau提出来的：

《 Neural Machine Translation by Jointly Learning to Align and Translate 》
https://arxiv.org/abs/1409.0473

这种Attention结构大致如下图所示：

图1 Encoder-Decoder中的Attention结构图

但细心的同学就发现了，我平常用的都是简单的RNN，单向展开后就得到结果，不是Encoder-Decoder的结构，这种Attention机制用不了啊！

于是大家就在Tensorflow的API里翻来覆去，终于找到了一个可以用于单向传播的Attention机制API：AttentionCellWrapper

在RNN中使用这个API也非常简单，只需要把cell包裹起来就好了：

cell = tf.contrib.rnn.LSTMCell(num_units)
cell = tf.contrib.rnn.AttentionCellWrapper(cell, attn_length = attention_len) # attention_len是自己指定的Attention关注长度

但仔细看API里的说明：

Implementation based on 《 Neural Machine Translation by Jointly Learning to Align and Translate 》

是不是有点眼熟……我的RNN（LSTM）明明就一层，没有第二层Decoder，怎么使用之前的信息呢？

AttentionCellWrapper灵感来源

经过我的反复Google，Tensorflow的AttentionCellWrapper并非基于Encoder-Decoder的架构设计的，其灵感来源于这篇文章的Attention：

https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn

这篇文章提出了一种单向RNN就能使用的Attention结构，在处理每一步的输入时，考虑前面N步的输出，经过映射加权后把这些历史信息加到本次输入的预测中。公式如下图：

图2 Attention公式

其中：

图3 公式释义

看公式有点懵，结合结构图看一下：

图4 通用型Attention结构图

这里我没画LSTM的输入及与其他step的连接，大家意会就好。其中：
绿色线：当前step（第t步）的cell
蓝色线：前面step的输出（hi）
红色线：每一个step经Attention加权后的输出，虚线代表权重小，实线代表权重大，对当前step影响也大
黄色方块：可学习的参数矩阵

如此一来，当前step的处理就用到了此前所有step的输出信息，至于每个step的输出贡献了多少，就要看Attention的这些矩阵学的怎么样了。

AttentionCellWrapper结构探秘

然而Tensorflow版的AttentionCellWrapper结构比上述结构还要再复杂一点，它考虑了以下两点：

使用了前面所有step信息后的当前step输出，是否可以用于下一个step的输入

是否可以同时使用当前step的输出和cell信息，加入到Attention的计算中

基于这两点，Tensorflow的AttentionCellWrapper实现了更为周全的结构，如下所示：

图5 Tensorflow AttentionCellWrapper结构图

其中紫色的线是新添加的内容。

总结

Tensorflow设计了一种通用的Attention结构，使得简单的时序模型就能使用Attention机制，让整个模型能更好地将注意力放在贡献最大的step上，一定程度上解决了时序模型记忆力不足的问题。

本文为YoungLittleFat原创文章，未经允许不得转载。

Tensorflow中的AttentionCellWrapper：一种更通用的Attention机制

AttentionCellWrapper的疑问

AttentionCellWrapper灵感来源

AttentionCellWrapper结构探秘

总结

你可能感兴趣的:(Tensorflow中的AttentionCellWrapper：一种更通用的Attention机制)