传统seq2seq模型中encoder将输入序列编码成一个context向量,decoder将context向量作为初始隐状态,生成目标序列。随着输入序列长度的增加,编码器难以将所有输入信息编码为单一context向量,编码信息缺失,难以完成高质量的解码。
注意力机制是在每个时刻解码时,基于当前时刻解码器的隐状态、输入或输出等信息,计算其对输入序列各位置隐状态的注意力(分数)并加权生成context向量用于当前时刻解码。引入注意力机制,使得不同时刻的解码能够关注不同位置的输入信息,提高预测准确性。
Bahdanau本质是一种 加性attention机制,将decoder的隐状态和encoder所有位置输出通过线性组合对齐,得到context向量,用于改善序列到序列的翻译模型
Bahdanau的特点为:
时刻 t t t,解码器的隐状态表示为
s t = f ( s t − 1 , c t , y t − 1 ) \boldsymbol s_t = f(\boldsymbol s_{t-1},\boldsymbol c_t,y_{t-1}) st=f(st−1,ct,yt−1)
时刻 t t t的隐状态 s t − 1 \boldsymbol s_{t-1} st−1对编码器各时刻输出 X X X的注意力分数为:
α t ( s t − 1 , X ) = softmax ( tanh ( s t − 1 W d e c o d e r + X W e n c o d e r ) W a l i g n m e n t ) , c t = ∑ i α t i x i \boldsymbol\alpha_t(\boldsymbol s_{t-1}, X) = \text{softmax}(\tanh( \boldsymbol s_{t-1}W_{decoder}+X W_{encoder})W_{alignment}),\quad \boldsymbol c_t=\sum_i\alpha_{ti}\boldsymbol x_i αt(st−1,X)=softmax(tanh(st−1Wdecoder+XWencoder)Walignment),ct=i∑αtixi
使用Bahdanau注意力机制的解码过程:
Tensorflow源码实现
def _bahdanau_score(processed_query, keys, attention_v):
"""Implements Bahdanau-style (additive) scoring function.
Args:
processed_query: Tensor, shape `[batch_size, num_units]` to compare to
keys.
keys: Processed memory, shape `[batch_size, max_time, num_units]`.
attention_v: Tensor, shape `[num_units]`.
Returns:
A `[batch_size, max_time]` tensor of unnormalized score values.
"""
# Reshape from [batch_size, ...] to [batch_size, 1, ...] for broadcasting.
processed_query = tf.expand_dims(processed_query, 1)
return tf.reduce_sum(attention_v * tf.tanh(keys + processed_query), [2])
def _calculate_attention(self, query, state):
"""Score the query based on the keys and values.
Args:
query: Tensor of dtype matching `self.values` and shape
`[batch_size, query_depth]`.
state: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]`
(`alignments_size` is memory's `max_time`).
Returns:
alignments: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]` (`alignments_size` is memory's
`max_time`).
next_state: same as alignments.
"""
processed_query = self.query_layer(query) if self.query_layer else query
score = _bahdanau_score(
processed_query,
self.keys,
self.attention_v,
)
alignments = self.probability_fn(score, state)
next_state = alignments
return alignments, next_state
Tensorflow自定义实现
# Bahdanau Attention
self.att_w1 = tf.keras.layers.Dense(self.dec_units)
self.att_w2 = tf.keras.layers.Dense(self.dec_units)
self.att_v = tf.keras.layers.Dense(1)
def bahdanau_attention(self, query, values):
"""Bahdanau Attention注意力机制
query: shape=(batch_size, hidden_size), 解码器隐状态
values: shape=(batch_size, seq_len, hidden_size),编码器输出
"""
query = tf.expand_dims(query, axis=1)
# shape=(batch_size, seq_len, 1)
score = self.att_v(tf.nn.tanh(self.att_w1(query) + self.att_w2(values)))
weight = tf.nn.softmax(score, axis=1)
# shape=(batch_size, hidden_size)
context = tf.reduce_sum(weight * values, axis=1)
return context, weight
Luong本质是一种 乘性attention机制,将解码器隐状态和编码器输出进行矩阵乘法,得到上下文向量。
Luong注意力机制是对Bahdanau注意力机制的改进,根据是否全部使用所有编码器输出分为两种:全局注意力和局部注意力,全局注意力适合用于短输入序列,局部注意力适合用于长输入序列(计算全局注意力代价高),以下内容仅介绍全局注意力。
Luong注意力机制与Bahdanau注意力机制的主要不同点:
Luong注意力的三种计算方法:
Tensorflow源码实现
def _calculate_attention(self, query, state):
"""Score the query based on the keys and values.
Args:
query: Tensor of dtype matching `self.values` and shape
`[batch_size, query_depth]`.
state: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]`
(`alignments_size` is memory's `max_time`).
Returns:
alignments: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]` (`alignments_size` is memory's
`max_time`).
next_state: Same as the alignments.
"""
score = _luong_score(query, self.keys, self.scale_weight)
alignments = self.probability_fn(score, state)
next_state = alignments
return alignments, next_state
def _luong_score(query, keys, scale):
"""Implements Luong-style (multiplicative) scoring function.
To enable the second form, call this function with `scale=True`.
Args:
query: Tensor, shape `[batch_size, num_units]` to compare to keys.
keys: Processed memory, shape `[batch_size, max_time, num_units]`.
scale: the optional tensor to scale the attention score.
Returns:
A `[batch_size, max_time]` tensor of unnormalized score values.
Raises:
ValueError: If `key` and `query` depths do not match.
"""
# Reshape from [batch_size, depth] to [batch_size, 1, depth]
query = tf.expand_dims(query, 1)
score = tf.matmul(query, keys, transpose_b=True)
score = tf.squeeze(score, [1])
if scale is not None:
score = scale * score
return score
I. context的计算方式不同
解码器在第t步,Bahdanau注意力机制使用decoder第t-1步的隐状态 s t − 1 \boldsymbol s_{t-1} st−1与encoder的各位置的输出 s ‾ i \boldsymbol{\overline s}_i si加权得到 c t \boldsymbol c_t ct,而Luong注意力机制使用decoder当前时刻的隐状态 s t \boldsymbol s_t st与encoder的每一个隐状态 s ‾ i \boldsymbol{\overline s}_i si加权得到 c t \boldsymbol c_t ct。
II. decoder的输入输出不同
解码器在第t步,Bahdanau注意力机制将前一时刻隐状态 s t − 1 \boldsymbol s_{t-1} st−1与 c t \boldsymbol c_t ct拼接作为decoder输入,经RNN运算得到 s t \boldsymbol s_{t} st,并直接输出 y ^ t + 1 \hat y_{t+1} y^t+1;Luong注意力机制将当前时刻隐状态 s t \boldsymbol s_t st与 c t \boldsymbol c_t ct拼接,并通过全连接网络输出 y ^ t + 1 \hat y_{t+1} y^t+1,Luong注意力机制只处理最顶层!!!
解码器t时刻输出是 o t o_t ot,由于其是下一时刻输出,所以多记作 y ^ t + 1 \hat y_{t+1} y^t+1。
III. 总结
Bahdanau的计算流程为 s t − 1 → a t → c t → s t \boldsymbol s_{t-1}\to \boldsymbol a_t \to \boldsymbol c_t \to\boldsymbol s_t st−1→at→ct→st,而Luong的计算流程为 s t → a t → c t → s ~ t \boldsymbol s_t\to\boldsymbol a_t \to\boldsymbol c_t \to\tilde\boldsymbol s_t st→at→ct→s~t。
1. Attention: Sequence 2 Sequence model with Attention Mechanism
2. NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
3. Effective Approaches to Attention-based Neural Machine Translation
4. Attention Mechanism