转载原文:https://blog.csdn.net/liuchonge/article/details/78856692
tensorflow版本升级之后把之前的tf.nn.seq2seq的代码迁移到了tf.contrib.legacy_seq2seq
下面,其实这部分API估计以后也会被遗弃,因为已经开发出了新的API放在tf.contrib.seq2seq
下面,更加灵活,但是目前在网上找到的代码和仿真实现基本上用的还是legacy_seq2seq
下面的代码,所以我们先来分析一下这部分的函数功能及源码实现。本次我们会介绍下面几个函数,这部分代码的定义都可以在python/ops/seq2seq.py文件中找到。
首先看一下这个文件的组成,主要包含下面几个函数:
可以看到按照调用关系和功能不同可以分成下面的结构:
model_with_buckets
seq2seq
函数
basic_rnn_seq2seq
rnn_decoder
tied_rnn_seq2seq
embedding_tied_rnn_seq2seq
embedding_rnn_seq2seq
embedding_rnn_decoder
embedding_attention_seq2seq
embedding_attention_decoder
attention_decoder
attention
one2many_rnn_seq2seq
loss
函数
sequence_loss_by_example
sequence_loss
根据函数调用关系,介绍如下函数:
最高层函数:model_with_buckets()
函数,定义:
def model_with_buckets(encoder_inputs,
decoder_inputs,
targets,
weights,
buckets,
seq2seq,
softmax_loss_function=None,
per_example_loss=False,
name=None):
if len(encoder_inputs) < buckets[-1][0]:
raise ValueError("Length of encoder_inputs (%d) must be at least that of la"
"st bucket (%d)." % (len(encoder_inputs), buckets[-1][0]))
if len(targets) < buckets[-1][1]:
raise ValueError("Length of targets (%d) must be at least that of last "
"bucket (%d)." % (len(targets), buckets[-1][1]))
if len(weights) < buckets[-1][1]:
raise ValueError("Length of weights (%d) must be at least that of last "
"bucket (%d)." % (len(weights), buckets[-1][1]))
all_inputs = encoder_inputs + decoder_inputs + targets + weights
#保存每个bucket对应的loss和output
losses = []
outputs = []
with ops.name_scope(name, "model_with_buckets", all_inputs):
#对每个bucket都要选择数据进行构建模型
for j, bucket in enumerate(buckets):
#buckets之间的参数都要进行复用
with variable_scope.variable_scope(
variable_scope.get_variable_scope(), reuse=True if j > 0 else None):
#调用seq2seq进行解码得到输出,这里需要注意的是,encoder_inputs和decoder_inputs是定义好的placeholder,
#都是长度为序列最大长度的列表(也就是最大的那个buckets的长度),按上面的例子,这两个placeholder分别是长度为20和30的列表
#在构建模型时,对于每个bucket,只取其对应的长度placeholder即可,如对于(5,10)这个bucket,就取前5/10个placeholder进行构建模型
bucket_outputs, _ = seq2seq(encoder_inputs[:bucket[0]],
decoder_inputs[:bucket[1]])
outputs.append(bucket_outputs)
#如果指定per_example_loss,则调用suquence_loss_by_example,losses添加的是一个batch_size大小的列表
if per_example_loss:
losses.append(
sequence_loss_by_example(
outputs[-1],
targets[:bucket[1]],
weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
#否则调用suquence_loss,对上面的结构进行求和,losses添加的是一个值
else:
losses.append(
sequence_loss(
outputs[-1],
targets[:bucket[1]],
weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
return outputs, losses
encoder_inputs
: encoder的输入,一个tensor的列表。列表中每一项都是encoder时的一个词。decoder_inputs
:decoder的输入,一个tensor的列表。列表中每一项都是decoder时的一个词。targets
:目标值,与decoder_input只相差一个< EOS >符号,int32型weights
:目标序列长度值的mask标志,如果是padding则weight=0,否则weight=1buckets
:就是定义的bucket值,是一个列表:[(5,10), (10,20),(20,30)…]seq2seq
:定义好的seq2seq模型,可以使用后面介绍的embedding_attention_seq2seq
,embedding_rnn_seq2seq
,basic_rnn_seq2seq
等softmax_loss_function
: 计算误差的函数,(labels, logits),默认为sparse_softmax_cross_entropy_with_logits
per_example_loss
: 如果为真,则调用sequence_loss_by_example
,返回一个列表,其每个元素就是一个样本的loss值。如果为假,则调用sequence_loss
函数,对一个batch的样本只返回一个求和的loss值,具体见后面的分析name
:Optional name for this operation, defaults to “model_with_buckets
”. 目的是为了减少计算量和加快模型计算速度;
在此这部分代码比较古老,有些地方还在使用static_rnn()这种函数,其实新版的tf中引入dynamic_rnn之后就不需要这么做了。
析:其实思路很简单,就是将输入长度分成不同的间隔,这样数据的在填充时只需要填充到相应的bucket长度即可,不需要都填充到最大长度。
例:buckets取[(5,10), (10,20),(20,30)…](每个bucket的第一个数字表示source填充的长度,第二个数字表示target填充的长度。
eg:‘我爱你’–>‘I love you’,应该会被分配到第一个bucket中,然后‘我爱你’会被pad成长度为5的序列,‘I love you’会被pad成长度为10的序列。
其实就是每个bucket表示一个模型的参数配置,这样对每个bucket都构造一个模型,然后训练时取相应长度的序列进行,而这些模型将会共享参数。其实这一部分可以参考现在的dynamic_rnn
来进行理解,dynamic_rnn
是对每个batch的数据将其pad至本batch中长度最大的样本,而bucket则是在数据预处理环节先对数据长度进行聚类操作。
tf.nn.seq2seq.embedding_attention_seq2seq
本函数会调用seq2seq函数进行解码操作,从名字就可看出本函数实现了embedding和attention两个功能,而attention则是使用了“Neural Machine Translation by Jointly Learning to Align and Translate”这篇论文里的定义方法:
# T代表time_steps,时序长度
def embedding_attention_seq2seq(encoder_inputs, # [T, batch_size]
decoder_inputs, # [T, batch_size]
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
num_heads=1, #attention的抽头数量
output_projection=None, #decoder的投影矩阵
feed_previous=False,
dtype=None,
scope=None,
initial_state_attention=False):
with variable_scope.variable_scope(
scope or "embedding_attention_seq2seq", dtype=dtype) as scope:
dtype = scope.dtype
# Encoder.先将cell进行deepcopy,因为seq2seq模型是两个相同的模型,但是模型参数不共享,所以encoder和decoder需要使用俩个不同的RNN cell
encoder_cell = copy.deepcopy(cell)
#先将encoder输入进行embedding操作,直接在RNNcell的基础上添加一个EmbeddingWrapper即可
encoder_cell = core_rnn_cell.EmbeddingWrapper(
encoder_cell,
embedding_classes=num_encoder_symbols,
embedding_size=embedding_size)
#这里仍然使用static_rnn函数来构造RNN模型
encoder_outputs, encoder_state = rnn.static_rnn(
encoder_cell, encoder_inputs, dtype=dtype)
# First calculate a concatenation of encoder outputs to put attention on.
#将encoder的输出由列表换成Tensor,shape为[batch_size,encoder_input_length,output_size],
#转换之后的Tensor就可以作为Attention的输入了
top_states = [
array_ops.reshape(e, [-1, 1, cell.output_size]) for e in encoder_outputs
]
attention_states = array_ops.concat(top_states, 1)
# Decoder.
output_size = None
#将decoder的输出进行映射到output_vocab_size维度,直接将RNNcell添加上一个OutputProjectionWrapper包装即可
if output_projection is None:
cell = core_rnn_cell.OutputProjectionWrapper(cell, num_decoder_symbols)
output_size = num_decoder_symbols
#如果feed_previous是bool型的值,则直接调用embedding_attention_decoder函数进行解码
if isinstance(feed_previous, bool):
return embedding_attention_decoder(
decoder_inputs,
encoder_state,
attention_states,
cell,
num_decoder_symbols,
embedding_size,
num_heads=num_heads,
output_size=output_size,
output_projection=output_projection,
feed_previous=feed_previous,
initial_state_attention=initial_state_attention)
# If feed_previous is a Tensor, we construct 2 graphs and use cond.
# 如果feed_previous是一个tensor,则使用tf.cond构建两个graph
def decoder(feed_previous_bool):
#本函数会被调用两次,第一次不适用reuse,第二次使用reuse。所以decoder(True),decoder(false)
reuse = None if feed_previous_bool else True
with variable_scope.variable_scope(
variable_scope.get_variable_scope(), reuse=reuse):
outputs, state = embedding_attention_decoder(
decoder_inputs,
encoder_state,
attention_states,
cell,
num_decoder_symbols,
embedding_size,
num_heads=num_heads,
output_size=output_size,
output_projection=output_projection,
feed_previous=feed_previous_bool,
update_embedding_for_previous=False,
initial_state_attention=initial_state_attention)
state_list = [state]
if nest.is_sequence(state):
state_list = nest.flatten(state)
return outputs + state_list
outputs_and_state = control_flow_ops.cond(feed_previous,
lambda: decoder(True),
lambda: decoder(False))
outputs_len = len(decoder_inputs) # Outputs length same as decoder inputs.
state_list = outputs_and_state[outputs_len:]
state = state_list[0]
if nest.is_sequence(encoder_state):
state = nest.pack_sequence_as(
structure=encoder_state, flat_sequence=state_list)
return outputs_and_state[:outputs_len], state
encoder_inputs
:encoder的输入,int32型 id tensor listdecoder_inputs
:decoder的输入,int32型 id tensor listcell
:RNNCell常见的一些RNNCell定义都可以用.num_encoder_symbols
:source的vocab_size大小(词表大小),用于embedding矩阵定义num_decoder_symbols
:target的vocab_size大小(词表大小),用于embedding矩阵定义embedding_size
:embedding向量的维度num_heads
:Attention头的个数,就是使用多少种attention的加权方式,用更多的参数来求出几种attention向量output_projection=None
:输出的映射层,想要得到num_decoder_symbols
对应的词还需要增加一个映射层,参数是W和B,W:[output_size, num_decoder_symbols]
,b:[num_decoder_symbols]
。若output_projection
为默认的None
时为训练模式,这时的cell加上了一层OutputProjectionWrapper
,即将[batch_size, output_size]转化为[batch_size, symbol]。如果output_projection
不为空,则此时的cell输出的为[batch_size, output_size]。(两个cell是不同的,这就直接影响到后续的embedding_rnn_decoder
解码过程和loop_function
的定义操作)。feed_previous
:是否将上一时刻输出作为下一时刻输入,一般测试的时候置为True,此时只有第一个decoder的输入(“GO"符号)有用,所有的decoder输入都依赖于上一步的输出。initial_state_attention
:默认为False, 初始的attention是零;若为True,将从initial state
和attention states
开始attention。 上面的代码进行了embedding的encoder阶段,最终得到每个时间步的隐藏层向量表示encoder_outputs,然后将各个时间步的输出进行reshape并concat变成一个[batch_size,encoder_input_length,output_size]的tensor。方便计算每个decode时刻的编码向量Ci。
在decoder阶段,先是对RNNCell封装了一个OutputProjectionWrapper
用于输出层的映射(将输出映射成想要的维度),然后直接调用embedding_attention_decoder
函数解码。但是当feed_previous不是bool型的变量,而是一个tensor的时候,会执行def decoder此函数.
(outputs, state) tuple pair:
前面的embedding_attention_seq2seq
在解码时会直接调用本函数。
代码定义:
def embedding_attention_decoder(decoder_inputs,
initial_state,
attention_states,
cell,
num_symbols,
embedding_size,
num_heads=1,
output_size=None,
output_projection=None,
feed_previous=False,
update_embedding_for_previous=True,
dtype=None,
scope=None,
initial_state_attention=False):
if output_size is None:
output_size = cell.output_size
if output_projection is not None:
proj_biases = ops.convert_to_tensor(output_projection[1], dtype=dtype)
proj_biases.get_shape().assert_is_compatible_with([num_symbols])
with variable_scope.variable_scope(
scope or "embedding_attention_decoder", dtype=dtype) as scope:
#decoder阶段的embedding
embedding = variable_scope.get_variable("embedding",
#将上一个cell输出进行output_projection,然后embedding得到当前cell的输入,尽在feed_previous情况下使用 [num_symbols, embedding_size])
loop_function = _extract_argmax_and_embed(
embedding, output_projection,
update_embedding_for_previous) if feed_previous else None
#如果不是feed_previous的话,将decoder_inputs进行embedding得到词向量
emb_inp = [
embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs
]
return attention_decoder(
emb_inp,
initial_state,
attention_states,
cell,
output_size=output_size,
num_heads=num_heads,
loop_function=loop_function,
initial_state_attention=initial_state_attention)
decoder_inputs
:这里input是token id,shape为a list of [batch_size, ]也就是说,输入不需要自己做embedding,直接输入tokens在vocab中对应的idx(即ids)即可,内部会自动帮我们进行id到embedding的转化。num_symbols
:就是decoder阶段的vocab_sizeembedding_size
:每个token需要embedding成的维数。output_projection
:如果output_projection
为默认的None,此时为训练模式,这时的cell加了一层OutputProjectionWrapper
,即将输出的[batch_size, output_size]转化为[batch_size,nums_symbol]。而如果output_projection
不为空,此时的cell的输出还是[batch_size, output_size]。update_embedding_for_previous
:如果前一时刻的output不作为当前的input的话(feed_previous=False
),这个参数没影响();只有在feed_previous为真的时候才会起作用。就是在bp时只更新‘GO’的embedding向量,其他元素保持不变。initial_state
:2D Tensor [batch_size x cell.state_size],RNN的初始状态attention_states
:3D Tensor [batch_size x attn_length x attn_size],就是上面计算出来的encoder阶段的隐层向量第一步创建了解码用的embedding;
第二步创建了一个循环函数loop_function,用于将上一步的输出映射到词表空间,输出一个word embedding作为下一步的输入;
tf.nn.attention_decoder
论文涉及三个公式:
u i t = v T t a n h ( W 1 ′ h i + W 2 ′ d t ) u^{t}_{i} = v^{T} tanh(W^{'}_{1}h_{i}+W^{'}_{2}d_{t}) uit=vTtanh(W1′hi+W2′dt) a i t = s o f t m a x ( u i t ) a^{t}_{i} = softmax(u^{t}_{i}) ait=softmax(uit) d t ′ = ∑ i = 1 T A a i t h i d^{'}_{t} = \sum_{i=1}^{T_{A}}a^{t}_{i}h_{i} dt′=i=1∑TAaithi
encoder输出的隐层状态( h 1 h_{1} h1,…, h T A h_{T_{A}} hTA),decoder的隐层状态( d 1 d_{1} d1,…, d T B d_{T_{B}} dTB)。 v T v^{T} vT, W 1 ′ W^{'}_{1} W1′, W 2 ′ W^{'}_{2} W2′是模型要学的参数。所谓的attention,就是在每个解码的时间步,对encoder的隐层状态进行加权求和,针对不同信息进行不同程度的注意力。那么我们的重点就是求出不同隐层状态对应的权重。源码中的attention机制里是最常见的一种,可以分为三步走:(1)通过当前隐层状态( d t d_{t} dt)和关注的隐层状态( h i h_{i} hi)求出对应权重 u i t u^{t}_{i} uit;(2)softmax归一化为概率;(3)作为加权系数对不同隐层状态求和,得到一个的信息向量 d t ′ d^{'}_{t} dt′。后续的 d t ′ d^{'}_{t} dt′使用会因为具体任务有所差别。
上面的 a i t a^{t}_{i} ait含义是第t个时间步,对 h i h_{i} hi的加权系数。
def attention_decoder(decoder_inputs, # T*[batch_size, input_size]
initial_state, # [batch_size, cell.states]
attention_states,# [batch_size, attn_length, attn_size]
cell,
output_size=None,
num_heads=1,
loop_function=None,
dtype=None,
scope=None,
initial_state_attention=False):
if not decoder_inputs:
raise ValueError("Must provide at least 1 input to attention decoder.")
if num_heads < 1:
raise ValueError("With less than 1 heads, use a non-attention decoder.")
if attention_states.get_shape()[2].value is None:
raise ValueError("Shape[2] of attention_states must be known: %s" %
attention_states.get_shape())
if output_size is None:
output_size = cell.output_size
with variable_scope.variable_scope(
scope or "attention_decoder", dtype=dtype) as scope:
dtype = scope.dtype
batch_size = array_ops.shape(decoder_inputs[0])[0] # Needed for reshaping.
attn_length = attention_states.get_shape()[1].value
if attn_length is None:
attn_length = array_ops.shape(attention_states)[1]
attn_size = attention_states.get_shape()[2].value
# To calculate W1 * h_t we use a 1-by-1 convolution, need to reshape before.
# 为了方便进行1*1卷积,将attention_states转化为[batch_size,num_steps,1,attention_size]的四维tensor
#第四个维度是attention_size,表示的是
hidden = array_ops.reshape(attention_states,
[-1, attn_length, 1, attn_size])
#用来保存num_heads个读取头的相关信息,hidden_states保存的是w*hj,v保存的是v,每个读取头的参数是不一样的
hidden_features = []
v = []
#-----------------------------接下来计算v*tanh(w*hj+u*zi)来表示二者的相关性
attention_vec_size = attn_size # Size of query vectors for attention.
#对隐藏层的每个元素计算w*hj
for a in xrange(num_heads):
#卷积核的size是1*1,输入channle为attn_size,共有attention_vec_size个filter
k = variable_scope.get_variable("AttnW_%d" % a,
[1, 1, attn_size, attention_vec_size])
#卷积之后的结果就是[batch_size,num_steps,1,attention_vec_size]
hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
v.append(
variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))
state = initial_state
#解码RNN的隐层状态
def attention(query):
"""Put attention masks on hidden using hidden_features and query."""
ds = [] # Results of attention reads will be stored here.
#如果query是tuple,则将其flatten,并连接成二维的tensor
if nest.is_sequence(query): # If the query is a tuple, flatten it.
query_list = nest.flatten(query)
for q in query_list: # Check that ndims == 2 if specified.
ndims = q.get_shape().ndims
if ndims:
assert ndims == 2
query = array_ops.concat(query_list, 1)
for a in xrange(num_heads):
with variable_scope.variable_scope("Attention_%d" % a):
#计算u*zi,并将其reshape成[batch_size, 1, 1, attention_vec_size]
y = Linear(query, attention_vec_size, True)(query)
y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
# Attention mask is a softmax of v^T * tanh(...).
#计算v * tanh(w * hj + u * zi)
#hidden_features[a] + y的shape为[batch_size, num_steps, 1,attention_vec_size],在于v向量(【attention_vec_size】)相乘仍保持不变
#在2, 3两个维度上进行reduce_sum操作,最终变成[batch_size,num_steps]的tensor,也就是各个hidden向量所对应的分数
s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y),
[2, 3])
#使用softmax函数进行归一化操作
a = nn_ops.softmax(s)
# Now calculate the attention-weighted vector d.
#对所有向量进行加权求和
d = math_ops.reduce_sum(
array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
ds.append(array_ops.reshape(d, [-1, attn_size]))
return ds
outputs = []
prev = None
batch_attn_size = array_ops.stack([batch_size, attn_size])
attns = [
array_ops.zeros(
batch_attn_size, dtype=dtype) for _ in xrange(num_heads)
]
for a in attns: # Ensure the second shape of attention vectors is set.
a.set_shape([None, attn_size])
#如果使用全零初始化状态,则直接调用attention并使用全另状态。
if initial_state_attention:
attns = attention(initial_state)
#如果不用全另初始化状态,则对所有decoder_inputs进行遍历,并逐个解码
for i, inp in enumerate(decoder_inputs):
if i > 0:
#如果i>0,则复用解码RNN模型的参数
variable_scope.get_variable_scope().reuse_variables()
# If loop_function is set, we use it instead of decoder_inputs.
#如果要使用前一时刻输出作为本时刻输入,则调用loop_function覆盖inp的值
if loop_function is not None and prev is not None:
with variable_scope.variable_scope("loop_function", reuse=True):
inp = loop_function(prev, i)
# Merge input and previous attentions into one vector of the right size.
input_size = inp.get_shape().with_rank(2)[1]
if input_size.value is None:
raise ValueError("Could not infer input size from input: %s" % inp.name)
#输入是将inp与attns进行concat,喂给RNNcell
inputs = [inp] + attns
x = Linear(inputs, input_size, True)(inputs)
# Run the RNN.
cell_output, state = cell(x, state)
# Run the attention mechanism.
#计算下一时刻的atten向量
if i == 0 and initial_state_attention:
with variable_scope.variable_scope(
variable_scope.get_variable_scope(), reuse=True):
attns = attention(state)
else:
attns = attention(state)
with variable_scope.variable_scope("AttnOutputProjection"):
inputs = [cell_output] + attns
output = Linear(inputs, output_size, True)(inputs)
if loop_function is not None:
prev = output
outputs.append(output)
return outputs, state
对于num_heads参数,我们知道,attention就是对信息的加权求和,一个attention head对应了一种加权求和方式,这个参数定义了用多少个attention head去加权求和,所以公式三可以进一步表述为 ∑ j = 1 n u m _ h e a d s ∑ i = 1 T A a i , j h i \sum^{num\_heads}_{j=1}\sum^{T_{A}}_{i=1}a_{i,j}h_{i} ∑j=1num_heads∑i=1TAai,jhi
W 1 ∗ h i W_{1}*h_{i} W1∗hi用的是卷积的方式实现,返回的tensor的形状是[batch_size, attn_length, 1, attention_vec_size]
# To calculate W1 * h_t we use a 1-by-1 convolution
hidden = array_ops.reshape(
attention_states, [-1, attn_length, 1, attn_size])
hidden_features = []
v = []
attention_vec_size = attn_size # Size of query vectors for attention.
for a in xrange(num_heads):
k = variable_scope.get_variable("AttnW_%d" % a,
[1, 1, attn_size, attention_vec_size])
hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
v.append(
variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))
W 2 ∗ d t W_{2}*d_{t} W2∗dt,此项是通过下面的线性映射函数linear实现
for a in xrange(num_heads):
with variable_scope.variable_scope("Attention_%d" % a):
# query对应当前隐层状态d_t
y = linear(query, attention_vec_size, True)
y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
# 计算u_t
s = math_ops.reduce_sum(
v[a] * math_ops.tanh(hidden_features[a] + y), [2, 3])
a = nn_ops.softmax(s)
# 计算 attention-weighted vector d.
d = math_ops.reduce_sum(
array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden,
[1, 2])
ds.append(array_ops.reshape(d, [-1, attn_size]))