这篇文章就简单从源码的角度上分析一下tf.contrib.seq2seq下提供的API,首先来讲这个文件夹下面的几个文件和函数上篇文章中都已经提到而且介绍了他们之间的关系和如何使用,如果对源码不感兴趣就不用看下去了~~
为了简单起见,从decode的入口dynamic_deocde函数开始分析:
dynamic_decode(
decoder,
output_time_major=False,
impute_finished=False,
maximum_iterations=None,
parallel_iterations=32,
swap_memory=False,
scope=None
)
decoder: BasicDecoder、BeamSearchDecoder或者自己定义的decoder类对象
output_time_major: 见RNN,为真时step*batch_size*...,为假时batch_size*step*...
impute_finished: Boolean,为真时会拷贝最后一个时刻的状态并将输出置零,程序运行更稳定,使最终状态和输出具有正确的值,在反向传播时忽略最后一个完成步。但是会降低程序运行速度。
maximum_iterations: 最大解码步数,一般训练设置为decoder_inputs_length,预测时设置一个想要的最大序列长度即可。程序会在产生或者到达最大步数处停止。
其实简单来讲dynamic_decode就是先执行decoder的初始化函数,对解码时刻的state等变量进行初始化,然后循环执行decoder的step函数进行多轮解码。如果让我写可能就一个for循环,但是源码里面比较复杂,因为会涉及到很多条件判断等,以保证程序正常运行和报错。所以我们直接来看主体程序部分,也是一个control_flow_ops.while_loop循环,正好借机了解一下这个函数的使用方法:
while_loop(cond, body, loop_vars, shape_invariants=None, parallel_iterations=10, back_prop=True, swap_memory=False, name=None)
cond是循环的条件,body是循环执行的主体,这两个都是函数。loop_vars是要用到的变量,cond和body的参数相同且都是loop_vars。但一般cond只用到个别参数用来判断循环是否结束,大部分参数都是body中用到。parallel_iterations是并行执行循环的个数。看下面cond函数其实就是看finished变量是否已经全部变为0,而body函数也就是执行了decoder.step(time, inputs, state)
这句代码之后一系列的赋值和判断。
def condition(unused_time, unused_outputs_ta, unused_state, unused_inputs,
finished, unused_sequence_lengths):
return math_ops.logical_not(math_ops.reduce_all(finished))
def body(time, outputs_ta, state, inputs, finished, sequence_lengths):
#======1,调用step函数得到下一时刻的输出、状态、并得到下一时刻输入(由helper得到)和是否完成变量decoder_finished
(next_outputs, decoder_state, next_inputs, decoder_finished) = decoder.step(time, inputs, state)
#======2,根据decoder_finished和time是否已经大于maximum_iterations综合判断解码是否结束
next_finished = math_ops.logical_or(decoder_finished, finished)
if maximum_iterations is not None:
next_finished = math_ops.logical_or(
next_finished, time + 1 >= maximum_iterations)
next_sequence_lengths = array_ops.where(
math_ops.logical_and(math_ops.logical_not(finished), next_finished),
array_ops.fill(array_ops.shape(sequence_lengths), time + 1),
sequence_lengths)
nest.assert_same_structure(state, decoder_state)
nest.assert_same_structure(outputs_ta, next_outputs)
nest.assert_same_structure(inputs, next_inputs),
##======3,如果设置了impute_finished为真,在程序结束时将next_outputs置为零,不让其进行反向传播。并对decoder_state进行拷贝得到下一时刻状态。所以这里如果设置为true,会浪费一些时间
if impute_finished:
emit = nest.map_structure(lambda out, zero: array_ops.where(finished, zero, out), next_outputs, zero_outputs)
else:
emit = next_outputs
# Copy through states past finish
def _maybe_copy_state(new, cur):
# TensorArrays and scalar states get passed through.
if isinstance(cur, tensor_array_ops.TensorArray):
pass_through = True
else:
new.set_shape(cur.shape)
pass_through = (new.shape.ndims == 0)
return new if pass_through else array_ops.where(finished, cur, new)
if impute_finished:
next_state = nest.map_structure(_maybe_copy_state, decoder_state, state)
else:
next_state = decoder_state
#=====4,返回结果。
outputs_ta = nest.map_structure(lambda ta, out: ta.write(time, out), outputs_ta, emit)
return (time + 1, outputs_ta, next_state, next_inputs, next_finished, next_sequence_lengths)
#调用上面定义的cond和body进行循环解码
res = control_flow_ops.while_loop(condition, body,
loop_vars=[initial_time, initial_outputs_ta, initial_state, initial_inputs, initial_finished, initial_sequence_lengths, ],
parallel_iterations=parallel_iterations, swap_memory=swap_memory)
看完上面代码,就会想知道decoder.step()函数究竟做了哪些工作。其实你可以把它理解为RNNCell.cell滚动了一次。只不过考虑到解码,会在此基础上添加一些诸如使用helper得到输出答案,并将其转换为下一时刻输入等操作。如下所示:
def step(self, time, inputs, state, name=None):
with ops.name_scope(name, "BasicDecoderStep", (time, inputs, state)):
cell_outputs, cell_state = self._cell(inputs, state)
if self._output_layer is not None:
#如果设置了output层,将cell的输出进行映射
cell_outputs = self._output_layer(cell_outputs)
#根据输出结果,选出想要的答案,比如说贪婪法选择概率最大的单词,Scheduled使用某种概率分布进行采样等等
sample_ids = self._helper.sample(time=time, outputs=cell_outputs, state=cell_state)
#得到输出结果将其转化为下一时刻输入。train的时候就是decoder_inputs的下一时刻,预测的时候将选出的单词进行embedding即可
(finished, next_inputs, next_state) = self._helper.next_inputs(time=time, outputs=cell_outputs, state=cell_state, sample_ids=sample_ids)
outputs = BasicDecoderOutput(cell_outputs, sample_ids)#nameTulpe,将其一起作为outputs变量
return (outputs, next_state, next_inputs, finished)
接下来我们就看一下不同的helper类的sample和next_inputs两个函数分别干了什么。
TrainingHelper
def sample(self, time, outputs, name=None, **unused_kwargs):
with ops.name_scope(name, "TrainingHelperSample", [time, outputs]):
#使用argmax函数取出outputs中的最大值
sample_ids = math_ops.cast(math_ops.argmax(outputs, axis=-1), dtypes.int32)
return sample_ids
def next_inputs(self, time, outputs, state, name=None, **unused_kwargs):
with ops.name_scope(name, "TrainingHelperNextInputs", [time, outputs, state]):
next_time = time + 1
finished = (next_time >= self._sequence_length)
all_finished = math_ops.reduce_all(finished)
#直接从decode_inputs中读取下一个值作为下一时刻的解码输入
def read_from_ta(inp):
return inp.read(next_time)
next_inputs = control_flow_ops.cond(
all_finished, lambda: self._zero_inputs,
lambda: nest.map_structure(read_from_ta, self._input_tas))
return (finished, next_inputs, state)
GreedyEmbeddingHelper
def sample(self, time, outputs, state, name=None):
del time, state # unused by sample_fn
if not isinstance(outputs, ops.Tensor):
raise TypeError("Expected outputs to be a single Tensor, got: %s" %type(outputs))
#使用argmax函数取出outputs中的最大值
sample_ids = math_ops.cast(math_ops.argmax(outputs, axis=-1), dtypes.int32)
return sample_ids
def next_inputs(self, time, outputs, state, sample_ids, name=None):
del time, outputs # unused by next_inputs_fn
finished = math_ops.equal(sample_ids, self._end_token)
all_finished = math_ops.reduce_all(finished)
#将sample_ids通过embedding转化成下一时刻输入的词向量
next_inputs = control_flow_ops.cond(
all_finished,
# If we're finished, the next_inputs value doesn't matter
lambda: self._start_inputs,
lambda: self._embedding_fn(sample_ids))
return (finished, next_inputs, state)
可能上面的代码中并未涉及到attention机制,那这部分代码在哪里会用到呢。tf源码理是把attention封装到了RNNCell上面,就像DropoutWrapper那样,这里实现了一个AttentionWrapper,直接把attention封装在RNNCell里面,这样每次调用self._cell(inputs, state)
这句代码是都会执行attention机制。除此之外,这里的attention机制采用了和Memory Networks里面相似的思想,所以很多变量在定义的时候都是memory、query、keys、values等,不要造成理解上的困扰。此外,本文只会介绍这两种attention机制的代码实现,具体的原理可以参考我之前的文章:从头实现一个深度学习对话系统–Seq-to-Seq模型详解。
LuongAttention
先来看一下相似性分数的计算方法:
def _luong_score(query, keys, scale):
#本函数用来计算query和memory之间的相似性分数,直接使用内积表示
depth = query.get_shape()[-1]
key_units = keys.get_shape()[-1]
dtype = query.dtype
# query是本时刻的状态,[batch_size, rnn_size],keys是记忆向量,[batch_size, rnn_size, max_time]
# 所以二者相乘需要先将query扩展一个维度
query = array_ops.expand_dims(query, 1) #[batch_size, 1, rnn_size]
score = math_ops.matmul(query, keys, transpose_b=True) # [batch_size, 1, max_time]
score = array_ops.squeeze(score, [1]) # [batch_size, max_time],表示max_time个记忆与query之间的相似性
if scale:
g = variable_scope.get_variable(
"attention_g", dtype=dtype, initializer=1.)
score = g * score
return score
接下来再看看LuongAttention类的定义
class LuongAttention(_BaseAttentionMechanism):
def __init__(self, num_units, memory, memory_sequence_length=None, scale=False, probability_fn=None, score_mask_value=float("-inf"), name="LuongAttention"):
#定义probability_fn函数,用来将得分进行归一化操作,一般使用softmax
if probability_fn is None:
probability_fn = nn_ops.softmax
wrapped_probability_fn = lambda score, _: probability_fn(score)
#调用_BaseAttentionMechanism基类的构造函数
super(LuongAttention, self).__init__(query_layer=None,
memory_layer=layers_core.Dense(num_units, name="memory_layer", use_bias=False),
memory=memory, probability_fn=wrapped_probability_fn, memory_sequence_length=memory_sequence_length,
score_mask_value=score_mask_value, name=name)
self._num_units = num_units
self._scale = scale
self._name = name
def __call__(self, query, previous_alignments):
with variable_scope.variable_scope(None, "luong_attention", [query]):
#计算得分
score = _luong_score(query, self._keys, self._scale)
#归一化
alignments = self._probability_fn(score, previous_alignments)
return alignments
BahdanauAttention
最终BahdanauAttention类的定义与上面LuongAttention相差无几,就不在贴代码了,有兴趣的可以自己看一下源码:
def _bahdanau_score(processed_query, keys, normalize):
dtype = processed_query.dtype
num_units = keys.shape[2].value or array_ops.shape(keys)[2]
processed_query = array_ops.expand_dims(processed_query, 1)
v = variable_scope.get_variable("attention_v", [num_units], dtype=dtype)
if normalize:
# Scalar used in weight normalization
g = variable_scope.get_variable("attention_g", dtype=dtype, initializer=math.sqrt((1. / num_units)))
# Bias added prior to the nonlinearity
b = variable_scope.get_variable("attention_b", [num_units], dtype=dtype, initializer=init_ops.zeros_initializer())
# normed_v = g * v / ||v||
normed_v = g * v * math_ops.rsqrt(math_ops.reduce_sum(math_ops.square(v)))
return math_ops.reduce_sum(normed_v * math_ops.tanh(keys + processed_query + b), [2])
else:
return math_ops.reduce_sum(v * math_ops.tanh(keys + processed_query), [2])
源码里还实现了其他的attention机制,这里就不再赘述了。
_tile_batch
beam_search这部分的代码比较多,不过看得开心啊,为什么,因为他用的方法跟我之前自己想的方法一样啊,我凑,我在没看源代码的时候想到了跟他一样的方案,就是把输入扩展beam_size倍,我可能能吹好久==#虽然就是很小的一个点,但是我们还是看代码吧。
之前实现chatbot的时候也说过,要想用beam_search的话,需要先将encoder的output、state、length使用tile_batch函数处理一下,将batch_size扩展beam_size倍变成batch_size*beam_size,具体原因就不说了,那我们就来看一下这个函数具体做了哪些工作:
def _tile_batch(t, multiplier):
t = ops.convert_to_tensor(t, name="t")
shape_t = array_ops.shape(t)
if t.shape.ndims is None or t.shape.ndims < 1:
raise ValueError("t must have statically known rank")
tiling = [1] * (t.shape.ndims + 1)
tiling[1] = multiplier
tiled_static_batch_size = (t.shape[0].value * multiplier if t.shape[0].value is not None else None)
#将t扩展一个维度,然后使用tile函数复制
tiled = array_ops.tile(array_ops.expand_dims(t, 1), tiling)
#将tile之后的tensor进行reshape变成[batch_size*beam_size, ...]
tiled = array_ops.reshape(tiled, array_ops.concat(([shape_t[0] * multiplier], shape_t[1:]), 0))
tiled.set_shape(tensor_shape.TensorShape([tiled_static_batch_size]).concatenate(t.shape[1:]))
return tiled
通过下面这个例子看一下上面这个函数的功效:
a = tf.constant([[1,2,3], [4,5,6]]) # batch_size为2 [2, 3]
tiling = [1, 3, 1] # 取beam_size=3
tiled = tf.tile(tf.expand_dims(a, 1), tiling) # 将a的每个元素复制三次
sess.run(tiled)
输出:array([[[1, 2, 3],
[1, 2, 3],
[1, 2, 3]],
[[4, 5, 6],
[4, 5, 6],
[4, 5, 6]]])
tiled = tf.reshape(tiled, tf.concat(([6], [3]), 0)) # 6=2*3,进行reshape,变成[6, 3]
sess.run(tiled)
Out[11]:
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6]])
BeamSearchDecoder
我们知道,BeamSearchDecoder其实就是一个Decoder类,跟BasicDecoder一样。不过他不需要helper函数而已。下面看一下他的定义(去掉了一些没用的判断代码):
def __init__(self,
cell,
embedding,
start_tokens,
end_token,
initial_state,
beam_width,
output_layer=None,
length_penalty_weight=0.0):
#本函数主要是一个复制作用,将一些变量初始化以供解码开始
self._cell = cell
self._output_layer = output_layer
#注意这里embedding既可以是一个矩阵变量,也可以是一个可调用的look_up函数。
if callable(embedding):
self._embedding_fn = embedding
else:
self._embedding_fn = (
lambda ids: embedding_ops.embedding_lookup(embedding, ids))
self._start_tokens = ops.convert_to_tensor(start_tokens, dtype=dtypes.int32, name="start_tokens")
self._end_token = ops.convert_to_tensor(end_token, dtype=dtypes.int32, name="end_token")
self._batch_size = array_ops.size(start_tokens)
self._beam_width = beam_width
self._length_penalty_weight = length_penalty_weight
self._initial_cell_state = nest.map_structure(self._maybe_split_batch_beams, initial_state, self._cell.state_size)
#将start_token扩展到batch_size*beam_size维度,并进行embedding得到其词向量
self._start_tokens = array_ops.tile(array_ops.expand_dims(self._start_tokens, 1), [1, self._beam_width])
self._start_inputs = self._embedding_fn(self._start_tokens)
#finished变量也进行扩展
self._finished = array_ops.zeros([self._batch_size, self._beam_width], dtype=dtypes.bool)
然后看一下step函数,我们知道step就是while_loop里面每一次解码调用的函数,所以这里实现了主要功能。而这里跟BasicDecoder一样,先是调用cell_outputs, next_cell_state = self._cell(inputs, cell_state)
函数执行RNNCell,得到本时刻的输出和状态,接下来会将其shape由[batch_size*beam_szie, vocab_size]转换为[batch_szie, beam_size, vocab_szie]的格式,然后调用_beam_search_step()函数选择输出并产生下一时刻的输入,其实这部分相当于helper类的功能。
在看代码之前首先要明白一个概念就是,因为一共需要获得beam_size个序列,但是这些序列可能在到达最大长度之前就会产生符号,也就是说有些序列会比较早结束编码,而有些序列可能会一直编码到最后一步。那如何标识序列是否已经解码完毕呢,就是加一个finished和length变量,记录每个序列是否编码结束以及最终的长度。而且_beam_search_step函数很大一部分篇幅都是在进行这个工作(其实我倒感觉为了代码简单起见不如直接全部解码到最大长度,然后在转换成字符串的时候如果出现了eos,就不管后面的符号即可)。
下面主要将_beam_search_step函数的实现:
def _beam_search_step(time, logits, next_cell_state, beam_state, batch_size,
beam_width, end_token, length_penalty_weight):
"""Performs a single step of Beam Search Decoding.
Args:
time: 解码步数,从零开始。第一步是因为输入全都是start_token,所以这里只取第一个输入的前beam_size个输出
logits: cell的输出为[batch_size*beam_size, vocab_size],先将其转化为[batch_size, beam_width, vocab_size]在输入
next_cell_state: cell输出的下一时刻state
beam_state: An instance of `BeamSearchDecoderState`.
batch_size: The batch size for this input.
beam_width: Python int. The size of the beams.
end_token: The int32 end token.
length_penalty_weight: Float weight to penalize length. Disabled with 0.0.
"""
static_batch_size = tensor_util.constant_value(batch_size)
# Calculate the current lengths of the predictions
prediction_lengths = beam_state.lengths
previously_finished = beam_state.finished
#对cell的输出概率进行softmax归一化,如果某个beam已经结束则给其添加eos结束编码,其他的保持不变。然后与之前序列的概率值相加,以便后面选择概率最大的几个序列
step_log_probs = nn_ops.log_softmax(logits)
step_log_probs = _mask_probs(step_log_probs, end_token, previously_finished)
total_probs = array_ops.expand_dims(beam_state.log_probs, 2) + step_log_probs
# 对于还没有结束编码的序列,为其添加长度标识.
vocab_size = logits.shape[-1].value or array_ops.shape(logits)[-1]
lengths_to_add = array_ops.one_hot(indices=array_ops.tile(array_ops.reshape(end_token, [1, 1]), [batch_size, beam_width]), depth=vocab_size, on_value=constant_op.constant(0, dtype=dtypes.int64), off_value=constant_op.constant(1, dtype=dtypes.int64), dtype=dtypes.int64)
add_mask = (1 - math_ops.to_int64(previously_finished))
lengths_to_add = array_ops.expand_dims(add_mask, 2) * lengths_to_add
new_prediction_lengths = (lengths_to_add + array_ops.expand_dims(prediction_lengths, 2))
# 根据长度重新计算每个序列的得分。比如不想要很长的序列时可以对长度进行惩罚
scores = _get_scores(
log_probs=total_probs,
sequence_lengths=new_prediction_lengths,
length_penalty_weight=length_penalty_weight)
time = ops.convert_to_tensor(time, name="time")
# 第一次计算时只计算第一个序列的输出即可,后面则需要对所有序列计算求他们的前K个最大值
scores_shape = array_ops.shape(scores)
scores_flat = control_flow_ops.cond(time > 0, lambda: array_ops.reshape(scores, [batch_size, -1]), lambda: scores[:, 0])
num_available_beam = control_flow_ops.cond(time > 0, lambda: math_ops.reduce_prod(scores_shape[1:]), lambda: math_ops.reduce_prod(scores_shape[2:]))
# next_beam_size为beam_width和num_available_beam的最小值,因为可能在最后一个编码阶段,有正常输出的序列总共都不到beam_width个,所以这里进行一次取最小值操作。然后选择得分最高的next_beam_size个序列作为结果
next_beam_size = math_ops.minimum(ops.convert_to_tensor(beam_width, dtype=dtypes.int32, name="beam_width"), num_available_beam)
next_beam_scores, word_indices = nn_ops.top_k(scores_flat, k=next_beam_size)
#将结果reshape成[static_batch_size, beam_width],也就说每次编码结束后,对batch中每个样本最终只会保留beam_width个概率最大的序列
next_beam_scores.set_shape([static_batch_size, beam_width])
word_indices.set_shape([static_batch_size, beam_width])
# Pick out the probs, beam_ids, and states according to the chosen predictions
next_beam_probs = _tensor_gather_helper(
gather_indices=word_indices,
gather_from=total_probs,
batch_size=batch_size,
range_size=beam_width * vocab_size,
gather_shape=[-1],
name="next_beam_probs")
# Note: just doing the following
# math_ops.to_int32(word_indices % vocab_size,
# name="next_beam_word_ids")
# would be a lot cleaner but for reasons unclear, that hides the results of
# the op which prevents capturing it with tfdbg debug ops.
raw_next_word_ids = math_ops.mod(word_indices, vocab_size,
name="next_beam_word_ids")
next_word_ids = math_ops.to_int32(raw_next_word_ids)
next_beam_ids = math_ops.to_int32(word_indices / vocab_size,
name="next_beam_parent_ids")
# Append new ids to current predictions
previously_finished = _tensor_gather_helper(
gather_indices=next_beam_ids,
gather_from=previously_finished,
batch_size=batch_size,
range_size=beam_width,
gather_shape=[-1])
next_finished = math_ops.logical_or(previously_finished,
math_ops.equal(next_word_ids, end_token),
name="next_beam_finished")
# Calculate the length of the next predictions.
# 1. Finished beams remain unchanged
# 2. Beams that are now finished (EOS predicted) remain unchanged
# 3. Beams that are not yet finished have their length increased by 1
lengths_to_add = math_ops.to_int64(
math_ops.not_equal(next_word_ids, end_token))
lengths_to_add = (1 - math_ops.to_int64(next_finished)) * lengths_to_add
next_prediction_len = _tensor_gather_helper(
gather_indices=next_beam_ids,
gather_from=beam_state.lengths,
batch_size=batch_size,
range_size=beam_width,
gather_shape=[-1])
next_prediction_len += lengths_to_add
# Pick out the cell_states according to the next_beam_ids. We use a
# different gather_shape here because the cell_state tensors, i.e.
# the tensors that would be gathered from, all have dimension
# greater than two and we need to preserve those dimensions.
# pylint: disable=g-long-lambda
next_cell_state = nest.map_structure(
lambda gather_from: _maybe_tensor_gather_helper(
gather_indices=next_beam_ids,
gather_from=gather_from,
batch_size=batch_size,
range_size=beam_width,
gather_shape=[batch_size * beam_width, -1]),
next_cell_state)
# pylint: enable=g-long-lambda
next_state = BeamSearchDecoderState(
cell_state=next_cell_state,
log_probs=next_beam_probs,
lengths=next_prediction_len,
finished=next_finished)
output = BeamSearchDecoderOutput(
scores=next_beam_scores,
predicted_ids=next_word_ids,
parent_ids=next_beam_ids)
return output, next_state
至此我们就大致的分析了一下tf.contrib.seq2seq的源代码,相比看完之后大家应该就有了新的认识,可以自己动手写代码写程序了。可以试着基于CustomHelper写自己的Helper类实现自定义的seq2seq模型~~