转载请注明出处: https://blog.csdn.net/qq_33427047/article/details/81320098
Sequence-to-Sequence Models中会有一个 output_projection 参数,它是什么意思呢?
以 embedding_attention_seq2seq() 函数为例:
def embedding_attention_seq2seq(encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
num_heads=1,
output_projection=None,
feed_previous=False,
dtype=None,
scope=None,
initial_state_attention=False)
在 https://www.tensorflow.org/versions/r1.2/tutorials/seq2seq 中是这么说的:
One more important argument used above is output_projection. If not specified, the outputs of the embedding model will be tensors of shape batch-size by num_decoder_symbols as they represent the logits for each generated symbol. When training models with large output vocabularies, i.e., when num_decoder_symbols is large, it is not practical to store these large tensors. Instead, it is better to return smaller output tensors, which will later be projected onto a large output tensor using output_projection. This allows to use our seq2seq models with a sampled softmax loss, as described in Jean et. al., 2014 (pdf).
这段话最重要的是 “it is better to return smaller output tensors, which will later be projected onto a large output tensor using output_projection” ,意思是如果不指定output_projection, “as they represent the logits for each generated symbol”,模型的outputs是一个形状为num_decoder_symbols的 tensor。num_decoder_symbols很大时,存储这些较大的tensor不合理,而更好的方法是用较小的output tensors。通过output_projection,这些较小的output tensors会被映射到大的output tensor。
使用 sampled softmax 来处理 large output vocabulary
if num_samples > 0 and num_samples < self.target_vocab_size:
w_t = tf.get_variable("proj_w", [self.target_vocab_size, size], dtype=dtype)
w = tf.transpose(w_t)
b = tf.get_variable("proj_b", [self.target_vocab_size], dtype=dtype)
output_projection = (w, b)
def sampled_loss(labels, inputs):
labels = tf.reshape(labels, [-1, 1])
# We need to compute the sampled_softmax_loss using 32bit floats to
# avoid numerical instabilities.
local_w_t = tf.cast(w_t, tf.float32)
local_b = tf.cast(b, tf.float32)
local_inputs = tf.cast(inputs, tf.float32)
return tf.cast(
tf.nn.sampled_softmax_loss(
weights=local_w_t,
biases=local_b,
labels=labels,
inputs=local_inputs,
num_sampled=num_samples,
num_classes=self.target_vocab_size),
dtype)
在 sampled softmax 中,samples(默认是512)比 target vocabulary size 小,因此会构造一个 output projection。它是由 <权值矩阵,偏置向量> 对构成的。the rnn cell will return vectors of shape batch-size by size, rather than batch-size by target_vocab_size。为了恢复logits,可以乘权重矩阵再加偏置向量。
if output_projection is not None:
for b in xrange(len(buckets)):
self.outputs[b] = [tf.matmul(output, output_projection[0]) +output_projection[1] for ...]
上述代码是 model_with_buckets ,具体实现可以如下:
if output_projection is not None:
for b in xrange(len(buckets)):
self.outputs[b] = [tf.matmul(output, output_projection[0]) + output_projection[1] for output in self.outputs[b]]