seq2seq 的 output_projection 参数

转载请注明出处: https://blog.csdn.net/qq_33427047/article/details/81320098


Sequence-to-Sequence Models中会有一个 output_projection 参数,它是什么意思呢?

以 embedding_attention_seq2seq() 函数为例:

def embedding_attention_seq2seq(encoder_inputs,
                                  decoder_inputs,
                                  cell,
                                  num_encoder_symbols,
                                  num_decoder_symbols,
                                  embedding_size,
                                  num_heads=1,
                                  output_projection=None,
                                  feed_previous=False,
                                  dtype=None,
                                  scope=None,
                                  initial_state_attention=False)

在 https://www.tensorflow.org/versions/r1.2/tutorials/seq2seq 中是这么说的:

One more important argument used above is output_projection. If not specified, the outputs of the embedding model will be tensors of shape batch-size by num_decoder_symbols as they represent the logits for each generated symbol. When training models with large output vocabularies, i.e., when num_decoder_symbols is large, it is not practical to store these large tensors. Instead, it is better to return smaller output tensors, which will later be projected onto a large output tensor using output_projection. This allows to use our seq2seq models with a sampled softmax loss, as described in Jean et. al., 2014 (pdf).

这段话最重要的是 “it is better to return smaller output tensors, which will later be projected onto a large output tensor using output_projection” ,意思是如果不指定output_projection, “as they represent the logits for each generated symbol”,模型的outputs是一个形状为num_decoder_symbols的 tensor。num_decoder_symbols很大时,存储这些较大的tensor不合理,而更好的方法是用较小的output tensors。通过output_projection,这些较小的output tensors会被映射到大的output tensor。


Sampled softmax and output projection

使用 sampled softmax 来处理 large output vocabulary

if num_samples > 0 and num_samples < self.target_vocab_size:
    w_t = tf.get_variable("proj_w", [self.target_vocab_size, size], dtype=dtype)
    w = tf.transpose(w_t)
    b = tf.get_variable("proj_b", [self.target_vocab_size], dtype=dtype)
    output_projection = (w, b)

    def sampled_loss(labels, inputs):
        labels = tf.reshape(labels, [-1, 1])
        # We need to compute the sampled_softmax_loss using 32bit floats to
        # avoid numerical instabilities.
        local_w_t = tf.cast(w_t, tf.float32)
        local_b = tf.cast(b, tf.float32)
        local_inputs = tf.cast(inputs, tf.float32)
        return tf.cast(
            tf.nn.sampled_softmax_loss(
                weights=local_w_t,
                biases=local_b,
                labels=labels,
                inputs=local_inputs,
                num_sampled=num_samples,
                num_classes=self.target_vocab_size),
                dtype)

在 sampled softmax 中,samples(默认是512)比 target vocabulary size 小,因此会构造一个 output projection。它是由 <权值矩阵,偏置向量> 对构成的。the rnn cell will return vectors of shape batch-size by size, rather than batch-size by target_vocab_size。为了恢复logits,可以乘权重矩阵再加偏置向量。

if output_projection is not None:
    for b in xrange(len(buckets)):
        self.outputs[b] = [tf.matmul(output, output_projection[0]) +output_projection[1] for ...]

上述代码是 model_with_buckets ,具体实现可以如下:

if output_projection is not None:
    for b in xrange(len(buckets)):
        self.outputs[b] = [tf.matmul(output, output_projection[0]) + output_projection[1] for output in self.outputs[b]]

你可能感兴趣的:(seq2seq)