基于Tensorflow 1.4自己写一个LSTM Language Model

LSTM公式

原理图：http://colah.github.io/posts/2015-08-Understanding-LSTMs/，并参考了github上这份代码实现：https://github.com/jonnykira/Tensorflow_mLSTM
公式：
更新细胞状态：
更新隐层：

定义需要用到Variable

先来个初始化用的对象，玄学初始化：

initializer = tf.contrib.layers.xavier_initializer()
全局变量：seq_length, embedding_size, rnn_size

定义那些矩阵

##tensorflow中的实现好像是把x_t和h_t-1拼起来了，这里简单点，分开算
Wi=tf.get_variable('Wi', shape=(embedding_size, rnn_size), initializer=initializer)
Ui=tf.get_variable('Ui', shape=(rnn_size, rnn_size), initializer=initializer)

Wf=tf.get_variable('Wf', shape=(embedding_size, rnn_size), initializer=initializer)
Uf=tf.get_variable('Uf', shape=(rnn_size, rnn_size), initializer=initializer)

Wo=tf.get_variable('Wo', shape=(embedding_size, rnn_size), initializer=initializer)
Uo=tf.get_variable('Uo', shape=(rnn_size, rnn_size), initializer=initializer)

Wc=tf.get_variable('Wc', shape=(embedding_size, rnn_size), initializer=initializer)
Uc=tf.get_variable('Uc', shape=(rnn_size, rnn_size), initializer=initializer)
# 如果要做weight normalization可以接着写.....

LSTM Cell

def lstm_cell(x, h, c):
    it = tf.sigmoid(tf.matmul(x, Wi) + tf.matmul(h, Ui))
    ft = tf.sigmoid(tf.matmul(x, Wf) + tf.matmul(h, Uf))
    ot = tf.sigmoid(tf.matmul(x, Wo) + tf.matmul(h, Uo))
    ct = tf.tanh(tf.matmul(x, Wc) + tf.matmul(h, Wc))

    c_new = (ft * c) + (it * ct)
    h_new = ot * tf.tanh(c_new)

    return c_new, h_new

展开LSTM

在tensorflow中这个过程是用tf.nn.static_rnn和tf.nn.dynamic_rnn实现，实际上写个循环就行了。(ps: tf.nn.dynamic_rnn是用tf.while实现的，不同batch可以有不同的seq_length，而tf.nn.static_rnn的time_step数量定义好了就不能改了)

def transform(x):
    # 处理一下输入数据，rnn的batch和cnn有些不同
    embedding_outputs = embedding(x) # embedding函数，需自己定义
    shape = tf.shape(embedding_outputs)
    embedding_inputs = tf.nn.dropout(embedding_outputs, 0.5,
                                     noise_shape=[1, shape[1], shape[2]])
    # (batch_size, seq_length, embeding_size)
    inputs_split = tf.split(embedding_inputs, seq_length, axis=1)
    # it's a list: seq_length x (batch_size, embedding_size)
    list_inputs = [tf.squeeze(input_, [1]) for input_ in inputs_split]
    return list_inputs


def unroll_lstm(lstm_cell, x, length):
    # length是序列的真实长度
    # x.shape = (batch_size, seq_length), 这个seq_length是padding后的
    batch_size = tf.shape(x)[0]
    # 对x做embedding
    input_list = transform(x)
    outputs = []
    # unrolled lstm loop
    # 定义output & state来接输出结果
    output = tf.tile(tf.expand_dims(tf.Variable(tf.zeros(cell_size),
                     trainable=False), 0), [batch_size, 1])
    state = tf.tile(tf.expand_dims(tf.Variable(tf.zeros(cell_size),
                     trainable=False), 0), [batch_size, 1])
    for ipt in input_list:
        state, output = lstm_cell(ipt, output, state)
        outputs.append(output)
    # 使用mask来截掉大于序列真实长度的部分（置为0）
    mask = tf.sequence_mask(length, seq_length)
    out_tensor = tf.stack(outputs, axis=1)
    outputs = tf.where(tf.stack([mask] * cell_size, axis=-1), out_tensor,
                       tf.zeros_like(out_tensor))
    return outputs, state

输出的截取

前面lstm输出的结果为(batch_size, seq_length, rnn_size)，batch中某些句子的长度可能比seq_length要短，这时需要使用tf.gather_nd函数去截取真实长度的输出。

# 计算真实输出部分的indices
# 这里我添加了一个记录batch中句子长度的placehoder: ph_length, shape: (batch_size, )
output_indices = tf.stack([tf.range(tf.shape(ph_length)[0]),
                          ph_length - 1], 1)
# (batch_size, rnn_size)
lstm_out_with_len = tf.gather_nd(lstm_outs, output_indices)

关于Language Model的Loss

基于LSTM的Language Model就是对于句子(其中是句子的分词结果)，使用去预测第个词是什么。如果是一整片文章，没有加Padding和句子末尾标记，那这个工作还是比较简单的；若加上Padding，在计算loss的时候需要对输入和输出做一些处理，Padding部分需要截取掉。
Loss参考代码：https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/model.py，其中loss函数用了sequence_loss_by_example有点迷，感觉用cross_entropy就够了，看了下API：https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py。sequence_loss_by_example计算了batch_size x sequece_length个sparse_softmax_cross_entropy_with_logits，最后放在了一个list里面。直接写的话，这样子：

# 输入: 
#   outputs: LSTM每个timestep的输出，shape = (batch_size, sequence_len, lstm_cell_size)
#   length: 这个batch_size中每个句子的实际长度，shape = (batch_size, )
#   max_seq_len: 最大句子长度
#   (optional) embed_mat: embedding使用的Lookup Table矩阵 (vocabulary_size, lstm_cell_size)


# mask tensor representing the first N positions of each cell
mask = tf.sequence_mask(length, max_seq_len)
# 提取非Padding位置的LSTM输出
output = tf.boolean_mask(outputs, mask) # (?, lstm_cell_size)

# 构造预测的target部分，例如 “落 霞 与 孤 鹜 齐 飞”其对应的target为
# "霞 与 孤 鹜 齐 飞 " → [20, 11, 38, 79, 3, 7, 0] (假设""的id表示为0)
# 这个工具最好预处理的时候做，tensorflow的tensor不支持assignment操作，不好实现。。
# input_y: 这个batch句子处理后的id化表示 shape = (batch_size, max_seq_len)
target = tf.boolean_mask(input_y, mask)

decoder_matrix = tf.get_variable(shape=[lstm_cell_size, vocabulary_size], initializer=
                                 tf.random_uniform_initializer(-1., 1.))
logits = tf.matmul(output, decoder_matrix)
# 如果想要节约内存，减少一些参数，可以复用embedding matrix
logits = tf.matmul(output, tf.transpose(embed_mat))

loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=target))