Keras 官方example理解之--lstm_seq2seq.py

主要记录下官方example中lstm_seq2seq.py的个人理解和注释。案例实现了基本字符级别的seq2seq模型,并通过一个翻译任务的样例来展示。数据集是这个,我下载的cmn-eng英语到汉语的数据集,1247KB。流程是encoder将数据编码成中间表示C,decoder将C解码成翻译后的数据。

1. 数据预处理

每一行数据包含一句英语和一句翻译后的中文,用'\t'隔开,数据预处理截断将所有的英文存入input_texts,将中文添加起始符‘'\t' 和结束符\n后存入到target_texts中,input_characterstarget_characters分别保存英语和汉语中的单词,不重复。

# 向量化数据
input_texts = []  # 存入待翻译的句子,即英语
target_texts = []  # 存入翻译后句子,即汉语
input_characters = set()  # 分别保存英语和汉语中的单词,不重复
target_characters = set()

with open(data_path, 'r', encoding='utf-8') as f:
    # 另外的读取每行的方法,for line in f:
    lines = f.read().split('\n')
    print(type(lines),'  ',len(lines))

# 添加起始符y0,decoder根据y0,h<0>,c就可以计算出y1的概率分布了;
# 添加结束符end,decoder根据y1,h<1>,c可以计算出y2的概率分布...直到end结束
for line in lines[: min(num_samples, len(lines) - 1)]:
    # 数据通过制表符划开的,左边为英语,右边为汉语(有些有繁体)
    input_text, target_text = line.split('\t')
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    # 即添加起始和结束标识符
    target_text = '\t' + target_text + '\n'
    input_texts.append(input_text)
    target_texts.append(target_text)
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)
            
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
# 保存输入和输出的句子中,单词数最大数
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of input len:', len(input_texts))
print('Number of output len:', len(target_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

输出如下:

<class 'list'>    20404  # 数据集中共20404行
Number of input len: 10000   # 对语料进行了截取,只用了10000行来训练
Number of output len: 10000
Number of unique input tokens: 73   # 输入中有73个不同的单词
Number of unique output tokens: 2617  # 输出中有2617个不同的汉字
Max sequence length for inputs: 30  # 最长的英文语句中有30个单词
Max sequence length for outputs: 22  # 最长的中文翻译中有22个字

2.对输入进行嵌入

建立文字索引字典并初始化输入输出,这里相当于将输入输出按照one-hot编码的形式简单的embedding了一下。

# 建立token的下标索引,这里token是指的每个单词,即可以通过dict根据token找到对应的数字标识
input_token_index = dict([(char,i) for i,char in enumerate(input_characters)])
target_token_index = dict([(char,i) for i ,char in enumerate(target_characters)])

#  np.zeros(shape,dtype)
# 这里应该是输入数据,中间隐藏层输出C(即decoder的输入数据),输出目标数据
# seq2seq 输入输出长度是不对应的,所以这里输入设置最大长度为语料中最大英文长度,输出为语料中最大中文长度
#(只是用语料中的最大长度是不是不太恰当?)
# decoder_input_data为了再最后将数字转换成可读文字,即将C解码成汉字
encoder_input_data = np.zeros((len(input_texts),max_encoder_seq_length,num_encoder_tokens),dtype='float32')
decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens),dtype='float32')
decoder_target_data = np.zeros((len(input_texts),max_decoder_seq_length,num_decoder_tokens),dtype='float32')


'''
 这里,根据语料库为输入数据赋值,decoder_target_data要比decoder_input_data靠前一步?
 
 哦,懂了,参考: https://blog.csdn.net/jerr__y/article/details/53749693  中图2结构
 encoder_input_data的赋值:i是第几行(第几个sample),t是这一行中第几个单词,
 char是这个单词,所以input_token_index就是这个单词对应在输入词汇表中的下标。
 将第i行,第t个词(对应的t时刻),单词在词汇表中对应的下标index处设置为1
 t = 0是开始位置,decoder_input_data是图中右下方的WXYZ , decoder_target_data是右上方的WXYZ,可以看到错位了一位
 encoder和decoder的t是分开计时的
 
'''
# 这里相当于将输入输出按照one-hot编码的形式简单的embedding了一下
# 输入输出和中间输入通过i对应起来

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t ,char in enumerate(input_text):
        encoder_input_data[i,t,input_token_index[char]] = 1
    for t,char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i,t,target_token_index[char]] = 1
        if t>0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i,t-1,target_token_index[char]] = 1 

3.Encoder

encoder部分得到h和c,组合作为decoder的初始状态。

# Define an input sequence and process it.
# 这里shape = (samples,num_tokens即词汇数),返回中间状态
'''
https://blog.csdn.net/u011327333/article/details/78501054
LSTM 返回状态的解释,关于参数是return_state是否为真以及return_sequences是否为真的设置
'''
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim,return_state = True)
# 返回形如(samples,timesteps,output_dim)的3D张量,否则,返回形如(samples,output_dim)的2D张量
encoder_outputs, state_h,state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c] 
# 作为decoder的初始状态
print(encoder_states)

输出:

[<tf.Tensor 'lstm_3/while/Exit_2:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'lstm_3/while/Exit_3:0' shape=(?, 256) dtype=float32>]

4.Decoder

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# 返回decoder全部的输出序列和中间状态。返回状态在训练阶段不需要,但是会将他们作为参考
'''
decoder 的初始状态为encoder的state_h和state_c的结合

unit代表该层的输出维度
Dense层(unit,activation)
'''

decoder_lstm = LSTM(latent_dim, return_sequences = True, return_state=True)
decoder_outputs ,_,_=decoder_lstm(decoder_inputs,initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation = 'softmax')
decoder_outputs = decoder_dense(decoder_outputs)

5. 构建模型并进行训练

案例的epoch是100次,训练到后面应该是过拟合了,即val_loss先减小后增大。样本为20000时,训练到28个epoch左右就过拟合了。

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
# 标识数据
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Run training
# 这里validation_split = 0.2 是从train中保留百分之20作为验证
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# 实际数据
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)

输出:

Epoch 1/20
8000/8000 [==============================] - 8s 1ms/step - loss: 2.0236 - val_loss: 2.5357
Epoch 2/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.8918 - val_loss: 2.4145
Epoch 3/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.7785 - val_loss: 2.2746
Epoch 4/20
8000/8000 [==============================] - 6s 774us/step - loss: 1.6937 - val_loss: 2.2429
Epoch 5/20
8000/8000 [==============================] - 6s 774us/step - loss: 1.6348 - val_loss: 2.1701
Epoch 6/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.5567 - val_loss: 2.1057
Epoch 7/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.4995 - val_loss: 2.0756
Epoch 8/20
8000/8000 [==============================] - 6s 772us/step - loss: 1.4527 - val_loss: 2.0360
Epoch 9/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.3995 - val_loss: 1.9788
Epoch 10/20
8000/8000 [==============================] - 6s 775us/step - loss: 1.3561 - val_loss: 1.9562
Epoch 11/20
8000/8000 [==============================] - 6s 774us/step - loss: 1.3147 - val_loss: 1.9274
Epoch 12/20
8000/8000 [==============================] - 6s 775us/step - loss: 1.2790 - val_loss: 1.8991
Epoch 13/20
8000/8000 [==============================] - 6s 776us/step - loss: 1.2433 - val_loss: 1.8901
Epoch 14/20
8000/8000 [==============================] - 6s 777us/step - loss: 1.2098 - val_loss: 1.8733
Epoch 15/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.1786 - val_loss: 1.8560
Epoch 16/20
8000/8000 [==============================] - 6s 775us/step - loss: 1.1487 - val_loss: 1.8447
Epoch 17/20
8000/8000 [==============================] - 6s 776us/step - loss: 1.1199 - val_loss: 1.8459
Epoch 18/20
8000/8000 [==============================] - 6s 774us/step - loss: 1.0944 - val_loss: 1.8493
Epoch 19/20
8000/8000 [==============================] - 6s 775us/step - loss: 1.0676 - val_loss: 1.8287
Epoch 20/20
8000/8000 [==============================] - 6s 777us/step - loss: 1.0410 - val_loss: 1.8131

6. 预测

预测并将预测的结果解码成文字,感觉生成效果确实不咋地。有个问题:为什么要将存在的文字对应的target_seq设为1呢,后来看了下代码,作用是更新one-hot编码,将前一时刻的输出和中间状态共同作用生成下一时刻的输出。

# Next: inference mode (sampling).
# Here's the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Repeat with the current target token and current states

# Define sampling models
# 定义编码后的模型,其中输入仍为encoder_inputs,输出为states
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
# 设定decoder的输入和初始状态,deocder_outputs设定的是所有隐藏层输出
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

# Reverse-lookup token index to decode sequences back to
# something readable.
# 根据角标数字,翻译成可读文字
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

# 根据输入序列,得到输入的句子。预测并将预测的结果解析为中文
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    # 先用encoder预测得到states_value
    states_value = encoder_model.predict(input_seq)
    
    # 第几句,第几个单词,所有的decoder中文字的大小
    # Generate empty target sequence of length 1.np.zeros(shape)
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.将0号位置填充为\t
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    # model.predict -- 输入的是x值输出是预测值
    while not stop_condition:
        # decoder_model已经加了dense层,所以输出的是分类概率。
        # predict输入中加了[target_seq],即起始符‘t’
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token argmax 返回一行中最大的数字
        # 这里output_tokens是一个列表,-1表示最后一维
        # 这里还是不太清楚为什么是-1,输出一下output_tokens的shape看一下吧
        # output_tokens 是dense层输出的概率,shape是(1,1,Num_of_output_tokens),所以这里0和-1是一样的啊
        sampled_token_index = np.argmax(output_tokens[0, -1, :])        
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char
        print(output_tokens)
        print(sampled_char)

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


for seq_index in range(10):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)

输出:

[[[  3.80258228e-07   2.07275691e-04   1.18607735e-04 ...,   5.21344482e-05
     1.74693909e-04   3.67174550e-07]]]
shape (1, 1, 3381)
1173
我
[[[  7.62539809e-09   2.84402677e-06   1.82350595e-06 ...,   8.31331931e-07
     3.69949048e-05   7.80731568e-09]]]
shape (1, 1, 3381)
244
們
[[[  1.11417904e-07   3.50167174e-05   1.67706850e-04 ...,   1.05691170e-05
     6.63804763e-04   1.03496674e-07]]]
shape (1, 1, 3381)
71
。
[[[  1.02697162e-09   9.72577691e-01   1.30860519e-03 ...,   4.74533701e-09
     1.41450837e-05   1.07544440e-09]]]
shape (1, 1, 3381)
1


-
0
Input sentence: Hi.
Decoded sentence: 我們。

[[[  3.80258228e-07   2.07275691e-04   1.18607735e-04 ...,   5.21344482e-05
     1.74693909e-04   3.67174550e-07]]]
shape (1, 1, 3381)
1173
我
[[[  7.62539809e-09   2.84402677e-06   1.82350595e-06 ...,   8.31331931e-07
     3.69949048e-05   7.80731568e-09]]]
shape (1, 1, 3381)
244
們
[[[  1.11417904e-07   3.50167174e-05   1.67706850e-04 ...,   1.05691170e-05
     6.63804763e-04   1.03496674e-07]]]
shape (1, 1, 3381)
71
。
[[[  1.02697162e-09   9.72577691e-01   1.30860519e-03 ...,   4.74533701e-09
     1.41450837e-05   1.07544440e-09]]]
shape (1, 1, 3381)
1


-
1
Input sentence: Hi.
Decoded sentence: 我們。

你可能感兴趣的:(NLP)