主要记录下官方example中lstm_seq2seq.py的个人理解和注释。案例实现了基本字符级别的seq2seq模型,并通过一个翻译任务的样例来展示。数据集是这个,我下载的cmn-eng英语到汉语的数据集,1247KB。流程是encoder将数据编码成中间表示C,decoder将C解码成翻译后的数据。
每一行数据包含一句英语和一句翻译后的中文,用'\t'
隔开,数据预处理截断将所有的英文存入input_texts
,将中文添加起始符‘'\t'
和结束符\n
后存入到target_texts
中,input_characters
和target_characters
分别保存英语和汉语中的单词,不重复。
# 向量化数据
input_texts = [] # 存入待翻译的句子,即英语
target_texts = [] # 存入翻译后句子,即汉语
input_characters = set() # 分别保存英语和汉语中的单词,不重复
target_characters = set()
with open(data_path, 'r', encoding='utf-8') as f:
# 另外的读取每行的方法,for line in f:
lines = f.read().split('\n')
print(type(lines),' ',len(lines))
# 添加起始符y0,decoder根据y0,h<0>,c就可以计算出y1的概率分布了;
# 添加结束符end,decoder根据y1,h<1>,c可以计算出y2的概率分布...直到end结束
for line in lines[: min(num_samples, len(lines) - 1)]:
# 数据通过制表符划开的,左边为英语,右边为汉语(有些有繁体)
input_text, target_text = line.split('\t')
# We use "tab" as the "start sequence" character
# for the targets, and "\n" as "end sequence" character.
# 即添加起始和结束标识符
target_text = '\t' + target_text + '\n'
input_texts.append(input_text)
target_texts.append(target_text)
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
# 保存输入和输出的句子中,单词数最大数
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])
print('Number of input len:', len(input_texts))
print('Number of output len:', len(target_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)
输出如下:
<class 'list'> 20404 # 数据集中共20404行
Number of input len: 10000 # 对语料进行了截取,只用了10000行来训练
Number of output len: 10000
Number of unique input tokens: 73 # 输入中有73个不同的单词
Number of unique output tokens: 2617 # 输出中有2617个不同的汉字
Max sequence length for inputs: 30 # 最长的英文语句中有30个单词
Max sequence length for outputs: 22 # 最长的中文翻译中有22个字
建立文字索引字典并初始化输入输出,这里相当于将输入输出按照one-hot编码的形式简单的embedding了一下。
# 建立token的下标索引,这里token是指的每个单词,即可以通过dict根据token找到对应的数字标识
input_token_index = dict([(char,i) for i,char in enumerate(input_characters)])
target_token_index = dict([(char,i) for i ,char in enumerate(target_characters)])
# np.zeros(shape,dtype)
# 这里应该是输入数据,中间隐藏层输出C(即decoder的输入数据),输出目标数据
# seq2seq 输入输出长度是不对应的,所以这里输入设置最大长度为语料中最大英文长度,输出为语料中最大中文长度
#(只是用语料中的最大长度是不是不太恰当?)
# decoder_input_data为了再最后将数字转换成可读文字,即将C解码成汉字
encoder_input_data = np.zeros((len(input_texts),max_encoder_seq_length,num_encoder_tokens),dtype='float32')
decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens),dtype='float32')
decoder_target_data = np.zeros((len(input_texts),max_decoder_seq_length,num_decoder_tokens),dtype='float32')
'''
这里,根据语料库为输入数据赋值,decoder_target_data要比decoder_input_data靠前一步?
哦,懂了,参考: https://blog.csdn.net/jerr__y/article/details/53749693 中图2结构
encoder_input_data的赋值:i是第几行(第几个sample),t是这一行中第几个单词,
char是这个单词,所以input_token_index就是这个单词对应在输入词汇表中的下标。
将第i行,第t个词(对应的t时刻),单词在词汇表中对应的下标index处设置为1
t = 0是开始位置,decoder_input_data是图中右下方的WXYZ , decoder_target_data是右上方的WXYZ,可以看到错位了一位
encoder和decoder的t是分开计时的
'''
# 这里相当于将输入输出按照one-hot编码的形式简单的embedding了一下
# 输入输出和中间输入通过i对应起来
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t ,char in enumerate(input_text):
encoder_input_data[i,t,input_token_index[char]] = 1
for t,char in enumerate(target_text):
# decoder_target_data is ahead of decoder_input_data by one timestep
decoder_input_data[i,t,target_token_index[char]] = 1
if t>0:
# decoder_target_data will be ahead by one timestep
# and will not include the start character.
decoder_target_data[i,t-1,target_token_index[char]] = 1
encoder部分得到h和c,组合作为decoder的初始状态。
# Define an input sequence and process it.
# 这里shape = (samples,num_tokens即词汇数),返回中间状态
'''
https://blog.csdn.net/u011327333/article/details/78501054
LSTM 返回状态的解释,关于参数是return_state是否为真以及return_sequences是否为真的设置
'''
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim,return_state = True)
# 返回形如(samples,timesteps,output_dim)的3D张量,否则,返回形如(samples,output_dim)的2D张量
encoder_outputs, state_h,state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# 作为decoder的初始状态
print(encoder_states)
输出:
[<tf.Tensor 'lstm_3/while/Exit_2:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'lstm_3/while/Exit_3:0' shape=(?, 256) dtype=float32>]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# 返回decoder全部的输出序列和中间状态。返回状态在训练阶段不需要,但是会将他们作为参考
'''
decoder 的初始状态为encoder的state_h和state_c的结合
unit代表该层的输出维度
Dense层(unit,activation)
'''
decoder_lstm = LSTM(latent_dim, return_sequences = True, return_state=True)
decoder_outputs ,_,_=decoder_lstm(decoder_inputs,initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation = 'softmax')
decoder_outputs = decoder_dense(decoder_outputs)
案例的epoch是100次,训练到后面应该是过拟合了,即val_loss
先减小后增大。样本为20000时,训练到28个epoch左右就过拟合了。
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
# 标识数据
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Run training
# 这里validation_split = 0.2 是从train中保留百分之20作为验证
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# 实际数据
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
输出:
Epoch 1/20
8000/8000 [==============================] - 8s 1ms/step - loss: 2.0236 - val_loss: 2.5357
Epoch 2/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.8918 - val_loss: 2.4145
Epoch 3/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.7785 - val_loss: 2.2746
Epoch 4/20
8000/8000 [==============================] - 6s 774us/step - loss: 1.6937 - val_loss: 2.2429
Epoch 5/20
8000/8000 [==============================] - 6s 774us/step - loss: 1.6348 - val_loss: 2.1701
Epoch 6/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.5567 - val_loss: 2.1057
Epoch 7/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.4995 - val_loss: 2.0756
Epoch 8/20
8000/8000 [==============================] - 6s 772us/step - loss: 1.4527 - val_loss: 2.0360
Epoch 9/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.3995 - val_loss: 1.9788
Epoch 10/20
8000/8000 [==============================] - 6s 775us/step - loss: 1.3561 - val_loss: 1.9562
Epoch 11/20
8000/8000 [==============================] - 6s 774us/step - loss: 1.3147 - val_loss: 1.9274
Epoch 12/20
8000/8000 [==============================] - 6s 775us/step - loss: 1.2790 - val_loss: 1.8991
Epoch 13/20
8000/8000 [==============================] - 6s 776us/step - loss: 1.2433 - val_loss: 1.8901
Epoch 14/20
8000/8000 [==============================] - 6s 777us/step - loss: 1.2098 - val_loss: 1.8733
Epoch 15/20
8000/8000 [==============================] - 6s 773us/step - loss: 1.1786 - val_loss: 1.8560
Epoch 16/20
8000/8000 [==============================] - 6s 775us/step - loss: 1.1487 - val_loss: 1.8447
Epoch 17/20
8000/8000 [==============================] - 6s 776us/step - loss: 1.1199 - val_loss: 1.8459
Epoch 18/20
8000/8000 [==============================] - 6s 774us/step - loss: 1.0944 - val_loss: 1.8493
Epoch 19/20
8000/8000 [==============================] - 6s 775us/step - loss: 1.0676 - val_loss: 1.8287
Epoch 20/20
8000/8000 [==============================] - 6s 777us/step - loss: 1.0410 - val_loss: 1.8131
预测并将预测的结果解码成文字,感觉生成效果确实不咋地。有个问题:为什么要将存在的文字对应的target_seq
设为1呢,后来看了下代码,作用是更新one-hot
编码,将前一时刻的输出和中间状态共同作用生成下一时刻的输出。
# Next: inference mode (sampling).
# Here's the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Repeat with the current target token and current states
# Define sampling models
# 定义编码后的模型,其中输入仍为encoder_inputs,输出为states
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
# 设定decoder的输入和初始状态,deocder_outputs设定的是所有隐藏层输出
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
# Reverse-lookup token index to decode sequences back to
# something readable.
# 根据角标数字,翻译成可读文字
reverse_input_char_index = dict(
(i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
(i, char) for char, i in target_token_index.items())
# 根据输入序列,得到输入的句子。预测并将预测的结果解析为中文
def decode_sequence(input_seq):
# Encode the input as state vectors.
# 先用encoder预测得到states_value
states_value = encoder_model.predict(input_seq)
# 第几句,第几个单词,所有的decoder中文字的大小
# Generate empty target sequence of length 1.np.zeros(shape)
target_seq = np.zeros((1, 1, num_decoder_tokens))
# Populate the first character of target sequence with the start character.将0号位置填充为\t
target_seq[0, 0, target_token_index['\t']] = 1.
# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ''
# model.predict -- 输入的是x值输出是预测值
while not stop_condition:
# decoder_model已经加了dense层,所以输出的是分类概率。
# predict输入中加了[target_seq],即起始符‘t’
output_tokens, h, c = decoder_model.predict(
[target_seq] + states_value)
# Sample a token argmax 返回一行中最大的数字
# 这里output_tokens是一个列表,-1表示最后一维
# 这里还是不太清楚为什么是-1,输出一下output_tokens的shape看一下吧
# output_tokens 是dense层输出的概率,shape是(1,1,Num_of_output_tokens),所以这里0和-1是一样的啊
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = reverse_target_char_index[sampled_token_index]
decoded_sentence += sampled_char
print(output_tokens)
print(sampled_char)
# Exit condition: either hit max length
# or find stop character.
if (sampled_char == '\n' or
len(decoded_sentence) > max_decoder_seq_length):
stop_condition = True
# Update the target sequence (of length 1).
target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, sampled_token_index] = 1.
# Update states
states_value = [h, c]
return decoded_sentence
for seq_index in range(10):
# Take one sequence (part of the training set)
# for trying out decoding.
input_seq = encoder_input_data[seq_index: seq_index + 1]
decoded_sentence = decode_sequence(input_seq)
print('-')
print('Input sentence:', input_texts[seq_index])
print('Decoded sentence:', decoded_sentence)
输出:
[[[ 3.80258228e-07 2.07275691e-04 1.18607735e-04 ..., 5.21344482e-05
1.74693909e-04 3.67174550e-07]]]
shape (1, 1, 3381)
1173
我
[[[ 7.62539809e-09 2.84402677e-06 1.82350595e-06 ..., 8.31331931e-07
3.69949048e-05 7.80731568e-09]]]
shape (1, 1, 3381)
244
們
[[[ 1.11417904e-07 3.50167174e-05 1.67706850e-04 ..., 1.05691170e-05
6.63804763e-04 1.03496674e-07]]]
shape (1, 1, 3381)
71
。
[[[ 1.02697162e-09 9.72577691e-01 1.30860519e-03 ..., 4.74533701e-09
1.41450837e-05 1.07544440e-09]]]
shape (1, 1, 3381)
1
-
0
Input sentence: Hi.
Decoded sentence: 我們。
[[[ 3.80258228e-07 2.07275691e-04 1.18607735e-04 ..., 5.21344482e-05
1.74693909e-04 3.67174550e-07]]]
shape (1, 1, 3381)
1173
我
[[[ 7.62539809e-09 2.84402677e-06 1.82350595e-06 ..., 8.31331931e-07
3.69949048e-05 7.80731568e-09]]]
shape (1, 1, 3381)
244
們
[[[ 1.11417904e-07 3.50167174e-05 1.67706850e-04 ..., 1.05691170e-05
6.63804763e-04 1.03496674e-07]]]
shape (1, 1, 3381)
71
。
[[[ 1.02697162e-09 9.72577691e-01 1.30860519e-03 ..., 4.74533701e-09
1.41450837e-05 1.07544440e-09]]]
shape (1, 1, 3381)
1
-
1
Input sentence: Hi.
Decoded sentence: 我們。