Keras序列到序列学习十分钟介绍(翻译)
原文链接:A ten-minute introduction to sequence-to-sequence learning in Keras
简单介绍如何使用Keras实现RNN之Seq2Seq学习。
本文假定,你对递归网络和Keras已经有了一定的了解。
Seq2Seq学习是训练将一个领域(如英文)的序列转换为另一种领域(如法语)的序列的模型的技术。
"the cat sat on the mat" -> [Seq2Seq model] -> "le chat etait assis sur le tapis"
这项技术可用于机器翻译或自由问答(为给定问题生成答案),甚至是需要生成文本的任何应用。
有很多方法可以完成这个任务,比如RNNs或1D convnets,这里我们只研究RNNs。
当输入输出序列等长时,可以简单的使用Keras的LSTM或GRU层(或多层)实现这个模型。这个实例代码的案例展示了如何教会RNN学习加法,编码为字符串:
这种方法的限制是其假设了由输入input[…t]生成相同长度的输出target[…t]。它在特定情况如(数字字符串相加)可行,但在大部分情况都不可行。一般情况,开始生成目标序列时需要得到整个输入序列的信息。
一般情况,输入序列和输出序列长度不同(如机器翻译)且开始预测目标前需要整个输入序列。这要求更先进的设置,这就是人们通常所说的“序列到序列模型”。其原理如下:
以推理的方式,比如,当我们想解码未知的输入序列,需要略微不同的处理:
相同的处理也可以用于不使用“teacher forcing”的Seq2Seq网络,比如,将解码器预测注入解码器的方法。
使用实际代码举例。
本例的实现,使用了英语和法语翻译句子对的数据集,可以从 manythings.org/anki下载,文件名是 fra-eng.zip。本文将实现字符级序列到序列模型,按字符处理输入并生成输出。另一种选择是机器翻译领域应用更广泛的单词级模型。在本文最后,展示了使用嵌入层将本模型转换为单词级模型注解。
本例的完整代码可以在GitHub下载。
处理过程概述:
由于训练过程和推断过程(解码句子)差别很大,尽管其使用了相同的内部层,但是使用了不同的模型。
这是本文的训练模型,利用了Keras RNNs的三个关键特征:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
用两行代码训练模型,同时使用预留的20%样本检测损失。
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
在苹果CPU上训练一个小时之后,我们进行推理。为了解码一个测试句子,要重复:
推理配置如下:
encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
使用如下代码实现上面描述的推理循环:
def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1, 1, num_decoder_tokens))
# Populate the first character of target sequence with the start character.
target_seq[0, 0, target_token_index['\t']] = 1.
# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = decoder_model.predict(
[target_seq] + states_value)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = reverse_target_char_index[sampled_token_index]
decoded_sentence += sampled_char
# Exit condition: either hit max length
# or find stop character.
if (sampled_char == '\n' or
len(decoded_sentence) > max_decoder_seq_length):
stop_condition = True
# Update the target sequence (of length 1).
target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, sampled_token_index] = 1.
# Update states
states_value = [h, c]
return decoded_sentence
由于测试集来源于训练集,所以结果比较好,并不奇怪。
Input sentence: Be nice.
Decoded sentence: Soyez gentil !
-
Input sentence: Drop it!
Decoded sentence: Laissez tomber !
-
Input sentence: Get out!
Decoded sentence: Sortez !
以上就是我们对Keras实现的seq2seq模型的十分钟简介。代码可以在GitHub下载。
很简单,因为GRU只有一个状态,而LSTM有两个状态。如何调整模型使用GRU层如下所示:
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = GRU(latent_dim, return_state=True)
encoder_outputs, state_h = encoder(encoder_inputs)
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_gru = GRU(latent_dim, return_sequences=True)
decoder_outputs = decoder_gru(decoder_inputs, initial_state=state_h)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
如果输入时整型序列(比如,使用词典索引编码的单词表征序列)?可以使用嵌入层对整型符号进行处理,如下所示:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None,))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim,
return_state=True)(x)
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Compile & run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# Note that `decoder_target_data` needs to be one-hot encoded,
# rather than sequences of integers like `decoder_input_data`!
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
有些情况下不能使用teacher forching,因为无法得到完整的目标序列,例如,在线预测很长的以至于不能完整缓存输入-目标对的序列。这种情况下,你可能想在训练阶段就像推理阶段那样将解码器的预测值注入到解码器的输入中。
可以创建一个硬编码输出训练循环模型来实现:
from keras.layers import Lambda
from keras import backend as K
# The first part is unchanged
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
states = [state_h, state_c]
# Set up the decoder, which will only process one timestep at a time.
decoder_inputs = Input(shape=(1, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
all_outputs = []
inputs = decoder_inputs
for _ in range(max_decoder_seq_length):
# Run the decoder on one timestep
outputs, state_h, state_c = decoder_lstm(inputs,
initial_state=states)
outputs = decoder_dense(outputs)
# Store the current prediction (we will concatenate all predictions later)
all_outputs.append(outputs)
# Reinject the outputs as inputs for the next loop iteration
# as well as update the states
inputs = outputs
states = [state_h, state_c]
# Concatenate all predictions
decoder_outputs = Lambda(lambda x: K.concatenate(x, axis=1))(all_outputs)
# Define and compile model as previously
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# Prepare decoder input data that just contains the start character
# Note that we could have made it a constant hard-coded in the model
decoder_input_data = np.zeros((num_samples, 1, num_decoder_tokens))
decoder_input_data[:, 0, target_token_index['\t']] = 1.
# Train model as previously
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
更多问题请见: reach out on Twitter.