前几篇博客介绍了基于检索聊天机器人的实现、seq2seq的模型和代码,本篇博客将从头实现一个基于seq2seq的聊天机器人。这样,在强化学习和记忆模型出现之前的对话系统中的模型就差不多介绍完了。后续将着重介绍强化学习和记忆模型在对话系统中的应用。
基于检索的聊天机器人的实现
seq2seq模型详解
tensorflow中的seq2seq的代码详解
闲聊机器人在网上有很多有趣的小例子:
cleverbot 闲聊机器人
小黄鸡闲聊机器人
他们不一定是用seq2seq来做的,但是实现的效果都是类似的。
本篇博客主要参考github代码DeepQA。可以实现命令行中的交互式对话和前端网页版的交互式对话。
可以看到,实现的效果就是单轮的闲聊对话,是对话中最简单的任务,但对于以后理解和实现多轮对话和对话管理帮助也很大。代码DeepQA的实现较为复杂,并不适合初学者理解其构造,因此我将代码按照textCNN的构造方式自己写了一遍,代码也已上传至githubSeq2seq-QA。闲话不多说,我们仍然按照数据处理、模型构建、模型训练和训练结果四个部分来介绍。
数据集介绍:Cornell Movie-Dialogs Corpus
该电影对话数据集包含从原始电影脚本中提取的虚构对话集合:
- 10,292对电影人物之间的220,579次会话交流
- 涉及617部电影中的9,035个角色
- 总共304,713条话语
本次训练主要使用了movie_lines.txt和movie_conversations.txt。
movie_lines.txt每一行的属性包括:
- lineID
- characterID (who uttered this phrase)
- movieID
- character name
- text of the utterance
举一个例子:
各属性之间以“ +++$+++ ”分割。第一个属性为lineID,最后一个属性为对话的文本。
movie_conversations.txt每一行的属性包括:
- characterID of the first character involved in the conversation
- characterID of the second character involved in the conversation
- movieID of the movie in which the conversation occurred
- list of the utterances that make the conversation
举一个例子:
仍然以相同的分隔符分割,每一行的最后一个属性为截取的对话的片段。如第一行对话片段为[‘L194’,’L195’,’L196’,’L197’],每一个元素代表lineID,将movie_conversations.txt中的lineID替换为movie_lines.txt中的对话文本,就构成了训练的数据集,即将lineID替换为对话文本后的[‘L194’,’L195’],[‘L195’,’L196’], [‘L196’,’L197’]就构成了三个训练样本。
接下来开始写代码。数据处理部分的三板斧:读取数据、构建词典、构造数据集,我们已经熟的不能再熟了。
1、读取数据依然使用pandas:
read_csv中使用正则来匹配分隔符,所以需要对“+++$+++”进行转义,在movie_lines.txt文件中只需要使用lineID和对话文本两列即可,然后将对话文本进行分词,得到对话文本的单词列表。
movie_conversations.txt中只需要使用对话片段,即只需要line_ids列即可,读取的时候是以str格式读取的,因此需要eval或者literal_eval函数将其还原为列表格式。
# 读取 movie_lines.txt 和movie_conversations.txt两个文件
print("开始读取数据")
self.lines = pd.read_csv(self.args.line_path, sep=" \+\+\+\$\+\+\+ ", usecols=[0,4],
names=["line_id", "utterance"], dtype={"utterance":str}, engine="python")
self.conversations = pd.read_csv(self.args.conv_path, usecols=[3], names=["line_ids"],
sep=" \+\+\+\$\+\+\+ ", dtype={"line_ids":str}, engine="python")
self.lines.utterance = self.lines.utterance.apply(lambda conv : self.word_tokenizer(conv))
self.conversations.line_ids = self.conversations.line_ids.apply(lambda li : eval(li))
2、构建词表
为了方便,将文本中所有的单词都转为小写,然后按照单词出现次数进行排序并分配id。选择出现次数大于1的单词作为vocab,减小长尾对生成结果的影响。使用pandas的series来构造word2id和id2word词表。
# 得到word2id和id2word两个词典
print("开始构建词典")
words = self.lines.utterance.values
words = list(chain(*words))
# 将全部words转为小写
print("正在转为小写")
words = list(map(str.lower, words))
print("转化小写完毕")
sr_words_count = pd.Series(words).value_counts()
# 筛选出 出现次数 大于 1 的词作为 vocabulary
sr_words_size = np.where(sr_words_count.values > self.args.vacab_filter)[0].size
sr_words_index = sr_words_count.index[0:sr_words_size]
self.sr_word2id = pd.Series(range(self.numToken, self.numToken + sr_words_size), index=sr_words_index)
self.sr_id2word = pd.Series(sr_words_index, index=range(self.numToken, self.numToken + sr_words_size))
self.sr_word2id[self.padToken] = 0
self.sr_word2id[self.goToken] = 1
self.sr_word2id[self.eosToken] = 2
self.sr_word2id[self.unknownToken] = 3
self.sr_id2word[0] = self.padToken
self.sr_id2word[1] = self.goToken
self.sr_id2word[2] = self.eosToken
self.sr_id2word[3] = self.unknownToken
3、构造数据集
生成对话类的数据集只需要构造训练样本就可以。前面提到要将movie_conversations.txt中的lineID替换为movie_lines.txt中的对话文本,为了快速索引,需要构建一个以lineID为键,对话文本为value的字典,即代码中的sr_line_id。然后构造型为[first_conv, first_conv]的样本。细心的读者可能注意到,这里在构造数据集的时候并没有填充,因为填充的部分卸载get_batch的部分了,这样可以方便代码的重用,在构建batch的时候会详细说明的。至此数据处理部分就完成了。
print("开始生成训练样本")
# 将id与line作为字典,以方便生成训练样本
self.sr_line_id = pd.Series(self.lines.utterance.values, index=self.lines.line_id.values)
for line_id in tqdm(self.conversations.line_ids.values, ncols=10):
for i in range(len(line_id) - 1):
first_conv = self.sr_line_id[line_id[i]]
second_conv = self.sr_line_id[line_id[i+1]]
# 将文本全部转化为小写,然后再将word替换为id
first_conv = self.replace_word_with_id(first_conv)
second_conv = self.replace_word_with_id(second_conv)
# 筛选样本,将输入或输出大于max_length的样本、输出中含有UNK的单词的样本过滤掉
valid = self.filter_conversations(first_conv, second_conv)
if valid :
temp = [first_conv, second_conv]
self.train_samples.append(temp)
print("生成训练样本结束")
def filter_conversations(self, first_conv, second_conv):
# 筛选样本, 首先将encoder_input 或 decoder_input大于max_length的conversation过滤
# 其次将target中包含有UNK的conversation过滤
valid = True
valid &= len(first_conv) <= self.args.maxLength
valid &= len(second_conv) <= self.args.maxLength
valid &= second_conv.count(self.sr_word2id[self.unknownToken]) == 0
return valid
模型构建部分主要使用tensorflow中的tf.contrib.legacy_seq2seq接口的embedding_rnn_seq2seq函数。这个函数在tensorflow中的seq2seq的代码详解中有详细的解释。值得注意的是,模型构建时的placeholder是一个列表,即list of [batch_size,]。因此在训练过程中,生成batch时需要根据对应的placeholder的shape进行填充和变形。此处的模型构建也不复杂,因此不详细介绍了。
class seq2seq:
def __init__(self, args, text_data):
self.args = args
self.text_data = text_data
# Placeholders
self.encoder_inputs = None
self.decoder_inputs = None
self.decoder_targets = None
self.decoder_weights = None
self.num_encoder_symbols = len(text_data.sr_word2id)
self.num_decoder_symbols = self.num_encoder_symbols
# self.num_encoder_symbols = 10000
# self.num_decoder_symbols = 10000
# important operation
self.outputs = None
self.loss = None
self.build_model()
def build_model(self):
outputProjection = None
# define mutil RNN cell
def create_cell():
cell = tf.contrib.rnn.BasicLSTMCell(self.args.hidden_size)
cell = tf.contrib.rnn.DropoutWrapper(
cell,
input_keep_prob=1.0,
output_keep_prob=self.args.dropout)
return cell
self.cell = tf.contrib.rnn.MultiRNNCell([create_cell() for _ in range(self.args.rnn_layers)])
# define placeholder
with tf.name_scope("encoder_placeholder"):
self.encoder_inputs = [tf.placeholder(tf.int32, [None, ])
for _ in range(self.args.maxLengthEnco)]
with tf.name_scope("decoder_placeholder"):
self.decoder_inputs = [tf.placeholder(tf.int32, [None, ], name='decoder_inputs')
for _ in range(self.args.maxLengthDeco)]
self.decoder_targets = [tf.placeholder(tf.int32, [None, ], name='decoder_targets')
for _ in range(self.args.maxLengthDeco)]
self.decoder_weights = [tf.placeholder(tf.float32, [None, ], name='decoder_weights')
for _ in range(self.args.maxLengthDeco)]
decoder_output, state = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(self.encoder_inputs,
self.decoder_inputs,
self.cell,
self.num_encoder_symbols,
self.num_decoder_symbols,
self.args.embedding_size,
output_projection=None,
feed_previous=bool(self.args.test),
dtype=None,
scope=None)
# For testing only
if self.args.test is not None:
if not outputProjection:
self.outputs = decoder_output
else:
self.outputs = [outputProjection(output) for output in decoder_output]
else:
self.loss = tf.contrib.legacy_seq2seq.sequence_loss(logits=decoder_output,
targets=self.decoder_targets,
weights=self.decoder_weights)
tf.summary.scalar('loss', self.loss) # Keep track of the cost
print("模型构建完毕")
训练的主体结构如下。这和我们之前所写的代码结构都一样,需要详细讲的是如何得到训练和预测都可以用的batch,即get_next_batches();如何得到训练和预测都可以用的feed_dict,即self.seq2seq_model.step(next_batch)。
try:
for i in range(self.args.epoch_nums):
# Generate batches
tic = datetime.datetime.now()
batches = self.text_data.get_next_batches()
for next_batch in tqdm(batches, desc="Training"):
# train_op, summaries, loss = self.seq2seq_model.step(next_batch)
feed_dict = self.seq2seq_model.step(next_batch)
_, summaries, loss = self.sess.run(
(self.train_op, mergedSummaries, self.seq2seq_model.loss),
feed_dict)
self.global_step += 1
self.writer.add_summary(summaries, self.global_step)
# Output training status
if self.global_step % 100 == 0:
perplexity = math.exp(float(loss)) if loss < 300 else float("inf")
tqdm.write("----- Step %d -- Loss %.2f -- Perplexity %.2f" %(self.global_step, loss, perplexity))
if self.global_step % self.args.checkpoint_every == 0:
self.save_session(self.sess, self.global_step)
toc = datetime.datetime.now()
print("Epoch finished in {}".format(toc - tic))
except (KeyboardInterrupt, SystemExit): # If the user press Ctrl+C while testing progress
print('Interruption detected, exiting the program...')
# self.save_session(sess, self.global_step) # Ultimate saving before complete exit
get_next_batches():当一次epoch结束时,首先要进行样本的shuffle。然后使用yield的方式来产生样本,得到的batches即为已经混洗过的,len(train_samples)/len(batch_size)个batches。注意samples,是没有经过数据填充,并且数据shape也不为list of [batch_size,]格式的。因此create_batch将数据填充,并且进行数据变形。
def get_next_batches(self):
"""Prepare the batches for the current epoch
Return:
list: Get a list of the batches for the next epoch
"""
self.shuffle()
batches = []
def gen_next_samples():
""" Generator over the mini-batch training samples
"""
for i in range(0, len(self.train_samples), self.args.batch_size):
yield self.train_samples[i:min(i + self.args.batch_size, len(self.train_samples))]
# TODO: Should replace that by generator (better: by tf.queue)
for samples in gen_next_samples():
batch = self.create_batch(samples)
batches.append(batch)
return batches
create_batch():主要进行数据填充和数据变形,以适应embedding_rnn_seq2seq函数输入参数的要求。
def create_batch(self, samples):
batch = Batch()
batch_size = len(samples)
# 数据填充和数据构造,将模型中四个placeholder都构造好。
for i in range(batch_size):
# Unpack the sample
sample = samples[i]
batch.encoderSeqs.append(list(reversed(
sample[0]))) # Reverse inputs (and not outputs), little trick as defined on the original seq2seq paper
batch.decoderSeqs.append([self.sr_word2id[self.goToken]] + sample[1] + [self.sr_word2id[self.eosToken]]) # Add the and tokens
batch.targetSeqs.append(
batch.decoderSeqs[-1][1:]) # Same as decoder, but shifted to the left (ignore the )
# Long sentences should have been filtered during the dataset creation
assert len(batch.encoderSeqs[i]) <= self.args.maxLengthEnco
assert len(batch.decoderSeqs[i]) <= self.args.maxLengthDeco
# TODO: Should use tf batch function to automatically add padding and batch samples
# Add padding & define weight
batch.encoderSeqs[i] = [self.sr_word2id[self.padToken]] * (self.args.maxLengthEnco -
len(batch.encoderSeqs[i])) + batch.encoderSeqs[i] # Left padding for the input
batch.weights.append(
[1.0] * len(batch.targetSeqs[i]) + [0.0] * (self.args.maxLengthDeco - len(batch.targetSeqs[i])))
batch.decoderSeqs[i] = batch.decoderSeqs[i] + [self.sr_word2id[self.padToken]] * (
self.args.maxLengthDeco - len(batch.decoderSeqs[i]))
batch.targetSeqs[i] = batch.targetSeqs[i] + [self.sr_word2id[self.padToken]] * (
self.args.maxLengthDeco - len(batch.targetSeqs[i]))
# 数据的reshape,构造为list of [batch_size,]格式的
encoderSeqsT = [] # Corrected orientation
for i in range(self.args.maxLengthEnco):
encoderSeqT = []
for j in range(batch_size):
encoderSeqT.append(batch.encoderSeqs[j][i])
encoderSeqsT.append(encoderSeqT)
batch.encoderSeqs = encoderSeqsT
decoderSeqsT = []
targetSeqsT = []
weightsT = []
for i in range(self.args.maxLengthDeco):
decoderSeqT = []
targetSeqT = []
weightT = []
for j in range(batch_size):
decoderSeqT.append(batch.decoderSeqs[j][i])
targetSeqT.append(batch.targetSeqs[j][i])
weightT.append(batch.weights[j][i])
decoderSeqsT.append(decoderSeqT)
targetSeqsT.append(targetSeqT)
weightsT.append(weightT)
batch.decoderSeqs = decoderSeqsT
batch.targetSeqs = targetSeqsT
batch.weights = weightsT
return batch
self.seq2seq_model.step(next_batch):训练时,需要将encoder_inputs、decoder_inputs、decoder_targets、decoder_weights四个placeholder都进行feed,否则无法计算loss,也就没法训练。预测时,将encoder_inputs和decoder_inputs的第一个时间步长进行feed就可以。
def step(self, batch):
""" Forward/training step operation.
Does not perform run on itself but just return the operators to do so. Those have then to be run
Args:
batch (Batch): Input data on testing mode, input and target on output mode
Return:
(ops), dict: A tuple of the (training, loss) operators or (outputs,) in testing mode with the associated feed dictionary
"""
# Feed the dictionary
feedDict = {}
if not self.args.test: # Training
for i in range(self.args.maxLengthEnco):
feedDict[self.encoder_inputs[i]] = batch.encoderSeqs[i]
for i in range(self.args.maxLengthDeco):
feedDict[self.decoder_inputs[i]] = batch.decoderSeqs[i]
feedDict[self.decoder_targets[i]] = batch.targetSeqs[i]
feedDict[self.decoder_weights[i]] = batch.weights[i]
# ops = (self.optOp, self.lossFct)
else: # Testing (batchSize == 1)
for i in range(self.args.maxLengthEnco):
feedDict[self.encoder_inputs[i]] = batch.encoderSeqs[i]
feedDict[self.decoder_inputs[0]] = [self.text_data.sr_word2id[self.text_data.goToken]]
# ops = (self.outputs,)
# Return one pass operator
return feedDict
训练结果:
经过了大概七千多步的训练:
loss降到了二点多,困惑度降到了二十多,交互式预测结果如下图所示: