这种模型的为LSTM中的 N-1 模型,即输入 N 个字符,预测下一个字符的分布。
生成文本时,如何选择下一个字符至关重要,最简单的方法是贪婪采样,即选择预测概率最大的字符作为下一个输入字符,但这种方法容易得到重复的、 可预测的字符串, 看起来不像是连贯的语言。
# 采样函数,根据温度值进行重新加权,值越大,熵越大
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
采样函数如上所示,取对数后除于 temperature ,再求指数,温度值为1是完全随机采样,小于1时,概率更大的字符会得到更加大的采样概率。
import keras
import numpy as np
path = keras.utils.get_file(
#path = './test.txt'
text = open(path,encoding='utf-8').read().lower()
print('Corpus length:', len(text))
max_len 为训练时句子的最大长度,step为文本处理时每个样本起始位置的间隔,sentences 是长度为 max_len 的输入数据,next_char为 sentences 的后一个字符,作为预测值。同时获得字符到索引的dict。
maxlen = 60
step = 3
sentences = []
next_chars = []
# 每3个字符重新取一个序列,每个序列60个字符
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i+maxlen])
print('Number of sentences:', len(sentences))
chars = sorted(list(set(text)))
print('Unique Char:', len(chars))
char_indices = dict((char,chars.index(char)) for char in chars)
将 sentences 和next_char转换成向量,这里转换成one-hot向量
# 对字符做one hot向量化
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)),dtype=np.bool)
# 构建训练数据
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i,t,char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
构建模型,模型由1层LSTM加1层Dense层组成,前者有128个神经元,后者激活函数选择 softmax,优化函数为 RMSprop,损失为交叉熵损失。同时定义采样函数
# 构建网络
from keras import layers
model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))
# 优化函数
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
# 采样函数,根据温度值进行重新加权,值越大,熵越大
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
import random
import sys
# 迭代60次
for epoch in range(1, 60):
print('epoch', epoch)
model.fit(x, y, batch_size=128, epochs=1) # 训练样本
start_index = random.randint(0, len(text) - maxlen - 1) # 生成开始文本
generated_text = text[start_index:start_index+maxlen]
print('--- Generating with seed: "' + generated_text + '"')
# 查看4种不同温度值的生成情况
for temperature in [0.2, 0.5, 1.0, 1.2]:
# 根据开始文本生成后 400个样本
for i in range(400):
# 把生成文本转换成数字
sampled = np.zeros((1,maxlen, len(chars)))
for t,char in enumerate(generated_text):
sampled[0, t, char_indices[char]] = 1
preds = model.predict(sampled, verbose=0)[0]
next_index = sample(preds, temperature)
# 把生成的文本加到generated_text后面
next_char = chars[next_index]
generated_text += next_char
generated_text = generated_text[1:]
# 输出结果
其实文本设置为 “without partiality for everything most abhorred is closel"”,生成文本如下所示:
—temperature: 0.2
without partiality for everything most abhorred is closely the standard of the standard of the spirit of the sense of the predication of the strength of the spirit of the constantially the more and the world of the standing and such a problem of the spirit of the moral and the more and the moral of the states of the probably and the more things and the more and the moral and the spirit of the experience of the sense of the more according to the
temperature 为 0.2 的生成情况一般,大部分单词拼写正常,除了“constantially”,语法上比较通顺,但是没有标点符号。
–temperature: 1.2
he youncifrig, totjection in liseral worrs, they were and origin–wesa–resperation. with some doys of , an find. are neverth follow therse what permiged thus from truth hard, too self? at frind; “short,” iumbilitubs and interpretateness magnluenism day for the tastes can be immemocious impigia’te vo to agreedeful howedlle midetely something and the wil knows thereforigment enteughaps.
temperature 为 1.2 的生成情况不太好,生成了更多不存在的单词,表达的内容很奇怪,但是生成了标点符号。