用LSTM做文本生成
举个小小的例子,来看看LSTM是怎么玩的
我们这里用温斯顿丘吉尔的人物传记作为我们的学习语料。
# -*- coding: utf-8 -*-
'''
用RNN做文本生成,用温斯顿丘吉尔的人物传记作为我们的学习语料
我们这里简单的文本预测是,给了前置的字母以后,下一个字母是谁?
比如,importan,给出t,Winsto,给出n,Britai, 给出n
'''
import numpy as np
from keras.models import Sequential
from keras.layers import Dense,Dropout,LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
raw_text = open("input/Winston_Churchil.txt",encoding="utf-8").read()
raw_text = raw_text.lower()
chars = sorted(list(set(raw_text)))
char_to_int = dict((c,i) for i,c in enumerate(chars))
int_to_char = dict((i,c) for i,c in enumerate(chars))
'''
构造训练测试集
我们需要把我们的raw text变成可以用来训练的x,y:
x 是前置字母们 y 是后一个字母
'''
seg_length = 100
x = []
y = []
for i in range(0,len(raw_text)-seg_length):
given = raw_text[i:i+seg_length]
predict = raw_text[i+seg_length]
x.append([char_to_int[char] for char in given])
y.append(char_to_int[predict])
'''
此刻,楼上这些表达方式,类似就是一个词袋,或者说 index。
接下来我们做两件事:
1.我们已经有了一个input的数字表达(index),我们要把它变成LSTM需要的数组格式: [样本数,时间步伐,特征]
2.第二,对于output,我们在Word2Vec里学过,用one-hot做output的预测可以给我们更好的效果,相对于直接预测一个准确的y数值的话。
'''
n_patterns = len(x)
n_vocab = len(chars)
# 把x变成LSTM需要的样子
x = np.reshape(x,(n_patterns,seg_length,1))
# 简单normal到0-1之间
x = x / float(n_vocab)
# output变成one-hot
y = np_utils.to_categorical(y)
'''
模型建造(LSTM)
'''
model = Sequential()
model.add(LSTM(256,input_shape=(x.shape[1],x.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1],activation="softmax"))
model.compile(loss="categorical_crossentropy",optimizer="adam")
model.fit(x,y,nb_epoch=50,batch_size=4096)
'''
测试程序,看看我们训练出来的LSTM的效果
'''
def predict_next(input_array):
x = np.reshape(input_array,(1,seg_length,1))
x = x / float(n_vocab)
y = model.predict(x)
return y
def string_to_index(raw_input):
res = []
for c in raw_input[(len(raw_input)-seg_length):]:
res.append(char_to_int[c])
return res
def y_to_char(y):
largest_index = y.argmax()
c = int_to_char[largest_index]
return c
def generate_article(init,rounds=200):
in_string = init.lower()
for i in range(rounds):
n = y_to_char(predict_next(string_to_index(in_string)))
in_string += n
return in_string
init = 'His object in coming to New York was to engage officers for that service. He came at an opportune moment'
article = generate_article(init)
print(article)
训练过程
Epoch 1/50
276730/276730 [==============================] - 197s - loss: 3.1120
Epoch 2/50
276730/276730 [==============================] - 197s - loss: 3.0227
Epoch 3/50
276730/276730 [==============================] - 197s - loss: 2.9910
Epoch 4/50
276730/276730 [==============================] - 197s - loss: 2.9337
Epoch 5/50
276730/276730 [==============================] - 197s - loss: 2.8971
Epoch 6/50
276730/276730 [==============================] - 197s - loss: 2.8784
Epoch 7/50
276730/276730 [==============================] - 197s - loss: 2.8640
Epoch 8/50
276730/276730 [==============================] - 197s - loss: 2.8516
Epoch 9/50
276730/276730 [==============================] - 197s - loss: 2.8384
Epoch 10/50
276730/276730 [==============================] - 197s - loss: 2.8254
Epoch 11/50
276730/276730 [==============================] - 197s - loss: 2.8133
Epoch 12/50
276730/276730 [==============================] - 197s - loss: 2.8032
Epoch 13/50
276730/276730 [==============================] - 197s - loss: 2.7913
Epoch 14/50
276730/276730 [==============================] - 197s - loss: 2.7831
Epoch 15/50
276730/276730 [==============================] - 197s - loss: 2.7744
Epoch 16/50
276730/276730 [==============================] - 197s - loss: 2.7672
Epoch 17/50
276730/276730 [==============================] - 197s - loss: 2.7601
Epoch 18/50
276730/276730 [==============================] - 197s - loss: 2.7540
Epoch 19/50
276730/276730 [==============================] - 197s - loss: 2.7477
Epoch 20/50
276730/276730 [==============================] - 197s - loss: 2.7418
Epoch 21/50
276730/276730 [==============================] - 197s - loss: 2.7360
Epoch 22/50
276730/276730 [==============================] - 197s - loss: 2.7296
Epoch 23/50
276730/276730 [==============================] - 197s - loss: 2.7238
Epoch 24/50
276730/276730 [==============================] - 197s - loss: 2.7180
Epoch 25/50
276730/276730 [==============================] - 197s - loss: 2.7113
Epoch 26/50
276730/276730 [==============================] - 197s - loss: 2.7055
Epoch 27/50
276730/276730 [==============================] - 197s - loss: 2.7000
Epoch 28/50
276730/276730 [==============================] - 197s - loss: 2.6934
Epoch 29/50
276730/276730 [==============================] - 197s - loss: 2.6859
Epoch 30/50
276730/276730 [==============================] - 197s - loss: 2.6800
Epoch 31/50
276730/276730 [==============================] - 197s - loss: 2.6741
Epoch 32/50
276730/276730 [==============================] - 197s - loss: 2.6669
Epoch 33/50
276730/276730 [==============================] - 197s - loss: 2.6593
Epoch 34/50
276730/276730 [==============================] - 197s - loss: 2.6529
Epoch 35/50
276730/276730 [==============================] - 197s - loss: 2.6461
Epoch 36/50
276730/276730 [==============================] - 197s - loss: 2.6385
Epoch 37/50
276730/276730 [==============================] - 197s - loss: 2.6320
Epoch 38/50
276730/276730 [==============================] - 197s - loss: 2.6249
Epoch 39/50
276730/276730 [==============================] - 197s - loss: 2.6187
Epoch 40/50
276730/276730 [==============================] - 197s - loss: 2.6110
Epoch 41/50
276730/276730 [==============================] - 192s - loss: 2.6039
Epoch 42/50
276730/276730 [==============================] - 141s - loss: 2.5969
Epoch 43/50
276730/276730 [==============================] - 140s - loss: 2.5909
Epoch 44/50
276730/276730 [==============================] - 140s - loss: 2.5843
Epoch 45/50
276730/276730 [==============================] - 140s - loss: 2.5763
Epoch 46/50
276730/276730 [==============================] - 140s - loss: 2.5697
Epoch 47/50
276730/276730 [==============================] - 141s - loss: 2.5635
Epoch 48/50
276730/276730 [==============================] - 140s - loss: 2.5575
Epoch 49/50
276730/276730 [==============================] - 140s - loss: 2.5496
Epoch 50/50
276730/276730 [==============================] - 140s - loss: 2.5451
结果输出
his object in coming to new york was to engage officers for that service. he came at an opportune moment th the toote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soote of the carie and the soo
import os
import numpy as np
import nltk
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from gensim.models.word2vec import Word2Vec
raw_text = ''
for file in os.listdir("../input/"):
if file.endswith(".txt"):
raw_text += open("../input/"+file, errors='ignore').read() + '\n\n'
# raw_text = open('../input/Winston_Churchil.txt').read()
raw_text = raw_text.lower()
sentensor = nltk.data.load('tokenizers/punkt/english.pickle')
#加载英文的划分句子的模型
sents = sentensor.tokenize(raw_text)
corpus = []
for sen in sents:
corpus.append(nltk.word_tokenize(sen))
w2v_model = Word2Vec(corpus, size=128, window=5, min_count=5, workers=4)
raw_input = [item for sublist in corpus for item in sublist]
text_stream = []
vocab = w2v_model.vocab
for word in raw_input:
if word in vocab:
text_stream.append(word)
seq_length = 10
x = []
y = []
for i in range(0, len(text_stream) - seq_length):
given = text_stream[i:i + seq_length]
predict = text_stream[i + seq_length]
x.append(np.array([w2v_model[word] for word in given]))
y.append(w2v_model[predict])
x = np.reshape(x, (-1, seq_length, 128))
y = np.reshape(y, (-1,128))
model = Sequential()
model.add(LSTM(256, dropout_W=0.2, dropout_U=0.2, input_shape=(seq_length, 128)))
model.add(Dropout(0.2))
model.add(Dense(128, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam')
model.fit(x, y, nb_epoch=50, batch_size=4096)
def predict_next(input_array):
x = np.reshape(input_array, (-1,seq_length,128))
y = model.predict(x)
return y
def string_to_index(raw_input):
raw_input = raw_input.lower()
input_stream = nltk.word_tokenize(raw_input)
res = []
for word in input_stream[(len(input_stream)-seq_length):]:
res.append(w2v_model[word])
return res
def y_to_word(y):
word = w2v_model.most_similar(positive=y, topn=1)
return word
def generate_article(init, rounds=30):
in_string = init.lower()
for i in range(rounds):
n = y_to_word(predict_next(string_to_index(in_string)))
in_string += ' ' + n[0][0]
return in_string
init = 'Language Models allow us to measure how likely a sentence is, which is an important for Machine'
article = generate_article(init)
print(article)
训练过程
Epoch 1/50
2058743/2058743 [==============================] - 150s - loss: 0.6839
Epoch 2/50
2058743/2058743 [==============================] - 150s - loss: 0.6670
Epoch 3/50
2058743/2058743 [==============================] - 150s - loss: 0.6625
Epoch 4/50
2058743/2058743 [==============================] - 150s - loss: 0.6598
Epoch 5/50
2058743/2058743 [==============================] - 150s - loss: 0.6577
Epoch 6/50
2058743/2058743 [==============================] - 150s - loss: 0.6562
Epoch 7/50
2058743/2058743 [==============================] - 150s - loss: 0.6549
Epoch 8/50
2058743/2058743 [==============================] - 150s - loss: 0.6537
Epoch 9/50
2058743/2058743 [==============================] - 150s - loss: 0.6527
Epoch 10/50
2058743/2058743 [==============================] - 150s - loss: 0.6519
Epoch 11/50
2058743/2058743 [==============================] - 150s - loss: 0.6512
Epoch 12/50
2058743/2058743 [==============================] - 150s - loss: 0.6506
Epoch 13/50
2058743/2058743 [==============================] - 150s - loss: 0.6500
Epoch 14/50
2058743/2058743 [==============================] - 150s - loss: 0.6496
Epoch 15/50
2058743/2058743 [==============================] - 150s - loss: 0.6492
Epoch 16/50
2058743/2058743 [==============================] - 150s - loss: 0.6488
Epoch 17/50
2058743/2058743 [==============================] - 151s - loss: 0.6485
Epoch 18/50
2058743/2058743 [==============================] - 150s - loss: 0.6482
Epoch 19/50
2058743/2058743 [==============================] - 150s - loss: 0.6480
Epoch 20/50
2058743/2058743 [==============================] - 150s - loss: 0.6477
Epoch 21/50
2058743/2058743 [==============================] - 150s - loss: 0.6475
Epoch 22/50
2058743/2058743 [==============================] - 150s - loss: 0.6473
Epoch 23/50
2058743/2058743 [==============================] - 150s - loss: 0.6471
Epoch 24/50
2058743/2058743 [==============================] - 150s - loss: 0.6470
Epoch 25/50
2058743/2058743 [==============================] - 150s - loss: 0.6468
Epoch 26/50
2058743/2058743 [==============================] - 150s - loss: 0.6466
Epoch 27/50
2058743/2058743 [==============================] - 150s - loss: 0.6464
Epoch 28/50
2058743/2058743 [==============================] - 150s - loss: 0.6463
Epoch 29/50
2058743/2058743 [==============================] - 150s - loss: 0.6462
Epoch 30/50
2058743/2058743 [==============================] - 150s - loss: 0.6461
Epoch 31/50
2058743/2058743 [==============================] - 150s - loss: 0.6460
Epoch 32/50
2058743/2058743 [==============================] - 150s - loss: 0.6458
Epoch 33/50
2058743/2058743 [==============================] - 150s - loss: 0.6458
Epoch 34/50
2058743/2058743 [==============================] - 150s - loss: 0.6456
Epoch 35/50
2058743/2058743 [==============================] - 150s - loss: 0.6456
Epoch 36/50
2058743/2058743 [==============================] - 150s - loss: 0.6455
Epoch 37/50
2058743/2058743 [==============================] - 150s - loss: 0.6454
Epoch 38/50
2058743/2058743 [==============================] - 150s - loss: 0.6453
Epoch 39/50
2058743/2058743 [==============================] - 150s - loss: 0.6452
Epoch 40/50
2058743/2058743 [==============================] - 150s - loss: 0.6452
Epoch 41/50
2058743/2058743 [==============================] - 150s - loss: 0.6451
Epoch 42/50
2058743/2058743 [==============================] - 150s - loss: 0.6450
Epoch 43/50
2058743/2058743 [==============================] - 150s - loss: 0.6450
Epoch 44/50
2058743/2058743 [==============================] - 150s - loss: 0.6449
Epoch 45/50
2058743/2058743 [==============================] - 150s - loss: 0.6448
Epoch 46/50
2058743/2058743 [==============================] - 150s - loss: 0.6447
Epoch 47/50
2058743/2058743 [==============================] - 150s - loss: 0.6447
Epoch 48/50
2058743/2058743 [==============================] - 150s - loss: 0.6446
Epoch 49/50
2058743/2058743 [==============================] - 150s - loss: 0.6446
Epoch 50/50
2058743/2058743 [==============================] - 150s - loss: 0.6445
输出结果
language models allow us to measure how likely a sentence is, which is an important for machine engagement . to-day good-for-nothing fit job job job job job . i feel thing job job job ; thing really done certainly job job ; but i need not say
import os,cv2,random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
from keras.models import Sequential
from keras.layers import Input, Dropout, Flatten, Convolution2D, MaxPooling2D, Dense, Activation
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras.utils import np_utils
from keras import backend as K
K.set_image_dim_ordering('th')
'''
准备数据:
我们先把所有的数据load进来。猫狗各取200
'''
train_dir = 'G:/KNNtest/RNN/train/'
test_dir = 'G:/KNNtest/RNN/test/'
#把猫狗分开读入(1/0代表label)
train_dogs = [(train_dir+i,1) for i in os.listdir(train_dir) if 'dog' in i]
train_cats = [(train_dir+i,0) for i in os.listdir(train_dir) if 'cat' in i]
#-1也代表lable,随意设置
test_images = [(test_dir+i,-1) for i in os.listdir(test_dir)]
#合成训练集,鉴于内存大小和时间问题,猫狗数据各取200
train_images = train_dogs[:200] + train_cats[:200]
#洗牌,打乱数据
random.shuffle(train_images)
test_images = test_images[:10]
#使用opencv读取图片,并将图片格式统一为64*64
rows = 64
cols = 64
def read_image(tuple_set):
file_path = tuple_set[0]
label = tuple_set[1]
# 你这里的参数,可以是彩色或者灰度(GRAYSCALE)
img = cv2.imread(file_path)
# 这里,可以选择压缩图片的方式,zoom(cv2.INTER_CUBIC & cv2.INTER_LINEAR)还是shrink(cv2.INTER_AREA)
return cv2.resize(img,(rows,cols),interpolation=cv2.INTER_CUBIC),label
#!!!预处理图片,将图片数据转换为numpy数组
channels=3 #代表RGB三个颜色
def prep_data(images):
no_images = len(images)
data = np.ndarray((no_images,channels,rows,cols),dtype=np.uint8)
labels = []
for i,image_file in enumerate(images):
image,label = read_image(image_file)
data[i] = image.T
labels.append(label)
return data,labels
x_train,y_train = prep_data(train_images)
x_test,y_test = prep_data(test_images)
'''
模型构造
'''
optimizer = RMSprop(lr=1e-4)
objective = 'binary_crossentropy'
model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode='same', input_shape=(3, rows, cols), activation='relu'))
model.add(Convolution2D(32, 3, 3, border_mode='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(64, 3, 3, border_mode='same', activation='relu'))
model.add(Convolution2D(64, 3, 3, border_mode='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(128, 3, 3, border_mode='same', activation='relu'))
model.add(Convolution2D(128, 3, 3, border_mode='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(256, 3, 3, border_mode='same', activation='relu'))
model.add(Convolution2D(256, 3, 3, border_mode='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss=objective, optimizer=optimizer, metrics=['accuracy'])
'''
训练与测试
'''
nb_epoch = 10 #使用完整的训练集进行10次训练
batch_size =10 #批大小,每次训练在训练集中取batch_size个样本进行训练
#每个epoch后,存下loss,便于画图
class LossHistory(Callback):
def on_train_begin(self, logs={}):
self.losses =[]
self.val_losses = []
def on_epoch_begin(self, epoch, logs={}):
self.losses.append(logs.get('loss'))
self.val_losses.append(logs.get('val_loss'))
early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1, mode='auto')
#跑模型
history = LossHistory()
model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,
validation_split=0.2, verbose=0, shuffle=True, callbacks=[history, early_stopping])
predictions = model.predict(x_test, verbose=0)
loss = history.losses
val_loss = history.val_losses
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('VGG-16 Loss Trend')
plt.plot(loss, 'blue', label='Training Loss')
plt.plot(val_loss, 'green', label='Validation Loss')
plt.xticks(range(0,nb_epoch)[0::2])
plt.legend()
plt.show()