keras学习笔记-SimpleRNN with Keras — generating text

SimpleRNN with Keras — generating text

当初写的时候就是用英文,考虑大部分都是代码,也就懒得在换成中文了:)

RNNs have been used extensively by the natural language processing (NLP) community for various applications. One such application is building language models. A language model allows us to predict the probability of a word in a text given the previous words. Language models are important for various higher level tasks such as machine translation, spelling correction, and so on.
A side effect of the ability to predict the next word given previous words is a generative model that allows us to generate text by sampling from the output probabilities. In language modeling, our input is typically a sequence of words and the output is a sequence of predicted words. The training data used is existing unlabeled text, where we set the label yt at time t to be the input xt+1 at time t+1.
For our first example of using Keras for building RNNs, we will train a character based language model on the text of Alice in Wonderland to predict the next character given 10 previous characters.
We have chosen to build a character-based model here because it has a smaller vocabulary and trains quicker. The idea is the same as using a word-based language model, except we use characters instead of words. We will then use the trained model to generate some text in the same style.

from __future__ import print_function
from keras.layers import Dense, Activation
from keras.layers.recurrent import SimpleRNN
from keras.models import Sequential
import numpy as np

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
import matplotlib.pyplot as plt
%matplotlib inline

Using TensorFlow backend.

in this case,we download Gutenberg’s Alice’s Adventures in Wonderland as dataset. use the rnn model , the computer can generatting texts.

fp = open("data/Alice.txt", 'rb')
lines = []
for line in fp:
    line = line.strip().lower()
    line = line.decode("ascii", "ignore")
    if len(line) == 0:
        continue
    lines.append(line)
fp.close()
text = " ".join(lines)
print(text[800:3000])
# we print some text in article (from 800 to 3000 chats)
ing nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book, thought alice without pictures or conversations? so she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a white rabbit with pink eyes ran close by her. there was nothing so very remarkable in that; nor did alice think it so very much out of the way to hear the rabbit say to itself, oh dear! oh dear! i shall be late! (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge. in another moment down went alice after it, never once considering how in the world she was to get out again. the rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that alice had not a moment to think about stopping herself before she found herself falling down a very deep well. either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. first, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. she took down a jar from one of the shelves as she passed; it was labelled orange marmalade, but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, 
# 根据文本内容生成字典
chars = set([c for c in text])
nb_chars = len(chars)
char2index = dict((c, i) for i, c in enumerate(chars))
index2char = dict((i, c) for i, c in enumerate(chars))
print('字典大小:', nb_chars,'\r\n字典内容:\r\n' ,chars)
print('字典索引:',char2index)
#char2index
#index2char
字典大小: 55 
字典内容:
 {'l', 'a', 'i', ',', '4', '?', 'c', 'b', 'd', 'u', '-', ']', 'z', 'p', 'h', '%', ' ', 'y', ';', 'g', '2', '@', '.', ':', '6', '9', 'k', '[', 'e', '*', 'v', '!', 'r', 'n', '#', '3', '8', '1', 'o', 'w', '5', '(', '/', '7', '$', 'x', 't', ')', 'm', 'j', '0', '_', 'f', 'q', 's'}
字典索引: {'l': 0, 'a': 1, 'i': 2, ',': 3, '4': 4, '?': 5, 'c': 6, 'b': 7, 'd': 8, 'u': 9, '-': 10, ']': 11, 'z': 12, 'p': 13, 'h': 14, '%': 15, ' ': 16, 'y': 17, ';': 18, 'g': 19, '2': 20, '@': 21, '.': 22, ':': 23, '6': 24, '9': 25, 'k': 26, '[': 27, 'e': 28, '*': 29, 'v': 30, '!': 31, 'r': 32, 'n': 33, '#': 34, '3': 35, '8': 36, '1': 37, 'o': 38, 'w': 39, '5': 40, '(': 41, '/': 42, '7': 43, '$': 44, 'x': 45, 't': 46, ')': 47, 'm': 48, 'j': 49, '0': 50, '_': 51, 'f': 52, 'q': 53, 's': 54}
SEQLEN = 10
STEP = 1
input_chars = []
label_chars = []
for i in range(0, len(text) - SEQLEN, STEP):
    input_chars.append(text[i:i + SEQLEN])
    label_chars.append(text[i + SEQLEN])

#input_chars
#label_chars
print('input_chars:', input_chars[:100])
print('\r\nlabel_chars:', label_chars[:100])

input_chars: ['project gu', 'roject gut', 'oject gute', 'ject guten', 'ect gutenb', 'ct gutenbe', 't gutenber', ' gutenberg', 'gutenbergs', 'utenbergs ', 'tenbergs a', 'enbergs al', 'nbergs ali', 'bergs alic', 'ergs alice', 'rgs alices', 'gs alices ', 's alices a', ' alices ad', 'alices adv', 'lices adve', 'ices adven', 'ces advent', 'es adventu', 's adventur', ' adventure', 'adventures', 'dventures ', 'ventures i', 'entures in', 'ntures in ', 'tures in w', 'ures in wo', 'res in won', 'es in wond', 's in wonde', ' in wonder', 'in wonderl', 'n wonderla', ' wonderlan', 'wonderland', 'onderland,', 'nderland, ', 'derland, b', 'erland, by', 'rland, by ', 'land, by l', 'and, by le', 'nd, by lew', 'd, by lewi', ', by lewis', ' by lewis ', 'by lewis c', 'y lewis ca', ' lewis car', 'lewis carr', 'ewis carro', 'wis carrol', 'is carroll', 's carroll ', ' carroll t', 'carroll th', 'arroll thi', 'rroll this', 'roll this ', 'oll this e', 'll this eb', 'l this ebo', ' this eboo', 'this ebook', 'his ebook ', 'is ebook i', 's ebook is', ' ebook is ', 'ebook is f', 'book is fo', 'ook is for', 'ok is for ', 'k is for t', ' is for th', 'is for the', 's for the ', ' for the u', 'for the us', 'or the use', 'r the use ', ' the use o', 'the use of', 'he use of ', 'e use of a', ' use of an', 'use of any', 'se of anyo', 'e of anyon', ' of anyone', 'of anyone ', 'f anyone a', ' anyone an', 'anyone any', 'nyone anyw']

label_chars: ['t', 'e', 'n', 'b', 'e', 'r', 'g', 's', ' ', 'a', 'l', 'i', 'c', 'e', 's', ' ', 'a', 'd', 'v', 'e', 'n', 't', 'u', 'r', 'e', 's', ' ', 'i', 'n', ' ', 'w', 'o', 'n', 'd', 'e', 'r', 'l', 'a', 'n', 'd', ',', ' ', 'b', 'y', ' ', 'l', 'e', 'w', 'i', 's', ' ', 'c', 'a', 'r', 'r', 'o', 'l', 'l', ' ', 't', 'h', 'i', 's', ' ', 'e', 'b', 'o', 'o', 'k', ' ', 'i', 's', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ', 'u', 's', 'e', ' ', 'o', 'f', ' ', 'a', 'n', 'y', 'o', 'n', 'e', ' ', 'a', 'n', 'y', 'w', 'h']
# The next step is to vectorize these input and label texts
X = np.zeros((len(input_chars), SEQLEN, nb_chars), dtype=np.bool)
y = np.zeros((len(input_chars), nb_chars), dtype=np.bool)
for i, input_char in enumerate(input_chars):
    for j, ch in enumerate(input_char):
        X[i, j, char2index[ch]] = 1
    y[i, char2index[label_chars[i]]] = 1
    
print('X[0]', X[0]*1)
print('y[0]', y[0]*1)
X[0] [[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
y[0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]

Finally, we are ready to build our model. We define the RNN's output dimension to have a size of

  1. This is a hyper-parameter that needs to be determined by experimentation. In general, if we
    choose too small a size, then the model does not have sufficient capacity for generating good text, and
    you will see long runs of repeating characters or runs of repeating word groups. On the other hand, if
    the value chosen is too large, the model has too many parameters and needs a lot more data to train
    effectively. We want to return a single character as output, not a sequence of characters, so
    return_sequences=False. We have already seen that the input to the RNN is of shape (SEQLEN and nb_chars).
    In addition, we set unroll=True because it improves performance on the TensorFlow backend.
    The RNN is connected to a dense (fully connected) layer. The dense layer has (nb_char) units, which
    emits scores for each of the characters in the vocabulary. The activation on the dense layer is a
    softmax, which normalizes the scores to probabilities. The character with the highest probability is
    chosen as the prediction. We compile the model with the categorical cross-entropy loss function, a
    good loss function for categorical outputs, and the RMSprop optimizer:
# build our model
HIDDEN_SIZE = 128
BATCH_SIZE = 128
NUM_ITERATIONS = 25
NUM_EPOCHS_PER_ITERATION = 1
NUM_PREDS_PER_EPOCH = 100
model = Sequential()
model.add(SimpleRNN(HIDDEN_SIZE, return_sequences=False,input_shape=(SEQLEN, nb_chars),unroll=True))
model.add(Dense(nb_chars))
model.add(Activation("softmax"))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

NUM_ITERATIONS=25
NUM_EPOCHS_PER_ITERATION=1
                                                            
for iteration in range(NUM_ITERATIONS):
    print("=" * 50)
    print("Iteration #: %d" % (iteration))
    model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION)
    test_idx = np.random.randint(len(input_chars))
    test_chars = input_chars[test_idx]
    print("Generating from seed: %s" % (test_chars))
    print(test_chars, end="")
    for i in range(NUM_PREDS_PER_EPOCH):
        Xtest = np.zeros((1, SEQLEN, nb_chars))
        for i, ch in enumerate(test_chars):
            Xtest[0, i, char2index[ch]] = 1
        pred = model.predict(Xtest, verbose=0)[0]
        ypred = index2char[np.argmax(pred)]
        print(ypred, end="")
        # move forward with test_chars + ypred
        test_chars = test_chars[1:] + ypred    
    print('\r\n')

==================================================
Iteration #: 0
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 2.3445    
Generating from seed: n a fight 
n a fight he could the so the the the the the the the the the the the the the the the the the the the the the ==================================================
Iteration #: 1
Epoch 1/1
158773/158773 [==============================] - 17s - loss: 2.0517    
Generating from seed: suppose it
suppose it the was in the was in the was in the was in the was in the was in the was in the was in the was in ==================================================
Iteration #: 2
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 1.9478    
Generating from seed: moment to 
moment to the mouth ore the mouth ore the mouth ore the mouth ore the mouth ore the mouth ore the mouth ore th==================================================
Iteration #: 3
Epoch 1/1
158773/158773 [==============================] - 17s - loss: 1.8648    
Generating from seed: guests to 
guests to her sood to the project gutenberg-t the roust i she had the mouse to great out to the project gutenb==================================================
Iteration #: 4
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 1.7956    
Generating from seed:  are confi
 are confing and and and and and and and and and and and and and and and and and and and and and and and and a==================================================
Iteration #: 5
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 1.7410    
Generating from seed: odes that 
odes that she was a the mouse the mock turtle she was a the mouse the mock turtle she was a the mouse the mock==================================================
Iteration #: 6
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 1.6942    
Generating from seed: isplaying,
isplaying, and the more to the said to herself and looked the was the mack to the was the mack to the was the ==================================================
Iteration #: 7
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.6559    
Generating from seed: began. you
began. you know it the gryphon a soom alice said the gryphon a soom alice said the gryphon a soom alice said t==================================================
Iteration #: 8
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.6233    
Generating from seed: : what a c
: what a conting and be a little the project gutenberg-tm electronic works and the poor and the poor and the p==================================================
Iteration #: 9
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 1.5940    
Generating from seed: on the gro
on the gropped the round the round the round the round the round the round the round the round the round the r==================================================
Iteration #: 10
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.5691    
Generating from seed: again, for
again, for the mock turtle with the roust on the work of the mock turtle with the roust on the work of the moc==================================================
Iteration #: 11
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 1.5472    
Generating from seed: g; and the
g; and the fill have in the fill have in the fill have in the fill have in the fill have in the fill have in t==================================================
Iteration #: 12
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 1.5289    
Generating from seed: ot i! said
ot i! said alice, and the cours, and the cours, and the cours, and the cours, and the cours, and the cours, an==================================================
Iteration #: 13
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.5119    
Generating from seed: very sleep
very sleep of the work out of the work out of the work out of the work out of the work out of the work out of ==================================================
Iteration #: 14
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.4959    
Generating from seed: im sure im
im sure important the round of the eart the mouse was so the dormouse in the mouse was so the dormouse in the ==================================================
Iteration #: 15
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.4822    
Generating from seed: olent shak
olent shake the parther alice was not a canten a consing and she had the parther alice was not a canten a cons==================================================
Iteration #: 16
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.4691    
Generating from seed: escopes: t
escopes: the doom the door alice was not a song the door alice was not a song the door alice was not a song th==================================================
Iteration #: 17
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.4588    
Generating from seed: to execute
to executer in a lough his by the cours, who was a little began in a lough his by the cours, who was a little ==================================================
Iteration #: 18
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.4481    
Generating from seed: could let 
could let the door the sure, it was the caterpillar. the come to the could he was to the could he was to the c==================================================
Iteration #: 19
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.4388    
Generating from seed: es, and th
es, and the mouse of the mouse of the mouse of the mouse of the mouse of the mouse of the mouse of the mouse o==================================================
Iteration #: 20
Epoch 1/1
158773/158773 [==============================] - 19s - loss: 1.4302    
Generating from seed: t, could n
t, could not me the said to the some to the some to the some to the some to the some to the some to the some t==================================================
Iteration #: 21
Epoch 1/1
158773/158773 [==============================] - 20s - loss: 1.4213    
Generating from seed: to say but
to say but no read to the party said to herself, and the door a peep and the party said to herself, and the do==================================================
Iteration #: 22
Epoch 1/1
158773/158773 [==============================] - 21s - loss: 1.4129    
Generating from seed: ir slates 
ir slates of the thing about the dormouse got the court, and she went on the stite alice had to do any with a ==================================================
Iteration #: 23
Epoch 1/1
158773/158773 [==============================] - 20s - loss: 1.4065    
Generating from seed: at is not 
at is not any all the rood a great did the crout to see it make to herself, and the roon as she went on and th==================================================
Iteration #: 24
Epoch 1/1
158773/158773 [==============================] - 18s - loss: 1.3987    
Generating from seed: y, said al
y, said alice, and the roon a great down on the gryphon remember the white rabbit heard and the words do an on

by 25 times iterations, now we see it can generate some texts, just nonsense, but it better than the beginning :)

你可能感兴趣的:(keras学习笔记-SimpleRNN with Keras — generating text)