此次的数据集被分为正面评论和负面评论,都各占一半,训练集和测试集都各有25000条。
import keras
keras.__version__
例行看一下keras的版本。
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
这里是加载imdb数据集的操作,其中后面的10000代表只取出前10000个经常用的词汇来保证学习的合理性。
train_data[0]
train_labels[0]
max([max(sequence) for sequence in train_data])
这里看一下第一句话和标签,还有索引的最大值,可以很容易的看出,第一个输出的是第一句话的单词索引,是一行数字,第二个是对应的标签,1为积极,0为消极,第三句话是最大索引,因为只有10000个单词,所以最大索引毫无疑问是9999.
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])
这三行是用于解析出来相应的评论语句用的。第一行是获取索引字典,第二行是将键值颠倒,最后一行是解码。因为前三个是预先保留的索引,所以跳过。
decoded_review
看一下这个句子,就会明白评论是什么:"? this film was just brilliant casting location scenery story direction everyone’s really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy’s that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don’t you think the whole story was so lovely because it was true and was someone’s life after all that was shared with us all"
太长了说实话人工分辨都要费一点时间。
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
惯例,这里对数据使用one hot编码,目的是将数据矩阵化,目的是简化数字,让它从0到9999变为0和1,同时把标签也向量化了。
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
这里开始搭建网络结构了。前两层为relu函数,它是非线性的函数,目的是得到更加丰富的假设空间,这样多层的结构才有意义,后面用sigmoid函数输出0到1的值,就是最终的判断。
from keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
这里的编译上一篇提到过,这里不再多说。这里的函数指标也可以自定义,具体的可以后面讲。
到此,网络搭建,前期准备都完成了,这个例子和第一个例子非常接近,只是数据量变大,评估方法更加优化了。这些下一期再谈。