这次是利用TensorFlow进行文本分类,判断电影评价是正面还是负面的.IMDB数据集包含5万个评论,其中2.5万作为训练集,2.5万作为测试集.训练集和数据集相当意味着正负样本数一样.
IMDB数据集经过处理,将单词序列转成数字序列,每一个数字在字典中代表中一个特定的单词.下载的代码如下,下载在文件夹/root/.keras/datasets下面,文件名是imdb.npz.
代码如下:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
imdb=keras.datasets.imdb
(train_data,train_labels),(test_data,test_labels)=imdb.load_data(num_words=10000)
这里的参数num_words=10000使得训练样本中的前1万个单词是频率最高的.
下面将展示一个样本,是一组整数的数组,代表着电影评论的单词.标签是0或者1,0表示的是负面的评论,1代表的是正面的评论.
代码:
print("Training entries:{},labels:{}".format(len(train_data),len(train_labels)))
print(train_data.shape)
print(test_labels.shape)
print(train_data[0])
结果:
Training entries:25000,labels:25000
(25000,)
(25000,)
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
Process finished with exit code 0
从上可知,数组包含25000的样本,每一个样本是一个list,代表的是这个评论的单词在字典中的下标.
代码:对于不同的样本,也就是不同的评论的长度是不一样大小的.
print(len(train_data[0]),len(train_data[1]))
结果:
218 189
上面的第一个训练样本长度是218,第二个样本的长度是189.
代码如下:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()
# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index[""] = 0
word_index[""] = 1
word_index[""] = 2 # unknown
word_index[""] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_review(train_data[0]))
结果:
" this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also to the two little boy's that played the of norman and paul they were just brilliant children are often left out of the list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
这些评论必须转成tensor在传入卷积神经网络中,下面几种方式可以:
第一种:因为我们之前取得是1万个单词数,所以如果评论是[3,5],我们就需要有一个1万个元素的向量,只有下标3和5的地方为1,其他都为0.这样,我们需要大小为1万*评论数的矩阵.
第二种:我们可以将每个评论的都设置成相同的长度,创造一个最大长度*评论数的tensor.
我们使用方法二.
我们将评论的数组不满256的补充0,扩充成256个单词数目.
代码:
train_data=keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index[''],
padding='post',
maxlen=256)
test_data=keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index[''],
padding='post',
maxlen=256)
print(len(train_data[0]),len(train_data[1]),len(test_data[1]))
print(train_data[0])
结果:
256 256 256
[ 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941
4 173 36 256 5 25 100 43 838 112 50 670 2 9
35 480 284 5 150 4 172 112 167 2 336 385 39 4
172 4536 1111 17 546 38 13 447 4 192 50 16 6 147
2025 19 14 22 4 1920 4613 469 4 22 71 87 12 16
43 530 38 76 15 13 1247 4 22 17 515 17 12 16
626 18 2 5 62 386 12 8 316 8 106 5 4 2223
5244 16 480 66 3785 33 4 130 12 16 38 619 5 25
124 51 36 135 48 25 1415 33 6 22 12 215 28 77
52 5 14 407 16 82 2 8 4 107 117 5952 15 256
4 2 7 3766 5 723 36 71 43 530 476 26 400 317
46 7 4 2 1029 13 104 88 4 381 15 297 98 32
2071 56 26 141 6 194 7486 18 4 226 22 21 134 476
26 480 5 144 30 5535 18 51 36 28 224 92 25 104
4 226 65 16 38 1334 88 12 16 283 5 16 4472 113
103 32 15 16 5345 19 178 32 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0]
代码如下:
vocab_size=10000
model=keras.Sequential()
model.add(keras.layers.Embedding(vocab_size,16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16,activation=tf.nn.relu))
model.add(keras.layers.Dense(1,activation=tf.nn.sigmoid))
model.summary()
结果展示:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 16) 160000
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16) 0
_________________________________________________________________
dense_1 (Dense) (None, 16) 272
_________________________________________________________________
dense_2 (Dense) (None, 1) 17
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
其实这个模型的结构我也不是很理解,特别是第一层看了几个博客都没搞清楚....
我们这里的损失函数设置成binary_crossentropy,这个并不是唯一的损失函数,还可以使用mean_squared_error,但是因为binary_crossentrpy这个能够更好的处理我们的概率问题,所以这里采用这个.为什么采用这个交叉熵函数,而不是普通的MSE呢?
下面简要的解释一下:(因为结果比较的复杂,这里我们写在草稿纸上进行展示)
一般来说sigmoid是用在二分类的问题,softmax是用于多分类的问题,同样的对应着这样的损失函数binary_crossentopy和categorical_crossentropy.但是注意一般我们在构建模型的时候,sigmoid激活函数可以放在最后一层也可以放在中间层,softmax只放在最后一层(一般认为是单独的一层,而不是单单的一个激活函数).且很特殊的是softmax是多输入多输出的,但是其他的激活函数仅仅是单输入和单输出.
代码如下:
model.compile(optimizer=tf.train.AdamOptimizer(),
loss='binary_crossentropy',
metrics=['accuracy'])
如果你看过机器学习的课程,你应该知道交叉验证集的作用,这里不讨论,可以查看我之前的博客.
这里的代码如下:
x_val=train_data[:10000]
partial_x_train=train_data[10000:]
y_val=train_labels[:10000]
partial_y_train=train_labels[10000:]
代码如下:(结果不展示了就是迭代了40次,很长)每次迭代完成输出的是训练的误差和准确率还有交叉验证集的误差和准确率.
history=model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val,y_val),
verbose=1
)
代码如下:(在测试集合上测试损失和准确率)
results=model.evaluate(test_data,test_labels)
print(results)
结果:
[0.30745334438323974, 0.8748]
我们在这里建立了两个图:第一个是关于迭代步数的损失图,第二个是关于迭代步数的准确率图.
代码如下:
可以看到,这里的history是由fit返回的,是一个包含了训练过程中的结果的字典.返回的是训练和交叉验证的过程中的损失和准确率.对于我们的训练集合展示的是蓝色的点,对于我们的交叉验证集我们采用的是蓝色的实线.
history_dict=history.history
history_dict.keys()
acc=history.history['acc']
val_acc=history.history['val_acc']
loss=history.history['loss']
val_loss=history.history['val_loss']
epochs=range(1,len(acc)+1)
plt.plot(epochs,loss,'bo',label='Training loss')
plt.plot(epochs,val_loss,'b',label='Validation loss')
plt.title('Trainig and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()
acc_values=history_dict['acc']
val_acc_values=history_dict['val_acc']
plt.plot(epochs,acc,'bo',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
结果展示:
下面的两个图你可以看到对于训练集合损失一直在下降,准确率一直在上升,但是对于交叉验证集他到差不多20次迭代的时候就基本平了,这就是所谓的过拟合的现象,即我们的模型对于训练集合能够达到很高的准确率,但是在训练反而对一般的未见过的数据很难准确的预测结果.一般的步骤就是只训练大约20次这样就行了.
图一:
图2