二分类问题可能是应用最广泛的机器学习问题。接下来这个例子就是将电影评论划分为正面和负面。
IMDB数据集包含互联网电影数据库(IMDB)的50000条严重两极分化的评论。数据集分为用于训练的25000条评论和用于测试的25000条评论。训练集和测试集各包含50%的正面评论和50%的负面评论。
模型在训练数据上表现的很好,并不意味着在前所未见的数据上表现的也很好,而且我们真正关心的应该是模型在新数据上的能力。用训练集作为测试,会出现论文中常提到的过拟合现象。
from keras.datasets import imdb
(train_data, train_labels) , (test_data, test_labels) = imdb.load_data(num_words=10000)
参数num_word =10000的意思是仅仅保留训练数据中前10000个最常见的单词。低频单词将被舍弃,这样得到的向量数据不会太大,便于处理。
word_index = imdb.get_word_index()
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])//键值颠倒,将整数索引映射为单词
decode_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])//0,1,2是为填充,序列开始和未知词保留的索引
将整数数据装换为二进制数据
import numpy as np
def vectorize_sequence(sequences,dimension = 10000):
results = np.zeros((len(sequences),dimension))
for i , sequence in enumerate(sequences):
results[i , sequence] = 1.
return results
将训练数据和测试数据序列化
x_train = vectorize_sequence(train_data)
x_test = vectorize_sequence(test_data)
将标签向量化
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
输入数据是向量,而标签是标量(0,1),这是最简单的一种网络。有一种网络在这种问题上表现的很好,就是带有relu激活层的全连接层(Dense)的简单堆叠。
output = relu(dot(W, input) + b)
网络选择有两个中间层,每层有十六个隐层单元,中间层使用relu作为激活函数,最后一层使用sigmond激活以输出一个0-1范围的概率值。
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16,activation='relu',input_shape=(10000,)))
model.add(layers.Dense(16,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))
配置损失函数和优化器
from keras import optimizers
from keras import losses
from keras import metrics
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss= losses.binary_crossentropy,
metrics= metrics.binary_accuracy)
留出验证集
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
现在使用512个样本组成的小批量,将模型训练20个轮次。
history = model.fit(partial_x_train,
partial_y_train,
epochs= 20,
batch_size= 512,
validation_data=(x_val,y_val))
Epoch 16/20
30/30 [==============================] - 0s 14ms/step - loss: 0.0126 - binary_accuracy: 0.9985 - val_loss: 0.5634 - val_binary_accuracy: 0.8710
Epoch 17/20
30/30 [==============================] - 0s 13ms/step - loss: 0.0131 - binary_accuracy: 0.9976 - val_loss: 0.5832 - val_binary_accuracy: 0.8702
Epoch 18/20
30/30 [==============================] - 0s 13ms/step - loss: 0.0057 - binary_accuracy: 0.9998 - val_loss: 0.6184 - val_binary_accuracy: 0.8673
Epoch 19/20
30/30 [==============================] - 0s 13ms/step - loss: 0.0092 - binary_accuracy: 0.9977 - val_loss: 0.6512 - val_binary_accuracy: 0.8681
Epoch 20/20
30/30 [==============================] - 0s 14ms/step - loss: 0.0030 - binary_accuracy: 0.9999 - val_loss: 0.6874 - val_binary_accuracy: 0.8647
import matplotlib.pyplot as plt
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'bo', label = 'Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.plot(epochs, acc, 'ro', label = 'Training acc')
plt.plot(epochs, val_acc, 'r', label='Validation acc')
plt.title('Training and vaildation loss and acc')
plt.ylabel('Loss')
plt.xlabel('Epoches')
plt.legend()
plt.show()
训练损失每轮都在降低,训练精度每轮都在提升。这就是梯度下降的结果,但是验证损失和精度没有提升。这就是出现了过拟合,训练的网络无法泛化。