Keras封装性比较高,现在的注意力机制都是用pytorch较为多。但是使用函数API也可以实现,Keras处理文本并且转化为词向量也很方便。
本文使用了一个外卖评价的数据集,标签是0和1,1代表好评,0代表差评。并且构建了12种模型,即 MLP,1DCNN,RNN,GRU,LSTM, CNN+LSTM,TextCNN,BiLSTM, Attention, BiLSTM+Attention,BiGRU+Attention,Attention*3(3个注意力层堆叠)
大家也可以在此基础上参考改进,组合出更好的模型。(需要数据集和停用词可以留言)
本文的注释算是我写博客最详细的一篇了。
由于中文不像英文中间有空白可以直接划分词语,需要依靠jieba库切词,然后把没有用的标点符号,或者是“了”,‘的’,‘也’,‘就’,‘很’.....等等没有用的虚词去掉。这就需要一个停用词库,大家可以网上找常用的停用词文本,也可以留言找博主要。我这有一个比较全的停用词,我还有一个简化版的。本次使用的是简化版的停用词。
首先看数据长这样
导入包和数据,读取停用词,用jieba库划分词汇并处理
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['KaiTi'] #指定默认字体 SimHei黑体
plt.rcParams['axes.unicode_minus'] = False #解决保存图像是负号'
import jieba
stop_list = pd.read_csv("stopwords_简略版.txt",index_col=False,quoting=3,
sep="\t",names=['stopword'], encoding='utf-8')
#Jieba分词函数
def txt_cut(juzi):
lis=[w for w in jieba.lcut(juzi) if w not in stop_list.values]
return " ".join(lis)
df=pd.read_excel('外卖.xlsx')
data=pd.DataFrame()
data['label']=df['label']
data['cutword']=df['review'].astype('str').apply(txt_cut)
data
词汇切割好了,得到如下结果
查看标签y的分布
data['label'].value_counts().plot(kind='bar')
负面评价0有将近8000个,正面评价1有4000个,不平衡,划分训练测试集时要分层抽样。
下面将文本变为数组,利用Keras里面的Tokenizer类实现,首先将词汇都索引化。这里有个参数num_words=6000很重要,意思是选择6000个词汇作为索引字典,也就是这个模型里面最多只有6000个词。
from os import listdir
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
# 将文件分割成单字, 建立词索引字典
tok = Tokenizer(num_words=6000)
tok.fit_on_texts(data['cutword'].values)
print("样本数 : ", tok.document_count)
print({k: tok.word_index[k] for k in list(tok.word_index)[:10]})
由于每个评论的词汇长度不一样,我们训练时需要弄成一样长的张量(多剪少补),需要确定这个词汇最大长度为多少,也就是max_words参数,这个是循环神经网络的时间步的长度,也是注意力机制的维度。如果max_words过小则很多语句的信息损失了,而max_words过大数据矩阵又会过于稀疏,并且计算量过大。我们查看一下X的长度的分布频率:
# 建立训练和测试数据集
X= tok.texts_to_sequences(data['cutword'].values)
#查看x的长度的分布
length=[]
for i in X:
length.append(len(i))
v_c=pd.Series(length).value_counts()
print(v_c[v_c>20]) #频率大于20才展现
v_c[v_c>20].plot(kind='bar',figsize=(12,5))
可以看出绝大多数的句子单词长度不超过10....长度为5的评论是最多的,本次选择max_words=20,将句子都裁剪为长为20 的向量。并取出y
# 将序列数据填充成相同长度
X= sequence.pad_sequences(X, maxlen=20)
Y=data['label'].values
print("X.shape: ", X.shape)
print("Y.shape: ", Y.shape)
#X=np.array(X)
#Y=np.array(Y)
然后划分训练测试集,查看形状:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=0)
X_train.shape,X_test.shape,Y_train.shape, Y_test.shape
将y进行独立热编码,并且保留原始的测试集y_test,方便后面做评价。查看x和y前3个
Y_test_original=Y_test.copy()
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)
print(X_train[:3])
print(Y_test[:3])
由于Keras里面没有封装好的注意力层,需要我们自己定义一个:
#自定义注意力层
from keras import initializers, constraints,activations,regularizers
from keras import backend as K
from keras.layers import Layer
class Attention(Layer):
#返回值:返回的不是attention权重,而是每个timestep乘以权重后相加得到的向量。
#输入:输入是rnn的timesteps,也是最长输入序列的长度
def __init__(self, step_dim,
W_regularizer=None, b_regularizer=None,
W_constraint=None, b_constraint=None,
bias=True, **kwargs):
self.supports_masking = True
self.init = initializers.get('glorot_uniform')
self.W_regularizer = regularizers.get(W_regularizer)
self.b_regularizer = regularizers.get(b_regularizer)
self.W_constraint = constraints.get(W_constraint)
self.b_constraint = constraints.get(b_constraint)
self.bias = bias
self.step_dim = step_dim
self.features_dim = 0
super(Attention, self).__init__(**kwargs)
def build(self, input_shape):
assert len(input_shape) == 3
self.W = self.add_weight(shape=(input_shape[-1],),initializer=self.init,name='{}_W'.format(self.name),
regularizer=self.W_regularizer,constraint=self.W_constraint)
self.features_dim = input_shape[-1]
if self.bias:
self.b = self.add_weight(shape=(input_shape[1],),initializer='zero', name='{}_b'.format(self.name),
regularizer=self.b_regularizer,constraint=self.b_constraint)
else:
self.b = None
self.built = True
def compute_mask(self, input, input_mask=None):
return None ## 后面的层不需要mask了,所以这里可以直接返回none
def call(self, x, mask=None):
features_dim = self.features_dim ## 这里应该是 step_dim是我们指定的参数,它等于input_shape[1],也就是rnn的timesteps
step_dim = self.step_dim
# 输入和参数分别reshape再点乘后,tensor.shape变成了(batch_size*timesteps, 1),之后每个batch要分开进行归一化
# 所以应该有 eij = K.reshape(..., (-1, timesteps))
eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
if self.bias:
eij += self.b
eij = K.tanh(eij) #RNN一般默认激活函数为tanh, 对attention来说激活函数差别不大,因为要做softmax
a = K.exp(eij)
if mask is not None: ## 如果前面的层有mask,那么后面这些被mask掉的timestep肯定是不能参与计算输出的,也就是将他们attention权重设为0
a *= K.cast(mask, K.floatx()) ## cast是做类型转换,keras计算时会检查类型,可能是因为用gpu的原因
a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
a = K.expand_dims(a) # a = K.expand_dims(a, axis=-1) , axis默认为-1, 表示在最后扩充一个维度。比如shape = (3,)变成 (3, 1)
## 此时a.shape = (batch_size, timesteps, 1), x.shape = (batch_size, timesteps, units)
weighted_input = x * a
# weighted_input的shape为 (batch_size, timesteps, units), 每个timestep的输出向量已经乘上了该timestep的权重
# weighted_input在axis=1上取和,返回值的shape为 (batch_size, 1, units)
return K.sum(weighted_input, axis=1)
def compute_output_shape(self, input_shape): ## 返回的结果是c,其shape为 (batch_size, units)
return input_shape[0], self.features_dim
别管这个类多复杂.....不用看,后面直接当成函数用就行。
下面导入Keras里面的常用的神经网络层,定义一些参数
from keras.preprocessing import sequence
from keras.models import Sequential,Model
from keras.layers import Dense,Input, Dropout, Embedding, Flatten,MaxPooling1D,Conv1D,SimpleRNN,LSTM,GRU,Multiply
from keras.layers import Bidirectional,Activation,BatchNormalization
from keras.layers.merge import concatenate
seed = 10
np.random.seed(seed) # 指定随机数种子
#单词索引的最大个数6000,单句话最大长度20
top_words=6000
max_words=20
num_labels=2 #2分类
下面构建模型函数,这个函数较为复杂,因为是12个模型一起定义的,方便代码的复用。但每个模型对应的那一块都写的很清楚:
def build_model(top_words=top_words,max_words=max_words,num_labels=num_labels,mode='LSTM',hidden_dim=[32]):
if mode=='RNN':
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Dropout(0.25))
model.add(SimpleRNN(32))
model.add(Dropout(0.25))
model.add(Dense(num_labels, activation="softmax"))
elif mode=='MLP':
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation="relu"))
model.add(Dropout(0.25))
model.add(Dense(num_labels, activation="softmax"))
elif mode=='LSTM':
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Dropout(0.25))
model.add(LSTM(32))
model.add(Dropout(0.25))
model.add(Dense(num_labels, activation="softmax"))
elif mode=='GRU':
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Dropout(0.25))
model.add(GRU(32))
model.add(Dropout(0.25))
model.add(Dense(num_labels, activation="softmax"))
elif mode=='CNN': #一维卷积
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Dropout(0.25))
model.add(Conv1D(filters=32, kernel_size=3, padding="same",activation="relu"))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(256, activation="relu"))
model.add(Dropout(0.25))
model.add(Dense(num_labels, activation="softmax"))
elif mode=='CNN+LSTM':
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Dropout(0.25))
model.add(Conv1D(filters=32, kernel_size=3, padding="same",activation="relu"))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(64))
model.add(Dropout(0.25))
model.add(Dense(num_labels, activation="softmax"))
elif mode=='BiLSTM':
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(num_labels, activation='softmax'))
#下面的网络采用Funcional API实现
elif mode=='TextCNN':
inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
## 词嵌入使用预训练的词向量
layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
## 词窗大小分别为3,4,5
cnn1 = Conv1D(32, 3, padding='same', strides = 1, activation='relu')(layer)
cnn1 = MaxPooling1D(pool_size=2)(cnn1)
cnn2 = Conv1D(32, 4, padding='same', strides = 1, activation='relu')(layer)
cnn2 = MaxPooling1D(pool_size=2)(cnn2)
cnn3 = Conv1D(32, 5, padding='same', strides = 1, activation='relu')(layer)
cnn3 = MaxPooling1D(pool_size=2)(cnn3)
# 合并三个模型的输出向量
cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)
flat = Flatten()(cnn)
drop = Dropout(0.2)(flat)
main_output = Dense(num_labels, activation='softmax')(drop)
model = Model(inputs=inputs, outputs=main_output)
elif mode=='Attention':
inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
attention_mul = Multiply()([layer, attention_probs])
mlp = Dense(64)(attention_mul) #原始的全连接
fla=Flatten()(mlp)
output = Dense(num_labels, activation='softmax')(fla)
model = Model(inputs=[inputs], outputs=output)
elif mode=='Attention*3':
inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
attention_mul = Multiply()([layer, attention_probs])
mlp = Dense(32,activation='relu')(attention_mul)
attention_probs = Dense(32, activation='softmax', name='attention_vec1')(mlp)
attention_mul = Multiply()([mlp, attention_probs])
mlp2 = Dense(32,activation='relu')(attention_mul)
attention_probs = Dense(32, activation='softmax', name='attention_vec2')(mlp2)
attention_mul = Multiply()([mlp2, attention_probs])
mlp3 = Dense(32,activation='relu')(attention_mul)
fla=Flatten()(mlp3)
output = Dense(num_labels, activation='softmax')(fla)
model = Model(inputs=[inputs], outputs=output)
elif mode=='BiLSTM+Attention':
inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
bilstm = Bidirectional(LSTM(64, return_sequences=True))(layer) #参数保持维度3
bilstm = Bidirectional(LSTM(64, return_sequences=True))(bilstm)
layer = Dense(256, activation='relu')(bilstm)
layer = Dropout(0.2)(layer)
## 注意力机制
attention = Attention(step_dim=max_words)(layer)
layer = Dense(128, activation='relu')(attention)
output = Dense(num_labels, activation='softmax')(layer)
model = Model(inputs=inputs, outputs=output)
elif mode=='BiGRU+Attention':
inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
attention_mul = Multiply()([layer, attention_probs])
mlp = Dense(64,activation='relu')(attention_mul) #原始的全连接
#bat=BatchNormalization()(mlp)
#act=Activation('relu')
gru=Bidirectional(GRU(32))(mlp)
mlp = Dense(16,activation='relu')(gru)
output = Dense(num_labels, activation='softmax')(mlp)
model = Model(inputs=[inputs], outputs=output)
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
return model
前几个简单的单一模型使用的是搭积木一样最简单的定义方式。后面复杂一点的模型都是使用的Functional API实现的。
下面再定义损失和精度的图,和混淆矩阵指标等等评价体系的函数
#定义损失和精度的图,和混淆矩阵指标等等
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
def plot_loss(history):
# 显示训练和验证损失图表
plt.subplots(1,2,figsize=(10,3))
plt.subplot(121)
loss = history.history["loss"]
epochs = range(1, len(loss)+1)
val_loss = history.history["val_loss"]
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "r", label="Validation Loss")
plt.title("Training and Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.subplot(122)
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
plt.plot(epochs, acc, "b-", label="Training Acc")
plt.plot(epochs, val_acc, "r--", label="Validation Acc")
plt.title("Training and Validation Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.tight_layout()
plt.show()
def plot_confusion_matrix(model,X_test,Y_test_original):
#预测概率
prob=model.predict(X_test)
#预测类别
pred=np.argmax(prob,axis=1)
#数据透视表,混淆矩阵
table = pd.crosstab(Y_test_original, pred, rownames=['Actual'], colnames=['Predicted'])
#print(table)
sns.heatmap(table,cmap='Blues',fmt='.20g', annot=True)
plt.tight_layout()
plt.show()
#计算混淆矩阵的各项指标
print(classification_report(Y_test_original, pred))
#科恩Kappa指标
print('科恩Kappa'+str(cohen_kappa_score(Y_test_original, pred)))
定义训练函数
#定义训练函数
def train_fuc(max_words=max_words,mode='BiLSTM+Attention',batch_size=32,epochs=10,hidden_dim=[32],show_loss=True,show_confusion_matrix=True):
#构建模型
model=build_model(max_words=max_words,mode=mode)
print(model.summary())
history=model.fit(X_train, Y_train,batch_size=batch_size,epochs=epochs,validation_split=0.2, verbose=1)
print('————————————训练完毕————————————')
# 评估模型
loss, accuracy = model.evaluate(X_test, Y_test)
print("测试数据集的准确度 = {:.4f}".format(accuracy))
if show_loss:
plot_loss(history)
if show_confusion_matrix:
plot_confusion_matrix(model=model,X_test=X_test,Y_test_original=Y_test_original)
设定一些参数
top_words=6000
max_words=20
batch_size=32
epochs=4
show_confusion_matrix=True
show_loss=True
mode='MLP'
训练轮数为4,比较少,因为这个数据集少,而且太简单了,每个句子很短,所以前面单一模型很容易过拟合,就只训练个4轮,也能节约时间。
下面开始一个个模型去训练并且评价:
train_fuc(mode='MLP',batch_size=batch_size,epochs=epochs)
如图,给出了训练每一轮的损失精度,和验证集的损失精度。并且画图,然后测试集的精度,画出的混淆矩阵,计算了混淆矩阵的一些指标,还有科恩系数。MLP测试集精度为0.8795
#下面模型都是接受三维数据输入,先把X变个形状
X_train= X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test= X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
train_fuc(mode='CNN',batch_size=batch_size,epochs=epochs)
也差不多,精度为0.8882
model='RNN'
train_fuc(mode=model,epochs=epochs)
结果类似,不展示那么多了,测试集精度为0.8912
train_fuc(mode='LSTM',epochs=epochs)
结果类似,不展示了,测试集精度为0.8966 (目前来看最高)
train_fuc(mode='GRU',epochs=epochs)
测试集精度为0.8912
train_fuc(mode='CNN+LSTM',epochs=epochs)
测试集精度为0.8916
train_fuc(mode='BiLSTM',epochs=epochs)
测试数据集的准确度 0.8816
train_fuc(mode='TextCNN',epochs=30)
这里加大了训练轮数,因为下面的模型都开始比较复杂,不容易过拟合,而且需要更多的训练轮数
测试集精度为0.8474
train_fuc(mode='Attention',epochs=100)
测试集精度为0.8207
train_fuc(mode='BiLSTM+Attention',epochs=30)
测试集精度0.8236
train_fuc(mode='BiGRU+Attention',epochs=100)
测试集精度0.8607
train_fuc(mode='Attention*3',epochs=50)
测试集精度0.8057
很明显,加了注意力机制的模型训练更加不容易过拟合。单一的循环网络才四轮就会过拟合,而注意力机制同时需要的训练轮数也更多,可以看到验证集精度一直在上升,损失一直在下降。
虽然最后整体的测试集的准确率不如前面的单一网络,但我猜测这应该是训练轮数不够和数据量过小的原因。
这个外卖的数据集实在是太短了,比较简单,而且样本量也不大。
而且和Transform比起来,这里的注意力机制没有采用残差连接,批量归一化等技巧,没有使用编码解码器,也没有堆叠很多层(Transform有18个注意力层)
以后可以在更复杂,更多的数据集上进行测试和训练注意力机制,把网络做大做深一点,多调参尝试,当然前提是需要有更多的计算资源(买台好电脑).....