用词向量加深度学习的方法做情感分析的基本思路是:
1.训练词向量 2.句子预处理、分词,句子变成一个个词的序列,指定序列最大长度,多砍少补,词分配索引、对应上词向量。3. 定义网络结构,比如可以使用一层LSTM+全连接层,使用dropout增加泛化性,然后开始训练。4.调整参数,看训练集和验证集的loss, accuracy,当验证集的accuracy非偶然地不增反降(一般也对应着loss开始上升)时,说明开始过拟合,停止训练,用这个epoch/iteration和参数重新对所有数据训练出模型。
1.词向量
训练词向量的语料和步骤在前面文章已有。可以把情感分析语料加上来一起训练词向量,方法和代码略过。值得一提的是,词向量语料最好是用待分析情感领域的语料,越多越好;另外,分词的好坏会很大程度影响词向量的准确性,可以做一些额外预处理比如去停用词、加入行业词条作为自定义词典等。
2.文本变数字
下图很好地解释了文本变数字的过程。通过词向量我们可以得到一个词表,词表里每个词有个index(比如词的index为词在词表中位置+1),且这个数字对应了该词的词向量,得到类似如图中embedding matrix。注意要留一个特殊数字如0代表非词典词。一句话分词后得到一系列词,如“I thought the movie was incredible and inspiring”分词后是“I”,“thought”,“the”,“movie”,“was”,“incredible”,“and”,“inspiring”,每个词对应一个索引数字,得到向量[41 804 201534 1005 15 7446 5 13767]。输入需转化为统一长度(max_len),如10,这句话只有8个词,那么需要补齐剩下的2个空位为0。那么[41 804 201534 1005 15 7446 5 13767 0 0]通过查询embedding matrix就可以得到[batch_size = 1, max_len = 10, word2vec_dimension = 50]的向量。即为输入。
3.网络结构
步骤2所说的是输入,输出就是one-hot向量。如3分类(正面、负面、中性),对应输出为[1 0 0]和[0 1 0]和[0 0 1],softmax得到的输出就可以代表各分类的概率。对于二分类,也可以用0,1来代表输出,这样用sigmoid使输出映射到0到1之间,也可以作为概率。那么有了输入和输出,就要定义模型/网络结构,然后让模型自己去学习参数。这里语料不多,模型尽可能简单。可以用一层CNN(这里当然是包括pooling层的),RNN(LSTM, GRU, Bidirectional lstm)等,最后是一层全连接层。实验发现Bidirectional lstm效果最好,测试集上能达到95%以上的正确率。这也与一般认知相符,因为CNN只提取了一段段的词,没考虑上下文信息;而lstm将句子由左向右计算,不能结合右边的信息,所以bi-lstm加一遍反向计算的信息。
4.训练
划分训练集和验证集(0.2比例),用训练集做训练,同时对验证集也要算loss和accuracy。正常情况,训练集loss越来越低,accuracy越来越高至收敛;验证集开始也如此,到某个时刻开始loss升高,accuracy降低,说明过拟合,在这一刻early-stopping。用当前参数重新训练整个数据,得到模型。
5.python代码
keras训练
# -*- coding: utf-8 -*-
import time
import yaml
import sys
from sklearn.model_selection import train_test_split
import multiprocessing
import numpy as np
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Bidirectional
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout,Activation
from keras.models import model_from_yaml
np.random.seed(35) # For Reproducibility
import jieba
import pandas as pd
import sys
sys.setrecursionlimit(1000000)
# set parameters:
vocab_dim = 256
maxlen = 150
batch_size = 32
n_epoch = 5
input_length = 150
validation_rate = 0.0
cpu_count = multiprocessing.cpu_count()
def read_txt(filename):
f = open(filename)
res = []
for i in f:
res.append(i.replace("\n",""))
del(res[0])
return res
#加载训练文件
def loadfile():
neg = read_txt("./bida_neg.txt")
pos = read_txt('./bida_pos.txt')
combined=np.concatenate((pos, neg))
y = np.concatenate((np.ones(len(pos),dtype=int), np.zeros(len(neg),dtype=int)))
return combined,y
#对句子经行分词,并去掉换行符
def tokenizer(text):
''' Simple Parser converting each document to lower-case, then
removing the breaks for new lines and finally splitting on the
whitespace
'''
text = [jieba.lcut(document.replace('\n', '')) for document in text]
return text
def create_dictionaries(model=None,
combined=None):
''' Function does are number of Jobs:
1- Creates a word to index mapping
2- Creates a word to vector mapping
3- Transforms the Training and Testing Dictionaries
'''
if (combined is not None) and (model is not None):
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的
索引
w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词
语的词向量
def parse_dataset(combined):
''' Words become integers
'''
data=[]
for sentence in combined:
new_txt = []
for word in sentence:
try:
new_txt.append(w2indx[word])
except:
new_txt.append(0)
data.append(new_txt)
return data
combined=parse_dataset(combined)
combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词
语对应的索引
return w2indx, w2vec,combined
else:
print('No data provided...')
def get_data(index_dict,word_vectors,combined,y):
n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1
embedding_weights = np.zeros((n_symbols, vocab_dim))#索引为0的词语,词向量全
为0
for word, index in index_dict.items():#从索引为1的词语开始,对每个词语对应其
词向量
embedding_weights[index, :] = word_vectors[word]
x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=validation_rate)
return n_symbols,embedding_weights,x_train,y_train,x_test,y_test
def word2vec_train(model, combined):
index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)
return index_dict, word_vectors, combined
##定义网络结构
def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):
model = Sequential()
model.add(Embedding(output_dim=vocab_dim,
input_dim=n_symbols,
mask_zero=True,
weights=[embedding_weights],
input_length=input_length)) # Adding Input Length
model.add(Bidirectional(LSTM(32, activation='sigmoid',inner_activation='sigmoid')))
model.add(Dropout(0.4))
model.add(Dense(1))
model.add(Activation('sigmoid'))
print('Compiling the Model...')
model.compile(loss='binary_crossentropy',
optimizer='adam',metrics=['accuracy'])
print("Train...")
model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test))
print("Evaluate...")
score = model.evaluate(x_test, y_test,
batch_size=batch_size)
yaml_string = model.to_yaml()
with open('lstm_data/lstm.yml', 'w') as outfile:
outfile.write( yaml.dump(yaml_string, default_flow_style=True) )
model.save_weights('lstm_data/lstm.h5')
print('Test score:', score)
#训练模型,并保存
def train():
combined,y=loadfile()
combined = tokenizer(combined)
model = Word2Vec.load("../models/word2vec.model")
index_dict, word_vectors,combined=create_dictionaries(model, combined)
n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)
train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)
if __name__=='__main__':
train()
以上是二分类,输出映射到0~1之间的代码,如果多分类,激活函数用softmax代替sigmoid,loss='binary_crossentropy'改为loss='categorical_crossentropy',另外y = to_categorical(y, num_classes=classes)
预测
# -*- coding: utf-8 -*-
import time
import yaml
import sys
from sklearn.model_selection import train_test_split
import multiprocessing
import numpy as np
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout,Activation
from keras.models import model_from_yaml
import jieba
import pandas as pd
# set parameters:
vocab_dim = 256
maxlen = 150
batch_size = 32
n_epoch = 5
input_length = 150
cpu_count = multiprocessing.cpu_count()
def init_dictionaries(w2v_model):
gensim_dict = Dictionary()
gensim_dict.doc2bow(w2v_model.wv.vocab.keys(),
allow_update=True)
w2indx = {v: k+1 for k, v in gensim_dict.items()}
w2vec = {word: w2v_model[word] for word in w2indx.keys()}
return w2indx, w2vec
def process_words(w2indx, words):
temp = []
for word in words:
try:
temp.append(w2indx[word])
except:
temp.append(0)
res = sequence.pad_sequences([temp], maxlen = maxlen)
return res
def input_transform(string, w2index):
words=jieba.lcut(string)
return process_words(w2index, words)
def load_model():
print('loading model......')
with open('lstm_data/lstm.yml', 'r') as f:
yaml_string = yaml.load(f)
model = model_from_yaml(yaml_string)
model.load_weights('lstm_data/lstm.h5')
model.compile(loss='binary_crossentropy',
optimizer='adam',metrics=['accuracy'])
w2v_model=Word2Vec.load('../models/word2vec.model')
return model,w2v_model
def lstm_predict(string, model, w2index):
data=input_transform(string, w2index)
data.reshape(1,-1)
result=model.predict_classes(data)
prob = model.predict_proba(data)
print(string)
print("prob:" + str(prob))
if result[0][0]==1:
#print(string,' positive')
return 1
else:
#print(string,' negative')
return -1
if __name__=='__main__':
model,w2v_model = load_model()
w2index, _ = init_dictionaries(w2v_model)
lstm_predict("平安大跌", model, w2index)
tensorflow训练
#coding = utf-8
from gensim.corpora import Dictionary
from gensim.models import Word2Vec
import numpy as np
from random import randint
from sklearn.model_selection import train_test_split
import tensorflow as tf
import jieba
def read_txt(filename):
f = open(filename)
res = []
for i in f:
res.append(i.replace("\n",""))
del(res[0])
return res
def loadfile():
neg = read_txt("../data/bida_neg.txt")
pos = read_txt('../data/bida_pos.txt')
combined=np.concatenate((pos, neg))
y = np.concatenate((np.ones(len(pos),dtype=int),np.zeros(len(neg),dtype=int)))
return combined,y
def create_dictionaries(model=None):
if (combined is not None) and (model is not None):
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2index = {v: k+1 for k, v in gensim_dict.items()}
vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
for k, v in gensim_dict.items():
vectors[k+1] = model[v]
return w2index, vectors
def get_train_batch(batch_size):
labels = []
arr = np.zeros([batch_size, max_seq_length])
for i in range(batch_size):
num = randint(0,len(X_train) - 1)
labels.append(y_train[num])
arr[i] = X_train[num]
return arr, labels
def get_test_batch(batch_size):
labels = []
arr = np.zeros([batch_size, max_seq_length])
for i in range(batch_size):
num = randint(0,len(X_test) - 1)
labels.append(y_test[num])
arr[i] = X_test[num]
return arr, labels
def get_all_batches(batch_size = 32, mode = "train"):
X, y = None, None
if mode == "train":
X = X_train
y = y_train
elif mode == "test":
X = X_test
y = y_test
batches = int(len(y)/batch_size)
arrs = [X[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
arrs.append(X[batches*batch_size:len(y)])
labels = [y[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
labels.append(y[batches*batch_size:len(y)])
return arrs, labels
def parse_dataset(sentences, w2index, max_len):
data=[]
for sentence in sentences:
words = jieba.lcut(sentence.replace('\n', ''))
new_txt = np.zeros((max_len), dtype='int32')
index = 0
for word in words:
try:
new_txt[index] = w2index[word]
except:
new_txt[index] = 0
index += 1
if index >= max_len:
break
data.append(new_txt)
return data
batch_size = 32
lstm_units = 64
num_classes = 2
iterations = 50000
num_dimensions = 256
max_seq_len = 150
max_seq_length = 150
validation_rate = 0.2
random_state = 9876
output_keep_prob = 0.5
learning_rate = 0.001
combined, y = loadfile()
model = Word2Vec.load("../models/word2vec.model")
w2index, vectors = create_dictionaries(model)
X = parse_dataset(combined, w2index, max_seq_len)
y = [[1,0] if yi == 1 else [0,1] for yi in y]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_rate, random_state=random_state)
tf.reset_default_graph()
labels = tf.placeholder(tf.float32, [None, num_classes])
input_data = tf.placeholder(tf.int32, [None, max_seq_length])
data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
data = tf.nn.embedding_lookup(vectors, input_data)
#bidirectional lstm
lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
(output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)
outputs = tf.concat([output_fw, output_bw], axis=2)
# Fully connected layer.
weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
dtype=tf.float32)
bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
initializer=tf.zeros_initializer())
last = tf.transpose(outputs, [1,0,2])
last = tf.gather(last, int(last.get_shape()[0]) - 1)
logits = (tf.matmul(last, weight) + bias)
prediction = tf.nn.softmax(logits)
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
sess = tf.InteractiveSession()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
cal_iter = 500
loss_train, loss_test = 0.0, 0.0
acc_train, acc_test = 0.0, 0.0
print("start training...")
for i in range(iterations):
#Next Batch of reviews
next_batch, next_batch_labels = get_train_batch(batch_size);
sess.run(optimizer, {input_data: next_batch, labels: next_batch_labels})
#Save the network every 10,000 training iterations
if (i % cal_iter == 0):
save_path = saver.save(sess, "models/pretrained_lstm.ckpt")
print("iteration: " + str(i))
train_acc, train_loss = 0.0, 0.0
test_acc, test_loss = 0.0, 0.0
train_arrs, train_labels = get_all_batches(300)
test_arrs, test_labels = get_all_batches(300, "test")
for k in range(len(train_labels)):
temp1, temp2 = sess.run([accuracy, loss], {input_data: train_arrs[k], labels : train_labels[k]})
train_acc += temp1
train_loss += temp2
train_acc /= len(train_labels)
train_loss /= len(train_labels)
for k in range(len(test_labels)):
temp1, temp2 = sess.run([accuracy, loss], {input_data: test_arrs[k], labels : test_labels[k]})
test_acc += temp1
test_loss += temp2
test_acc /= len(test_labels)
test_loss /= len(test_labels)
print("train accuracy: " + str(train_acc) + ", train loss: " + str(train_loss))
print("test accucary: " + str(test_acc) + ", test loss: " + str(test_loss))
预测
import tensorflow as tf
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
import numpy as np
import jieba
def create_dictionaries(model=None):
if model is not None:
gensim_dict = Dictionary()
gensim_dict.doc2bow(model.wv.vocab.keys(),
allow_update=True)
w2index = {v: k+1 for k, v in gensim_dict.items()}
vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
for k, v in gensim_dict.items():
vectors[k+1] = model[v]
return w2index, vectors
def parse_dataset(sentence, w2index, max_len):
words = jieba.lcut(sentence.replace('\n', ''))
new_txt = np.zeros((max_len), dtype='int32')
index = 0
for word in words:
try:
new_txt[index] = w2index[word]
except:
new_txt[index] = 0
index += 1
if index >= max_len:
break
return [new_txt]
batch_size = 32
lstm_units = 64
num_classes = 2
iterations = 100000
num_dimensions = 256
max_seq_len = 150
max_seq_length = 150
validation_rate = 0.2
random_state = 333
output_keep_prob = 0.5
model = Word2Vec.load("../models/word2vec.model")
w2index, vectors = create_dictionaries(model)
tf.reset_default_graph()
labels = tf.placeholder(tf.float32, [None, num_classes])
input_data = tf.placeholder(tf.int32, [None, max_seq_length])
data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
data = tf.nn.embedding_lookup(vectors,input_data)
"""
bi-lstm
"""
#bidirectional lstm
lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
(output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)
outputs = tf.concat([output_fw, output_bw], axis=2)
# Fully connected layer.
weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
dtype=tf.float32)
bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
initializer=tf.zeros_initializer())
#last = tf.reshape(outputs, [-1, 2 * lstm_units])
last = tf.transpose(outputs, [1,0,2])
last = tf.gather(last, int(last.get_shape()[0]) - 1)
logits = (tf.matmul(last, weight) + bias)
prediction = tf.nn.softmax(logits)
correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
sess = tf.InteractiveSession()
saver = tf.train.Saver()
#saver.restore(sess, 'models/pretrained_lstm.ckpt-27000.data-00000-of-00001')
saver.restore(sess, tf.train.latest_checkpoint('models'))
l = ["平安银行大跌", "平安银行暴跌", "平安银行扭亏为盈","小米将加深与TCL合作",
"苹果手机现在卖的不如以前了","苹果和三星的糟糕业绩预示着全球商业领域将经历更加严
峻的考验。"
,"这道菜不好吃"]
for s in l:
print(s)
X = parse_dataset(s, w2index, max_seq_len)
predictedSentiment = sess.run(prediction, {input_data: X})[0]
print(predictedSentiment[0], predictedSentiment[1])
参考资料:
https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb
https://buptldy.github.io/2016/07/20/2016-07-20-sentiment%20analysis/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
https://arxiv.org/abs/1408.5882