深度学习情感分析

用词向量加深度学习的方法做情感分析的基本思路是:
1.训练词向量 2.句子预处理、分词,句子变成一个个词的序列,指定序列最大长度,多砍少补,词分配索引、对应上词向量。3. 定义网络结构,比如可以使用一层LSTM+全连接层,使用dropout增加泛化性,然后开始训练。4.调整参数,看训练集和验证集的loss, accuracy,当验证集的accuracy非偶然地不增反降(一般也对应着loss开始上升)时,说明开始过拟合,停止训练,用这个epoch/iteration和参数重新对所有数据训练出模型。

1.词向量
训练词向量的语料和步骤在前面文章已有。可以把情感分析语料加上来一起训练词向量,方法和代码略过。值得一提的是,词向量语料最好是用待分析情感领域的语料,越多越好;另外,分词的好坏会很大程度影响词向量的准确性,可以做一些额外预处理比如去停用词、加入行业词条作为自定义词典等。

2.文本变数字
下图很好地解释了文本变数字的过程。通过词向量我们可以得到一个词表,词表里每个词有个index(比如词的index为词在词表中位置+1),且这个数字对应了该词的词向量,得到类似如图中embedding matrix。注意要留一个特殊数字如0代表非词典词。一句话分词后得到一系列词,如“I thought the movie was incredible and inspiring”分词后是“I”,“thought”,“the”,“movie”,“was”,“incredible”,“and”,“inspiring”,每个词对应一个索引数字,得到向量[41 804 201534 1005 15 7446 5 13767]。输入需转化为统一长度(max_len),如10,这句话只有8个词,那么需要补齐剩下的2个空位为0。那么[41 804 201534 1005 15 7446 5 13767 0 0]通过查询embedding matrix就可以得到[batch_size = 1, max_len = 10, word2vec_dimension = 50]的向量。即为输入。


深度学习情感分析_第1张图片
由文本到数字输入.png

3.网络结构
步骤2所说的是输入,输出就是one-hot向量。如3分类(正面、负面、中性),对应输出为[1 0 0]和[0 1 0]和[0 0 1],softmax得到的输出就可以代表各分类的概率。对于二分类,也可以用0,1来代表输出,这样用sigmoid使输出映射到0到1之间,也可以作为概率。那么有了输入和输出,就要定义模型/网络结构,然后让模型自己去学习参数。这里语料不多,模型尽可能简单。可以用一层CNN(这里当然是包括pooling层的),RNN(LSTM, GRU, Bidirectional lstm)等,最后是一层全连接层。实验发现Bidirectional lstm效果最好,测试集上能达到95%以上的正确率。这也与一般认知相符,因为CNN只提取了一段段的词,没考虑上下文信息;而lstm将句子由左向右计算,不能结合右边的信息,所以bi-lstm加一遍反向计算的信息。

4.训练
划分训练集和验证集(0.2比例),用训练集做训练,同时对验证集也要算loss和accuracy。正常情况,训练集loss越来越低,accuracy越来越高至收敛;验证集开始也如此,到某个时刻开始loss升高,accuracy降低,说明过拟合,在这一刻early-stopping。用当前参数重新训练整个数据,得到模型。

5.python代码
keras训练

# -*- coding: utf-8 -*-
import time
import yaml
import sys
from sklearn.model_selection import train_test_split
import multiprocessing
import numpy as np
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Bidirectional
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout,Activation
from keras.models import model_from_yaml
np.random.seed(35)  # For Reproducibility
import jieba
import pandas as pd
import sys
sys.setrecursionlimit(1000000)
# set parameters:
vocab_dim = 256
maxlen = 150
batch_size = 32
n_epoch = 5
input_length = 150
validation_rate = 0.0
cpu_count = multiprocessing.cpu_count()

def read_txt(filename):
  f = open(filename)
  res = []
  for i in f:
    res.append(i.replace("\n",""))
  del(res[0])
  return res


#加载训练文件
def loadfile():
    neg = read_txt("./bida_neg.txt")
    pos = read_txt('./bida_pos.txt')
    combined=np.concatenate((pos, neg))
    y = np.concatenate((np.ones(len(pos),dtype=int), np.zeros(len(neg),dtype=int)))
return combined,y

#对句子经行分词,并去掉换行符
def tokenizer(text):
    ''' Simple Parser converting each document to lower-case, then
        removing the breaks for new lines and finally splitting on the
        whitespace
    '''
    text = [jieba.lcut(document.replace('\n', '')) for document in text]
    return text

def create_dictionaries(model=None,
                        combined=None):
    ''' Function does are number of Jobs:
        1- Creates a word to index mapping
        2- Creates a word to vector mapping
        3- Transforms the Training and Testing Dictionaries
    '''
    if (combined is not None) and (model is not None):
        gensim_dict = Dictionary()
        gensim_dict.doc2bow(model.wv.vocab.keys(),
                            allow_update=True)
        w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的
索引
        w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词
语的词向量

        def parse_dataset(combined):
            ''' Words become integers
            '''
            data=[]
            for sentence in combined:
                new_txt = []
                for word in sentence:
                    try:
                        new_txt.append(w2indx[word])
                    except:
                        new_txt.append(0)
                data.append(new_txt)
            return data
        combined=parse_dataset(combined)
        combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词
语对应的索引
        return w2indx, w2vec,combined
    else:
        print('No data provided...')



def get_data(index_dict,word_vectors,combined,y):

    n_symbols = len(index_dict) + 1  # 所有单词的索引数,频数小于10的词语索引为0,所以加1
    embedding_weights = np.zeros((n_symbols, vocab_dim))#索引为0的词语,词向量全
为0
    for word, index in index_dict.items():#从索引为1的词语开始,对每个词语对应其
词向量
        embedding_weights[index, :] = word_vectors[word]
    x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=validation_rate)

    return n_symbols,embedding_weights,x_train,y_train,x_test,y_test


def word2vec_train(model, combined):

    index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)
    return   index_dict, word_vectors, combined

##定义网络结构
def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):


    model = Sequential()
    model.add(Embedding(output_dim=vocab_dim,
                        input_dim=n_symbols,
                        mask_zero=True,
                        weights=[embedding_weights],
                        input_length=input_length))  # Adding Input Length
 
    model.add(Bidirectional(LSTM(32, activation='sigmoid',inner_activation='sigmoid')))

    model.add(Dropout(0.4))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    print('Compiling the Model...')
model.compile(loss='binary_crossentropy',
                  optimizer='adam',metrics=['accuracy'])

    print("Train...")
    model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test))

    print("Evaluate...")
    score = model.evaluate(x_test, y_test,
                                batch_size=batch_size)

    yaml_string = model.to_yaml()
    with open('lstm_data/lstm.yml', 'w') as outfile:
        outfile.write( yaml.dump(yaml_string, default_flow_style=True) )
    model.save_weights('lstm_data/lstm.h5')
    print('Test score:', score)

#训练模型,并保存
def train():
    combined,y=loadfile()
    combined = tokenizer(combined)
    model = Word2Vec.load("../models/word2vec.model")
    index_dict, word_vectors,combined=create_dictionaries(model, combined)
    n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)
    train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)


if __name__=='__main__':
    train()


以上是二分类,输出映射到0~1之间的代码,如果多分类,激活函数用softmax代替sigmoid,loss='binary_crossentropy'改为loss='categorical_crossentropy',另外y = to_categorical(y, num_classes=classes)

预测

# -*- coding: utf-8 -*-
import time
import yaml
import sys
from sklearn.model_selection import train_test_split
import multiprocessing
import numpy as np
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout,Activation
from keras.models import model_from_yaml
import jieba
import pandas as pd

# set parameters:
vocab_dim = 256
maxlen = 150
batch_size = 32
n_epoch = 5
input_length = 150
cpu_count = multiprocessing.cpu_count()


def init_dictionaries(w2v_model):
    gensim_dict = Dictionary()
    gensim_dict.doc2bow(w2v_model.wv.vocab.keys(),
                            allow_update=True)
    w2indx = {v: k+1 for k, v in gensim_dict.items()}
    w2vec = {word: w2v_model[word] for word in w2indx.keys()}
    return w2indx, w2vec

def process_words(w2indx, words):

    temp = []
    for word in words:
        try:
                temp.append(w2indx[word])
        except:
                temp.append(0)
    res = sequence.pad_sequences([temp], maxlen = maxlen)
    return res

def input_transform(string, w2index):
    words=jieba.lcut(string)
    return process_words(w2index, words)


def load_model():
    print('loading model......')
    with open('lstm_data/lstm.yml', 'r') as f:
        yaml_string = yaml.load(f)
    model = model_from_yaml(yaml_string)

    model.load_weights('lstm_data/lstm.h5')
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',metrics=['accuracy'])

    w2v_model=Word2Vec.load('../models/word2vec.model')
    return model,w2v_model

def lstm_predict(string, model, w2index):
    data=input_transform(string, w2index)
    data.reshape(1,-1)
    result=model.predict_classes(data)
    prob = model.predict_proba(data)
    print(string)
    print("prob:" + str(prob))
    if result[0][0]==1:
        #print(string,' positive')
        return 1
    else:
        #print(string,' negative')
        return -1

if __name__=='__main__':
    model,w2v_model = load_model()
    w2index, _ = init_dictionaries(w2v_model)
    lstm_predict("平安大跌", model, w2index)

tensorflow训练

#coding = utf-8

from gensim.corpora import Dictionary
from gensim.models import Word2Vec
import numpy as np
from random import randint
from sklearn.model_selection import train_test_split
import tensorflow as tf
import jieba

def read_txt(filename):
  f = open(filename)
  res = []
  for i in f:
    res.append(i.replace("\n",""))
  del(res[0])
  return res


def loadfile():
    neg = read_txt("../data/bida_neg.txt")
    pos = read_txt('../data/bida_pos.txt')
    combined=np.concatenate((pos, neg))
    y = np.concatenate((np.ones(len(pos),dtype=int),np.zeros(len(neg),dtype=int)))

    return combined,y


def create_dictionaries(model=None):

    if (combined is not None) and (model is not None):
        gensim_dict = Dictionary()
        gensim_dict.doc2bow(model.wv.vocab.keys(),
                            allow_update=True)
        w2index = {v: k+1 for k, v in gensim_dict.items()}
        vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
        for k, v in gensim_dict.items():
            vectors[k+1] = model[v]

    return w2index, vectors

def get_train_batch(batch_size):
    labels = []
    arr = np.zeros([batch_size, max_seq_length])
    for i in range(batch_size):
        num = randint(0,len(X_train) - 1)
        labels.append(y_train[num])
        arr[i] = X_train[num]

    return arr, labels

def get_test_batch(batch_size):
    labels = []
    arr = np.zeros([batch_size, max_seq_length])
    for i in range(batch_size):
        num = randint(0,len(X_test) - 1)
        labels.append(y_test[num])
        arr[i] = X_test[num]
    return arr, labels

def get_all_batches(batch_size = 32, mode = "train"):
    X, y = None, None
    if mode == "train":
        X = X_train
        y = y_train
    elif mode == "test":
        X = X_test
        y = y_test

    batches = int(len(y)/batch_size)
    arrs = [X[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
    arrs.append(X[batches*batch_size:len(y)])
    labels = [y[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
    labels.append(y[batches*batch_size:len(y)])
    return arrs, labels

def parse_dataset(sentences, w2index, max_len):
    data=[]
    for sentence in sentences:
        words = jieba.lcut(sentence.replace('\n', ''))
        new_txt = np.zeros((max_len), dtype='int32')
        index = 0
        for word in words:
            try:
                new_txt[index] = w2index[word]
            except:
                new_txt[index] = 0

            index += 1
            if index >= max_len:
                break

        data.append(new_txt)
    return data

batch_size = 32
lstm_units = 64
num_classes = 2
iterations = 50000
num_dimensions = 256
max_seq_len = 150
max_seq_length = 150
validation_rate = 0.2
random_state = 9876
output_keep_prob = 0.5
learning_rate = 0.001

combined, y = loadfile()

model = Word2Vec.load("../models/word2vec.model")
w2index, vectors = create_dictionaries(model)

X = parse_dataset(combined, w2index, max_seq_len)
y = [[1,0] if yi == 1 else [0,1] for yi in y]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_rate, random_state=random_state)


tf.reset_default_graph()

labels = tf.placeholder(tf.float32, [None, num_classes])
input_data = tf.placeholder(tf.int32, [None, max_seq_length])

data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
data = tf.nn.embedding_lookup(vectors, input_data)

#bidirectional lstm
lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
(output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)

outputs = tf.concat([output_fw, output_bw], axis=2)
# Fully connected layer.
weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
                dtype=tf.float32)

bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
                initializer=tf.zeros_initializer())

last = tf.transpose(outputs, [1,0,2])
last = tf.gather(last, int(last.get_shape()[0]) - 1)

logits = (tf.matmul(last, weight) + bias)
prediction = tf.nn.softmax(logits)

correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

sess = tf.InteractiveSession()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())


cal_iter = 500


loss_train, loss_test = 0.0, 0.0
acc_train, acc_test = 0.0, 0.0
print("start training...")
for i in range(iterations):

#Next Batch of reviews
    next_batch, next_batch_labels = get_train_batch(batch_size);
    sess.run(optimizer, {input_data: next_batch, labels: next_batch_labels})

    #Save the network every 10,000 training iterations

    if (i % cal_iter == 0):
        save_path = saver.save(sess, "models/pretrained_lstm.ckpt")
        print("iteration: " + str(i))
        train_acc, train_loss = 0.0, 0.0
        test_acc, test_loss = 0.0, 0.0
        train_arrs, train_labels = get_all_batches(300)
        test_arrs, test_labels = get_all_batches(300, "test")

      for k in range(len(train_labels)):
            temp1, temp2 = sess.run([accuracy, loss], {input_data: train_arrs[k], labels : train_labels[k]})
            train_acc += temp1
            train_loss += temp2
        train_acc /= len(train_labels)
        train_loss /= len(train_labels)

        for k in range(len(test_labels)):
            temp1, temp2 = sess.run([accuracy, loss], {input_data: test_arrs[k], labels : test_labels[k]})
            test_acc += temp1
            test_loss += temp2
        test_acc /= len(test_labels)
        test_loss /= len(test_labels)

        print("train accuracy: " + str(train_acc) + ", train loss: " + str(train_loss))
        print("test accucary: " + str(test_acc) + ", test loss: " + str(test_loss))

          

预测

import tensorflow as tf
from gensim.models import Word2Vec
from gensim.corpora.dictionary import Dictionary
import numpy as np
import jieba

def create_dictionaries(model=None):

    if model is not None:
        gensim_dict = Dictionary()
        gensim_dict.doc2bow(model.wv.vocab.keys(),
                            allow_update=True)
        w2index = {v: k+1 for k, v in gensim_dict.items()}
        vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
        for k, v in gensim_dict.items():
            vectors[k+1] = model[v]

    return w2index, vectors

def parse_dataset(sentence, w2index, max_len):
    words = jieba.lcut(sentence.replace('\n', ''))
    new_txt = np.zeros((max_len), dtype='int32')
    index = 0
    for word in words:
            try:
                new_txt[index] = w2index[word]
            except:
                new_txt[index] = 0

            index += 1
            if index >= max_len:
                break

    return [new_txt]

batch_size = 32
lstm_units = 64
num_classes = 2
iterations = 100000
num_dimensions = 256
max_seq_len = 150
max_seq_length = 150
validation_rate = 0.2
random_state = 333
output_keep_prob = 0.5


model = Word2Vec.load("../models/word2vec.model")
w2index, vectors = create_dictionaries(model)




tf.reset_default_graph()

labels = tf.placeholder(tf.float32, [None, num_classes])
input_data = tf.placeholder(tf.int32, [None, max_seq_length])

data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
data = tf.nn.embedding_lookup(vectors,input_data)


"""
bi-lstm
"""
#bidirectional lstm
lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
(output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)

outputs = tf.concat([output_fw, output_bw], axis=2)

# Fully connected layer.
weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
                dtype=tf.float32)

bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
                initializer=tf.zeros_initializer())

#last = tf.reshape(outputs, [-1, 2 * lstm_units])
last = tf.transpose(outputs, [1,0,2])
last = tf.gather(last, int(last.get_shape()[0]) - 1)

logits = (tf.matmul(last, weight) + bias)
prediction = tf.nn.softmax(logits)


correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

sess = tf.InteractiveSession()
saver = tf.train.Saver()
#saver.restore(sess, 'models/pretrained_lstm.ckpt-27000.data-00000-of-00001')
saver.restore(sess, tf.train.latest_checkpoint('models'))


l = ["平安银行大跌", "平安银行暴跌", "平安银行扭亏为盈","小米将加深与TCL合作",
"苹果手机现在卖的不如以前了","苹果和三星的糟糕业绩预示着全球商业领域将经历更加严
峻的考验。"
,"这道菜不好吃"]
for s in l:
    print(s)
    X = parse_dataset(s, w2index, max_seq_len)
    predictedSentiment = sess.run(prediction, {input_data: X})[0]
    print(predictedSentiment[0], predictedSentiment[1])

参考资料:
https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb
https://buptldy.github.io/2016/07/20/2016-07-20-sentiment%20analysis/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
https://arxiv.org/abs/1408.5882

你可能感兴趣的:(深度学习情感分析)