【NLP】8中文语句情感分析实战——酒店、微博、外卖、网购等九个数据集处理、SVM和SGD训练

情感分析数据集获取与生成句向量

  • 一、情感分析数据集处理
    • 1. NLPCC 2014会议技术评测测试数据与答案
    • 2. 酒店评论数据ChnSentiCorp_htl_all
    • 3. 外卖平台用户评价waimai_10k
    • 4. 线上购物评论数据online_shopping_10_cats
    • 5. 新浪微博情感标注weibo_senti_100k
    • 6. 新浪微博情感标注simplifyweibo_4_moods
    • 7. 电影评论数据集dmsc_v2
    • 8. 餐馆用户评论数据yf_dianping
    • 9. 商品评论数据yf_amazon
    • 10. 文件合并
  • 二、句子的向量表示
    • 1. 不能用n_similarity就算句子相似度
    • 2. 对每个词向量求和取平均
    • 3. 情感分析数据集与测试数据集句子转向量
    • 4. 读取句向量测量长度
    • 5. 支持向量机SVM
    • 6. 随机梯度下降SGD
  • 小结

一、情感分析数据集处理

1. NLPCC 2014会议技术评测测试数据与答案

第二届自然语言处理与中文计算会议(NLP&CC 2013),大小:10 000 条微博,而且与2014年的是重复的,所以使用2014年会议的数据

NLPCC 2014 Evaluation Tasks Test Data,大小:14 000 条微博,45 421句子,网站

微博语料,标注了7 emotions: like, disgust, happiness, sadness, anger, surprise, fear

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup

def emotion_convert(string):
    dictionary = {
        'like':'POS',
        'disgust':'NEG',
        'happiness':'POS',
        'sadness':'NEG',
        'anger':'NEG',
        'surprise':'POS',
        'fear':'NEG'
    }
    return dictionary.get(string, None)

NLPCC_2014_path = '...your path/NLPCC/evtestdata1/Training data for Emotion Classification.xml'
out_path = '...your path/test/NLPCC_2014.txt'

file = open(NLPCC_2014_path, 'r', encoding='utf-8')
txt = file.read()
file.close()
file = open(out_path, 'a', encoding='utf-8')

soup = BeautifulSoup(txt,'html.parser')

for tag in soup.find_all('sentence'):
    file.write(tag.string + ' ')
    if tag.attrs['opinionated'] == 'N':
        file.write('NORM\n')
    elif tag.attrs['opinionated'] == 'Y':
        file.write(emotion_convert(tag.attrs['emotion-1-type'])+'\n')

2. 酒店评论数据ChnSentiCorp_htl_all

参考此,7000 多条酒店评论数据,5000 多条正向评论,2000 多条负向评论

关于pandas库遍历数据集用法可参考此

import pandas as pd

def emotion_convert(string):
    dictionary = {
        1:'POS',
        0:'NEG'
    }
    return dictionary.get(string, None)

path = '...your path/情感观点评论 倾向性分析/ChnSentiCorp_htl_all/'
pd_all = pd.read_csv(path + 'ChnSentiCorp_htl_all.csv')

print('评论数目(总体):%d' % pd_all.shape[0])
print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])

# print(pd_all.sample(2))

# 构造平衡语料

out_path = '...your path/test/ChnSentiCorp_htl_all.txt'
file = open(out_path, 'a', encoding='utf-8')

for row in pd_all.itertuples():
    # print(emotion_convert(getattr(row, 'label')),getattr(row, 'review'))
    try:
        file.write(getattr(row, 'review') + ' ' + emotion_convert(getattr(row, 'label')) + '\n')
    except:
        print('Error!')

结果:

评论数目(总体):7766
评论数目(正向):5322
评论数目(负向):2444
Error!

3. 外卖平台用户评价waimai_10k

参考此,某外卖平台收集的用户评价,正向 4000 条,负向约 8000 条

代码同上,需要修改的:

path = '...your path/情感观点评论 倾向性分析/waimai_10k/'
pd_all = pd.read_csv(path + 'waimai_10k.csv')

out_path = '...your path/test/waimai_10k.txt'

结果:

评论数目(总体):11987
评论数目(正向):4000
评论数目(负向):7987

4. 线上购物评论数据online_shopping_10_cats

参考此,10 个类别,共 6 万多条评论数据,正、负向评论各约 3 万条,包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店

代码同上,结果:

评论数目(总体):62774
评论数目(正向):31728
评论数目(负向):31046
Error!

5. 新浪微博情感标注weibo_senti_100k

参考此,10 万多条,带情感标注新浪微博,正负向评论约各 5 万条

代码同上,结果:

评论数目(总体):119988
评论数目(正向):59993
评论数目(负向):59995

6. 新浪微博情感标注simplifyweibo_4_moods

参考此, 36 万多条,带情感标注新浪微博,包含 4 种情感,其中喜悦约 20 万条,愤怒、厌恶、低落各约 5 万条

代码修改部分:

def emotion_convert(string):
    dictionary = {
        0: 'POS',
        1: 'NEG',
        2: 'NEG',
        3: 'NEG'
    }
    return dictionary.get(string, None)
    
print('评论数目(正向):%d' % pd_all[pd_all.label==0].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.label!=0].shape[0])

结果:

评论数目(总体):361744
评论数目(正向):199496
评论数目(负向):162248

7. 电影评论数据集dmsc_v2

参考此,28 部电影,超 70 万用户,超 200万条评分/评论数据

修改的部分代码如下:

def emotion_convert(string):
    dictionary = {
        5: 'POS',
        1: 'NEG'
    }
    return dictionary.get(string, None)

print('评论数目(正向):%d' % pd_all[pd_all.rating==5].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.rating==1].shape[0])

for row in pd_all.itertuples():
    try:
        if getattr(row, 'rating') == 1 or getattr(row, 'rating') == 5:
            file.write(getattr(row, 'comment') + ' ' + emotion_convert(getattr(row, 'rating')) + '\n')
    except:
        print('Error!')

结果:

评论数目(总体):2125056
评论数目(正向):638106
评论数目(负向):190927

8. 餐馆用户评论数据yf_dianping

参考此,24 万家餐馆,54 万用户,440 万条评论/评分数据

import pandas as pd

def emotion_convert(string):
    dictionary = {
        5: 'POS',
        1: 'NEG',
        0: 'NEG'
    }
    return dictionary.get(string, None)

path = '...your path/情感观点评论 倾向性分析/yf_dianping/ratings/'
pd_all = pd.read_csv(path + 'ratings.csv')

print('评论数目(总体):%d' % pd_all.shape[0])
print('评论数目(正向):%d' % pd_all[pd_all.rating==5].shape[0])
print('评论数目(负向):%d' % (pd_all[pd_all.rating==0] + pd_all[pd_all.rating==1]).shape[0])

out_path = '...your path/test/yf_dianping.txt'
file = open(out_path, 'a', encoding='utf-8')

for row in pd_all.itertuples():
    # print(emotion_convert(getattr(row, 'label')),getattr(row, 'review'))
    try:
        if getattr(row, 'rating') == 0 or getattr(row, 'rating') == 5 or getattr(row, 'rating') == 1:
            file.write((getattr(row, 'comment')).replace('\n',' ') + ' ' + emotion_convert(getattr(row, 'rating')) + '\n')
    except:
        print('Error!')

9. 商品评论数据yf_amazon

参考此,52 万件商品,1100 多个类目,142 万用户,720 万条评论/评分数据

如果一行结尾是以逗号结尾的话,会try不能运行,报excep的错

结果:

评论数目(总体):7202921
评论数目(正向):4184629
评论数目(负向):293751

10. 文件合并

代码如下:

Catalog = ['ChnSentiCorp_htl_all', 'dmsc_v2', 'NLPCC_2014', 'online_shopping_10_cats', 'simplifyweibo_4_moods', 'waimai_10k', 'weibo_senti_100k', 'yf_amazon', 'yf_dianping']
path = '...your path/test/'

Data = open(path + 'Data.txt', 'a', encoding='utf-8')

for item in Catalog:
    file = open(path + '{}.txt'.format(item), 'r', encoding='utf-8')
    txt = file.read().strip('\n').strip(' ')
    Data.write(txt + '\n')
    file.close()
    print("{}文件合并完毕".format(item))

Data.close()

合并完成后总共为741MB

二、句子的向量表示

1. 不能用n_similarity就算句子相似度

由以下代码:

from gensim.models import KeyedVectors
# import jieba

word_vectors = KeyedVectors.load('vectors.kv')

str1 = '如何更换花呗绑定银行卡'
str2 = '花呗更改绑定银行卡'
# str1list = ' '.join(jieba.cut(str1)).split(' ')
# str2list = ' '.join(jieba.cut(str2)).split(' ')
#
# print(str1list)
# print(str2list)

def get_sentence_vec(list1,list2):
    from numpy import array,dot,sum
    from gensim import matutils
    tmp = []
    for item in list1:
         tmp.append(word_vectors[item])
    tmp = array(tmp).mean(axis=0)
    print(tmp)
    print(sum(tmp))
    print(matutils.unitvec(tmp))
    print(sum(matutils.unitvec(tmp)))
    print(sum((matutils.unitvec(tmp))**2))      # 求列表平方和
    tmp2 = []
    for item in list2:
         tmp2.append(word_vectors[item])
    tmp2 = array(tmp2).mean(axis=0)
    return dot(matutils.unitvec(tmp),matutils.unitvec(tmp2))


list_ = get_sentence_vec(str1,str2)
print(list_)

str1sum = [0] * word_vectors.vector_size
cnt1 = 0
for word in str1:
    # print(word)
    cnt1 += 1
    str1sum = str1sum + word_vectors[word]

cnt2 = 0
str2sum = [0] * word_vectors.vector_size
for word in str2:
    cnt2 += 1
    str2sum = str2sum + word_vectors[word]

print('求和',str1sum)
print(sum(str1sum))
print('求平均',str1sum/cnt1)
print(sum(str1sum/cnt1))

结果:

[-2.19413742e-01...300维列表]		# tmp
3.1234498
[-2.19141953e-02...300维列表]		# L2正则化
0.3119583
1.0
0.93513715		# 直接求和、求平均、L2正则化,相似度都是一样的
求和 [-2.41355096e+00...300维列表]
34.3579681138508
求平均 [-2.19413723e-01...300维列表]		# 与tmp一样
3.1234516467137103

注意到之前生成句子向量文章的问题:

for word in str1		# 应该是str1list

以及matutils.unitvec()函数相当于一个L2正则化的函数,即使列表元素的平方和为1,而n_similarity函数是一个个字读入的,这显然不能与jieba分词后的结果等效:

def n_similarity(self, ws1, ws2):
    """Compute cosine similarity between two sets of words.

    Parameters
    ----------
    ws1 : list of str
        Sequence of words.
    ws2: list of str
        Sequence of words.

    Returns
    -------
    numpy.ndarray
        Similarities between `ws1` and `ws2`.

    """
    if not(len(ws1) and len(ws2)):
        raise ZeroDivisionError('At least one of the passed list is empty.')
    v1 = [self[word] for word in ws1]
    v2 = [self[word] for word in ws2]
    return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))

2. 对每个词向量求和取平均

代码如下:

from gensim.models import KeyedVectors

word_vectors = KeyedVectors.load('vectors.kv')

str1 = '如何更换花呗绑定银行卡'
str2 = '花呗更改绑定银行卡'


def get_sentence_vec(sentence):
    import jieba
    sentence_list = ' '.join(jieba.cut(sentence)).split(' ')
    vecsum = [0] * word_vectors.vector_size
    cnt = 0
    for word in sentence_list:
        vecsum = vecsum + word_vectors[word]
        cnt += 1
    return vecsum/cnt


vec1 = get_sentence_vec(str1)
vec2 = get_sentence_vec(str2)

from scipy.spatial.distance import cosine
print(cosine(vec1, vec2),1-cosine(vec1, vec2))

结果:

0.06521224543237958 0.9347877545676204

和n_similarity对每个汉字求和取平均的结果相差不大

3. 情感分析数据集与测试数据集句子转向量

关于从字符串中选出固定元素可看此

报错:

TypeError: unsupported operand type(s) for /: 'list' and 'int'

解决办法,使用numpy,之后报警告:

RuntimeWarning: invalid value encountered in true_divide   return vecsum/cnt

参考此,发现是有:

Couldn’t be better. POS

这种数据,导致数组出现0/0的情况

最后生成的’Vec.txt’的文件大小有17GB,显然太大了,所以这里先不用这三个数据集:

  • 电影评论数据集dmsc_v2
  • 餐馆用户评论数据yf_dianping
  • 商品评论数据yf_amazon

增加cnt为零的情况,其它和以上步骤一样

代码如下:

from gensim.models import KeyedVectors


def get_sentence_vec(sentence):
    import jieba
    import numpy as np
    sentence_list = ' '.join(jieba.cut(sentence)).split(' ')
    # vecsum = [0] * word_vectors.vector_size
    vecsum = np.zeros(word_vectors.vector_size)
    cnt = 0
    for word in sentence_list:
        try:
            vecsum = vecsum + word_vectors[word]
            cnt += 1
        except:
            continue
    if cnt == 0: return vecsum
    return vecsum/cnt


word_vectors = KeyedVectors.load('vectors.kv')
path = '...your path/Code/test/'
file = open(path + 'Data_Small.txt', 'r', encoding='utf-8')
output = open(path + 'Vec_Small.txt', 'a', encoding='utf-8')

for line in file.readlines():
    vec = get_sentence_vec(line[:-4])
    emotion = line[-4:-1]
    if vec.any() != 0:
        output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')

生成的’Data_Small.txt’文件大小为106MB,生成的’Vec_Small.txt’大小为2.38GB,程序运行时间:29分钟

测试数据集转向量代码同上,不同之处为:

word_vectors = KeyedVectors.load('vectors.kv')
path = '...your path/chinese-review-datasets/Chinese review datasets/'
file1 = open(path + 'phone_sentence.txt', 'r', encoding='utf-8')
file2 = open(path + 'phone_label.txt', 'r', encoding='utf-8')
list = file2.read().split('\n')
file2.close()
output = open(path + 'Vec_test.txt', 'a', encoding='utf-8')

i = 0
for line in file1.readlines():
    vec = get_sentence_vec(line.strip('\n'))
    if list[i] == '1':
        emotion = 'POS'
    elif list[i] == '0':
        emotion = 'NEG'
    if vec.any() != 0:
        output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')
    else:
        print(i)
        # 1359 屏不比屏差(2231)
        # 20 声噪大(1172)
        #
        # 961 字太大 970 字太大
    i += 1

4. 读取句向量测量长度

这里有个问题,读每句话的最后一句应该为三个字母,这样把NORM给截掉了……

先看一下句向量长度,代码如下:

path = '...your path/test/Vec_Small.txt'
file = open(path, 'r', encoding='utf-8')
cnt_POS = 0
cnt_NEG = 0
cnt_NORM = 0
for line in file.readlines():
    if line[-4:-1] == 'POS':
        cnt_POS += 1
    elif line[-4:-1] == 'NEG':
        cnt_NEG += 1
    elif line[-4:-1] == 'ORM':
        cnt_NORM += 1
print('句向量长度:{}'.format(cnt_POS + cnt_NEG + cnt_NORM))
print('积极句向量个数:%s' % cnt_POS)
print('消极句向量个数:%s' % cnt_NEG)
print('正常句向量个数:%s' % cnt_NORM)

结果:

句向量长度:608133
积极句向量个数:307547
消极句向量个数:270998
正常句向量个数:29588

这里可能有非平衡语料的问题,有两种措施:

  1. 先无视,继续做;
  2. 先只用积极和消极两种数据集

这里选择第二种

字符串转列表时出现:

SyntaxError: invalid syntax

原因:

list = []
str = '[1 2 3]'
list.append(eval(str))
print(list)

解决办法:

str = '[1, 2, 3]'

5. 支持向量机SVM

SVM代码如下:

import re

path = '...your path/test/Vec_Small.txt'
file = open(path, 'r', encoding='utf-8')
train_data = []
train_label = []
i = 0
for line in file.readlines():
    if line[-4:-1] == 'POS':
        train_label.append(1)
    elif line[-4:-1] == 'NEG':
        train_label.append(-1)
    elif line[-4:-1] == 'ORM':
        continue
    train_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
    i += 1
    if i % 10000 == 0: print(i)
file.close()
print(len(train_data) == len(train_label))
print('总训练句向量数据:%d' % len(train_data))

path = '...your path/chinese-review-datasets/Chinese review datasets/Vec_test.txt'
file = open(path, 'r', encoding='utf-8')
test_data = []
test_label = []

for line in file.readlines():
    if line[-4:-1] == 'POS':
        test_label.append(1)
    elif line[-4:-1] == 'NEG':
        test_label.append(-1)
    elif line[-4:-1] == 'ORM':
        continue
    test_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
file.close()
print(len(test_data) == len(test_label))
print('总测试句向量数据:%d' % len(test_data))


def svm(X_train, y_train, X_test, y_test):  # 支持向量机
    from sklearn.svm import SVC  # 导入支持向量机分类器SVC
    svm = SVC()  # *, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, cache_size=200
    svm.fit(X_train, y_train)  # 训练模型
    print('Accuracy of svm on training set:{:.2f}'.format(svm.score(X_train, y_train)))  # 打印训练集的预测准确率
    print('Accuracy of svm on test set:{:.2f}'.format(svm.score(X_test, y_test)))  # 打印测试集的预测准确率
    predict = svm.predict(X_test)  # 预测标签
    return predict  # 返回预测的标签值


def cal_accuracy(predict, testing_labels):  # 由预测值和实际值标签计算准确率
    if len(predict) != len(testing_labels):
        print('Error!')
        return
    correct_classification = 0  # 将正确的分类数记为correct_classification
    for i in range(0, len(predict)):  # 对于每一个测试集
        if testing_labels[i] == predict[i]:
            correct_classification += 1  # 如果正确分类则correct_classification ++1
    # print("The accuracy rate is:" + str(correct_classification / testing_data_num))       # 可以打印出准确率
    return correct_classification / len(predict)  # 返回正确率


predict = svm(train_data, train_label, test_data, test_label)
print(cal_accuracy(predict, test_label))

结果:

……
560000
570000
True
总训练句向量数据:578545
True
总测试句向量数据:6578

Process finished with exit code -1

问题:跑了一晚上,没出结果

有以下解决办法:

  1. 再次减小数据集大小,或者降低小数点后位数
  2. 不使用SVC而使用LineSVC()

采用第二种方法,修改代码为:

from sklearn.svm import LinearSVC  # 导入支持向量机分类器SVC
    svm = LinearSVC()  # max_iter = 1000

结果:

E:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn("Liblinear failed to converge, increase "
Accuracy of svm on training set:0.74
Accuracy of svm on test set:0.79
0.7902097902097902

报警告解决措施:修改参数为max_iter=10000

模型保存和再次调用可见此,此文章有误,直接:

import joblib

6. 随机梯度下降SGD

详见sklearn SGD官方手册

1. 默认参数,max_iter=10000

from sklearn.linear_model import SGDClassifier

def SGD(X_train, y_train, X_test, y_test):
    from sklearn.linear_model import SGDClassifier
    import joblib
    sgd = SGDClassifier(max_iter=10000)
    sgd.fit(X_train, y_train)  # 训练模型
    joblib.dump(sgd,'sgd_model.m')
    print('Accuracy of sgd on training set:{:.2f}'.format(sgd.score(X_train, y_train)))  # 打印训练集的预测准确率
    print('Accuracy of sgd on test set:{:.2f}'.format(sgd.score(X_test, y_test)))  # 打印测试集的预测准确率
    predict = sgd.predict(X_test)  # 预测标签
    return predict  # 返回预测的标签值
Accuracy of sgd on training set:0.74
Accuracy of sgd on test set:0.78

2. 早停(validation_fraction=0.1),缩放数据

def SGD(X_train, y_train, X_test, y_test):  
    from sklearn.linear_model import SGDClassifier
    from sklearn.preprocessing import StandardScaler
    import joblib
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)  # apply same transformation to test data,将相同的缩放应用于对应的测试向量中
    sgd = SGDClassifier(early_stopping=True, max_iter=10000)        # validation_fraction=0.1
    sgd.fit(X_train, y_train)  # 训练模型
    joblib.dump(sgd,'sgd_model_2.m')        # 早停,缩放数据
    print('Accuracy of sgd on training set:{:.2f}'.format(sgd.score(X_train, y_train)))  # 打印训练集的预测准确率
    print('Accuracy of sgd on test set:{:.2f}'.format(sgd.score(X_test, y_test)))  # 打印测试集的预测准确率
    predict = sgd.predict(X_test)  # 预测标签
    return predict  # 返回预测的标签值
Accuracy of sgd on training set:0.70
Accuracy of sgd on test set:0.66

3. 早停(validation_fraction=0.2)

Accuracy of sgd on training set:0.70
Accuracy of sgd on test set:0.75

4. 早停(validation_fraction=0.1),loss='modified_huber’

Accuracy of sgd on training set:0.68
Accuracy of sgd on test set:0.74

5. 早停(validation_fraction=0.1),loss='log’

Accuracy of sgd on training set:0.71
Accuracy of sgd on test set:0.78

这里采用’‘sgd_model_5.m’’:

  • 舍弃置信度低于0.7的数据时正确率为0.8445497630331753
  • 舍弃置信度低于0.75的数据时正确率为0.8605553287055941
  • 舍弃置信度低于0.8的数据时正确率为0.8764931259860266
import re

path = '...your path/chinese-review-datasets/Chinese review datasets/Vec_test.txt'
file = open(path, 'r', encoding='utf-8')
test_data = []
test_label = []

for line in file.readlines():
    if line[-4:-1] == 'POS':
        test_label.append(1)
    elif line[-4:-1] == 'NEG':
        test_label.append(-1)
    elif line[-4:-1] == 'ORM':
        continue
    test_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
file.close()
print(len(test_data) == len(test_label))
print('总测试句向量数据:%d' % len(test_data))


def cal_accuracy(predict, testing_labels):  # 由预测值和实际值标签计算准确率
    if len(predict) != len(testing_labels):
        print('Error!')
        return
    correct_classification = 0  # 将正确的分类数记为correct_classification
    for i in range(0, len(predict)):  # 对于每一个测试集
        if testing_labels[i] == predict[i]:
            correct_classification += 1  # 如果正确分类则correct_classification ++1
    # print("The accuracy rate is:" + str(correct_classification / testing_data_num))       # 可以打印出准确率
    return correct_classification / len(predict)  # 返回正确率


def Show(test_data, predict, testing_labels):  # 由预测值和实际值标签计算准确率
    if len(predict) != len(testing_labels):
        print('Error!')
        return
    correct_classification = 0  # 将正确的分类数记为correct_classification
    uncertain_classification = 0
    proba = model.predict_proba(test_data)
    for i in range(0, len(predict)):  # 对于每一个测试集
        if proba[i][0] < 0.8 and proba[i][1] < 0.8:
            # print('置信度低于0.8:%d' % i)
            uncertain_classification += 1
            continue
        if testing_labels[i] == predict[i]:
            correct_classification += 1  # 如果正确分类则correct_classification ++1
        else:
            print('分类错误:%d' % i)
    print(uncertain_classification)
    return correct_classification/(len(predict)-uncertain_classification)


import joblib

model = joblib.load('sgd_model_5.m')

predict = model.predict(test_data)
print(cal_accuracy(predict, test_label))
# print(model.score(test_data, test_label))
# print(model.predict_proba(test_data))
print(Show(test_data, predict, test_label))

小结

  1. 处理了九个情感数据集,由于内存限制,暂时不使用电影评论数据集dmsc_v2、餐馆用户评论数据yf_dianping、商品评论数据yf_amazon,使用NLPCC 2014会议技术评测测试数据与答案、酒店评论数据ChnSentiCorp_htl_all、外卖平台用户评价waimai_10k、线上购物评论数据online_shopping_10_cats、新浪微博情感标注weibo_senti_100k、新浪微博情感标注simplifyweibo_4_moods这六个数据集进行后续操作
  2. 基于词向量求和取平均生成句子向量,并用SVM(支持向量机)和SGD(随机梯度下降)对Learning multi-grained aspect target sequence for Chinese sentiment analysis中情感数据集进行测试,准确率分别为0.79和0.78

后续工作:

  1. 收集更多语音识别和文本识别的结果,将以上工作应用到实践中
  2. 句子向量表示的准确率问题:
    • 取消停词表,观察结果
    • 尝试doc2vec
    • 更后:
      • 为词向量进行td-idf加权表示句向量
      • 使用神经网络表示句向量
  3. 情感数据集准确率问题:
    • 选择其它例如NLPCC 2012会议的数据集进行处理,其有测评结果,便于对比
    • 尝试其它方法进行训练
    • 更后:使用更多情感数据集进行训练
  4. 结果可视化(人机交互界面设计)

你可能感兴趣的:(【自然语言处理】,python,自然语言处理,nlp,人工智能)