第二届自然语言处理与中文计算会议(NLP&CC 2013),大小:10 000 条微博,而且与2014年的是重复的,所以使用2014年会议的数据
NLPCC 2014 Evaluation Tasks Test Data,大小:14 000 条微博,45 421句子,网站
微博语料,标注了7 emotions: like, disgust, happiness, sadness, anger, surprise, fear
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
def emotion_convert(string):
dictionary = {
'like':'POS',
'disgust':'NEG',
'happiness':'POS',
'sadness':'NEG',
'anger':'NEG',
'surprise':'POS',
'fear':'NEG'
}
return dictionary.get(string, None)
NLPCC_2014_path = '...your path/NLPCC/evtestdata1/Training data for Emotion Classification.xml'
out_path = '...your path/test/NLPCC_2014.txt'
file = open(NLPCC_2014_path, 'r', encoding='utf-8')
txt = file.read()
file.close()
file = open(out_path, 'a', encoding='utf-8')
soup = BeautifulSoup(txt,'html.parser')
for tag in soup.find_all('sentence'):
file.write(tag.string + ' ')
if tag.attrs['opinionated'] == 'N':
file.write('NORM\n')
elif tag.attrs['opinionated'] == 'Y':
file.write(emotion_convert(tag.attrs['emotion-1-type'])+'\n')
参考此,7000 多条酒店评论数据,5000 多条正向评论,2000 多条负向评论
关于pandas库遍历数据集用法可参考此
import pandas as pd
def emotion_convert(string):
dictionary = {
1:'POS',
0:'NEG'
}
return dictionary.get(string, None)
path = '...your path/情感观点评论 倾向性分析/ChnSentiCorp_htl_all/'
pd_all = pd.read_csv(path + 'ChnSentiCorp_htl_all.csv')
print('评论数目(总体):%d' % pd_all.shape[0])
print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])
# print(pd_all.sample(2))
# 构造平衡语料
out_path = '...your path/test/ChnSentiCorp_htl_all.txt'
file = open(out_path, 'a', encoding='utf-8')
for row in pd_all.itertuples():
# print(emotion_convert(getattr(row, 'label')),getattr(row, 'review'))
try:
file.write(getattr(row, 'review') + ' ' + emotion_convert(getattr(row, 'label')) + '\n')
except:
print('Error!')
结果:
评论数目(总体):7766
评论数目(正向):5322
评论数目(负向):2444
Error!
参考此,某外卖平台收集的用户评价,正向 4000 条,负向约 8000 条
代码同上,需要修改的:
path = '...your path/情感观点评论 倾向性分析/waimai_10k/'
pd_all = pd.read_csv(path + 'waimai_10k.csv')
out_path = '...your path/test/waimai_10k.txt'
结果:
评论数目(总体):11987
评论数目(正向):4000
评论数目(负向):7987
参考此,10 个类别,共 6 万多条评论数据,正、负向评论各约 3 万条,包括书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店
代码同上,结果:
评论数目(总体):62774
评论数目(正向):31728
评论数目(负向):31046
Error!
参考此,10 万多条,带情感标注新浪微博,正负向评论约各 5 万条
代码同上,结果:
评论数目(总体):119988
评论数目(正向):59993
评论数目(负向):59995
参考此, 36 万多条,带情感标注新浪微博,包含 4 种情感,其中喜悦约 20 万条,愤怒、厌恶、低落各约 5 万条
代码修改部分:
def emotion_convert(string):
dictionary = {
0: 'POS',
1: 'NEG',
2: 'NEG',
3: 'NEG'
}
return dictionary.get(string, None)
print('评论数目(正向):%d' % pd_all[pd_all.label==0].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.label!=0].shape[0])
结果:
评论数目(总体):361744
评论数目(正向):199496
评论数目(负向):162248
参考此,28 部电影,超 70 万用户,超 200万条评分/评论数据
修改的部分代码如下:
def emotion_convert(string):
dictionary = {
5: 'POS',
1: 'NEG'
}
return dictionary.get(string, None)
print('评论数目(正向):%d' % pd_all[pd_all.rating==5].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.rating==1].shape[0])
for row in pd_all.itertuples():
try:
if getattr(row, 'rating') == 1 or getattr(row, 'rating') == 5:
file.write(getattr(row, 'comment') + ' ' + emotion_convert(getattr(row, 'rating')) + '\n')
except:
print('Error!')
结果:
评论数目(总体):2125056
评论数目(正向):638106
评论数目(负向):190927
参考此,24 万家餐馆,54 万用户,440 万条评论/评分数据
import pandas as pd
def emotion_convert(string):
dictionary = {
5: 'POS',
1: 'NEG',
0: 'NEG'
}
return dictionary.get(string, None)
path = '...your path/情感观点评论 倾向性分析/yf_dianping/ratings/'
pd_all = pd.read_csv(path + 'ratings.csv')
print('评论数目(总体):%d' % pd_all.shape[0])
print('评论数目(正向):%d' % pd_all[pd_all.rating==5].shape[0])
print('评论数目(负向):%d' % (pd_all[pd_all.rating==0] + pd_all[pd_all.rating==1]).shape[0])
out_path = '...your path/test/yf_dianping.txt'
file = open(out_path, 'a', encoding='utf-8')
for row in pd_all.itertuples():
# print(emotion_convert(getattr(row, 'label')),getattr(row, 'review'))
try:
if getattr(row, 'rating') == 0 or getattr(row, 'rating') == 5 or getattr(row, 'rating') == 1:
file.write((getattr(row, 'comment')).replace('\n',' ') + ' ' + emotion_convert(getattr(row, 'rating')) + '\n')
except:
print('Error!')
参考此,52 万件商品,1100 多个类目,142 万用户,720 万条评论/评分数据
如果一行结尾是以逗号结尾的话,会try不能运行,报excep的错
结果:
评论数目(总体):7202921
评论数目(正向):4184629
评论数目(负向):293751
代码如下:
Catalog = ['ChnSentiCorp_htl_all', 'dmsc_v2', 'NLPCC_2014', 'online_shopping_10_cats', 'simplifyweibo_4_moods', 'waimai_10k', 'weibo_senti_100k', 'yf_amazon', 'yf_dianping']
path = '...your path/test/'
Data = open(path + 'Data.txt', 'a', encoding='utf-8')
for item in Catalog:
file = open(path + '{}.txt'.format(item), 'r', encoding='utf-8')
txt = file.read().strip('\n').strip(' ')
Data.write(txt + '\n')
file.close()
print("{}文件合并完毕".format(item))
Data.close()
合并完成后总共为741MB
由以下代码:
from gensim.models import KeyedVectors
# import jieba
word_vectors = KeyedVectors.load('vectors.kv')
str1 = '如何更换花呗绑定银行卡'
str2 = '花呗更改绑定银行卡'
# str1list = ' '.join(jieba.cut(str1)).split(' ')
# str2list = ' '.join(jieba.cut(str2)).split(' ')
#
# print(str1list)
# print(str2list)
def get_sentence_vec(list1,list2):
from numpy import array,dot,sum
from gensim import matutils
tmp = []
for item in list1:
tmp.append(word_vectors[item])
tmp = array(tmp).mean(axis=0)
print(tmp)
print(sum(tmp))
print(matutils.unitvec(tmp))
print(sum(matutils.unitvec(tmp)))
print(sum((matutils.unitvec(tmp))**2)) # 求列表平方和
tmp2 = []
for item in list2:
tmp2.append(word_vectors[item])
tmp2 = array(tmp2).mean(axis=0)
return dot(matutils.unitvec(tmp),matutils.unitvec(tmp2))
list_ = get_sentence_vec(str1,str2)
print(list_)
str1sum = [0] * word_vectors.vector_size
cnt1 = 0
for word in str1:
# print(word)
cnt1 += 1
str1sum = str1sum + word_vectors[word]
cnt2 = 0
str2sum = [0] * word_vectors.vector_size
for word in str2:
cnt2 += 1
str2sum = str2sum + word_vectors[word]
print('求和',str1sum)
print(sum(str1sum))
print('求平均',str1sum/cnt1)
print(sum(str1sum/cnt1))
结果:
[-2.19413742e-01...300维列表] # tmp
3.1234498
[-2.19141953e-02...300维列表] # L2正则化
0.3119583
1.0
0.93513715 # 直接求和、求平均、L2正则化,相似度都是一样的
求和 [-2.41355096e+00...300维列表]
34.3579681138508
求平均 [-2.19413723e-01...300维列表] # 与tmp一样
3.1234516467137103
注意到之前生成句子向量文章的问题:
for word in str1 # 应该是str1list
以及matutils.unitvec()
函数相当于一个L2正则化的函数,即使列表元素的平方和为1,而n_similarity函数是一个个字读入的,这显然不能与jieba分词后的结果等效:
def n_similarity(self, ws1, ws2):
"""Compute cosine similarity between two sets of words.
Parameters
----------
ws1 : list of str
Sequence of words.
ws2: list of str
Sequence of words.
Returns
-------
numpy.ndarray
Similarities between `ws1` and `ws2`.
"""
if not(len(ws1) and len(ws2)):
raise ZeroDivisionError('At least one of the passed list is empty.')
v1 = [self[word] for word in ws1]
v2 = [self[word] for word in ws2]
return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))
代码如下:
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load('vectors.kv')
str1 = '如何更换花呗绑定银行卡'
str2 = '花呗更改绑定银行卡'
def get_sentence_vec(sentence):
import jieba
sentence_list = ' '.join(jieba.cut(sentence)).split(' ')
vecsum = [0] * word_vectors.vector_size
cnt = 0
for word in sentence_list:
vecsum = vecsum + word_vectors[word]
cnt += 1
return vecsum/cnt
vec1 = get_sentence_vec(str1)
vec2 = get_sentence_vec(str2)
from scipy.spatial.distance import cosine
print(cosine(vec1, vec2),1-cosine(vec1, vec2))
结果:
0.06521224543237958 0.9347877545676204
和n_similarity对每个汉字求和取平均的结果相差不大
关于从字符串中选出固定元素可看此
报错:
TypeError: unsupported operand type(s) for /: 'list' and 'int'
解决办法,使用numpy,之后报警告:
RuntimeWarning: invalid value encountered in true_divide return vecsum/cnt
参考此,发现是有:
Couldn’t be better. POS
这种数据,导致数组出现0/0的情况
最后生成的’Vec.txt’的文件大小有17GB,显然太大了,所以这里先不用这三个数据集:
增加cnt为零的情况,其它和以上步骤一样
代码如下:
from gensim.models import KeyedVectors
def get_sentence_vec(sentence):
import jieba
import numpy as np
sentence_list = ' '.join(jieba.cut(sentence)).split(' ')
# vecsum = [0] * word_vectors.vector_size
vecsum = np.zeros(word_vectors.vector_size)
cnt = 0
for word in sentence_list:
try:
vecsum = vecsum + word_vectors[word]
cnt += 1
except:
continue
if cnt == 0: return vecsum
return vecsum/cnt
word_vectors = KeyedVectors.load('vectors.kv')
path = '...your path/Code/test/'
file = open(path + 'Data_Small.txt', 'r', encoding='utf-8')
output = open(path + 'Vec_Small.txt', 'a', encoding='utf-8')
for line in file.readlines():
vec = get_sentence_vec(line[:-4])
emotion = line[-4:-1]
if vec.any() != 0:
output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')
生成的’Data_Small.txt’文件大小为106MB,生成的’Vec_Small.txt’大小为2.38GB,程序运行时间:29分钟
测试数据集转向量代码同上,不同之处为:
word_vectors = KeyedVectors.load('vectors.kv')
path = '...your path/chinese-review-datasets/Chinese review datasets/'
file1 = open(path + 'phone_sentence.txt', 'r', encoding='utf-8')
file2 = open(path + 'phone_label.txt', 'r', encoding='utf-8')
list = file2.read().split('\n')
file2.close()
output = open(path + 'Vec_test.txt', 'a', encoding='utf-8')
i = 0
for line in file1.readlines():
vec = get_sentence_vec(line.strip('\n'))
if list[i] == '1':
emotion = 'POS'
elif list[i] == '0':
emotion = 'NEG'
if vec.any() != 0:
output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')
else:
print(i)
# 1359 屏不比屏差(2231)
# 20 声噪大(1172)
#
# 961 字太大 970 字太大
i += 1
这里有个问题,读每句话的最后一句应该为三个字母,这样把NORM给截掉了……
先看一下句向量长度,代码如下:
path = '...your path/test/Vec_Small.txt'
file = open(path, 'r', encoding='utf-8')
cnt_POS = 0
cnt_NEG = 0
cnt_NORM = 0
for line in file.readlines():
if line[-4:-1] == 'POS':
cnt_POS += 1
elif line[-4:-1] == 'NEG':
cnt_NEG += 1
elif line[-4:-1] == 'ORM':
cnt_NORM += 1
print('句向量长度:{}'.format(cnt_POS + cnt_NEG + cnt_NORM))
print('积极句向量个数:%s' % cnt_POS)
print('消极句向量个数:%s' % cnt_NEG)
print('正常句向量个数:%s' % cnt_NORM)
结果:
句向量长度:608133
积极句向量个数:307547
消极句向量个数:270998
正常句向量个数:29588
这里可能有非平衡语料的问题,有两种措施:
这里选择第二种
字符串转列表时出现:
SyntaxError: invalid syntax
原因:
list = []
str = '[1 2 3]'
list.append(eval(str))
print(list)
解决办法:
str = '[1, 2, 3]'
SVM代码如下:
import re
path = '...your path/test/Vec_Small.txt'
file = open(path, 'r', encoding='utf-8')
train_data = []
train_label = []
i = 0
for line in file.readlines():
if line[-4:-1] == 'POS':
train_label.append(1)
elif line[-4:-1] == 'NEG':
train_label.append(-1)
elif line[-4:-1] == 'ORM':
continue
train_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
i += 1
if i % 10000 == 0: print(i)
file.close()
print(len(train_data) == len(train_label))
print('总训练句向量数据:%d' % len(train_data))
path = '...your path/chinese-review-datasets/Chinese review datasets/Vec_test.txt'
file = open(path, 'r', encoding='utf-8')
test_data = []
test_label = []
for line in file.readlines():
if line[-4:-1] == 'POS':
test_label.append(1)
elif line[-4:-1] == 'NEG':
test_label.append(-1)
elif line[-4:-1] == 'ORM':
continue
test_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
file.close()
print(len(test_data) == len(test_label))
print('总测试句向量数据:%d' % len(test_data))
def svm(X_train, y_train, X_test, y_test): # 支持向量机
from sklearn.svm import SVC # 导入支持向量机分类器SVC
svm = SVC() # *, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, cache_size=200
svm.fit(X_train, y_train) # 训练模型
print('Accuracy of svm on training set:{:.2f}'.format(svm.score(X_train, y_train))) # 打印训练集的预测准确率
print('Accuracy of svm on test set:{:.2f}'.format(svm.score(X_test, y_test))) # 打印测试集的预测准确率
predict = svm.predict(X_test) # 预测标签
return predict # 返回预测的标签值
def cal_accuracy(predict, testing_labels): # 由预测值和实际值标签计算准确率
if len(predict) != len(testing_labels):
print('Error!')
return
correct_classification = 0 # 将正确的分类数记为correct_classification
for i in range(0, len(predict)): # 对于每一个测试集
if testing_labels[i] == predict[i]:
correct_classification += 1 # 如果正确分类则correct_classification ++1
# print("The accuracy rate is:" + str(correct_classification / testing_data_num)) # 可以打印出准确率
return correct_classification / len(predict) # 返回正确率
predict = svm(train_data, train_label, test_data, test_label)
print(cal_accuracy(predict, test_label))
结果:
……
560000
570000
True
总训练句向量数据:578545
True
总测试句向量数据:6578
Process finished with exit code -1
问题:跑了一晚上,没出结果
有以下解决办法:
采用第二种方法,修改代码为:
from sklearn.svm import LinearSVC # 导入支持向量机分类器SVC
svm = LinearSVC() # max_iter = 1000
结果:
E:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
Accuracy of svm on training set:0.74
Accuracy of svm on test set:0.79
0.7902097902097902
报警告解决措施:修改参数为max_iter=10000
模型保存和再次调用可见此,此文章有误,直接:
import joblib
详见sklearn SGD官方手册
1. 默认参数,max_iter=10000
from sklearn.linear_model import SGDClassifier
def SGD(X_train, y_train, X_test, y_test):
from sklearn.linear_model import SGDClassifier
import joblib
sgd = SGDClassifier(max_iter=10000)
sgd.fit(X_train, y_train) # 训练模型
joblib.dump(sgd,'sgd_model.m')
print('Accuracy of sgd on training set:{:.2f}'.format(sgd.score(X_train, y_train))) # 打印训练集的预测准确率
print('Accuracy of sgd on test set:{:.2f}'.format(sgd.score(X_test, y_test))) # 打印测试集的预测准确率
predict = sgd.predict(X_test) # 预测标签
return predict # 返回预测的标签值
Accuracy of sgd on training set:0.74
Accuracy of sgd on test set:0.78
2. 早停(validation_fraction=0.1),缩放数据
def SGD(X_train, y_train, X_test, y_test):
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
import joblib
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data,将相同的缩放应用于对应的测试向量中
sgd = SGDClassifier(early_stopping=True, max_iter=10000) # validation_fraction=0.1
sgd.fit(X_train, y_train) # 训练模型
joblib.dump(sgd,'sgd_model_2.m') # 早停,缩放数据
print('Accuracy of sgd on training set:{:.2f}'.format(sgd.score(X_train, y_train))) # 打印训练集的预测准确率
print('Accuracy of sgd on test set:{:.2f}'.format(sgd.score(X_test, y_test))) # 打印测试集的预测准确率
predict = sgd.predict(X_test) # 预测标签
return predict # 返回预测的标签值
Accuracy of sgd on training set:0.70
Accuracy of sgd on test set:0.66
3. 早停(validation_fraction=0.2)
Accuracy of sgd on training set:0.70
Accuracy of sgd on test set:0.75
4. 早停(validation_fraction=0.1),loss='modified_huber’
Accuracy of sgd on training set:0.68
Accuracy of sgd on test set:0.74
5. 早停(validation_fraction=0.1),loss='log’
Accuracy of sgd on training set:0.71
Accuracy of sgd on test set:0.78
这里采用’‘sgd_model_5.m’’:
import re
path = '...your path/chinese-review-datasets/Chinese review datasets/Vec_test.txt'
file = open(path, 'r', encoding='utf-8')
test_data = []
test_label = []
for line in file.readlines():
if line[-4:-1] == 'POS':
test_label.append(1)
elif line[-4:-1] == 'NEG':
test_label.append(-1)
elif line[-4:-1] == 'ORM':
continue
test_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[ ', '[').replace(' ]', ']'))))
file.close()
print(len(test_data) == len(test_label))
print('总测试句向量数据:%d' % len(test_data))
def cal_accuracy(predict, testing_labels): # 由预测值和实际值标签计算准确率
if len(predict) != len(testing_labels):
print('Error!')
return
correct_classification = 0 # 将正确的分类数记为correct_classification
for i in range(0, len(predict)): # 对于每一个测试集
if testing_labels[i] == predict[i]:
correct_classification += 1 # 如果正确分类则correct_classification ++1
# print("The accuracy rate is:" + str(correct_classification / testing_data_num)) # 可以打印出准确率
return correct_classification / len(predict) # 返回正确率
def Show(test_data, predict, testing_labels): # 由预测值和实际值标签计算准确率
if len(predict) != len(testing_labels):
print('Error!')
return
correct_classification = 0 # 将正确的分类数记为correct_classification
uncertain_classification = 0
proba = model.predict_proba(test_data)
for i in range(0, len(predict)): # 对于每一个测试集
if proba[i][0] < 0.8 and proba[i][1] < 0.8:
# print('置信度低于0.8:%d' % i)
uncertain_classification += 1
continue
if testing_labels[i] == predict[i]:
correct_classification += 1 # 如果正确分类则correct_classification ++1
else:
print('分类错误:%d' % i)
print(uncertain_classification)
return correct_classification/(len(predict)-uncertain_classification)
import joblib
model = joblib.load('sgd_model_5.m')
predict = model.predict(test_data)
print(cal_accuracy(predict, test_label))
# print(model.score(test_data, test_label))
# print(model.predict_proba(test_data))
print(Show(test_data, predict, test_label))
后续工作: