sentence,label
游戏太坑,暴率太低,太克金,平民不能玩,negative
让人失望,negative
能解决一下服务器问题?网络正常老掉线,换手机也一样。。。,negative
期待,positive
一星也不想给,这特么简直龟速,炫舞老年版?,negative
衣服不好看游戏内容无特色,界面乱糟糟的,negative
喜欢喜欢,positive
从有了这个手游就一直玩,很喜欢呀,希望更多漂漂衣服,positive
因违反评价条例规定被折叠,negative
import jieba
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
def get_stop_words():
filename = "your stop words file path"
stop_word_list = []
with open(filename, encoding='utf-8') as f:
for line in f.readlines():
stop_word_list.append(line.strip())
return stop_word_list
def processing_sentence(x, stop_words):
cut_word = jieba.cut(str(x).strip())
words = [word for word in cut_word if word not in stop_words and word != ' ']
return ' '.join(words)
def data_processing():
train_file = “your train file path"
df = pd.read_csv(train_file)
x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.1)
stop_words = get_stop_words()
x_train = x_train.apply(lambda x: processing_sentence(x, stop_words))
x_test = x_test.apply(lambda x: processing_sentence(x, stop_words))
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)
x_test = tf.transform(x_test)
x_train_weight = x_train.toarray()
x_test_weight = x_test.toarray()
return x_train_weight, x_test_weight, y_train, y_test
整体还是将文本分词,然后将其转化为tf-idf特征。
def model_train():
x_train_weight, x_test_weight, y_train, y_test = data_processing()
lr = LogisticRegression(C=1.0, penalty='l2', tol=0.01)
lr.fit(x_train_weight, y_train)
train_score = lr.score(x_train_weight, y_train)
print("训练集准确率: ", train_score)
y_predict = lr.predict(x_test_weight)
confusion_mat = metrics.confusion_matrix(y_test, y_predict)
print('测试集准确率:', metrics.accuracy_score(y_test, y_predict))
print("confusion_matrix is: ", confusion_mat)
print('分类报告:', metrics.classification_report(y_test, y_predict))
最后代码输出的训练过程与结果为
训练集准确率: 0.8926945588554086
测试集准确率: 0.746588693957115
confusion_matrix is: [[177 64]
[ 66 206]]
分类报告: precision recall f1-score support
negative 0.73 0.73 0.73 241
positive 0.76 0.76 0.76 272
accuracy 0.75 513
macro avg 0.75 0.75 0.75 513
weighted avg 0.75 0.75 0.75 513