使用Logistic Regression进行文本分类

1.文本格式

sentence,label
游戏太坑,暴率太低,太克金,平民不能玩,negative
让人失望,negative
能解决一下服务器问题?网络正常老掉线,换手机也一样。。。,negative
期待,positive
一星也不想给,这特么简直龟速,炫舞老年版?,negative
衣服不好看游戏内容无特色,界面乱糟糟的,negative
喜欢喜欢,positive
从有了这个手游就一直玩,很喜欢呀,希望更多漂漂衣服,positive
因违反评价条例规定被折叠,negative

2.数据预处理过程

import jieba
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics


def get_stop_words():
    filename = "your stop words file path"
    stop_word_list = []
    with open(filename, encoding='utf-8') as f:
        for line in f.readlines():
            stop_word_list.append(line.strip())
    return stop_word_list


def processing_sentence(x, stop_words):
    cut_word = jieba.cut(str(x).strip())
    words = [word for word in cut_word if word not in stop_words and word != ' ']
    return ' '.join(words)


def data_processing():
    train_file = “your train file path"
    df = pd.read_csv(train_file)
    x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.1)
    stop_words = get_stop_words()
    x_train = x_train.apply(lambda x: processing_sentence(x, stop_words))
    x_test = x_test.apply(lambda x: processing_sentence(x, stop_words))

    tf = TfidfVectorizer()
    x_train = tf.fit_transform(x_train)
    x_test = tf.transform(x_test)
    x_train_weight = x_train.toarray()
    x_test_weight = x_test.toarray()

    return x_train_weight, x_test_weight, y_train, y_test

整体还是将文本分词,然后将其转化为tf-idf特征。

3.构建LR模型

def model_train():
    x_train_weight, x_test_weight, y_train, y_test = data_processing()
    lr = LogisticRegression(C=1.0, penalty='l2', tol=0.01)
    lr.fit(x_train_weight, y_train)

    train_score = lr.score(x_train_weight, y_train)
    print("训练集准确率: ", train_score)

    y_predict = lr.predict(x_test_weight)

    confusion_mat = metrics.confusion_matrix(y_test, y_predict)
    print('测试集准确率:', metrics.accuracy_score(y_test, y_predict))
    print("confusion_matrix is: ", confusion_mat)
    print('分类报告:', metrics.classification_report(y_test, y_predict))

最后代码输出的训练过程与结果为

训练集准确率:  0.8926945588554086
测试集准确率: 0.746588693957115
confusion_matrix is:  [[177  64]
 [ 66 206]]
分类报告:               precision    recall  f1-score   support

    negative       0.73      0.73      0.73       241
    positive       0.76      0.76      0.76       272

    accuracy                           0.75       513
   macro avg       0.75      0.75      0.75       513
weighted avg       0.75      0.75      0.75       513

你可能感兴趣的:(text,classifier,Logistic,Regression,文本分类)