文本分类(上)- 基于传统机器学习方法进行文本分类

简介

自己由于最近参加了一个比赛“达观杯”文本智能处理挑战赛,上一周主要在做这一个比赛,看了一写论文和资料,github上搜刮下。。感觉一下子接触的知识很多,自己乘热打铁整理下吧。

接着上一篇文章20 newsgroups数据介绍以及文本分类实例,我们继续探讨下文本分类方法。文本分类作为NLP领域最为经典场景之一,当目前为止在业界和学术界已经积累了很多方法,主要分为两大类:

  • 基于传统机器学习的文本分类
  • 基于深度学习的文本分类

传统机器学习的文本分类通常提取tfidf或者词袋特征,然后给LR模型进行训练;这里模型有很多,比如贝叶斯、svm等;深度学习的文本分类,主要采用CNN、RNN、LSTM、Attention等。

利用传统机器学习和深度学习进行文本分类

  • 基于传统机器学习方法进行文本分类
    基本思路是:提取tfidf特征,然后喂给各种分类模型进行训练
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
from sklearn.svm import SVC,LinearSVC,LinearSVR
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# 选取下面的8类
selected_categories = [
    'comp.graphics',
    'rec.motorcycles',
    'rec.sport.baseball',
    'misc.forsale',
    'sci.electronics',
    'sci.med',
    'talk.politics.guns',
    'talk.religion.misc']

# 加载数据集
newsgroups_train=fetch_20newsgroups(subset='train',
                                    categories=selected_categories,
                                    remove=('headers','footers','quotes'))
newsgroups_test=fetch_20newsgroups(subset='train',
                                    categories=selected_categories,
                                    remove=('headers','footers','quotes'))

train_texts=newsgroups_train['data']
train_labels=newsgroups_train['target']
test_texts=newsgroups_test['data']
test_labels=newsgroups_test['target']
print(len(train_texts),len(test_texts))

# 贝叶斯
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',MultinomialNB())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("MultinomialNB准确率为:",np.mean(predicted==test_labels))

# SGD
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',SGDClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("SGDClassifier准确率为:",np.mean(predicted==test_labels))

# LogisticRegression
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',LogisticRegression())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LogisticRegression准确率为:",np.mean(predicted==test_labels))

# SVM
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',SVC())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("SVC准确率为:",np.mean(predicted==test_labels))

text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',LinearSVC())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LinearSVC准确率为:",np.mean(predicted==test_labels))

text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',LinearSVR())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LinearSVR准确率为:",np.mean(predicted==test_labels))

# MLPClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',MLPClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("MLPClassifier准确率为:",np.mean(predicted==test_labels))

# KNeighborsClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',KNeighborsClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("KNeighborsClassifier准确率为:",np.mean(predicted==test_labels))

# RandomForestClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',RandomForestClassifier(n_estimators=8))])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("RandomForestClassifier准确率为:",np.mean(predicted==test_labels))

# GradientBoostingClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',GradientBoostingClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("GradientBoostingClassifier准确率为:",np.mean(predicted==test_labels))

# AdaBoostClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',AdaBoostClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("AdaBoostClassifier准确率为:",np.mean(predicted==test_labels))

# DecisionTreeClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',DecisionTreeClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("DecisionTreeClassifier准确率为:",np.mean(predicted==test_labels))

输出结果为:

MultinomialNB准确率为: 0.8960196779964222
SGDClassifier准确率为: 0.9724955277280859
LogisticRegression准确率为: 0.9304561717352415
SVC准确率为: 0.13372093023255813
LinearSVC准确率为: 0.9749552772808586
LinearSVR准确率为: 0.00022361359570661896
MLPClassifier准确率为: 0.9758497316636852
KNeighborsClassifier准确率为: 0.45840787119856885
RandomForestClassifier准确率为: 0.9680232558139535
GradientBoostingClassifier准确率为: 0.9186046511627907
AdaBoostClassifier准确率为: 0.5916815742397138
DecisionTreeClassifier准确率为: 0.9758497316636852

从上面结果可以看出,不同分类器在改数据集上的表现差别是比较大的,所以在做文本分类的时候要多尝试几种方法,说不定有意外收获;另外TfidfVectorizer、LogisticRegression等方法,我们可以设置很多参数,这里对实验的效果也影响比较大,比如TfidfVectorizer中一个参数ngram_range直接影响提取的特征,这里也是需要多磨多练;
更多请见:https://github.com/yanqiangmiffy/20newsgroups-text-classification

参考资料

中文文本分类对比(经典方法和CNN)
sklearn 中的 Pipeline 机制

你可能感兴趣的:(文本分类(上)- 基于传统机器学习方法进行文本分类)