简介
自己由于最近参加了一个比赛“达观杯”文本智能处理挑战赛,上一周主要在做这一个比赛,看了一写论文和资料,github上搜刮下。。感觉一下子接触的知识很多,自己乘热打铁整理下吧。
接着上一篇文章20 newsgroups数据介绍以及文本分类实例,我们继续探讨下文本分类方法。文本分类作为NLP领域最为经典场景之一,当目前为止在业界和学术界已经积累了很多方法,主要分为两大类:
- 基于传统机器学习的文本分类
- 基于深度学习的文本分类
传统机器学习的文本分类通常提取tfidf或者词袋特征,然后给LR
模型进行训练;这里模型有很多,比如贝叶斯、svm
等;深度学习的文本分类,主要采用CNN、RNN、LSTM、Attention
等。
利用传统机器学习和深度学习进行文本分类
- 基于传统机器学习方法进行文本分类
基本思路是:提取tfidf特征,然后喂给各种分类模型进行训练
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
from sklearn.svm import SVC,LinearSVC,LinearSVR
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# 选取下面的8类
selected_categories = [
'comp.graphics',
'rec.motorcycles',
'rec.sport.baseball',
'misc.forsale',
'sci.electronics',
'sci.med',
'talk.politics.guns',
'talk.religion.misc']
# 加载数据集
newsgroups_train=fetch_20newsgroups(subset='train',
categories=selected_categories,
remove=('headers','footers','quotes'))
newsgroups_test=fetch_20newsgroups(subset='train',
categories=selected_categories,
remove=('headers','footers','quotes'))
train_texts=newsgroups_train['data']
train_labels=newsgroups_train['target']
test_texts=newsgroups_test['data']
test_labels=newsgroups_test['target']
print(len(train_texts),len(test_texts))
# 贝叶斯
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',MultinomialNB())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("MultinomialNB准确率为:",np.mean(predicted==test_labels))
# SGD
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',SGDClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("SGDClassifier准确率为:",np.mean(predicted==test_labels))
# LogisticRegression
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',LogisticRegression())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LogisticRegression准确率为:",np.mean(predicted==test_labels))
# SVM
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',SVC())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("SVC准确率为:",np.mean(predicted==test_labels))
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',LinearSVC())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LinearSVC准确率为:",np.mean(predicted==test_labels))
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',LinearSVR())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LinearSVR准确率为:",np.mean(predicted==test_labels))
# MLPClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',MLPClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("MLPClassifier准确率为:",np.mean(predicted==test_labels))
# KNeighborsClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',KNeighborsClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("KNeighborsClassifier准确率为:",np.mean(predicted==test_labels))
# RandomForestClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',RandomForestClassifier(n_estimators=8))])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("RandomForestClassifier准确率为:",np.mean(predicted==test_labels))
# GradientBoostingClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',GradientBoostingClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("GradientBoostingClassifier准确率为:",np.mean(predicted==test_labels))
# AdaBoostClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',AdaBoostClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("AdaBoostClassifier准确率为:",np.mean(predicted==test_labels))
# DecisionTreeClassifier
text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
('clf',DecisionTreeClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("DecisionTreeClassifier准确率为:",np.mean(predicted==test_labels))
输出结果为:
MultinomialNB准确率为: 0.8960196779964222
SGDClassifier准确率为: 0.9724955277280859
LogisticRegression准确率为: 0.9304561717352415
SVC准确率为: 0.13372093023255813
LinearSVC准确率为: 0.9749552772808586
LinearSVR准确率为: 0.00022361359570661896
MLPClassifier准确率为: 0.9758497316636852
KNeighborsClassifier准确率为: 0.45840787119856885
RandomForestClassifier准确率为: 0.9680232558139535
GradientBoostingClassifier准确率为: 0.9186046511627907
AdaBoostClassifier准确率为: 0.5916815742397138
DecisionTreeClassifier准确率为: 0.9758497316636852
从上面结果可以看出,不同分类器在改数据集上的表现差别是比较大的,所以在做文本分类的时候要多尝试几种方法,说不定有意外收获;另外TfidfVectorizer、LogisticRegression等方法,我们可以设置很多参数,这里对实验的效果也影响比较大,比如TfidfVectorizer中一个参数ngram_range直接影响提取的特征,这里也是需要多磨多练;
更多请见:https://github.com/yanqiangmiffy/20newsgroups-text-classification
参考资料
中文文本分类对比(经典方法和CNN)
sklearn 中的 Pipeline 机制