安装完Scikit-learn 之后,利用其进行文本分类。
背景知识:
现在文本分类的算法很多,常见的有Naïve Bayes,SVM,KNN,Logistic回归等。其中SVM据文献中说是在工业界和学术界通吃的。
资料与程序
1. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html介绍NaiveBayes方法如何应用在文本分类上
2. http://blog.163.com/jiayouweijiewj@126/blog/static/17123217720113115027394/详细分析了Mahout中如何实现NaïveBayes
3. http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Libsvm是用来进行SVM训练与预测的开源工具。下载下来就可以直接用,作者的文档写的很详细。
4. http://www.blogjava.net/zhenandaci/category/31868.htmlSVM的八股介绍,讲解的还是通俗易懂的
5. http://blog.pluskid.org/?page_id=683 介绍支持向量机的
6. https://code.google.com/p/tmsvm/ Tmsvm是我之前写的利用svm进行文本分类的程序,涉及到文本分类的所有流程。
7. http://www.blogjava.net/zhenandaci/category/31868.html?Show=All 这里有一个文本分类的入门系列,介绍的还是比较详细的。
8. 《文本挖掘中若干关键问题研究》,这本书很薄,但是写的很深入,对文本挖掘的一些重点问题进行了讨论
进入正题
本文主要包括4个部分:
- 数据下载
- 提取特征
- Pipline 训练模型
- GridSearchCV 寻找最优参数
1. Sklearn 文本分类的数据集:20news-19997.tar
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'comp.sys.ibm.pc.hardware', 'sci.med'] twenty_train = fetch_20newsgroups(subset = 'train',categories = categories,shuffle=True, random_state=42) twenty_test = fetch_20newsgroups(subset='test',categories=categories, shuffle=True, random_state=42)
2. 提取特征
1)语料文件可以用一个词文档矩阵代表,每行是一个文档,每列是一个标记(即词)。将文档文件转化为数值特征的一般过程被称为 向量化。这个特殊的策略(标记,计数和正态化)被称为词袋 或者Bag of n-grams表征。用词频描述文档,但是完全忽略词在文档中出现的相对位置信息。
CountVectorizer在一个类中实现了标记和计数:
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(min_df=1)#得到模型
vectorizer.get_feature_names()#d得到特征 corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] X = vectorizer.fit_transform(corpus)
2)TF-IDF 计算词的权重
from sklearn.feature_extraction.textimportTfidfTransformer
transformer = TfidfTransformer()tfidf = transformer.fit_transform(X)
*大文本向量可以选择哈希向量,限定特征个数
HashingVectorizer 详细内容见官网from sklearn.feature_extraction.textimportHashingVectorizer
hv =HashingVectorizer(n_features=10)
hv.transform(corpus)
HashingVectorizer的局限:
- 不能反转模型(没有inverse_transform方法),也无法访问原始的字符串表征,因为,进行mapping的哈希方法是单向本性。
- 没有提供了IDF权重,因为这需要在模型中引入状态。如果需要的话,可以在管道中添加TfidfTransformer。
3. 测试简单的 模型训练+预测from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer from sklearn.naive_bayes import MultinomialNB #get vector vect = CountVectorizer() X_train= count_vect.fit_transform(twenty_train.data) #get word tf-idf tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train) #model train clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target docs_new = ['God is love', 'OpenGL on the GPU is fast'] X_new = vect.transform(docs_new) X_new_tfidf = tfidf_transformer.transform(X_new) #predict predicted = clf.predict(X_new_tfidf) for doc, category in zip(docs_new, predicted): print('%r => %s' % (doc, twenty_train.target_names[category]))
4. Pipline 串联处理器#pipeline串联了3个处理器 def test(): docs_new = ['God is love', 'OpenGL on the GPU is fast'] text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ]) #train text_clf.fit(twenty_train.data, twenty_train.target) #predict new_predicted = text_clf.predict(docs_new) for doc, category in zip(docs_new,new_predicted): #输出文档 => 类别 print ('%r => %s' %(doc, twenty_train.target_names[category]))
5. 模型训练+预测
def testPipline(): #1. MultinomialNB print '*************************\nNB\n*************************' text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ]) text_clf.fit(twenty_train.data, twenty_train.target) docs_test = twenty_test.data nb_predicted = text_clf.predict(docs_test) accuracy=np.mean(nb_predicted == twenty_test.target) #print accuracy print ("The accuracy of twenty_test is %s" %accuracy) print(metrics.classification_report(twenty_test.target, nb_predicted,target_names=twenty_test.target_names)) #2. KNN print '*************************\nKNN\n*************************' text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', KNeighborsClassifier()), ]) text_clf.fit(twenty_train.data, twenty_train.target) docs_test = twenty_test.data knn_predicted = text_clf.predict(docs_test) accuracy=np.mean(knn_predicted == twenty_test.target) #print accuracy print ("The accuracy of twenty_test is %s" %accuracy) print(metrics.classification_report(twenty_test.target, knn_predicted,target_names=twenty_test.target_names)) #3. SVM print '*************************\nSVM\n*************************' text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)),]) text_clf.fit(twenty_train.data, twenty_train.target) svm_predicted = text_clf.predict(docs_test) accuracy=np.mean(svm_predicted == twenty_test.target) #print accuracy print ("The accuracy of twenty_test is %s" %accuracy) print(metrics.classification_report(twenty_test.target, svm_predicted,target_names=twenty_test.target_names)) #4. 少量特征 print '*************************\nHashingVectorizer\n*************************' text_clf = Pipeline([('vect', HashingVectorizer(stop_words = 'english',non_negative = True, n_features = 10000)), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)),]) text_clf.fit(twenty_train.data, twenty_train.target) svm_predicted = text_clf.predict(docs_test) accuracy=np.mean(svm_predicted == twenty_test.target) #print accuracy print ("The accuracy of twenty_test is %s" %accuracy) print(metrics.classification_report(twenty_test.target, svm_predicted,target_names=twenty_test.target_names))
*结果分析
分析:对比************************* NB ************************* The accuracy of twenty_test is 0.838897721251 precision recall f1-score support alt.atheism 0.97 0.58 0.73 319 comp.graphics 0.95 0.85 0.89 389 comp.sys.mac.hardware 0.93 0.92 0.93 385 sci.med 0.96 0.81 0.87 396 soc.religion.christian 0.62 0.99 0.76 398 avg / total 0.88 0.84 0.84 1887 ************************* KNN ************************* The accuracy of twenty_test is 0.746157922629 precision recall f1-score support alt.atheism 0.56 0.86 0.68 319 comp.graphics 0.84 0.73 0.78 389 comp.sys.mac.hardware 0.82 0.75 0.78 385 sci.med 0.87 0.58 0.69 396 soc.religion.christian 0.75 0.84 0.79 398 avg / total 0.78 0.75 0.75 1887 ************************* SVM ************************* The accuracy of twenty_test is 0.912559618442 precision recall f1-score support alt.atheism 0.94 0.81 0.87 319 comp.graphics 0.89 0.92 0.91 389 comp.sys.mac.hardware 0.92 0.96 0.94 385 sci.med 0.94 0.90 0.92 396 soc.religion.christian 0.88 0.96 0.92 398 avg / total 0.91 0.91 0.91 1887 ************************* HashingVectorizer ************************* The accuracy of twenty_test is 0.897191308956 precision recall f1-score support alt.atheism 0.91 0.77 0.84 319 comp.graphics 0.89 0.91 0.90 389 comp.sys.mac.hardware 0.92 0.95 0.93 385 sci.med 0.91 0.89 0.90 396 soc.religion.christian 0.87 0.94 0.90 398 avg / total 0.90 0.90 0.90 1887
CountVectorizer 和HashingVectorizer,全部特征的结果要更好一些,虽然加大了内存压力。
对比NB,SVM和KNN分类结果,SVM结果最好,接下来继续采用次算法。
6. GridSearch 搜索最优参数,见代码注释
GridSearch 详细定义见官网
#GridSearchCV 搜索最优参数 def testGridSearch(): print '*************************\nPipeline+GridSearch+CV\n*************************' text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier()),]) parameters = { 'vect__ngram_range': [(1, 1), (1, 2)], 'vect__max_df': (0.5, 0.75), 'vect__max_features': (None, 5000, 10000), 'tfidf__use_idf': (True, False), # 'tfidf__norm': ('l1', 'l2'), 'clf__alpha': (0.00001, 0.000001), # 'clf__penalty': ('l2', 'elasticnet'), 'clf__n_iter': (10, 50), } #GridSearch 寻找最优参数的过程 flag = 0 if (flag!=0): grid_search = GridSearchCV(text_clf,parameters,n_jobs = 1,verbose=1) grid_search.fit(twenty_train.data, twenty_train.target) print("Best score: %0.3f" % grid_search.best_score_) best_parameters = dict(); best_parameters = grid_search.best_estimator_.get_params() print("Out the best parameters"); for param_name in sorted(parameters.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name])); #找到最优参数后,利用最优参数训练模型 text_clf.set_params(clf__alpha = 1e-05, clf__n_iter = 50, tfidf__use_idf = True, vect__max_df = 0.5, vect__max_features = None); text_clf.fit(twenty_train.data, twenty_train.target) #预测 pred = text_clf.predict(twenty_test.data) #输出结果 accuracy=np.mean(pred == twenty_test.target) #print accuracy print ("The accuracy of twenty_test is %s" %accuracy) print(metrics.classification_report(twenty_test.target, pred,target_names=twenty_test.target_names)) array = metrics.confusion_matrix(twenty_test.target, pred) print array
*结果分析
************************* Pipeline+GridSearch+CV ************************* The accuracy of twenty_test is 0.918388977213 precision recall f1-score support alt.atheism 0.95 0.84 0.89 319 comp.graphics 0.90 0.92 0.91 389 comp.sys.mac.hardware 0.92 0.95 0.93 385 sci.med 0.95 0.91 0.93 396 soc.religion.christian 0.89 0.96 0.92 398 avg / total 0.92 0.92 0.92 1887
1)每一个算法会输出分类结果报表分类结果报表,其中:
- 准确率=被识别为该分类的正确分类记录数/被识别为该分类的记录数
- 召回率=被识别为该分类的正确分类记录数/测试集中该分类的记录总数
- F1-score=2(准确率 * 召回率)/(准确率 + 召回率),F1-score是F-measure(又称F-score)beta=1时的特例
- support=测试集中该分类的记录总数
2)混淆矩阵
SVM分类结果的混淆矩阵,类别数n,结果是一个n*n的矩阵,每一行的所有数字之和表示测试集中该分类的记录总数,等于结果报表中的support值。array = metrics.confusion_matrix(twenty_test.target, pred) print array
其中对角线上的元素表示正确分类结果数目,如comp.graphics 测试集中有319个文档记录,在这里有268个文档被分类正确,其他文档散落在了其他分类中。[[268 7 1 7 36] [ 5 359 17 3 5] [ 0 12 366 6 1] [ 4 16 13 359 4] [ 6 6 1 4 381]] #对应类别 categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'comp.sys.mac.hardware', 'sci.med']
所有代码下载
参考博客 http://blog.csdn.net/zhzhl202/article/details/8197109 (文本分类与SVM)
博客先到这里,继续补充。