朴树贝叶斯 sklean 文本分类实现

读取本地文件进行分析,分词中英文都支持,可以换结巴分词。

训练样本可以自己定义,目录结构就是当前项目的 data_log文件夹,一级目录是类别,二级目录是文件即可。

博主训练集合 仅供参考:http://download.csdn.net/download/yl3395017/10236998

from sklearn.datasets import load_files
# 加载数据集
training_data = load_files('./data_log', encoding='utf-8')
'''
这是开始提取特征,这里的特征是词频统计。
'''
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_data.data)
'''
这是开始提取特征,这里的特征是TFIDF特征。
'''
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

'''
使用朴素贝叶斯分类,并做出简单的预测
'''
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, training_data.target)
docs_new = ['danger_degree:1;breaking_sighn:0;event']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, training_data.target_names[category]))

'''
使用测试集来评估模型好坏。
'''
from sklearn import metrics
import numpy as np;
# twenty_test = fetch_20newsgroups(subset='test',categories=categories, shuffle=True, random_state=42)
testing_data = load_files('./predict_test_log', encoding='utf-8')
docs_test = testing_data.data
X_test_counts = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict(X_test_tfidf)
print(metrics.classification_report(testing_data.target, predicted,target_names=testing_data.target_names))
print("accurary\t"+str(np.mean(predicted == testing_data.target)))


你可能感兴趣的:([老达笔记]机器学习)