贝叶斯定理是描述条件概率关系的定律
$$P(A|B) = \cfrac{P(B|A) * P(A)}{P(B)}$$
朴素贝叶斯分类器是一种基于概率的分类器,我们做以下定义:
有了这个定义,我们解释贝叶斯公式
对于朴素贝叶斯分类器,进一步假设特征向量之间无关,那么朴素贝叶斯分类器公式可以如下表示$$P(A|B) = \cfrac{P(A)\prod P(B_{i} |A)}{P(B)}$$
以上公式右侧的值都可以在训练样本中算得。进行预测时,分别计算每个类别的概率,取概率最高的一个类别。
对于连续值,有以下两种处理方式
# from sklearn.datasets import fetch_20newsgroups
# news = fetch_20newsgroups(subset='all')
# print(len(news.data))
# print(news.data[0])
from sklearn import datasets
train = datasets.load_files("./20newsbydate/20news-bydate-train")
test = datasets.load_files("./20newsbydate/20news-bydate-test")
print(train.DESCR)
print(len(train.data))
print(train.data[0])
None
11314
b"From: [email protected] ( )\nSubject: Re: Cubs behind Marlins? How?\nArticle-I.D.: agate.1pt592$f9a\nOrganization: University of California, Berkeley\nLines: 12\nNNTP-Posting-Host: garnet.berkeley.edu\n\n\[email protected] writes:\n\nmorgan and guzman will have era's 1 run higher than last year, and\n the cubs will be idiots and not pitch harkey as much as hibbard.\n castillo won't be good (i think he's a stud pitcher)\n\n This season so far, Morgan and Guzman helped to lead the Cubs\n at top in ERA, even better than THE rotation at Atlanta.\n Cubs ERA at 0.056 while Braves at 0.059. We know it is early\n in the season, we Cubs fans have learned how to enjoy the\n short triumph while it is still there.\n"
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words="english",decode_error='ignore')
train_vec = vec.fit_transform(train.data)
test_vec = vec.transform(test.data)
print(train_vec.shape)
(11314, 129782)
from sklearn.naive_bayes import MultinomialNB
bays = MultinomialNB()
bays.fit(train_vec,train.target)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
bays.score(test_vec,test.target)
0.80244291024960168
from sklearn.metrics import classification_report
y = bays.predict(test_vec)
print(classification_report(test.target,y,target_names=test.target_names))
precision recall f1-score support
alt.atheism 0.80 0.81 0.80 319
comp.graphics 0.65 0.80 0.72 389
comp.os.ms-windows.misc 0.80 0.04 0.08 394
comp.sys.ibm.pc.hardware 0.55 0.80 0.65 392
comp.sys.mac.hardware 0.85 0.79 0.82 385
comp.windows.x 0.69 0.84 0.76 395
misc.forsale 0.89 0.74 0.81 390
rec.autos 0.89 0.92 0.91 396
rec.motorcycles 0.95 0.94 0.95 398
rec.sport.baseball 0.95 0.92 0.93 397
rec.sport.hockey 0.92 0.97 0.94 399
sci.crypt 0.80 0.96 0.87 396
sci.electronics 0.79 0.70 0.74 393
sci.med 0.88 0.87 0.87 396
sci.space 0.84 0.92 0.88 394
soc.religion.christian 0.81 0.95 0.87 398
talk.politics.guns 0.72 0.93 0.81 364
talk.politics.mideast 0.93 0.94 0.94 376
talk.politics.misc 0.68 0.62 0.65 310
talk.religion.misc 0.88 0.44 0.59 251
avg / total 0.81 0.80 0.78 7532