朴素贝叶斯 Naive Bayes
根据条件概率公式:
在B条件下A发生的概率: P(A∣B)=P(AB)P(B)
在A条件下B发生的概率: P(B∣A)=P(AB)P(A)
则
P(A∣B)P(B)=P(AB)=P(B∣A)P(A)
可得贝叶斯定理:
P(A∣B)=P(B∣A)P(A)P(B)
贝叶斯定理+特征条件独立假设 ⇒ 朴素贝叶斯法
特征条件独立假设:
1.特征之间相互独立.
2.每个特征同等重要.
朴素贝叶斯算法(Naive Bayes algorithm):
输入:训练数据 T= { (x1,y1),(x2,y2),...,(xN,yN) },其中 xi=(x1i,x2i,...,xNi)T , xji 是第i个样本的第j个特征值, x(j)iϵ { aj1,aj2,...,ajSj }, ajl 是第j个特征可能取的第l个值,j=1,2,…,n;l=1,2,…, Sj ; yiϵ { c1,c2,...,cK };实例x;
输出:实例x的分类.
P(Y=ck)=∑Ni=1I(yi=ck)N,k=1,2,...,K
P(X(j)=ajl∣Y=ck)=∑Ni=1I(x(j)i=ajl,yi=ck)∑Ni=1I(yi=ck),j=1,2,...,n;l=1,2,...,Sj;k=1,2,...,K
2.对于给定实例x=(),计算
P(Y=ck)∏nj=1P(X(j)=x(j)∣Y=ck),k=1,2,...,K
3.确定实例x的类
y=argmaxckP(Y=ck)∏nj=1P(X(j)=x(j)∣Y=ck)
朴素贝叶斯分类器主要有3种模型:
1.高斯模型
2.多项式模型
3.伯努利模型
GaussianNB()朴素贝叶斯高斯模型.
对于连续属性可考虑概率密度函数,假定 P(xi,|y)∼N(μy,σ2y) ,其中 μy 和 σ2y 分别是样本y在i个属性上取值的均值和方差:
使用最大似然估计参数 σy 和 μy .
使用朴素贝叶斯分类器的高斯模型对自带的乳腺癌数据集进行分类测试.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
# 加载乳腺癌数据集
cancerData = load_breast_cancer()
# 将数据集随机分割为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(cancerData.data, cancerData.target, test_size=0.33, random_state=1)
# 拟合朴素贝叶斯分类器
clf = GaussianNB()
clf.fit(x_train, y_train)
error = 0
# 对测试向量执行分类
for i in range(len(x_test)):
if clf.predict(x_test[i])[0] != y_test[i]:
# 记录分类错误个数
error +=1
print 'the total number of errors: %d' % error
print 'the total error rate: %.4f' % (error/float(len(x_test)))
输出:
the total number of errors: 10
the total error rate: 0.0532
MultinomialNB()朴素贝叶斯分类器的多项式模型.
当特征是离散的时候使用多项式模型.
利用极大似然估计可能会出现要估计的概率值为0的情况,即训练集中未出现该属性.为避免其他属性被抹去,在估计概率值时通常采用平滑处理.
如在垃圾邮件分类的例子中:
P(xi|y) 表示在垃圾邮件y这个类别中单词 xi 出现的概率. xi 表示待考查的邮件中的某个词.
nxi 表示所有垃圾邮件中单词 xi 出现的次数,若 xi 没出现过则 nxi=0 .
Ny 表示属于y类的全部邮件中出现过单词的总数.
P(xi|y)=nxi+λNy+Nλ
N 表示所有单词的数目,修正分母是为了保证概率和为1.
λ≥0 等价于在随机变量各个取值的频数上赋予一个正数 λ>0 .
λ=0 时就是极大似然估计;
λ=1 时称为拉普拉斯平滑(Laplace smoothing);
λ<1 时称为Lidstone平滑(Lidstone smoothing);
同样对先验概率的贝叶斯估计进行平滑处理:
P(y)=nyxi+λnxi+Kλ
nyxi 表示属于y类的全部邮件中单词 xi 出现的总数.
K 表示邮件的分类数目.
多项式模型是文本分类中使用的两种经典朴素贝叶斯变体之一(其中数据通常表示为词向量计数,尽管TF-IDF向量在实践中也可以很好地工作).
[TF-IDF(term frequency–inverse document frequency)即用于评估字词在文件中重要程度的一种统计方法.TF-IDF=TF*IDF,TF即词频(Term Frequency),IDF即反文档频率(Inverse Document Frequency)]
#!/usr/bin/python
# -*-coding:utf-8-*-
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# 加载乳腺癌数据集
cancerData = load_breast_cancer()
# 将数据集随机分割为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(cancerData.data, cancerData.target, test_size=0.35, random_state=1)
# 拟合朴素贝叶斯分类器
clf = MultinomialNB()
clf.fit(x_train, y_train)
error = 0
# 对测试向量执行分类
for i in range(len(x_test)):
if clf.predict(x_test[i])[0] != y_test[i]:
# 记录分类错误个数
error +=1
print 'the total number of errors: %d' % error
print 'the total error rate: %.4f' % (error/float(len(x_test)))
输出:
the total number of errors: 21
the total error rate: 0.1050
BernoulliNB():朴素贝叶斯分类器的多元多元伯努利模型.
BernoulliNB分类器适用于离散数据.与多项式模型不同的是伯努利模型训练和分类的数据服从多元伯努利分布; 即数据可能有多个特征,但是每个特征都被假定为二值(Bernoulli,布尔)变量。 因此,这个类要求样本表示为二值的特征向量; 如果输入其他类型的数据,则BernoulliNB实例可对其进行二进制化(取决于binarize参数).
如,在文本分类中某单词在文本中出现过则其特征值为1,没有出现过则为0.
当特征值 xi 为1时, P(xi|y)=P(xi=1|y)
当特征值 xi 为0时, P(xi|y)=P(xi=0|y)
BernoulliNB可能在某些数据集上效果更好,尤其是那些文档较短的数据集.
使用朴素贝叶斯分类器的多元伯努利模型对自带的鸢尾花数据集进行分类测试.
#!/usr/bin/python
# -*-coding:utf-8-*-
# BernoulliNB()朴素贝叶斯分类器的多元伯努利模型
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
# 加载鸢尾花数据集
cancerData = load_iris()
# 将数据集随机分割为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(cancerData.data, cancerData.target, test_size=0.4, random_state=1)
# 拟合朴素贝叶斯分类器
clf = BernoulliNB()
clf.fit(x_train, y_train)
error = 0
# 对测试向量执行分类
for i in range(len(x_test)):
if clf.predict(x_test[i])[0] != y_test[i]:
# 记录分类错误个数
error +=1
print 'the total number of errors: %d' % error
print 'the total error rate: %.4f' % (error/float(len(x_test)))
输出:
the total number of errors: 41
the total error rate: 0.6833
基于概率论的分类方法:朴素贝叶斯–文档分类
此处使用贝努力模型–重复词语视为出现1次
#!/usr/bin/python
# coding:utf-8
# 机器学习实战 第04章 Bayes
# 基于概率论的分类方法:朴素贝叶斯--文档分类
# 此处使用贝努力模型--重复词语视为出现1次
from numpy import *
def loadDataSet():
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0, 1, 0, 1, 0, 1] # 1 代表侮辱性文字, 0 代表正常言论
return postingList, classVec
# 集合set中的元素不能重复,重复元素会被自动过滤--利用这一性质,将所有文档中的重复词条过滤掉
# 生成无重复词汇的文档
def createVocabList(dataSet):
vocabSet = set([]) # 创建一个空集
for document in dataSet:
# 将dataSet中的list依次转换为集合set然后并入vocabSet
vocabSet = vocabSet | set(document)
return list(vocabSet) # 返回没有重复词汇的列表
# 输入无重复词汇的文档和待转换的文档, 将文档转换为向量,分别用1/0表示对应词汇在词汇表中出现/未出现
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
# 对应词汇在词汇表中出现则将对应位置设置为1
returnVec[vocabList.index(word)] = 1
else: print "the word: %s is not in my Vocabulary!" % word
return returnVec
"""
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0]
[1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1]
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]
[0, 1, 0, 1, 0, 1]
"""
# 朴素贝叶斯训练函数
def trainNB0(trainMatrix,trainCategory):
# 获得trainMatrix中的总文档数
numTrainDocs = len(trainMatrix)
# 获得trainMatrix第0行的词汇数numWords
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
# 初始化,拉普拉斯平滑
p0Num = ones(numWords); p1Num = ones(numWords)
p0Denom = 2.0; p1Denom = 2.0
# 遍历训练集trainMatrix中的文档
for i in range(numTrainDocs):
# print numTrainDocs
# print trainMatrix[i]
# print trainCategory[i]
# 侮辱性或正常词语在某一文档中出现,则该词对应的个数(PlNmn或者p0NuIn)就加1
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
# print p1Num
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
# print p1Num,p0Num
# print p1Denom,p0Denom
# 取对数,防下溢
p1Vect = log(p1Num/p1Denom)
p0Vect = log(p0Num/p0Denom)
return p0Vect,p1Vect,pAbusive
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1)
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0
def testingNB():
listOPosts,listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
testEntry = ['love', 'my', 'dalmation']
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
testEntry = ['stupid', 'garbage']
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
# 文档词袋模型
def bagOfWords2VecMN(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec
# 接受一个大字符串并将其解析为字符串列表
import re
def textParse(bigString):
listOfTokens = re.split(r'\W*', bigString)
# 将字符串转换为小写字符,并去除长度小于2的字符串
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
# 对贝叶斯垃圾邮件分类器进行自动化处理
def spamTest():
docList=[]; classList = []; fullText =[]
for i in range(1, 26):
# 导入并解析文本文件
# 将文本转换为字符串列表
wordList = textParse(open('email/spam/%d.txt' % i).read())
# 合并为一个25维列表
docList.append(wordList)
# 合并为一个1维列表
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
# classList=[1, 0, 1, 0, ..., 1, 0, 1, 0, 1, 0]
# 生成无重复词汇的文档vocabList
vocabList = createVocabList(docList)
trainingSet = range(50); testSet=[]
# 随机构建训练集
for i in range(10):
# 随机生成一个大小在0到len(trainingSet)之间的Index
randIndex = int(random.uniform(0, len(trainingSet)))
# 将trainingSet中第randIndex个元素添加到训练集
testSet.append(trainingSet[randIndex])
# 从trainingSet中删除第randIndex个元素,避免重复选择
del(trainingSet[randIndex])
trainMat=[]; trainClasses = []
# 生成文档词袋模型
for docIndex in trainingSet:
# 输入无重复词汇的文档和待转换的文档, 将文档转换为向量,分别用1/0表示对应词汇在词汇表中出现/未出现
trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
# trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
# 生成训练集对应的分类标签
trainClasses.append(classList[docIndex])
# 计算分类所需的概率
p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
# print p0V,p1V
errorCount = 0
# 对测试集分类
for docIndex in testSet:
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
errorCount += 1
print "classification error",docList[docIndex]
print 'the error rate is: ',float(errorCount)/len(testSet)
将文档转换为词条向量
if __name__ == '__main__':
# listClasses=[0, 1, 0, 1, 0, 1]
listOPosts, listClasses = loadDataSet()
# 生成无重复文档myVocabList
myVocabList = createVocabList(listOPosts)
# 将文档转换为向量,分别用1/0表示对应词汇在词汇表中出现/未出现
print setOfWords2Vec(myVocabList, listOPosts[0])
print setOfWords2Vec(myVocabList, listOPosts[3])
打印结果:
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
生成训练矩阵trainMat
if __name__ == '__main__':
# listClasses=[0, 1, 0, 1, 0, 1]
listOPosts, listClasses = loadDataSet()
# 生成无重复文档myVocabList
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
# 将文档列表listOPosts中的文档逐个转换为向量,然后依次添加到训练矩阵trainMat中
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
p0Vect, p1Vect, pAbusive = trainNB0(trainMat, listClasses)
print trainMat
print p0Vect,'\n'
print p1Vect,'\n'
print pAbusive
打印结果:
[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]
[-2.56494936 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
-2.56494936 -2.56494936 -3.25809654 -2.56494936 -2.56494936 -2.56494936
-2.56494936 -3.25809654 -3.25809654 -2.15948425 -3.25809654 -3.25809654
-2.56494936 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -2.56494936
-2.56494936 -2.56494936 -3.25809654 -2.56494936 -3.25809654 -2.56494936
-2.56494936 -1.87180218]
[-3.04452244 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
-3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -3.04452244
-3.04452244 -2.35137526 -2.35137526 -2.35137526 -2.35137526 -2.35137526
-3.04452244 -1.94591015 -3.04452244 -2.35137526 -2.35137526 -3.04452244
-1.94591015 -3.04452244 -1.65822808 -3.04452244 -2.35137526 -3.04452244
-3.04452244 -3.04452244]
0.5
对文档进行分词
if __name__ == '__main__':
emailText = open('email/ham/1.txt').read()
print emailText
regEx = re.compile('\\W*')
listOfTokens = regEx.split(emailText)
# print listOfTokens
print [tok for tok in listOfTokens if len(tok) > 0]
print [tok.lower() for tok in listOfTokens if len(tok) > 0]
打印结果:
Hi Peter,
With Jose out of town, do you want to
meet once in a while to keep things
going and do some interesting stuff?
Let me know
Eugene
['Hi', 'Peter', 'With', 'Jose', 'out', 'of', 'town', 'do', 'you', 'want', 'to', 'meet', 'once', 'in', 'a', 'while', 'to', 'keep', 'things', 'going', 'and', 'do', 'some', 'interesting', 'stuff', 'Let', 'me', 'know', 'Eugene']
['hi', 'peter', 'with', 'jose', 'out', 'of', 'town', 'do', 'you', 'want', 'to', 'meet', 'once', 'in', 'a', 'while', 'to', 'keep', 'things', 'going', 'and', 'do', 'some', 'interesting', 'stuff', 'let', 'me', 'know', 'eugene']
测试:
if __name__ == '__main__':
print spamTest()
打印结果:
the error rate is: 0.0
None
参考:
李航<统计学习方法>
周志华<机器学习>
<机器学习实战>
sklearn手册