P ( c j ) P(c_j) P(cj)代表未有训练模型之前,根据历史数据/经验估算 c j c_j cj拥有的初始概率。 P ( c j ) P(c_j) P(cj)常称为 c j c_j cj的先验概率(prior probability),它反映了 c j c_j cj的概率分布,该分布独立于样本。
公式如下所示:
P ( c j ) = ∣ c j ∣ ∣ D ∣ P(c_j)=\frac{|c_j|}{|D|} P(cj)=∣D∣∣cj∣
∣ c j ∣ |c_j| ∣cj∣表示样例中属于 c j c_j cj的样例数,|D|表示总样例数。
在给定数据样本x时 c j c_j cj成立的概率 P ( c j ∣ x ) P(c_j|x) P(cj∣x)称为后验概率(posterior probability),因为它反映了看到数据样本x后 c j c_j cj成立的置信度。后验概率是观测到x后对结果y的估计,大部分机器学习模型尝试得到后验概率。
已知两个独立事件A和B,事件B发生的前提下,事件A发生的概率可以表示为 P ( A ∣ B ) P(A|B) P(A∣B),求解 P ( A ∣ B ) P(A|B) P(A∣B)的公式如下所示:
P ( A ∣ B ) = P ( A , B ) P ( B ) P(A|B)=\frac{P(A,B)}{P(B)} P(A∣B)=P(B)P(A,B)
=> P ( A , B ) = P ( B ) ∗ P ( A ∣ B ) = P ( A ) ∗ P ( B ∣ A ) P(A,B)=P(B)*P(A|B)=P(A)*P(B|A) P(A,B)=P(B)∗P(A∣B)=P(A)∗P(B∣A)
=> P ( A ∣ B ) = P ( A ) ∗ P ( B ∣ A ) P ( B ) P(A|B)=\frac{P(A)*P(B|A)}{P(B)} P(A∣B)=P(B)P(A)∗P(B∣A)
朴素贝叶斯分类器采用了“属性条件独立性假设”,即每个属性独立地对分类结果发生影响。
基于属性条件独立性假设,贝叶斯公式可重写为:
P ( c ∣ x ) = P ( c ) P ( x ∣ c ) P ( x ) = P ( c ) P ( x ) π i = 1 d P ( x i ∣ c ) P(c|x)=\frac{P(c)P(x|c)}{P(x)}=\frac{P(c)}{P(x)}\pi_{i=1}^dP(x_i|c) P(c∣x)=P(x)P(c)P(x∣c)=P(x)P(c)πi=1dP(xi∣c)
d为属性数目, x i x_i xi为x在第i个属性上的取值。
朴素贝叶斯分类器的训练器的训练过程就是基于训练集D估计类先验概率P(c),并为每个属性估计条件概率 P ( x i ∣ c ) P(x_i|c) P(xi∣c).
令 D c D_c Dc表示训练集D中第c类样本组合的集合,则类先验概率为:
P ( c ) = D c D P(c)=\frac{D_c}{D} P(c)=DDc
朴素贝叶斯分类器一般步骤:
若某个属性值在训练集中没有与某个类同时出现过,则训练后的模型会出现over-fitting现象。为了避免其他属性携带的信息,被训练集中未出现的属性值“抹去”,在估计概率值时通常要进行拉普拉斯修正。
令N表示训练集D中可能的类别数, N i N_i Ni表示第i个属性可能的取值数,则贝叶斯公式可修正为:
P ^ ( c ) = ∣ D c + 1 ∣ ∣ D ∣ + N \widehat{P}(c)=\frac{|D_c+1|}{|D|+N} P (c)=∣D∣+N∣Dc+1∣
P ^ ( x i ∣ c ) = ∣ D c , x i ∣ + 1 ∣ D ∣ + N i \widehat{P}(x_i|c)=\frac{|D_{c,x_i}|+1}{|D|+N_i} P (xi∣c)=∣D∣+Ni∣Dc,xi∣+1
条件概率乘法计算过程中,因子一般较小(均是小于1的实数),当属性数量增多时,会导致累乘结果下溢出的现象。
在代数中有 l n ( a ∗ b ) = l n ( a ) + l n ( b ) ln(a*b)=ln(a)+ln(b) ln(a∗b)=ln(a)+ln(b),因此可以把条件概率累乘转化为对数累加。分类结果仅需对比概率的对数累加法运算后的数值,以确定划分类别。
词表到向量的转换函数
from numpy import *
import pandas as pd
def loadDataSet():
postingList=[['my','dog','has','flea','problems','help','please'],
['maybe','not','take','him','to','dog','park','stupid'],
['my','dalmation','is','so','cute','I','love','him'],
['stop','posting','stupid','worthless','garbage'],
['mr','licks','ate','my','steak','how','to','stop','him'],
['quit','buying','worthless','dog','food','stupid']]
classVec=[0,1,0,1,0,1]#1代表侮辱性文字,0代表正常言论
return postingList,classVec
def createVocabList(dataSet):
vocabSet=set([])#创建一个空集
for document in dataSet:
vocabSet=vocabSet|set(document)#|表示求并操作
return list(vocabSet)
def setOfwords2Vec(vocabList,inputSet):
returnVec=[0]*len(vocabList)#创建一个等长向量并设置为0
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)]=1#若出现了单词表中的单词,则将对应值设置为1
else:
print("the word:%s is not in my vocabulary!"% word)
return returnVec
朴素贝叶斯分类器训练函数
def trainNB0(trainMatrix,trainCategory):
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=sum(trainCategory)/float(numTrainDocs)
p0Num=zeros(numWords)
p1Num=zeros(numWords)
p0Denom=0.0
p1Denom=0.0
for i in range(numTrainDocs):
if trainCategory[i]==1:
p1Num+=trainMatrix[i]
p1Denom+=sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=sum(trainMatrix[i])
p1Vect=p1Num/p1Denom
p0Vect=p0Num/p0Denom
return p0Vect,p1Vect,pAbusive
def trainNB1(trainMatrix,trainCategory):
numTrainDocs=len(trainMatrix)#
numWords=len(trainMatrix[0])
pAbusive=sum(trainCategory)/float(numTrainDocs)
p0Num=ones(numWords)
p1Num=ones(numWords)
p0Denom=2.0
p1Denom=2.0
for i in range(numTrainDocs):
if trainCategory[i]==1:#类别为1
p1Num+=trainMatrix[i]
p1Denom+=sum(trainMatrix[i])
else:#类别为0
p0Num+=trainMatrix[i]
p0Denom+=sum(trainMatrix[i])
p1Vect=log(p1Num/p1Denom)
p0Vect=log(p0Num/p0Denom)
return p0Vect,p1Vect,pAbusive
朴素贝叶斯分类函数
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
p1=sum(vec2Classify*p1Vec)+log(pClass1)
p0=sum(vec2Classify*p0Vec)+log(1.0-pClass1)
if p1>p0:#比较类别概率
return 1
else:
return 0
def testingNB():
listOposts,listClasses=loadDataSet()
myVocabList=createVocabList(listOposts)
trainMat=[]
for postinDoc in listOposts:
trainMat.append(setOfwords2Vec(myVocabList,postinDoc))
p0V,p1V,pAb=trainNB0(array(trainMat),array(listClasses))
testEntry=['love','my','dalmation']
thisDoc=array(setOfwords2Vec(myVocabList,testEntry))
print(testEntry,'classified as:',classifyNB(thisDoc,p0V,p1V,pAb))
testEntry=['stupid','garbage']
thisDoc=array(setOfwords2Vec(myVocabList,testEntry))
print(testEntry,'classified as:',classifyNB(thisDoc,p0V,p1V,pAb))
文档词袋模型
def bagOfWords2VecMN(vocabList,inputSet):#词袋模型
returnVec=[0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)]+=1
return returnVec
准备数据:切分文本
def textParse(bigString):#切分文本
import re
listOfTokens=re.split(r'\W+',bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
垃圾邮件测试函数
def spamTest():
docList=[]
classList=[]
fullText=[]
for i in range(1,26):
wordList=textParse(open('D:/machinelearning/machinelearningsource/email/spam/%d.txt'%i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)#将垃圾邮件标记为1
wordList=textParse(open('D:/machinelearning/machinelearningsource/email/ham/%d.txt'%i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)#将正常邮件标记为0
vocabList=createVocabList(docList)
trainingSet=list(range(50))
testSet=[]
for i in range(10):
randIndex=int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]
trainClasses=[]
for docIndex in trainingSet:#遍历训练集
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam=trainNB1(array(trainMat),array(trainClasses))
errorCount=0
for docIndex in testSet:#遍历测试集
wordVector=bagOfWords2VecMN(vocabList,docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:
errorCount+=1
print('the error rate is:',float(errorCount)/len(testSet))#计算错误率
return float(errorCount)/len(testSet)
运行结果:
可以看出分类错误率为0.1。因为这些邮件是随机选择的,所以每次输出结果可能有些差别,当然为了更好的估计错误率,我们可以重复多次,然后求出平均值。
spamTest()
error = 0.0
for i in range(10):
error += spamTest()
error = error / 10.0
print(error)
sms数据集中提供了ham和spam两种类型的垃圾短信,同垃圾邮件分类一样,我们需要先对数据集进行文本操作,然后进行文本划分,再进行训练和测试等。
数据集如下所示:
数据集中的每行数据的开头为该条数据的标签,然后用\t隔开,后面是该条数据的具体内容。因此我们需要对文本进行划分,先将文本用split方法分隔开后,判断数据标签,若为spam,则将数据内容放入对应列表里,并将标签置为1,同理若为ham,则将数据内容放入对应列表里,并将标签置为0.`
import re
# s=[]
f=open('D:/machinelearning/machinelearningsource/smsspamcollection/SMSSpamCollection.txt','r')
for lines in f:
listOfTokens = re.split(r'\W+',lines)
if listOfTokens[0]=='spam':
listOfTokens.remove(listOfTokens[0])
# print(listOfTokens)
for i in listOfTokens:
docList.append(listOfTokens)
fullText.extend(listOfTokens)
classList.append(1)
# print(classList)
# print(s)
# print(lines)
# print(s)
# print(lines)
if listOfTokens[0]=='ham':
listOfTokens.remove(listOfTokens[0])
print(listOfTokens)
for i in listOfTokens:
docList.append(listOfTokens)
# print(docList)
fullText.extend(listOfTokens)
# print(fullText)
classList.append(0)
f.close()
from numpy import *
import pandas as pd
def createVocabList(dataSet):
vocabSet=set([])#创建一个空集
for document in dataSet:
vocabSet=vocabSet|set(document)#|表示求并操作
return list(vocabSet)
def trainNB1(trainMatrix,trainCategory):
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=sum(trainCategory)/float(numTrainDocs)
p0Num=ones(numWords)
p1Num=ones(numWords)
p0Denom=2.0
p1Denom=2.0
for i in range(numTrainDocs):
if trainCategory[i]==1:
p1Num+=trainMatrix[i]
p1Denom+=sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=sum(trainMatrix[i])
p1Vect=log(p1Num/p1Denom)
p0Vect=log(p0Num/p0Denom)
return p0Vect,p1Vect,pAbusive
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
p1=sum(vec2Classify*p1Vec)+log(pClass1)
p0=sum(vec2Classify*p0Vec)+log(1.0-pClass1)
if p1>p0:
return 1
else:
return 0
def testingNB():
listOposts,listClasses=loadDataSet()
myVocabList=createVocabList(listOposts)
trainMat=[]
for postinDoc in listOposts:
trainMat.append(setOfwords2Vec(myVocabList,postinDoc))
p0V,p1V,pAb=trainNB0(array(trainMat),array(listClasses))
testEntry=['love','my','dalmation']
thisDoc=array(setOfwords2Vec(myVocabList,testEntry))
print(testEntry,'classified as:',classifyNB(thisDoc,p0V,p1V,pAb))
testEntry=['stupid','garbage']
thisDoc=array(setOfwords2Vec(myVocabList,testEntry))
print(testEntry,'classified as:',classifyNB(thisDoc,p0V,p1V,pAb))
def bagOfWords2VecMN(vocabList,inputSet):
returnVec=[0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)]+=1
return returnVec
def textParse(bigString):
import re
listOfTokens=re.split(r'\W+',bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 3]
def spamTest():
docList=[]
classList=[]
fullText=[]
import re
# s=[]
f=open('D:/machinelearning/machinelearningsource/smsspamcollection/SMSSpamCollection.txt','r')
for lines in f:
listOfTokens = re.split(r'\W+',lines)
if listOfTokens[0]=='spam':
listOfTokens.remove(listOfTokens[0])
# print(listOfTokens)
for i in listOfTokens:
docList.append(listOfTokens)
fullText.extend(listOfTokens)
classList.append(1)
# print(classList)
# print(s)
# print(lines)
# print(s)
# print(lines)
if listOfTokens[0]=='ham':
listOfTokens.remove(listOfTokens[0])
# print(listOfTokens)
for i in listOfTokens:
docList.append(listOfTokens)
# print(docList)
fullText.extend(listOfTokens)
# print(fullText)
classList.append(0)
f.close()
vocabList=createVocabList(docList)
# print(vocabList)
# print(vocabList)
trainingSet=list(range(50))
testSet=[]
for i in range(10):
randIndex=int(random.uniform(0,len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat=[]
trainClasses=[]
for docIndex in trainingSet:
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
# print(trainMat)
trainClasses.append(classList[docIndex])
# print(trainClasses)
p0V,p1V,pSpam=trainNB1(array(trainMat),array(trainClasses))
errorCount=0
for docIndex in testSet:
wordVector=bagOfWords2VecMN(vocabList,docList[docIndex])
if classifyNB(array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:
errorCount+=1
print('the error rate is:',float(errorCount)/len(testSet))
return float(errorCount)/len(testSet)
在朴素贝叶斯分类实验中,首先对于实验原理以及贝叶斯公式要理解透彻,还有在实验中关键要学会如何对文本进行操作,这样有利于实验的进行。