我们将充分利用python的文本处理能力将文档切分成词向量,然后利用词向量对文档进行分类。还将构造分类器观察其在真实的垃圾邮件数据集中的过滤效果。
基于贝叶斯决策理论的分类方法
假设现在我们有一个数据集,它由两类数据组成,数据分布如图4-1所示。
我们现在用 p1(x,y) 表示数据点(x,y)属于类别1(图中用圆点表示的类别)的概率,用 p2(x,y) 表示数据点(x,y)属于类别2(图中用三角形表示的类别)的概率,那么对于一个新数据点(x,y),可以用下面的规则来判断它的类别:
- 如果 p1(x,y) > p2(x,y) ,那么类别为1。
- 如果 p2(x,y) > p1(x,y) ,那么类别为2。
我们这次的实验是使用朴素贝叶斯进行文档分类,我们以垃圾邮件的识别为例。
问题背景:以在线社区的留言板为例。为了不影响社区的发展,我们要屏蔽侮辱性的言论,所以要构建一个快速过滤器,如果某条留言使用了负面或者侮辱性的语言,那么就将该留言标识为内容不当。过滤这类内容是一个很常见的需求。对此问题建立两个类别:侮辱类和非侮辱类,使用1和0分别表示。
准备数据:从文本中构建词向量
给一段文本,根据词的出现与否构建词向量。
def loadDataSet():
oldPostingList = [['my dog has flea problems help please'], ['maybe not take him to dog park stupid'],
['my dalmation is so cute I love him'], ['stop posting stupid worthless garbage'],
['mr licks ate my steak how to stop him']]
postingList = []
for line in oldPostingList:
newline = line[0].split()
postingList.append(newline)
classVec = [0, 1, 0, 1, 0, 1]
return postingList,classVec
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet)
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
print "the word: %s is not in my Vocabulary"%word
return returnVec
训练算法:从词向量计算概率
p ( c i | w ) = p ( w | c i ) p ( c i ) p ( w )
朴素贝叶斯分类器训练函数
from numpy import *
def trainNB0(trainMatrix, trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
# p0Num =zeros(numWords); p1Num =zeros(numWords)
# p0Denom = 0.0; p1Denom = 0.0
p0Num =ones(numWords); p1Num = ones(numWords)
p0Denom = 2.0; p1Denom = 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
# p1Vect = p1Num/p1Denom
# p0Vect = p0Num/p0Denom
p1Vect = log(p1Num/p1Denom)
p0Vect = log(p0Num/p0Denom)
return p0Vect,p1Vect,pAbusive
朴素贝叶斯分类函数:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1)
p0 = sum(vec2Classify * p0Vec) + log(1-pClass1)
if p1 > p0:
return 1
else:
return 0
def testingNB():
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
p0V,p1V,pAb = trainNB0(array(trainMat), array(listClasses))
testEntry = ['love', 'my', 'dalmation']
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ', classifyNB(thisDoc,p0V,p1V,pAb)
testEntry = ['stupid', 'garbage']
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print testEntry, 'classified as :', classifyNB(thisDoc, p0V, p1V, pAb)
准备数据:文档词袋模
如果一个词在文档中出现不止一次,这可能意味着包含该词是否出现在文档中所不能表达的某种信息,这种方法被称为词袋模型(bag-of-words model)
def bagOfWords2VecMN(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec
使用朴素贝叶斯过滤垃圾邮件
def textParse(bigString):
import re
listOfTokens = re.split(r'\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
def spamTest():
docList = []; classList = []; fullText = []
for i in range(1,26):
wordList = textParse(open('email/spam/%d.txt'%i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(open('email/ham/%d.txt'%i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)
trainingSet = range(50); testSet = []
for i in range(10):
randIndex = int(random.uniform(0, len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat = []; trainClasses = []
for docIndex in trainingSet:
trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V,p1V,pSpam = trainNB0(array(trainMat), array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = setOfWords2Vec(vocabList, docList[docIndex])
if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
errorCount += 1
print 'the error rate is :', float(errorCount)/len(testSet)
函数 spamTest() 会输出在10封随机选择的电子邮件上的分类错误率。既然这些电子邮件是随机选择的,所以每次的输出结果可能有些差别。如果发现错误的话,函数会输出错分文档的词表,这样就可以了解到底是哪篇文档发生了错误。如果想要更好地估计错误率,那么就应该将上述过程重复多次,比如说10次,然后求平均值。