4. 朴素贝叶斯的优缺点:
优点:在数据较少的情况下仍然有效,可以处理多类别问题。
缺点:对于输入数据的准备方式较为敏感。
适用数据类型:标称型数据。
5.使用Python进行文本分类
5.1 准备数据:从文本中构建词向量
我们将把文本看成单词向量或者词条向量。考虑出现在所有文档中的所有单词,再决定将哪些词纳入词汇表或者说所要的词汇集合,所以必须将每一篇文档转化成词汇表上的向量。比如机器学习实战所给出的实例。
def loadDataSet(): postingList=[ ['my','dog','has','flea','problem','help','please'], ['maybe','not','take','him','to','dog','park','stupid'], ['my','dalmation','is','so','cute','I','love','him'], ['stop','posting','stupid','worthless','garbage'], ['mr','licks','ate','my','steak','how','to','stop','him'], ['quit','buying','worthless','dog','food','stupid'] ]#创建一个词汇表集合 classVec = [0,1,0,1,0,1]#每个实例是否是侮辱性的语言,1表示是,0表示不是 return postingList,classVec
def createVocabList(dataSet): vocabSet = set([])#去除数据集中的重复数据 for document in dataSet: vocabSet = vocabSet | set(document) return list(vocabSet) def setOfWords2Vec(vocabList,inputSet):#将一条词向量转化为二进制向量 returnVec = [0] * len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else : print("the word: %s is not in my vocabulary!" % word) return returnVec5.2 训练算法:从词向量计算概率
前面介绍了如何将一组单词转换为一组数字,接下来看看如何使用这些数字计算概率。
该函数的伪代码如下:
计算每个类别中的文档数目
对每篇训练文档:
对每个类别:
如果词条出现在文档中,则增加该词条的计数值
增加所有词条的计数值
对每个类别:
对每个词条:
将该词条的数目除以总词条数目得到条件概率
返回每个类别的条件概率
def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix)#计算数据总量 numWords = len(trainMatrix[0])#每一条数据的词汇量 pAbusive = sum(trainCategory)/float(numTrainDocs)#侮辱性文档的比例 p0Num = ones(numWords);p1Num = ones(numWords)#将所有词的出现数初始化为1 p0Denom = numTrainDocs - sum(trainCategory) + 2.0;#非侮辱性文档的总数量 p1Denom = sum(trainCategory) + 2.0;#侮辱性文档的总数量 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i]#侮辱性文档对应词汇的数量加1 else: p0Num += trainMatrix[i]#非侮辱性文档对应词汇的数量加1 p1Vec = log(p1Num/p1Denom)#取log p0Vec = log(p0Num/p0Denom)#取log return p0Vec,p1Vec,pAbusive
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1): p1 = sum(vec2Classify*p1Vec) + log(pClass1)#计算侮辱性文档的概率 p0 = sum(vec2Classify*p0Vec) + log(1.0 - pClass1)#计算非侮辱性文档的概率 if p1 > p0: return 1 else: return 0
def testingNB():#计算测试数据 listOPosts,listClasses = loadDataSet()#获取词向量 myVocabList = createVocabList(listOPosts)#获取去重词汇集 trainMat = [] for postinDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList,postinDoc))#词向量转化为二进制向量 p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))#训练样本数据 testEntry = ['love','my','dalmation'] thisDoc = array(setOfWords2Vec(myVocabList,testEntry)) print(testEntry," classified as: ",classifyNB(thisDoc,p0V,p1V,pAb)) testEntry = ["stupid",'garbage'] thisDoc = array(setOfWords2Vec(myVocabList,testEntry)) print(testEntry," classified as: ",classifyNB(thisDoc,p0V,p1V,pAb))