朴素贝叶斯的基本思想就是选择高概率对应的类别,即如果有两类, 优点:在数据较少的情况下仍然有效,可以处理多类别问题 (1)收集数据:可以使用任何方法,这里使用RSS源 文档的自动分类 贝叶斯公式如下: 但需注意,在利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率。如果其中一个概率值为0,那么最终的概率乘积也为0,为了避免这种影响,可以将所有词的出现数初始化为1,分母初始化为2 (3)测试过程 持续更新。。。
若p1(x,y)>p2(x,y),则分类类别为1
若p1(x,y)特点
缺点:对于输入数据的准备方式比较敏感
适用数据类型:标称型数据一般过程
(2)准备数据:需要数值型或者布尔型数据
(3)分析数据:有大量特征时,绘制特征作用不大,此时可以使用直方图
(4)训练算法:计算不同的独立特征的条件概率
(5)测试算法:计算错误率
(6)使用算法:一个常见的朴素贝叶斯应用是文档分类(见示例1),可以在任意的分类场景中使用朴素贝叶斯分类器,不一定是文本示例1
(1)从文本中构建词向量def loadDataSet():
postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1]
return postingList,classVec
def createVocabList(dataset):
vocabSet = set([])
for doc in dataset:
vocabSet = vocabSet | set(doc)
print(vocabSet)
return list(vocabSet)
def setOfWords2Vec(vocabList,inputSet):
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else:
print("The word %s is not in the vocabulary list" % word)
print(returnVec)
return returnVec
if __name__ == '__main__':
postingList,classVec = loadDataSet()
vocabList = createVocabList(postingList)
returnVec = setOfWords2Vec(vocabList,postingList[0])
(2)从词向量中计算概率(***)
伪代码如下:计算每个类别中的文档数目
对每篇训练文档:
对每个类别:
如果词条出现在文档中,则增加该词条的计数值
增加所有词条的计数值
对每个类别:
对每个词条:
将该词条的数目除以总词条数目得到条件概率
返回每个类别的条件概率
p ( c i ∣ ω ) = p ( ω ∣ c i ) p ( c i ) p ( ω ) p(c_{i}|\omega)= \frac{p(\omega|c_{i})p(c_{i})}{p(\omega)} p(ci∣ω)=p(ω)p(ω∣ci)p(ci)
用代码实现:"""
Function:
the naive bayes
Parameters:
trainingMatrix: the training document which has transformed to the vector
trainCategory: the training category,1 means abusive documents
Output/Return:
p0Vec: the probability of a word to be not abusive
p1Vec: the probability of a word to be abusive
pAbusive: the probability of the abusive document
"""
def trainNB0(trainMatrix,trainCategory):
num_training = len(trainMatrix) # the number of the training examples
num_words = len(trainMatrix[0]) # the number of the words in the document
pAbusive = sum(trainCategory) / float(num_training) # the probabiliry of the abusive documents
p0Num = zeros(num_words)
p1Num = zeros(num_words)
p0Denom = 0.0
p1Denom = 0.0
for i in range(num_training):
if trainCategory[i] == 1: # if it is the abusive document
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p0Vec = p0Num / p0Denom # calculate the probability
p1Vec = p1Num / p1Denom # calculate the probability
return p0Vec,p1Vec,pAbusive
此外,还会出现下溢出的问题,这是由于太多很小的数相乘所导致的,因此对乘积采用自然对数,以此来避免下溢出或者浮点数舍入导致的错误。