4-5节朴素贝叶斯|使用朴素贝叶斯过滤垃圾邮件项目汇总|机器学习实战-学习笔记

文章原创,最近更新：2018-08-21

前言:
本文介绍机器学习分类算法中的朴素贝叶斯分类算法并给出伪代码，Python代码实现。

学习参考链接:
1.第4章基于概率论的分类方法：朴素贝叶斯

本章节的主要内容是:
重点介绍项目案例2:使用朴素贝叶斯过滤垃圾邮件项目汇总代码。

1.朴素贝叶斯项目案例介绍:

项目案例2:

使用朴素贝叶斯过滤垃圾邮件

项目概述:

完成朴素贝叶斯的一个最著名的应用: 电子邮件垃圾过滤。

朴素贝叶斯工作原理:

提取所有文档中的词条并进行去重
获取文档的所有类别
计算每个类别中的文档数目
对每篇训练文档: 
    对每个类别: 
        如果词条出现在文档中-->增加该词条的计数值（for循环或者矩阵相加）
        增加所有词条的计数值（此类别下词条总数）
对每个类别: 
    对每个词条: 
        将该词条的数目除以总词条数目得到的条件概率（P(词条|类别)）
返回该文档属于每个类别的条件概率（P(类别|文档的所有词条)）

开发流程：

使用朴素贝叶斯对电子邮件进行分类

收集数据: 提供文本文件
准备数据: 将文本文件解析成词条向量
分析数据: 检查词条确保解析的正确性
训练算法: 使用我们之前建立的 trainNB() 函数
测试算法: 使用朴素贝叶斯进行交叉验证
使用算法: 构建一个完整的程序对一组文档进行分类，将错分的文档输出到屏幕上

数据集介绍

数据集包含两部分,spam(垃圾邮件)文件夹,以及ham(正常邮件)文件夹.spam/ham文件夹分别有25个邮件.代码块的内容是ham邮件其中的一个1.txt邮件数据集,图1是ham(正常邮件)文件夹所有的数据集.

Hi Peter,

With Jose out of town, do you want to
meet once in a while to keep things
going and do some interesting stuff?

Let me know
Eugene

4-5节朴素贝叶斯|使用朴素贝叶斯过滤垃圾邮件项目汇总|机器学习实战-学习笔记_第1张图片

图1

2.文档词袋模型的介绍

目前为止，我们将每个词的出现与否作为一个特征，这可以被描述为词集模型（set-of-wordsmodl）.如果一个词在文档中出现不止一次，这可能意味着包含该词是否出现在文档中所不能表达的某种信息，这种方法被称为词袋模型（bag-of-words model）。在词袋中，每个单词可以出现多次，而在词集中，每个词只能出现一次。为适应词袋模型，需要对函数 setofWords2vec（）稍加修改，修改后的函数称为 bagofwords2vec（）.

下面的程序清单给出了基于词袋模型的朴素贝叶斯代码。它与函数 setofwords2vec（）几乎完全相同，唯一不同的是每当遇到一个单词时，它会增加词向量中的对应值，而不只是将对应的数值设为1.

def bagOfWords2Vec(vocabList, inputSet):
    """
    遍历查看该单词是否出现，出现该单词则将该单词置1,否则该单词置0
    :param vocabList: 所有单词集合列表
    :param inputSet: 输入数据集
    :return: 匹配列表[0,1,0,1...]，其中 1与0 表示词汇表中的单词是否出现在输入的数据集中
    """
    # 创建一个和词汇表vocabList等长的向量returnVec,向量中每一元素都为0
    returnVec = [0]*len(vocabList)# [0,0......]
    #用变量word遍历输入文档inputSet中的所有单词
    for word in inputSet:
        # 如果单词在词汇表vocabList中
        if word in vocabList:
            # 则将输出文档向量中的值设为1
            returnVec[vocabList.index(word)] +=1
    return returnVec

3.相关代码:

3.1准备数据: 将文本文件解析成词条向量

之前介绍了如何创建词向量，并基于这些词向量进行朴素贝叶斯分类的过程。前一节中的词向量是预先给定的，下面介绍如何从文本文档中构建自己的词列表。

对于一个文本字符串，可以使用 Python的 string.split（）方法将其切分。下面看看实际的运行结果

mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'

mySent.split()
Out[244]: 
['This', 'book', 'is', 'the','best', 'book','on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever','laid', 'eyes', 'upon.']

可以看到，切分的结果不错，但是标点符号也被当成了词的一部分。可以使用正则表示式来切分句子，其中分隔符是除单词、数字外的任意字符串.

使用正则表达式来切分文本

import re
regEx = re.compile("\\W*")
listOfTokens=regEx.split(mySent)

listOfTokens
Out[248]: 
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I','have', 'ever', 'laid', 'eyes', 'upon', '']

现在得到了一系列词组成的词表，但是里面的空字符串需要去掉。可以计算每个字符串的长度只返回长度大于0的字符串。

[tok for tok in listOfTokens if len(tok)>0]
Out[250]: 
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']

最后，我们发现句子中的第一个单词是大写的。如果目的是句子查找，那么这个特点会很有用。

但这里的文本只看成词袋，所以我们希望所有词的形式都是统一的，不论它们出现在句子中间、结尾还是开头。

Python中有一些内嵌的方法，可以将字符串全部转换成小写（.1ower()）或者大写（.upper（），借助这些方法可以达到目的。于是，可以进行如下处理：

[tok.lower() for tok in listOfTokens if len(tok)>0]
Out[255]: 
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']

现在来看数据集中一封完整的电子邮件的实际处理结果。

emailText=open("6.txt").read()

emailText
Out[259]: 'Hello,\n\nSince you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions.  Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.\n\nFor example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you抮e just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.\n\nyou have received this mandatory email service announcement to update you about important changes to Google Groups.'

listOfTokens=regEx.split(emailText)

listOfTokens
Out[261]: 
['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one', 'Google', 'Groups', 'group', 'that', 'uses', 'the', 'customized', 'welcome', 'message', 'pages', 'or', 'files', 'we', 'are', 'writing', 'to', 'inform', 'you', 'that', 'we', 'will', 'no', 'longer', 'be', 'supporting', 'these', 'features', 'starting', 'February', '2011', 'We', 'made', 'this', 'decision', 'so', 'that', 'we', 'can', 'focus', 'on', 'improving', 'the', 'core', 'functionalities', 'of', 'Google', 'Groups', 'mailing', 'lists', 'and', 'forum', 'discussions', 'Instead', 'of', 'these', 'features', 'we', 'encourage', 'you', 'to', 'use', 'products', 'that', 'are', 'designed', 'specifically', 'for', 'file', 'storage', 'and', 'page', 'creation', 'such', 'as', 'Google', 'Docs', 'and', 'Google', 'Sites', 'For', 'example', 'you', 'can', 'easily', 'create', 'your', 'pages', 'on', 'Google', 'Sites', 'and', 'share', 'the', 'site', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '174623', 'with', 'the', 'members', 'of', 'your', 'group', 'You', 'can', 'also', 'store', 'your', 'files', 'on', 'the', 'site', 'by', 'attaching', 'files', 'to', 'pages', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '90563', 'on', 'the', 'site', 'If', 'you抮e', 'just', 'looking', 'for', 'a', 'place', 'to', 'upload', 'your', 'files', 'so', 'that', 'your', 'group', 'members', 'can','download', 'them', 'we', 'suggest', 'you', 'try', 'Google', 'Docs', 'You', 'can', 'upload', 'files', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '50092', 'and', 'share', 'access', 'with', 'either', 'a', 'group', 'http', 'docs', 'google','com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '66343', 'or', 'an', 'individual', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '86152', 'assigning', 'either', 'edit', 'or', 'download', 'only', 'access', 'to', 'the', 'files', 'you', 'have','received', 'this', 'mandatory', 'email', 'service', 'announcement', 'to', 'update', 'you', 'about', 'important', 'changes', 'to', 'Google', 'Groups', '']

通过结果可以知道,这里对url进行了切分,得到了很多词,如果我们想去掉这些单词,因此在实现时会过滤掉长度小于3的字符串.

3.2训练算法: 使用我们之前建立的 trainNB0() 函数

def trainNB0(trainMatrix, trainCategory):
    """
    训练数据原版
    :param trainMatrix: 文件单词矩阵 [[1,0,1,1,1....],[],[]...]
    :param trainCategory: 文件对应的标签类别[0,1,1,0....]，列表长度等于单词矩阵数，其中的1代表对应的文件是侮辱性文件，0代表不是侮辱性矩阵
    :return:
        p0Vect:    各单词在分类0的条件下出现的概率
        p1Vect:    各单词在分类1的条件下出现的概率
        pAbusive:    文档属于分类1的概率
    """
    # 总文件数
    numTrainDocs = len(trainMatrix)
    # 每个文件中的单词数
    numWords = len(trainMatrix[0])
    # 侮辱性文件的出现概率,即trainCategory中所有的1的个数
    # 代表的就是多少个侮辱性文件,与文件的总数相除就得到了侮辱性文件的出现概率
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # p0Num 正常的统计,p1Num 侮辱的统计
    p0Num = np.ones(numWords); p1Num =np.ones(numWords)
    # 整个数据集单词出现总数，2.0根据样本/实际调查结果调整分母的值（2主要是避免分母为0，当然值可以调整）
    # p0Num 正常的统计
    # p1Num 侮辱的统计
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i]==1:
            # 累加辱骂词的频次
            p1Num += trainMatrix[i] 
            # 对每篇文章的辱骂的频次 进行统计汇总
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # 类别1，即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
    # log下什么都不写默认是自然对数 
    p1Vect = np.log(p1Num/p1Denom)
    # 类别0，即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
    p0Vect = np.log(p0Num/p0Denom)
    return p0Vect, p1Vect, pAbusive

3.3测试算法: 使用朴素贝叶斯进行交叉验证

文件解析及完整的垃圾邮件测试函数

打开文本编辑器，将下面程序清单中的代码添加到bayes.py文件中.

# 切分文本
def textParse(bigString):
    '''
    Desc:
        接收一个大字符串并将其解析为字符串列表
    Args:
        bigString -- 大字符串
    Returns:
        去掉少于 2 个字符的字符串，并将所有字符串转换为小写，返回字符串列表
    '''  
    import re
    # 使用正则表达式来切分句子，其中分隔符是除单词、数字外的任意字符串
    listOfTokens =re.split(r'\W*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

def spamTest():
    '''
    Desc:
        对贝叶斯垃圾邮件分类器进行自动化处理。
    Args:
        none
    Returns:
        对测试集中的每封邮件进行分类，若邮件分类错误，则错误数加 1，最后返回总的错误百分比。
    '''
    import random
    # 定义docList文档列表,classList类别列表,fullText所有文档词汇
    docList=[];classList=[];fullText=[]
    #spam和ham文件夹里的邮件都是25封,所以用for循环25次
    for i in range(1,26):
        # 打开并读取第i封垃圾邮件到列表worldList里
        wordList = textParse(open('email/spam/%d.txt' % i, "rb").read().decode('GBK','ignore') )
        # append()向列表尾部追加一个新元素,列表只占一个索引位
        docList.append(wordList)
        # extend()向列表尾部追加一个列表,将列表中的每个元素都追加进来
        fullText.extend(wordList)
        # classList里面追加加个1(1代表是类标签,垃圾邮件)
        classList.append(1)    
        
        # 打开并读取第i封正常邮件到列表worldList里
        wordList = textParse(open('email/ham/%d.txt' % i,  "rb").read().decode('GBK','ignore') )
        # append()向列表尾部追加一个新元素,列表只占一个索引位
        docList.append(wordList)
        # extend()向列表尾部追加一个列表,将列表中的每个元素都追加进来
        fullText.extend(wordList)
        # classList里面追加加个0(0代表是类标签,正常邮件)
        classList.append(0)
        # 创建词列表，调用自定义函数createVocabList()剔除列表docList中重复的词，返回含有docList中所有不重复词的列表
        vocabList=createVocabList(docList)
        # 创建测试集与训练集
        trainingSet = list(range(50))
        testSet=[]
        # 用for循环随机选择10个文档作为测试集,其余作为训练集
    for i in range(10):
        # 生成随机整数randIndex,random.uniform(a,b)用于生成指定范围内的随机浮点数
        randIndex=int(random.uniform(0,len(trainingSet)))
        # 将trainingSet列表中的第randIndex个元素添加到testSet里
        testSet.append(trainingSet[randIndex])
        # 删除trainingSet列表中的第randIndex个元素，其余作为训练集
        del(trainingSet[randIndex])
            
    # 创建空训练集列表以及空的训练集标签列表
    trainMat =[];trainClasses=[]
    # 遍历训练集,求得先验概率和条件概率
    for docIndex in trainingSet:
        # 将词汇列表变为向量放到trainMat
        trainMat.append(bagOfWords2Vec(vocabList, docList[docIndex]))
        # 训练集的类别标签
        trainClasses.append(classList[docIndex])
        # 计算先验概率,条件概率
        p0V,p1V,pSpam=trainNB0(np.array(trainMat), np.array(trainClasses))
        # 定义错误计数
        errorCount =  0
        # 对测试集进行分类
    for docIndex in testSet:
        # 将测试集词汇向量化
        wordVector =bagOfWords2Vec(vocabList, docList[docIndex])
        # 对测试数据进行分类
        if classifyNB(np.array(wordVector),p0V,p1V,pSpam)!=classList[docIndex]:
            #分类不正确,错误计数+1
            errorCount +=  1
            # 输出分类错误的文档
            print("classification error",docList[docIndex])
    #输出分类错误率
    print('the error rate is: ',float(errorCount)/len(testSet))

测试代码及其结果如下:

import bayes

bayes.spamTest()
the error rate is:  0.0
D:\360Downloads\Software\Anaconda\lib\re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)

import bayes

bayes.spamTest()
classification error ['benoit', 'mandelbrot', '1924', '2010', 'benoit', 'mandelbrot', '1924', '2010', 'wilmott', 'team', 'benoit', 'mandelbrot', 'the', 'mathematician', 'the', 'father', 'fractal', 'mathematics', 'and', 'advocate', 'more', 'sophisticated', 'modelling', 'quantitative', 'finance', 'died', '14th', 'october', '2010', 'aged', 'wilmott', 'magazine', 'has', 'often', 'featured', 'mandelbrot', 'his', 'ideas', 'and', 'the', 'work', 'others', 'inspired', 'his', 'fundamental', 'insights', 'you', 'must', 'logged', 'view', 'these', 'articles', 'from', 'past', 'issues', 'wilmott', 'magazine']
the error rate is:  0.1
D:\360Downloads\Software\Anaconda\lib\re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)

3.4与之关联相关的代码

def createVocabList(dataSet):
    """
    获取所有单词的集合
    :param dataSet: 数据集
    :return: 所有单词的集合(即不含重复元素的单词列表)
    """
    vocabSet =  set()
    for document in dataSet:
        # 操作符 | 用于求两个集合的并集
        vocabSet=set(document)|vocabSet
    return list(vocabSet)

def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    """
    使用算法：
        # 将乘法转换为加法
        乘法：P(C|F1F2...Fn) = P(F1F2...Fn|C)P(C)/P(F1F2...Fn)
        加法：P(F1|C)*P(F2|C)....P(Fn|C)P(C) -> log(P(F1|C))+log(P(F2|C))+....+log(P(Fn|C))+log(P(C))
    :param vec2Classify: 待测数据[0,1,1,1,1...]，即要分类的向量
    :param p0Vec: 类别0，即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
    :param p1Vec: 类别1，即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
    :param pClass1: 类别1，侮辱性文件的出现概率
    :return: 类别1 or 0
    """
    # 计算公式  log(P(F1|C))+log(P(F2|C))+....+log(P(Fn|C))+log(P(C))
    # 大家可能会发现，上面的计算公式，没有除以贝叶斯准则的公式的分母，也就是 P(w) （P(w) 指的是此文档在所有的文档中出现的概率）就进行概率大小的比较了，
    # 因为 P(w) 针对的是包含侮辱和非侮辱的全部文档，所以 P(w) 是相同的。
    # 使用 NumPy 数组来计算两个向量相乘的结果，这里的相乘是指对应元素相乘，即先将两个向量中的第一个元素相乘，然后将第2个元素相乘，以此类推。
    # 我的理解是：这里的 vec2Classify * p1Vec 的意思就是将每个词与其对应的概率相关联起来
    # P(w|c1) * P(c1) ，即贝叶斯准则的分子
    p1=sum(vec2Classify*p1Vec)+np.log(pClass1)
    # P(w|c0) * P(c0) ，即贝叶斯准则的分子·
    p0=sum(vec2Classify*p0Vec)+np.log(1.0-pClass1)
    if p1>p0:
        return 1
    else:
        return 0

def trainNB0(trainMatrix, trainCategory):
    """
    训练数据原版
    :param trainMatrix: 文件单词矩阵 [[1,0,1,1,1....],[],[]...]
    :param trainCategory: 文件对应的标签类别[0,1,1,0....]，列表长度等于单词矩阵数，其中的1代表对应的文件是侮辱性文件，0代表不是侮辱性矩阵
    :return:
        p0Vect:    各单词在分类0的条件下出现的概率
        p1Vect:    各单词在分类1的条件下出现的概率
        pAbusive:    文档属于分类1的概率
    """
    # 总文件数
    numTrainDocs = len(trainMatrix)
    # 每个文件中的单词数
    numWords = len(trainMatrix[0])
    # 侮辱性文件的出现概率,即trainCategory中所有的1的个数
    # 代表的就是多少个侮辱性文件,与文件的总数相除就得到了侮辱性文件的出现概率
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # p0Num 正常的统计,p1Num 侮辱的统计
    p0Num = np.ones(numWords); p1Num =np.ones(numWords)
    # 整个数据集单词出现总数，2.0根据样本/实际调查结果调整分母的值（2主要是避免分母为0，当然值可以调整）
    # p0Num 正常的统计
    # p1Num 侮辱的统计
    p0Denom = 2.0
    p1Denom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i]==1:
            # 累加辱骂词的频次
            p1Num += trainMatrix[i] 
            # 对每篇文章的辱骂的频次 进行统计汇总
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    # 类别1，即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
    # log下什么都不写默认是自然对数 
    p1Vect = np.log(p1Num/p1Denom)
    # 类别0，即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
    p0Vect = np.log(p0Num/p0Denom)
    return p0Vect, p1Vect, pAbusive

4.相关知识点

知识点1:re正则表达式

学习参考链接:(三)正则表达式入门学习笔记|Python网络爬虫与信息提取

4-5节朴素贝叶斯|使用朴素贝叶斯过滤垃圾邮件项目汇总|机器学习实战-学习笔记_第2张图片

4-5节朴素贝叶斯|使用朴素贝叶斯过滤垃圾邮件项目汇总|机器学习实战-学习笔记_第3张图片

4-5节朴素贝叶斯|使用朴素贝叶斯过滤垃圾邮件项目汇总|机器学习实战-学习笔记_第4张图片

知识点2:Python uniform() 函数

uniform() 方法将随机生成下一个实数，它在 [x, y) 范围内。
具体语法如下:

import random

random.uniform(x, y)

注意：uniform()是不能直接访问的，需要导入 random 模块，然后通过 random 静态对象调用该方法。

参数
- x -- 随机数的最小值，包含该值。
- y -- 随机数的最大值，不包含该值。
返回值:返回一个浮点数。

具体小案例如下:

import random

random.uniform(5, 10)
Out[279]: 9.171064277882996

random.uniform(7, 14)
Out[280]: 10.991165154571751

知识点3:删除列表元素

可以使用 del 语句来删除列表的元素，如下实例：

list1 = ['physics', 'chemistry', 1997, 2000]
del list1[2]

list1
Out[283]: ['physics', 'chemistry', 2000]

知识点4:代码运行报错汇总

学习参考链接:机器学习实战Py3.x填坑记—朴素贝叶斯
代码运行报错1

问题点:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte sequence
原因:
因为有可能文件中存在类似“�”非法字符,按以下方法更改后,代码能有效运行.
更改前:

wordList = textParse(open('email/spam/%d.txt' % i).read()
wordList = textParse(open('email/ham/%d.txt' % i).read()

更改后:

wordList = textParse(open('email/spam/%d.txt' % i, "rb").read().decode('GBK','ignore') )
wordList = textParse(open('email/ham/%d.txt' % i,  "rb").read().decode('GBK','ignore') )

代码运行报错2

问题点:
del(trainingSet[randIndex])
TypeError: 'range' object doesn't support item deletion
原因:
因为是python3中range不返回数组对象，而是返回range对象.方法是将代码del(trainingSet[randIndex])上面第4行代码进行修改.按以下方法更改后,代码能有效运行.
更改前:

trainingSet = range(50)

更改后:

trainingSet = list(range(50))

4-5节 朴素贝叶斯|使用朴素贝叶斯过滤垃圾邮件项目汇总|机器学习实战-学习笔记