利用贝叶斯公式,我们可以在计算机上验证和测试垃圾邮件的字词统计,计算词语出现的概率。并根据的得出来的概率来判断一封新的邮件是垃圾邮件还是正常邮件。
公式:
P ( A ∣ B ) = P ( B ∣ A ) ⋅ P ( A ) P ( B ) P (A|B)= \frac {P (B|A) \cdot P (A)}{P (B)} P(A∣B)=P(B)P(B∣A)⋅P(A)
解释:在B情况出现的条件下A情况出现的概率等于在A情况出现的情况下B情况出现的概率乘以A情况出现的概率,除以B情况出现的概率。
应用:看起来公式很麻烦,但在本问题的解决中并不会直接试用,后续代码中的使用是P(A|B)=(A和B同时满足的个数)/B的个数。这样使用是同等效力的,但是会更加简单。
首先我们需要使用60000条Email的数据集,数据集连接如下:
60000条Email数据集
数据集的结构如下:
full文件夹下的index:包含每一条Email的标签和对应的文件位置。
data文件夹下的每个文件夹都有若干条Email数据。
import re
import jieba
import numpy as np
re:正则表达式用于过滤非中文字符
jieba:用于做中文分词
numpy:用于划分训练集和测试集
在进行统计之前我们要先进行数据预处理以保证统计能够顺利进行。
EmailsIndex = "./full/index"
def filterEmail(email):
email = re.sub("[a-zA-Z0-9_]+|\W","",email)
return email
使用正则表达式过滤掉非中文词
原理:按照字母,数字和下划线划分email中的文本
def readEmail(filename):
with open(filename, "r",encoding='GB2312', errors='ignore') as fp:
content=fp.read()
content = filterEmail(content) #把一封邮件提取所有中文内容
words = list(jieba.cut(content))
return words
读取一封Email:首先按照传进来的filename(文件路径)打开txt文件,然后调用filterEmail函数过滤掉非中文字符,过滤后content中不含非中文字符。最后使用jieba.cut对中文进行分词,分词后words中为中文词组。
注:数据集的编码方式要按照GB2312打开
def loadAllEmails(IndexFile):
Emails=[]
with open(IndexFile, "r") as fp:
lines=fp.readlines()
for line in lines:
spam,filename = line.split()
Emails.append((spam,readEmail(filename)))
return Emails
传进来的IndexFile参数是index的路径
打开index文件后读取每一行数据,每一行数据包含邮件类别和邮件路径,使用split将他们分离开,调用readEmail获得邮件路径对应的邮件的中文分词。
def train_test_split(Emails,testSize):
arrayEmails = np.array(Emails)
test_size = int(len(Emails)*testSize)
shuffle_indexes = np.random.permutation(len(arrayEmails))
test_indexs,train_indexs= np.split(shuffle_indexes,[test_size])
return arrayEmails[train_indexs],arrayEmails[test_indexs]
def calWordsFreqTable(Emails):
wordsSet = []
spamWords = dict()
hamWords = dict()
spamlength = 0
hamlength = 0
for elem in Emails:
if elem[0] == "spam":
spamlength += 1
tempset = set()
for word in elem[1]:
if word not in tempset:
tempset.add(word)
if word in wordsSet:
spamWords[word] += 1
else:
wordsSet.append(word)
spamWords[word] = 1
hamWords[word] = 0
elif elem[0] == "ham":
hamlength += 1
tempset = set()
for word in elem[1]:
if word not in tempset:
tempset.add(word)
if word in wordsSet:
hamWords[word] += 1
else:
wordsSet.append(word)
hamWords[word] = 1
spamWords[word] = 0
wordsProbs = [] # 格式(word, spamprob, hamprob)
for word in wordsSet:
wordsProbs.append((word, spamWords[word] * 1.0 / (spamWords[word] + hamWords[word]), hamWords[word] * 1.0 / (spamWords[word] + hamWords[word])))
return wordsProbs
变量含义:
wordsSet:词典,在遍历Email时逐渐将词典中没有的词添加到词典
spamWords:分词与其在垃圾邮件中出现的次数组成的字典
hamWords:分词与其在正常邮件中出现的次数组成的字典
spamlength:总垃圾邮件数
hamlength:总正常邮件数
tempset:用来判断在一封Email内某词是否已经出现过,如果已经出现,则不予统计。
def saveWordsTable(wordsProbs, filepath):
import os
with open(os.path.join(filepath, "probs.txt"), "w", encoding="utf-8") as f:
for prob in wordsProbs:
f.write(prob[0])
f.write("\t")
f.write(str(prob[1]))
f.write("\t")
f.write(str(prob[2]))
f.write('\n')
def getProbDict_WordsSet(filepath):
ProbsDict = dict()
wordsSet = list()
import os
with open(os.path.join(filepath, "probs.txt"), "r", encoding="utf-8") as f:
lines = f.readlines()
for line in lines:
word, spamprob, hamprob = line.split()
wordsSet.append(word)
ProbsDict[word] = (float(spamprob), float(hamprob))
return ProbsDict, wordsSet
def checkOneEmail(ProbsDict, wordsSet, email):
spamprob = 1
hamprob = 1
for word in email:
if word in wordsSet:
hamprob *= max((1 - ProbsDict[word][0]), 0.01)
spamprob *= max((1 - ProbsDict[word][1]), 0.01)
if hamprob == 0 or spamprob == 0:
break
if spamprob > hamprob:
return spamprob, 1
elif spamprob < hamprob:
return hamprob, -1
else:
return 0, 0
hamprob *= max((1 - ProbsDict[word][0]), 0.01)中1 - ProbsDict[word][0]的含义是该词不是来着垃圾邮件的概率,使用max是为了防止获得的概率过低,同时用1来表示垃圾邮件,-1表示正常邮件。当垃圾邮件和正常邮件的概率相等时用0表示未能判断出是否为垃圾邮件。
def checkEmails(ProbsDict, wordsSet, emails):
rates = []
predict_flags = []
for email in emails:
rate, ans = checkOneEmail(ProbsDict, wordsSet, email)
rates.append(rate)
predict_flags.append(ans)
return rates, predict_flags # 返回准确率和每一封邮件的预测值
if __name__ == "__main__":
opt = int(input("1.通过训练集获取概率并在测试集测试 2.单个邮件预测"))
if opt == 1:
Emails = loadAllEmails(EmailsIndex)
trainEmails, testEmails = train_test_split(Emails, 0.2)
wordsProbs = calWordsFreqTable(trainEmails)
saveWordsTable(wordsProbs, filepath='')
ProbsDict, wordsSet = getProbDict_WordsSet(filepath='')
rate, ans = checkEmails(ProbsDict, wordsSet, testEmails[:,1])
t = 0
for i in range(len(ans)):
if ans[i] == 1 and testEmails[i][0] == 'spam':
t += 1
elif ans[i] == -1 and testEmails[i][0] == 'ham':
t += 1
print("精度:", t / len(ans))
else:
opt = int(input("1.手动输入 2.打开"))
if opt == 1:
text = input("输入邮件:")
Email = filterEmail(text)
ProbsDict, wordsSet = getProbDict_WordsSet(filepath='')
rate, ans = checkOneEmail(ProbsDict, wordsSet, Email)
if ans == 1:
print("垃圾邮件")
elif ans == -1:
print("有效邮件")
else:
print("未知邮件")
else:
path = input("请输入邮件的路径:") # 路径用”/“而不是”\’“
text = ""
with open(path, encoding="utf-8") as f:
for line in f:
text += line
Email = filterEmail(text)
ProbsDict, wordsSet = getProbDict_WordsSet(filepath='')
rate, ans = checkOneEmail(ProbsDict, wordsSet, Email)
if ans == 1:
print("垃圾邮件")
elif ans == -1:
print("有效邮件")
else:
print("未知邮件")
注:在首次使用时必须先选择1来获取概率,生成时间比较长,可以在下方下载,下载后可以直接使用。
分词概率下载