python垃圾邮件识别_Python 手写朴素贝叶斯分类器检测垃圾邮件/短信

自己从头手写一下这些经典的算法,不调用 sklearn 等 API,调一调参数,蛮有收获和启发。

数据集

概要:5572 条短信,13% 的 spam。

选择这个数据集的原因:短信的文本预处理要比 email 简单一些,运算量小,更容易聚焦算法本身。

数据集来自 kaggle,取样相对科学一些,更容易准确的反应算法的效果。

我的数据备份:github.com spam.csv​github.com

算法原理

目标函数:给定一篇文章(d),计算属于各个分类(c) 的概率,以概率最大的分类作为最终结果。

在垃圾邮件/短信检测的案例里,分类只有 2 个:spam,not-spam.

在垃圾邮件检测的特定领域里,not-spam 通常又叫 ham。没有什么原因,最初的大佬一时兴起想到了这个名字而已。所以,分类名字就变成了 spam, ham

根据贝叶斯公式,变形为

其中 f1, f2, ..., fn 是 document 的 feature。

有很多种选取 feature 的方法,比如,单词出现频率,单词的 TF/IDF 值,去除 stop-words 以后的单词频率。选择什么 feature,与贝叶斯无关,由我们要解决的问题本身决定。后面专门讨论和实验对比不同的 feature。

假定 feature 相互独立,实践上,即使不相互独立,直接用的效果也不错。

得到新公式:

跟其他的 language model 一样,换成 log space,

Feature 选取

spammy email 检测的算法已经很成熟,商用的版本里,会用到发件人地址等非文本特征。

此处,我们把讨论约定在文本/自然语言特征范围内。

在机器学习之前,已经有 rule based 的过滤器,大家发现的规则,比如:online pharmaceutical

WITHOUT ANY COST

Dear Winner

总结下来,垃圾邮件喜欢把一些 keyword 全部大写,表述习惯上与普通文本不同。

常见的 NLP 预处理 pipeline,比如,全部转小写,TF/IDF 等,反而会把一些 feature 处理掉。此处可能不适合做这些预处理。

我们以单词出现次数这个最简单的指标作为特征。

训练语料里,所有分类下的所有单词,构成一个 vocabulary,然后在每个类别下,分别统计各个单词的出现次数。

在某个分类下没有出现的单词,概率是 0,导致最终的概率也都是 0。为了解决这个问题,使用 add-one (Laplace) smoothing,

伪代码

模型评价

以 spam 分类作为 positive 分类。

Python 实现

import csv

import string

import numpy as np

import math

def load_data(filename, train_ratio):

with open(filename, "rb") as f:

csv_reader = csv.reader(f)

csv_reader.next() # header

dataset = [(line[0], line[1]) for line in csv_reader]

np.random.shuffle(dataset)

train_size = int(len(dataset) * train_ratio)

return dataset[:train_size], dataset[train_size:]

def train(train_set):

total_doc_cnt = len(train_set)

label_doc_cnt = {}

bigdoc_words = {}

for label, doc in train_set:

if label not in label_doc_cnt:

# init

label_doc_cnt[label] = 0

bigdoc_words[label] = []

label_doc_cnt[label] += 1

bigdoc_words[label].extend([

w.strip(string.punctuation) for w in doc.split()])

vocabulary = set()

for words in bigdoc_words.values():

vocabulary |= set(words)

V = len(vocabulary)

log_priors = {label: math.log(1.0 * cnt / total_doc_cnt) for label, cnt in label_doc_cnt.items()}

log_likelihoods = dict()

for label, words in bigdoc_words.items():

word_cnt = len(words) + V

log_likelihoods[label] = [math.log(1.0 * (1 + words.count(w)) / word_cnt) for w in vocabulary]

return log_priors, log_likelihoods, vocabulary

def predict(log_priors, log_likelihoods, vocabulary, input_text, expect_label=None):

words = {w.strip(string.punctuation) for w in input_text.split()}

prob_max = 0

label_max = None

probs = {} # tmp for log

for label, likelihood in log_likelihoods.items():

prob = log_priors[label] + sum([p for w, p in zip(vocabulary, likelihood) if w in words])

probs[label] = prob

if not prob_max or prob > prob_max:

prob_max = prob

label_max = label

if expect_label and expect_label != label_max:

print '---'

print 'expect: %s, got: %s' % (expect_label, label_max)

print probs

print input_text

return label_max

def main():

filename = 'input/spam.csv'

train_ratio = 0.75

train_data, test_data = load_data(filename, train_ratio)

print('data loaded. train: {}, test: {}').format(

len(train_data), len(test_data))

# train the model

log_priors, log_likelihoods, vocabulary = train(train_data)

print 'model trained. log_priors: {}, V(vocabulary word count): {}'.format(log_priors, len(vocabulary))

pos_true = 0

pos_false = 0

neg_false = 0

neg_true = 0

for label, text in test_data:

got = predict(log_priors, log_likelihoods, vocabulary, text, label)

if label != got:

if label == 'spam':

pos_false += 1

else:

neg_false += 1

else:

if label == 'spam':

pos_true += 1

else:

neg_true += 1

print 'positive(spam) true: %s, false: %s' % (pos_true, pos_false)

print 'negative true: %s, false: %s' % (neg_true, neg_false)

print 'Precision: %.2f%%, Recall: %.2f%%' % (

100.0 * pos_true / (pos_true + pos_false),

100.0 * pos_true / (pos_true + neg_false),

)

if __name__ == '__main__':

main()

运行结果

在 load_data 函数里,对 dataset 做了 shuffle 洗牌,所以,每次运行结果都会有区别,但 Precision 基本在 90% 左右,Recall 96%

data loaded. train: 4179, test: 1393

model trained. log_priors: {'ham': -0.14608724117045765, 'spam': -1.995705843726764}, V(vocabulary word count): 9996

positive(spam) true: 156, false: 23

negative true: 1213, false: 1

Precision: 87.15%, Recall: 99.36%

误判数据的详细信息

---

expect: spam, got: ham

{'ham': -84.45531071975456, 'spam': -90.70055052987975}

You can donate �2.50 to UNICEF's Asian Tsunami disaster support fund by texting DONATE to 864233. �2.50 will be added to your next bill

---

expect: spam, got: ham

{'ham': -139.40891845750357, 'spam': -147.473600840229}

Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!

---

expect: spam, got: ham

{'ham': -52.676090263258466, 'spam': -56.80155705621162}

Are you unique enough? Find out from 30th August. www.areyouunique.co.uk

---

expect: spam, got: ham

{'ham': -70.17115950997167, 'spam': -72.51873090685471}

This message is brought to you by GMW Ltd. and is not connected to the

---

expect: spam, got: ham

{'ham': -139.40891845750357, 'spam': -147.473600840229}

Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!

---

expect: spam, got: ham

{'ham': -168.26318051822577, 'spam': -178.0225162634835}

Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES

---

expect: spam, got: ham

{'ham': -142.83829011312517, 'spam': -160.32809534946767}

ROMCAPspam Everyone around should be responding well to your presence since you are so warm and outgoing. You are bringing in a real breath of sunshine.

---

expect: spam, got: ham

{'ham': -153.87952746055848, 'spam': -157.51773572622523}

How about getting in touch with folks waiting for company? Just txt back your NAME and AGE to opt in! Enjoy the community (150p/SMS)

---

expect: spam, got: ham

{'ham': -70.49633709940345, 'spam': -71.4324288990577}

Latest News! Police station toilet stolen, cops have nothing to go on!

---

expect: spam, got: ham

{'ham': -167.8034762079193, 'spam': -181.29820672852267}

Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1

---

expect: spam, got: ham

{'ham': -201.64646891761794, 'spam': -221.190966006233}

Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)

---

expect: spam, got: ham

{'ham': -179.72840292764013, 'spam': -198.04988429454147}

Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!

---

expect: spam, got: ham

{'ham': -169.32983912276305, 'spam': -171.08314763971316}

Talk sexy!! Make new friends or fall in love in the worlds most discreet text dating service. Just text VIP to 83110 and see who you could meet.

---

expect: spam, got: ham

{'ham': -94.5052592368405, 'spam': -100.35544827044257}

Reminder: You have not downloaded the content you have already paid for. Goto http://doit. mymoby. tv/ to collect your content.

---

expect: spam, got: ham

{'ham': -92.32667266264504, 'spam': -98.95596194645321}

Dont forget you can place as many FREE Requests with 1stchoice.co.uk as you wish. For more Information call 08707808226.

---

expect: spam, got: ham

{'ham': -72.48756276984723, 'spam': -76.59107843644884}

Missed call alert. These numbers called but left no message. 07008009200

---

expect: spam, got: ham

{'ham': -77.48695175645791, 'spam': -93.71448200582458}

Did you hear about the new \Divorce Barbie\"? It comes with all of Ken's stuff!"

---

expect: spam, got: ham

{'ham': -178.2932238903798, 'spam': -184.67604680539299}

Am new 2 club & dont fink we met yet Will B gr8 2 C U Please leave msg 2day wiv ur area 09099726553 reply promised CARLIE x Calls�1/minMobsmore LKPOBOX177HP51FL

---

expect: spam, got: ham

{'ham': -146.47067170071077, 'spam': -149.22917007660618}

Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give Arsenal a 2 goal margin after 78 mins.

---

expect: spam, got: ham

{'ham': -50.71127368750657, 'spam': -55.32646466594103}

Money i have won wining number 946 wot do i do next

---

expect: spam, got: ham

{'ham': -86.61673741525858, 'spam': -100.67904461391855}

Sorry I missed your call let's talk when you have the time. I'm on 07090201529

---

expect: spam, got: ham

{'ham': -135.8452563916395, 'spam': -138.23271598529016}

Download as many ringtones as u like no restrictions, 1000s 2 choose. U can even send 2 yr buddys. Txt Sir to 80082 �3

---

expect: spam, got: ham

{'ham': -106.60938155575177, 'spam': -115.73546864261806}

INTERFLORA - ��It's not too late to order Interflora flowers for christmas call 0800 505060 to place your order before Midnight tomorrow.

---

expect: ham, got: spam

{'ham': -109.18333505635592, 'spam': -107.80307740800048}

MAKE SURE ALEX KNOWS HIS BIRTHDAY IS OVER IN FIFTEEN MINUTES AS FAR AS YOU'RE CONCERNED

调参数

我们只做一个非常简单的实验,如果先全部转成小写字母,再统计出现次数,效果会不会提升。

当前代码,连续 3 次的运行结果:

positive(spam) true: 188, false: 21

negative true: 1182, false: 2

Precision: 89.95%, Recall: 98.95%

---

positive(spam) true: 160, false: 16

negative true: 1209, false: 8

Precision: 90.91%, Recall: 95.24%

---

positive(spam) true: 167, false: 13

negative true: 1208, false: 5

Precision: 92.78%, Recall: 97.09%

代码修改:

第 11 行

dataset = [(line[0], line[1]) for line in csv_reader]

改成

dataset = [(line[0], line[1].lower()) for line in csv_reader]

连续 3 次的运行结果:

positive(spam) true: 164, false: 18

negative true: 1205, false: 6

Precision: 90.11%, Recall: 96.47%

---

positive(spam) true: 174, false: 20

negative true: 1197, false: 2

Precision: 89.69%, Recall: 98.86%

---

positive(spam) true: 162, false: 23

negative true: 1204, false: 4

Precision: 87.57%, Recall: 97.59%

结果变化不大,Precision 略有降低,Recall 略微提升。

3 次运行结果,随机性比较大,不能作出哪一个 feature 更好的结论。

但我们也没有看到明显的优化或下降。

总结training 阶段主要是计算 prior 和 likelihood,这两个都与具体的 document 内的文本无关,而是在整个 label 所有训练集内统计 document count 和 word count。

对结果产生影响的,不是一个训练集内 word A 与 word B 的相对高低,而是 word 在不同 label 集内的概率差异。

你可能感兴趣的:(python垃圾邮件识别)