自己从头手写一下这些经典的算法,不调用 sklearn 等 API,调一调参数,蛮有收获和启发。
数据集
概要:5572 条短信,13% 的 spam。
选择这个数据集的原因:短信的文本预处理要比 email 简单一些,运算量小,更容易聚焦算法本身。
数据集来自 kaggle,取样相对科学一些,更容易准确的反应算法的效果。
我的数据备份:github.com spam.csvgithub.com
算法原理
目标函数:给定一篇文章(d),计算属于各个分类(c) 的概率,以概率最大的分类作为最终结果。
在垃圾邮件/短信检测的案例里,分类只有 2 个:spam,not-spam.
在垃圾邮件检测的特定领域里,not-spam 通常又叫 ham。没有什么原因,最初的大佬一时兴起想到了这个名字而已。所以,分类名字就变成了 spam, ham
根据贝叶斯公式,变形为
其中 f1, f2, ..., fn 是 document 的 feature。
有很多种选取 feature 的方法,比如,单词出现频率,单词的 TF/IDF 值,去除 stop-words 以后的单词频率。选择什么 feature,与贝叶斯无关,由我们要解决的问题本身决定。后面专门讨论和实验对比不同的 feature。
假定 feature 相互独立,实践上,即使不相互独立,直接用的效果也不错。
得到新公式:
跟其他的 language model 一样,换成 log space,
Feature 选取
spammy email 检测的算法已经很成熟,商用的版本里,会用到发件人地址等非文本特征。
此处,我们把讨论约定在文本/自然语言特征范围内。
在机器学习之前,已经有 rule based 的过滤器,大家发现的规则,比如:online pharmaceutical
WITHOUT ANY COST
Dear Winner
总结下来,垃圾邮件喜欢把一些 keyword 全部大写,表述习惯上与普通文本不同。
常见的 NLP 预处理 pipeline,比如,全部转小写,TF/IDF 等,反而会把一些 feature 处理掉。此处可能不适合做这些预处理。
我们以单词出现次数这个最简单的指标作为特征。
训练语料里,所有分类下的所有单词,构成一个 vocabulary,然后在每个类别下,分别统计各个单词的出现次数。
在某个分类下没有出现的单词,概率是 0,导致最终的概率也都是 0。为了解决这个问题,使用 add-one (Laplace) smoothing,
伪代码
模型评价
以 spam 分类作为 positive 分类。
Python 实现
import csv
import string
import numpy as np
import math
def load_data(filename, train_ratio):
with open(filename, "rb") as f:
csv_reader = csv.reader(f)
csv_reader.next() # header
dataset = [(line[0], line[1]) for line in csv_reader]
np.random.shuffle(dataset)
train_size = int(len(dataset) * train_ratio)
return dataset[:train_size], dataset[train_size:]
def train(train_set):
total_doc_cnt = len(train_set)
label_doc_cnt = {}
bigdoc_words = {}
for label, doc in train_set:
if label not in label_doc_cnt:
# init
label_doc_cnt[label] = 0
bigdoc_words[label] = []
label_doc_cnt[label] += 1
bigdoc_words[label].extend([
w.strip(string.punctuation) for w in doc.split()])
vocabulary = set()
for words in bigdoc_words.values():
vocabulary |= set(words)
V = len(vocabulary)
log_priors = {label: math.log(1.0 * cnt / total_doc_cnt) for label, cnt in label_doc_cnt.items()}
log_likelihoods = dict()
for label, words in bigdoc_words.items():
word_cnt = len(words) + V
log_likelihoods[label] = [math.log(1.0 * (1 + words.count(w)) / word_cnt) for w in vocabulary]
return log_priors, log_likelihoods, vocabulary
def predict(log_priors, log_likelihoods, vocabulary, input_text, expect_label=None):
words = {w.strip(string.punctuation) for w in input_text.split()}
prob_max = 0
label_max = None
probs = {} # tmp for log
for label, likelihood in log_likelihoods.items():
prob = log_priors[label] + sum([p for w, p in zip(vocabulary, likelihood) if w in words])
probs[label] = prob
if not prob_max or prob > prob_max:
prob_max = prob
label_max = label
if expect_label and expect_label != label_max:
print '---'
print 'expect: %s, got: %s' % (expect_label, label_max)
print probs
print input_text
return label_max
def main():
filename = 'input/spam.csv'
train_ratio = 0.75
train_data, test_data = load_data(filename, train_ratio)
print('data loaded. train: {}, test: {}').format(
len(train_data), len(test_data))
# train the model
log_priors, log_likelihoods, vocabulary = train(train_data)
print 'model trained. log_priors: {}, V(vocabulary word count): {}'.format(log_priors, len(vocabulary))
pos_true = 0
pos_false = 0
neg_false = 0
neg_true = 0
for label, text in test_data:
got = predict(log_priors, log_likelihoods, vocabulary, text, label)
if label != got:
if label == 'spam':
pos_false += 1
else:
neg_false += 1
else:
if label == 'spam':
pos_true += 1
else:
neg_true += 1
print 'positive(spam) true: %s, false: %s' % (pos_true, pos_false)
print 'negative true: %s, false: %s' % (neg_true, neg_false)
print 'Precision: %.2f%%, Recall: %.2f%%' % (
100.0 * pos_true / (pos_true + pos_false),
100.0 * pos_true / (pos_true + neg_false),
)
if __name__ == '__main__':
main()
运行结果
在 load_data 函数里,对 dataset 做了 shuffle 洗牌,所以,每次运行结果都会有区别,但 Precision 基本在 90% 左右,Recall 96%
data loaded. train: 4179, test: 1393
model trained. log_priors: {'ham': -0.14608724117045765, 'spam': -1.995705843726764}, V(vocabulary word count): 9996
positive(spam) true: 156, false: 23
negative true: 1213, false: 1
Precision: 87.15%, Recall: 99.36%
误判数据的详细信息
---
expect: spam, got: ham
{'ham': -84.45531071975456, 'spam': -90.70055052987975}
You can donate �2.50 to UNICEF's Asian Tsunami disaster support fund by texting DONATE to 864233. �2.50 will be added to your next bill
---
expect: spam, got: ham
{'ham': -139.40891845750357, 'spam': -147.473600840229}
Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!
---
expect: spam, got: ham
{'ham': -52.676090263258466, 'spam': -56.80155705621162}
Are you unique enough? Find out from 30th August. www.areyouunique.co.uk
---
expect: spam, got: ham
{'ham': -70.17115950997167, 'spam': -72.51873090685471}
This message is brought to you by GMW Ltd. and is not connected to the
---
expect: spam, got: ham
{'ham': -139.40891845750357, 'spam': -147.473600840229}
Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!
---
expect: spam, got: ham
{'ham': -168.26318051822577, 'spam': -178.0225162634835}
Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES
---
expect: spam, got: ham
{'ham': -142.83829011312517, 'spam': -160.32809534946767}
ROMCAPspam Everyone around should be responding well to your presence since you are so warm and outgoing. You are bringing in a real breath of sunshine.
---
expect: spam, got: ham
{'ham': -153.87952746055848, 'spam': -157.51773572622523}
How about getting in touch with folks waiting for company? Just txt back your NAME and AGE to opt in! Enjoy the community (150p/SMS)
---
expect: spam, got: ham
{'ham': -70.49633709940345, 'spam': -71.4324288990577}
Latest News! Police station toilet stolen, cops have nothing to go on!
---
expect: spam, got: ham
{'ham': -167.8034762079193, 'spam': -181.29820672852267}
Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1
---
expect: spam, got: ham
{'ham': -201.64646891761794, 'spam': -221.190966006233}
Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)
---
expect: spam, got: ham
{'ham': -179.72840292764013, 'spam': -198.04988429454147}
Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!
---
expect: spam, got: ham
{'ham': -169.32983912276305, 'spam': -171.08314763971316}
Talk sexy!! Make new friends or fall in love in the worlds most discreet text dating service. Just text VIP to 83110 and see who you could meet.
---
expect: spam, got: ham
{'ham': -94.5052592368405, 'spam': -100.35544827044257}
Reminder: You have not downloaded the content you have already paid for. Goto http://doit. mymoby. tv/ to collect your content.
---
expect: spam, got: ham
{'ham': -92.32667266264504, 'spam': -98.95596194645321}
Dont forget you can place as many FREE Requests with 1stchoice.co.uk as you wish. For more Information call 08707808226.
---
expect: spam, got: ham
{'ham': -72.48756276984723, 'spam': -76.59107843644884}
Missed call alert. These numbers called but left no message. 07008009200
---
expect: spam, got: ham
{'ham': -77.48695175645791, 'spam': -93.71448200582458}
Did you hear about the new \Divorce Barbie\"? It comes with all of Ken's stuff!"
---
expect: spam, got: ham
{'ham': -178.2932238903798, 'spam': -184.67604680539299}
Am new 2 club & dont fink we met yet Will B gr8 2 C U Please leave msg 2day wiv ur area 09099726553 reply promised CARLIE x Calls�1/minMobsmore LKPOBOX177HP51FL
---
expect: spam, got: ham
{'ham': -146.47067170071077, 'spam': -149.22917007660618}
Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give Arsenal a 2 goal margin after 78 mins.
---
expect: spam, got: ham
{'ham': -50.71127368750657, 'spam': -55.32646466594103}
Money i have won wining number 946 wot do i do next
---
expect: spam, got: ham
{'ham': -86.61673741525858, 'spam': -100.67904461391855}
Sorry I missed your call let's talk when you have the time. I'm on 07090201529
---
expect: spam, got: ham
{'ham': -135.8452563916395, 'spam': -138.23271598529016}
Download as many ringtones as u like no restrictions, 1000s 2 choose. U can even send 2 yr buddys. Txt Sir to 80082 �3
---
expect: spam, got: ham
{'ham': -106.60938155575177, 'spam': -115.73546864261806}
INTERFLORA - ��It's not too late to order Interflora flowers for christmas call 0800 505060 to place your order before Midnight tomorrow.
---
expect: ham, got: spam
{'ham': -109.18333505635592, 'spam': -107.80307740800048}
MAKE SURE ALEX KNOWS HIS BIRTHDAY IS OVER IN FIFTEEN MINUTES AS FAR AS YOU'RE CONCERNED
调参数
我们只做一个非常简单的实验,如果先全部转成小写字母,再统计出现次数,效果会不会提升。
当前代码,连续 3 次的运行结果:
positive(spam) true: 188, false: 21
negative true: 1182, false: 2
Precision: 89.95%, Recall: 98.95%
---
positive(spam) true: 160, false: 16
negative true: 1209, false: 8
Precision: 90.91%, Recall: 95.24%
---
positive(spam) true: 167, false: 13
negative true: 1208, false: 5
Precision: 92.78%, Recall: 97.09%
代码修改:
第 11 行
dataset = [(line[0], line[1]) for line in csv_reader]
改成
dataset = [(line[0], line[1].lower()) for line in csv_reader]
连续 3 次的运行结果:
positive(spam) true: 164, false: 18
negative true: 1205, false: 6
Precision: 90.11%, Recall: 96.47%
---
positive(spam) true: 174, false: 20
negative true: 1197, false: 2
Precision: 89.69%, Recall: 98.86%
---
positive(spam) true: 162, false: 23
negative true: 1204, false: 4
Precision: 87.57%, Recall: 97.59%
结果变化不大,Precision 略有降低,Recall 略微提升。
3 次运行结果,随机性比较大,不能作出哪一个 feature 更好的结论。
但我们也没有看到明显的优化或下降。
总结training 阶段主要是计算 prior 和 likelihood,这两个都与具体的 document 内的文本无关,而是在整个 label 所有训练集内统计 document count 和 word count。
对结果产生影响的,不是一个训练集内 word A 与 word B 的相对高低,而是 word 在不同 label 集内的概率差异。