下面展示了一封常见的 email,邮件内容包含了一个 URL (http://www.rackspace.com/
),一个邮箱地址([email protected]),若干数字,美元符号等:
Anyone knows how much it costs to host a web portal ?
Well, it depends on how many visitors youre expecting. This can be anywhere from less than 10 bucks a month to a couple of $100. You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if youre running something big..
To unsubscribe yourself from this mailing list, send an email to: [email protected]
许多其他邮件可能也包含了这些信息,因此,在对邮件内容的预处理阶段,就是要 “标准化” 这些参数。例如,我们需要将所有的 URL链接都替换为 httpaddr
,这能让垃圾邮件分类器在获得决策边界时不受限于某个具体的 URL。标准化的手段能显著提升分类器的性能,这是因为,垃圾邮件通常会随机产生 URL,因此,在新的垃圾邮件中,发现之前见过的 URL 的可能性是不高的。
在预处理邮件内容阶段,我们将完成如下内容:
httpaddr
。emailaddr
。number
。$
替换为 dollar
。上述邮件经过预处理后,内容如下:
anyon know how much it cost to host a web portal well it depend on how mani visitor your expect thi can be anywher from less than number buck a month to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if your run someth big to unsubscrib yourself from thi mail list send an email to emailaddr
接下来,我们需要确定用于垃圾邮件分类器的单词,亦即,需要构建一个用于垃圾邮件分类器的词汇列表。在本例中,将选择在垃圾邮件语料库(spam corpus)中,出现至少 100 次的单词。这么做的原因是,如果选择将出现寥寥的单词放到训练样本中,可能就会造成模型的过拟合。最终完整的词汇列表存入了文件 vocab.txt 中,当中含有共计 1899 个单词:
1 aa
2 ab
3 abil
...
86 anyon
...
916 know
...
1898 zero
1899 zip
在生产环境中,词汇列表的规模在 10,000 到 50,000 个单词之间。
有了词汇表之后,我们现在就能将预处理后的邮件映射为一个坐标集,每个坐标反映了邮件中单词在词汇表的位置:
86 916 794 1077 883
370 1699 790 1822
1831 883 431 1171
794 1002 1893 1364
592 1676 238 162 89
688 945 1663 1120
1062 1699 375 1162
479 1893 1510 799
1182 1237 810 1895
1440 1547 181 1699
1758 1896 688 1676
992 961 1477 71 530
1699 531
特别地,“anyone” 被标准化了词汇表的第86个单词 “anyon”。
预处理的代码片如下,我们用到了函数式编程库 pydash 以及词干提取库 stemming:
# coding: utf8
# svm/spam.py
# ...
def processEmail(email):
"""预处理邮件
Args:
email 邮件内容
Returns:
indices 单词在词表中的位置
"""
# 转换为小写 --> 标准化 URL --> 标准化 邮箱地址
# --> 去除 HTML 标签 --> 标准化数字
# --> 标准化美元 --> 删除非空格字符
return py_(email) \
.strip_tags() \
.reg_exp_replace(r'(http|https)://[^\s]*', 'httpaddr') \
.reg_exp_replace(r'[^\s]+@[^\s]+', 'emailaddr') \
.reg_exp_replace(r'\d+', 'number') \
.reg_exp_replace(r'[$]+', 'dollar') \
.lower_case() \
.trim() \
.words() \
.map(stem) \
.map(lambda word : py_.index_of(vocabList, word) + 1) \
.value()
我们定义特征 x ( i ) x^{(i)} x(i) 为:
x ( i ) = { 1 , 如 果 词 汇 表 的 第 i 个 单 词 出 现 0 , o t h e r w i s e x^{(i)}=\begin{cases}1,\quad\quad 如果词汇表的第\ i\ 个单词出现\\ 0,\quad\quad otherwise\end{cases} x(i)={1,如果词汇表的第 i 个单词出现0,otherwise
则知, x ∈ R n x∈\R^n x∈Rn , n n n 为词汇表长度。
特征提取的代码同样封装到了项目的 svm/spam.py
中:
# coding: utf8
# svm/spam.py
# ....
def extractFeatures(indices):
"""提取特征
Args:
indices 单词索引
Returns:
feature 邮件特征
"""
feature = py_.map(range(1, len(vocabList) + 1),
lambda index: py_.index_of(indices, index) > -1)
return np.array(feature, dtype=np.uint)
在 SVM 训练后获得的模型中,权值向量 w w w 评估了每个特征的重要性,对应到词汇表的坐标中,我们也就知道了那些词汇最能分辨垃圾邮件,这些词汇称为 Top Predictors。
# coding: utf8
# svm/spam.py
# ....
def getTopPredictors(weights, count):
"""获得最佳标识词汇
Args:
weights 权值
count top count
Returns:
predictors predicators
"""
return py_(vocabList) \
.map(lambda word, idx: (word, weights[idx])) \
.sort_by(lambda item: item[1], reverse = True) \
.take(count) \
.value()
下面,我们会使用 sklearn 提供的 SVM 模型进行训练以加快训练效率,它是基于 libsvm 的,有不错的运算性能:
# coding: utf8
# svm/test_spam.py
import spam
import numpy as np
from scipy.io import loadmat
from sklearn.svm import SVC
import matplotlib.pyplot as plt
# 垃圾邮件分类器
data = loadmat('data/spamTrain.mat')
X = np.mat(data['X'])
y = data['y']
m, n = X.shape
C = 0.1
tol = 1e-3
# 使用训练集训练分类器
clf = SVC(C=C, kernel='linear', tol=tol)
clf.fit(X, y.ravel())
predictions = np.mat([clf.predict(X[i, :]) for i in range(m)])
accuracy = 100 * np.mean(predictions == y)
print 'Training set accuracy: %0.2f %%' % accuracy
# 使用测试集评估训练结果
data = loadmat('data/spamTest.mat')
XTest = np.mat(data['Xtest'])
yTest = data['ytest']
mTest, _ = XTest.shape
clf.fit(XTest, yTest.ravel())
print 'Evaluating the trained Linear SVM on a test set ...'
predictions = np.mat([clf.predict(XTest[i, :]) for i in range(mTest)])
accuracy = 100 * np.mean(predictions == yTest)
print 'Test set accuracy: %0.2f %%' % accuracy
# 获得最能标识垃圾邮件的词汇(在模型中获得高权值的)
weights = abs(clf.coef_.flatten())
top = 15
predictors = spam.getTopPredictors(weights, top)
print 'Top %d predictors of spam:'%top
for word, weight in predictors:
print '%-15s (%f)' % (word, weight)
# 使用邮件测试
def genExample(f):
email = open(f).read()
indices = spam.processEmail(email)
features = spam.extractFeatures(indices)
return features
files = [
'data/emailSample1.txt',
'data/emailSample1.txt',
'data/spamSample1.txt',
'data/spamSample2.txt'
]
emails = np.mat([genExample(f) for f in files], dtype=np.uint8)
labels = np.array([[0, 0, 1, 1]]).reshape(-1, 1)
predictions = np.mat([clf.predict(emails[i, :]) for i in range(len(files))])
accuracy = 100 * np.mean(predictions == labels)
print('Test set accuracy for own datasets: %0.2f %%' % accuracy)
测试结果为:
Training set accuracy: 99.83 %
Test set accuracy: 99.90 %
Top 15 predictors of spam:
click (0.477352)
url (0.332529)
date (0.316182)
wrote (0.282825)
spamassassin (0.266681)
remov (0.250697)
there (0.248973)
numbertnumb (0.240243)
the (0.238051)
pleas (0.235655)
our (0.233895)
do (0.226717)
basenumb (0.218456)
we (0.218421)
free (0.214057)
Test set accuracy for own datasets: 100.00 %