由于Programming Exercise 6的作业内容可以分为两大块,即1Support Vector Machines 2Span Classification.其中,第一块主要是描述SVM算法的具体实现过程,是本周课程内容的基础,第二块垃圾邮件分类是基于第一块代码的基础上的具体实际应用。第一部分的实现过程已经在前一篇博客中给出,地址为:第一部分:Support Vector Machines。下面介绍第二部分Spam Classification(垃圾邮件分类)的实现过程。
好的,话不多说,开始我们的讲解。
数据集:spamTrain.mat ---训练样本,用于训练垃圾邮件分类器,有4000个训练样本
spamTest.mat ---测试样本,用于查看已得到的分类器对新样本的泛化能力,有1000个训练样本
vocab.txt ---分类器用到的词汇集合,保存在txt中,存有编号和对应的词汇,这里设置的是1899个常用词汇
emailSample1.txt ,emailSample2.txt---邮件样例,用于查看邮件处理算法对邮件的处理效果
spamSample1.txt,spamSample2.txt ---垃圾邮件,用于对以训练的样本进行预测,看分类器判断是否为垃圾邮件
函数文件:ex6_spam.m --- 实验的控制文件,描述实验的进行过程,控制输入输出,绘图等操作。不用修改
getVocabList.m ---将vocab.txt读入matlab中,并放在一个一维字符数组vocabList中。不用修改,直接调用
porterStemmer.m ---英文分词函数。不用修改,直接调用
readFile.m ---读取邮件文本函数。不用修改,直接调用
processEmain.m ---对读取的字符串进行加工,如:去标点,分词,去网址,去数字等。需要修改
emailFeature.m ---看分好词后,这个词是否在词汇库vocabList中,如果在,则在对应的向量上置1。需要修改
svmTrain.m --- 利用训练样本训练SVM分类器,不用修改,直接调用。上篇博客解释过
linearKernel.m --- 线性核函数,不用修改,直接调用。上篇博客解释过
svmPredict.m --- 利用训练得到的model对新样本进行预测,不用修改,直接调用。上篇博客提到过
1.ex6_spam.m为此程序的实现流程,代码如下
%% Initialization
clear ; close all; clc
%% ==================== Part 1: Email Preprocessing ====================
% To use an SVM to classify emails into Spam v.s. Non-Spam, you first need
% to convert each email into a vector of features. In this part, you will
% implement the preprocessing steps for each email. You should
% complete the code in processEmail.m to produce a word indices vector
% for a given email.
fprintf('\nPreprocessing sample email (emailSample1.txt)\n');
% Extract Features
file_contents = readFile('emailSample1.txt');
word_indices = processEmail(file_contents);
% Print Stats
fprintf('Word Indices: \n');
fprintf(' %d', word_indices);
fprintf('\n\n');
fprintf('Program paused. Press enter to continue.\n');
pause;
%% ==================== Part 2: Feature Extraction ====================
% Now, you will convert each email into a vector of features in R^n.
% You should complete the code in emailFeatures.m to produce a feature
% vector for a given email.
fprintf('\nExtracting features from sample email (emailSample1.txt)\n');
% Extract Features
file_contents = readFile('emailSample1.txt');
word_indices = processEmail(file_contents);
features = emailFeatures(word_indices);
% Print Stats
fprintf('Length of feature vector: %d\n', length(features));
fprintf('Number of non-zero entries: %d\n', sum(features > 0));
fprintf('Program paused. Press enter to continue.\n');
pause;
%% =========== Part 3: Train Linear SVM for Spam Classification ========
% In this section, you will train a linear classifier to determine if an
% email is Spam or Not-Spam.
% Load the Spam Email dataset
% You will have X, y in your environment
load('spamTrain.mat');
fprintf('\nTraining Linear SVM (Spam Classification)\n')
fprintf('(this may take 1 to 2 minutes) ...\n')
C = 0.1;
model = svmTrain(X, y, C, @linearKernel);
p = svmPredict(model, X);
fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);
%% =================== Part 4: Test Spam Classification ================
% After training the classifier, we can evaluate it on a test set. We have
% included a test set in spamTest.mat
% Load the test dataset
% You will have Xtest, ytest in your environment
load('spamTest.mat');
fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')
p = svmPredict(model, Xtest);
fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
pause;
%% ================= Part 5: Top Predictors of Spam ====================
% Since the model we are training is a linear SVM, we can inspect the
% weights learned by the model to understand better how it is determining
% whether an email is spam or not. The following code finds the words with
% the highest weights in the classifier. Informally, the classifier
% 'thinks' that these words are the most likely indicators of spam.
%
% Sort the weights and obtin the vocabulary list
[weight, idx] = sort(model.w, 'descend');
vocabList = getVocabList();
fprintf('\nTop predictors of spam: \n');
for i = 1:15
fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i));
end
fprintf('\n\n');
fprintf('\nProgram paused. Press enter to continue.\n');
pause;
%% =================== Part 6: Try Your Own Emails =====================
% Now that you've trained the spam classifier, you can use it on your own
% emails! In the starter code, we have included spamSample1.txt,
% spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.
% The following code reads in one of these emails and then uses your
% learned SVM classifier to determine whether the email is Spam or
% Not Spam
% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
filename = 'spamSample1.txt';
% Read and predict
file_contents = readFile(filename);
word_indices = processEmail(file_contents);
x = emailFeatures(word_indices);
p = svmPredict(model, x);
fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);
fprintf('(1 indicates spam, 0 indicates not spam)\n\n');
part1:Email Preprocessing---读取样本邮件,利用processEmail函数处理读入的数据,得到便于处理的标准英文单词
part2:Feature Extraction ---建立一个和字典vocabList等长的向量x,有part1得到的标准数据输入,假如该单词在字典中,向量x对应位置置1
part3:Train Linear SVM for Spam Classification ---建立线性和函数的SVM,利用spanTrain.mat中的(X,y)对SVM进行训练,得到model
part4:Test Spam Classification ---在part4得到的model基础上,利用spamTest.mat中的(Xtest,ytest)对model进行测试,查看它的准确率
part5:Top Predictors of Spam ---给出训练得到的model中出现概率最高的15个单词
part6:Try Your Own Emails ---对一封信件进行预测,输出为0表示它不是垃圾邮件,输出为1表示它是垃圾邮件
2.完善processEmail.m(此处需要编写代码!!!)
function word_indices = processEmail(email_contents)
% Load Vocabulary 导入词汇列表
vocabList = getVocabList();
% Init return value 建立返回字符串数组
word_indices = [];
% Lower case 将大写字母全都转换成
email_contents = lower(email_contents);
% Strip all HTML
% Looks for any expression that starts with < and ends with > and replace
% and does not have any < or > in the tag it with a space
email_contents = regexprep(email_contents, '<[^<>]+>', ' '); %将所有html标记
% Handle Numbers
% Look for one or more characters between 0-9
email_contents = regexprep(email_contents, '[0-9]+', 'number');%删除所有数字,改写成字符串number
% Handle URLS
% Look for strings starting with http:// or https://
email_contents = regexprep(email_contents, ...
'(http|https)://[^\s]*', 'httpaddr');%家那个多有http地址改写成字符串httpaddr
% Handle Email Addresses
% Look for strings with @ in the middle
email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');%将多有email地址改写成字符串emailaddr
% Handle $ sign
email_contents = regexprep(email_contents, '[$]+', 'dollar');%将$符号改写成字符串dollar
% ========================== Tokenize Email ===========================
% Output the email to screen as well
fprintf('\n==== Processed Email ====\n\n');
% Process file
l = 0;
while ~isempty(email_contents)
% Tokenize and also get rid of any punctuation 删除标点
[str, email_contents] = ...
strtok(email_contents, ...
[' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);
% Remove any non alphanumeric characters 删除非数字,字母的符号
str = regexprep(str, '[^a-zA-Z0-9]', '');
% Stem the word 找词干
% (the porterStemmer sometimes has issues, so we use a try catch
% block)删除多余空格
try str = porterStemmer(strtrim(str));
catch str = ''; continue;
end;
% Skip the word if it is too short
if length(str) < 1
continue;
end
% ====================== YOUR CODE HERE ======================
vocab_length = length(vocabList);
for i = 1:vocab_length, %将str与vocabList中的每个元素进行比较
if(strcmp(str,vocabList(i)) == 1)
word_indices = [word_indices;i]; %若存在,则将其对应的字符编号加入word_indices中
end
end
% =============================================================
% Print to screen, ensuring that the output lines are not too long
if (l + length(str) + 1) > 78
fprintf('\n');
l = 0;
end
fprintf('%s ', str);
l = l + length(str) + 1;
end
% Print footer
fprintf('\n\n=========================\n');
end
这个文件的作用是对读入的邮件进行改造,先后进行1.大写字符转化成小写 2.删除所有HTML编辑 3.将所有数字改写成字符串number 4.将所有http地址改写成字符串httpaddr 5.将所有email地址改写成emailaddr 6.将所有$符号改写成字符串dollar 7.删除标点,删除多余空格,查找词干,删除非数字字母符号。
最后,将str与vocabList中的每个元素进行比较,若str在其中,则将他在vocabList中的位置记录在数字数组word_indices中。
3.完善emailFeature.m(此处需要编写代码!!!)
function x = emailFeatures(word_indices)
% Total number of words in the dictionary
n = 1899;
% You need to return the following variables correctly.
x = zeros(n, 1);
% ====================== YOUR CODE HERE ======================
k = length(word_indices);
for i = 1:k,
if(x(word_indices(i)) == 0)
x(word_indices(i)) = x(word_indices(i)) + 1;
end
end
% =========================================================================
end
这个文件的左右是将2得到的word_indices转换成向量x。因为word_indices存储的是文件中存在单词在词汇列表vocabList中的位置,我们建立一个与vocabList等长的向量x,将所有出现的词汇列表的单词对应位置上标1。这样就将输入样本转换成标准输入数据了。
from:http://blog.csdn.net/a1015553840/article/details/50826728