由于Programming Exercise 6的作业内容可以分为两大块,即1Support Vector Machines 2Span Classification.其中,第一块主要是描述SVM算法的具体实现过程,是本周课程内容的基础,第二块垃圾邮件分类是基于第一块代码的基础上的具体实际应用。第一部分的实现过程已经在前一篇博客中给出,地址为:第一部分:Support Vector Machines。下面介绍第二部分Spam Classification(垃圾邮件分类)的实现过程。
好的,话不多说,开始我们的讲解。
数据集:spamTrain.mat ---训练样本,用于训练垃圾邮件分类器,有4000个训练样本
spamTest.mat ---测试样本,用于查看已得到的分类器对新样本的泛化能力,有1000个训练样本
vocab.txt ---分类器用到的词汇集合,保存在txt中,存有编号和对应的词汇,这里设置的是1899个常用词汇
emailSample1.txt ,emailSample2.txt---邮件样例,用于查看邮件处理算法对邮件的处理效果
spamSample1.txt,spamSample2.txt ---垃圾邮件,用于对以训练的样本进行预测,看分类器判断是否为垃圾邮件
函数文件:ex6_spam.m --- 实验的控制文件,描述实验的进行过程,控制输入输出,绘图等操作。不用修改
getVocabList.m ---将vocab.txt读入matlab中,并放在一个一维字符数组vocabList中。不用修改,直接调用
porterStemmer.m ---英文分词函数。不用修改,直接调用
readFile.m ---读取邮件文本函数。不用修改,直接调用
processEmain.m ---对读取的字符串进行加工,如:去标点,分词,去网址,去数字等。需要修改
emailFeature.m ---看分好词后,这个词是否在词汇库vocabList中,如果在,则在对应的向量上置1。需要修改
svmTrain.m --- 利用训练样本训练SVM分类器,不用修改,直接调用。上篇博客解释过
linearKernel.m --- 线性核函数,不用修改,直接调用。上篇博客解释过
svmPredict.m --- 利用训练得到的model对新样本进行预测,不用修改,直接调用。上篇博客提到过
1.ex6_spam.m为此程序的实现流程,代码如下
<span style="font-size:12px;">%% Initialization clear ; close all; clc %% ==================== Part 1: Email Preprocessing ==================== % To use an SVM to classify emails into Spam v.s. Non-Spam, you first need % to convert each email into a vector of features. In this part, you will % implement the preprocessing steps for each email. You should % complete the code in processEmail.m to produce a word indices vector % for a given email. fprintf('\nPreprocessing sample email (emailSample1.txt)\n'); % Extract Features file_contents = readFile('emailSample1.txt'); word_indices = processEmail(file_contents); % Print Stats fprintf('Word Indices: \n'); fprintf(' %d', word_indices); fprintf('\n\n'); fprintf('Program paused. Press enter to continue.\n'); pause; %% ==================== Part 2: Feature Extraction ==================== % Now, you will convert each email into a vector of features in R^n. % You should complete the code in emailFeatures.m to produce a feature % vector for a given email. fprintf('\nExtracting features from sample email (emailSample1.txt)\n'); % Extract Features file_contents = readFile('emailSample1.txt'); word_indices = processEmail(file_contents); features = emailFeatures(word_indices); % Print Stats fprintf('Length of feature vector: %d\n', length(features)); fprintf('Number of non-zero entries: %d\n', sum(features > 0)); fprintf('Program paused. Press enter to continue.\n'); pause; %% =========== Part 3: Train Linear SVM for Spam Classification ======== % In this section, you will train a linear classifier to determine if an % email is Spam or Not-Spam. % Load the Spam Email dataset % You will have X, y in your environment load('spamTrain.mat'); fprintf('\nTraining Linear SVM (Spam Classification)\n') fprintf('(this may take 1 to 2 minutes) ...\n') C = 0.1; model = svmTrain(X, y, C, @linearKernel); p = svmPredict(model, X); fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100); %% =================== Part 4: Test Spam Classification ================ % After training the classifier, we can evaluate it on a test set. We have % included a test set in spamTest.mat % Load the test dataset % You will have Xtest, ytest in your environment load('spamTest.mat'); fprintf('\nEvaluating the trained Linear SVM on a test set ...\n') p = svmPredict(model, Xtest); fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100); pause; %% ================= Part 5: Top Predictors of Spam ==================== % Since the model we are training is a linear SVM, we can inspect the % weights learned by the model to understand better how it is determining % whether an email is spam or not. The following code finds the words with % the highest weights in the classifier. Informally, the classifier % 'thinks' that these words are the most likely indicators of spam. % % Sort the weights and obtin the vocabulary list [weight, idx] = sort(model.w, 'descend'); vocabList = getVocabList(); fprintf('\nTop predictors of spam: \n'); for i = 1:15 fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i)); end fprintf('\n\n'); fprintf('\nProgram paused. Press enter to continue.\n'); pause; %% =================== Part 6: Try Your Own Emails ===================== % Now that you've trained the spam classifier, you can use it on your own % emails! In the starter code, we have included spamSample1.txt, % spamSample2.txt, emailSample1.txt and emailSample2.txt as examples. % The following code reads in one of these emails and then uses your % learned SVM classifier to determine whether the email is Spam or % Not Spam % Set the file to be read in (change this to spamSample2.txt, % emailSample1.txt or emailSample2.txt to see different predictions on % different emails types). Try your own emails as well! filename = 'spamSample1.txt'; % Read and predict file_contents = readFile(filename); word_indices = processEmail(file_contents); x = emailFeatures(word_indices); p = svmPredict(model, x); fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p); fprintf('(1 indicates spam, 0 indicates not spam)\n\n'); </span>
part1:Email Preprocessing---读取样本邮件,利用processEmail函数处理读入的数据,得到便于处理的标准英文单词
part2:Feature Extraction ---建立一个和字典vocabList等长的向量x,有part1得到的标准数据输入,假如该单词在字典中,向量x对应位置置1
part3:Train Linear SVM for Spam Classification ---建立线性和函数的SVM,利用spanTrain.mat中的(X,y)对SVM进行训练,得到model
part4:Test Spam Classification ---在part4得到的model基础上,利用spamTest.mat中的(Xtest,ytest)对model进行测试,查看它的准确率
part5:Top Predictors of Spam ---给出训练得到的model中出现概率最高的15个单词
part6:Try Your Own Emails ---对一封信件进行预测,输出为0表示它不是垃圾邮件,输出为1表示它是垃圾邮件
2.完善processEmail.m(此处需要编写代码!!!)
function word_indices = processEmail(email_contents) % Load Vocabulary 导入词汇列表 vocabList = getVocabList(); % Init return value 建立返回字符串数组 word_indices = []; % Lower case 将大写字母全都转换成 email_contents = lower(email_contents); % Strip all HTML % Looks for any expression that starts with < and ends with > and replace % and does not have any < or > in the tag it with a space email_contents = regexprep(email_contents, '<[^<>]+>', ' '); %将所有html标记 % Handle Numbers % Look for one or more characters between 0-9 email_contents = regexprep(email_contents, '[0-9]+', 'number');%删除所有数字,改写成字符串number % Handle URLS % Look for strings starting with http:// or https:// email_contents = regexprep(email_contents, ... '(http|https)://[^\s]*', 'httpaddr');%家那个多有http地址改写成字符串httpaddr % Handle Email Addresses % Look for strings with @ in the middle email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');%将多有email地址改写成字符串emailaddr % Handle $ sign email_contents = regexprep(email_contents, '[$]+', 'dollar');%将$符号改写成字符串dollar % ========================== Tokenize Email =========================== % Output the email to screen as well fprintf('\n==== Processed Email ====\n\n'); % Process file l = 0; while ~isempty(email_contents) % Tokenize and also get rid of any punctuation 删除标点 [str, email_contents] = ... strtok(email_contents, ... [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]); % Remove any non alphanumeric characters 删除非数字,字母的符号 str = regexprep(str, '[^a-zA-Z0-9]', ''); % Stem the word 找词干 % (the porterStemmer sometimes has issues, so we use a try catch % block)删除多余空格 try str = porterStemmer(strtrim(str)); catch str = ''; continue; end; % Skip the word if it is too short if length(str) < 1 continue; end % ====================== YOUR CODE HERE ====================== vocab_length = length(vocabList); for i = 1:vocab_length, %将str与vocabList中的每个元素进行比较 if(strcmp(str,vocabList(i)) == 1) word_indices = [word_indices;i]; %若存在,则将其对应的字符编号加入word_indices中 end end % ============================================================= % Print to screen, ensuring that the output lines are not too long if (l + length(str) + 1) > 78 fprintf('\n'); l = 0; end fprintf('%s ', str); l = l + length(str) + 1; end % Print footer fprintf('\n\n=========================\n'); end
这个文件的作用是对读入的邮件进行改造,先后进行1.大写字符转化成小写 2.删除所有HTML编辑 3.将所有数字改写成字符串number 4.将所有http地址改写成字符串httpaddr 5.将所有email地址改写成emailaddr 6.将所有$符号改写成字符串dollar 7.删除标点,删除多余空格,查找词干,删除非数字字母符号。
最后,将str与vocabList中的每个元素进行比较,若str在其中,则将他在vocabList中的位置记录在数字数组word_indices中。
3.完善emailFeature.m(此处需要编写代码!!!)
function x = emailFeatures(word_indices) % Total number of words in the dictionary n = 1899; % You need to return the following variables correctly. x = zeros(n, 1); % ====================== YOUR CODE HERE ====================== k = length(word_indices); for i = 1:k, if(x(word_indices(i)) == 0) x(word_indices(i)) = x(word_indices(i)) + 1; end end % ========================================================================= end这个文件的左右是将2得到的word_indices转换成向量x。因为word_indices存储的是文件中存在单词在词汇列表vocabList中的位置,我们建立一个与vocabList等长的向量x,将所有出现的词汇列表的单词对应位置上标1。这样就将输入样本转换成标准输入数据了。
from:http://blog.csdn.net/a1015553840/article/details/50826728