Programming Exercise 6:Support Vector Machines 第二部分

        大家好,我是Mac Jiang,今天和大家分享一下coursera网站上Stanford University的Machine Learning公开课(吴恩达老师)课程第六次作业:Programming Exercise 6:Support Vector Machines。写这篇博客的目的是为在课程学习中遇到困难的同学提供一些帮助,同时帮助自己巩固这周的课程内容。欢迎博友转载此文章,但希望您在转载之前与我联系并标明文章的出处,谢谢!

        由于Programming Exercise 6的作业内容可以分为两大块,即1Support Vector Machines  2Span Classification.其中,第一块主要是描述SVM算法的具体实现过程,是本周课程内容的基础,第二块垃圾邮件分类是基于第一块代码的基础上的具体实际应用。第一部分的实现过程已经在前一篇博客中给出,地址为:第一部分:Support Vector Machines。下面介绍第二部分Spam Classification(垃圾邮件分类)的实现过程。


        好的,话不多说,开始我们的讲解。

        数据集:spamTrain.mat ---训练样本,用于训练垃圾邮件分类器,有4000个训练样本

                    spamTest.mat ---测试样本,用于查看已得到的分类器对新样本的泛化能力,有1000个训练样本

                    vocab.txt ---分类器用到的词汇集合,保存在txt中,存有编号和对应的词汇,这里设置的是1899个常用词汇

                    emailSample1.txt ,emailSample2.txt---邮件样例,用于查看邮件处理算法对邮件的处理效果

                    spamSample1.txt,spamSample2.txt ---垃圾邮件,用于对以训练的样本进行预测,看分类器判断是否为垃圾邮件

  函数文件:ex6_spam.m --- 实验的控制文件,描述实验的进行过程,控制输入输出,绘图等操作。不用修改

                    getVocabList.m ---将vocab.txt读入matlab中,并放在一个一维字符数组vocabList中。不用修改,直接调用

                    porterStemmer.m ---英文分词函数。不用修改,直接调用

                    readFile.m ---读取邮件文本函数。不用修改,直接调用

                    processEmain.m ---对读取的字符串进行加工,如:去标点,分词,去网址,去数字等。需要修改

                    emailFeature.m ---看分好词后,这个词是否在词汇库vocabList中,如果在,则在对应的向量上置1。需要修改

                    svmTrain.m --- 利用训练样本训练SVM分类器,不用修改,直接调用。上篇博客解释过

                    linearKernel.m --- 线性核函数,不用修改,直接调用。上篇博客解释过

                    svmPredict.m --- 利用训练得到的model对新样本进行预测,不用修改,直接调用。上篇博客提到过

1.ex6_spam.m为此程序的实现流程,代码如下

%% Initialization
clear ; close all; clc
%% ==================== Part 1: Email Preprocessing ====================
%  To use an SVM to classify emails into Spam v.s. Non-Spam, you first need
%  to convert each email into a vector of features. In this part, you will
%  implement the preprocessing steps for each email. You should
%  complete the code in processEmail.m to produce a word indices vector
%  for a given email.
fprintf('\nPreprocessing sample email (emailSample1.txt)\n');
% Extract Features
file_contents = readFile('emailSample1.txt');
word_indices  = processEmail(file_contents);
% Print Stats
fprintf('Word Indices: \n');
fprintf(' %d', word_indices);
fprintf('\n\n');
fprintf('Program paused. Press enter to continue.\n');
pause;

%% ==================== Part 2: Feature Extraction ====================
%  Now, you will convert each email into a vector of features in R^n. 
%  You should complete the code in emailFeatures.m to produce a feature
%  vector for a given email.

fprintf('\nExtracting features from sample email (emailSample1.txt)\n');

% Extract Features
file_contents = readFile('emailSample1.txt');
word_indices  = processEmail(file_contents);
features      = emailFeatures(word_indices);
% Print Stats
fprintf('Length of feature vector: %d\n', length(features));
fprintf('Number of non-zero entries: %d\n', sum(features > 0));
fprintf('Program paused. Press enter to continue.\n');
pause;

%% =========== Part 3: Train Linear SVM for Spam Classification ========
%  In this section, you will train a linear classifier to determine if an
%  email is Spam or Not-Spam.

% Load the Spam Email dataset
% You will have X, y in your environment
load('spamTrain.mat');

fprintf('\nTraining Linear SVM (Spam Classification)\n')
fprintf('(this may take 1 to 2 minutes) ...\n')

C = 0.1;
model = svmTrain(X, y, C, @linearKernel);
p = svmPredict(model, X);
fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);

%% =================== Part 4: Test Spam Classification ================
%  After training the classifier, we can evaluate it on a test set. We have
%  included a test set in spamTest.mat

% Load the test dataset
% You will have Xtest, ytest in your environment
load('spamTest.mat');
fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')
p = svmPredict(model, Xtest);
fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
pause;

%% ================= Part 5: Top Predictors of Spam ====================
%  Since the model we are training is a linear SVM, we can inspect the
%  weights learned by the model to understand better how it is determining
%  whether an email is spam or not. The following code finds the words with
%  the highest weights in the classifier. Informally, the classifier
%  'thinks' that these words are the most likely indicators of spam.
%
% Sort the weights and obtin the vocabulary list
[weight, idx] = sort(model.w, 'descend');
vocabList = getVocabList();
fprintf('\nTop predictors of spam: \n');
for i = 1:15
    fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i));
end
fprintf('\n\n');
fprintf('\nProgram paused. Press enter to continue.\n');
pause;

%% =================== Part 6: Try Your Own Emails =====================
%  Now that you've trained the spam classifier, you can use it on your own
%  emails! In the starter code, we have included spamSample1.txt,
%  spamSample2.txt, emailSample1.txt and emailSample2.txt as examples. 
%  The following code reads in one of these emails and then uses your 
%  learned SVM classifier to determine whether the email is Spam or 
%  Not Spam
% Set the file to be read in (change this to spamSample2.txt,
% emailSample1.txt or emailSample2.txt to see different predictions on
% different emails types). Try your own emails as well!
filename = 'spamSample1.txt';
% Read and predict
file_contents = readFile(filename);
word_indices  = processEmail(file_contents);
x             = emailFeatures(word_indices);
p = svmPredict(model, x);
fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p);
fprintf('(1 indicates spam, 0 indicates not spam)\n\n');

        part1:Email Preprocessing---读取样本邮件,利用processEmail函数处理读入的数据,得到便于处理的标准英文单词

        part2:Feature Extraction ---建立一个和字典vocabList等长的向量x,有part1得到的标准数据输入,假如该单词在字典中,向量x对应位置置1

        part3:Train Linear SVM for Spam Classification ---建立线性和函数的SVM,利用spanTrain.mat中的(X,y)对SVM进行训练,得到model

        part4:Test Spam Classification ---在part4得到的model基础上,利用spamTest.mat中的(Xtest,ytest)对model进行测试,查看它的准确率

        part5:Top Predictors of Spam ---给出训练得到的model中出现概率最高的15个单词

        part6:Try Your Own Emails ---对一封信件进行预测,输出为0表示它不是垃圾邮件,输出为1表示它是垃圾邮件

2.完善processEmail.m(此处需要编写代码!!!)

function word_indices = processEmail(email_contents)
% Load Vocabulary    导入词汇列表
vocabList = getVocabList();
% Init return value   建立返回字符串数组
word_indices = [];
% Lower case    将大写字母全都转换成
email_contents = lower(email_contents);
% Strip all HTML
% Looks for any expression that starts with < and ends with > and replace
% and does not have any < or > in the tag it with a space
email_contents = regexprep(email_contents, '<[^<>]+>', ' ');  %将所有html标记
% Handle Numbers
% Look for one or more characters between 0-9
email_contents = regexprep(email_contents, '[0-9]+', 'number');%删除所有数字,改写成字符串number

% Handle URLS
% Look for strings starting with http:// or https://
email_contents = regexprep(email_contents, ...
                           '(http|https)://[^\s]*', 'httpaddr');%家那个多有http地址改写成字符串httpaddr

% Handle Email Addresses
% Look for strings with @ in the middle
email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');%将多有email地址改写成字符串emailaddr

% Handle $ sign
email_contents = regexprep(email_contents, '[$]+', 'dollar');%将$符号改写成字符串dollar


% ========================== Tokenize Email ===========================

% Output the email to screen as well
fprintf('\n==== Processed Email ====\n\n');

% Process file
l = 0;

while ~isempty(email_contents)

    % Tokenize and also get rid of any punctuation   删除标点
    [str, email_contents] = ...
       strtok(email_contents, ...
              [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);
   
    % Remove any non alphanumeric characters     删除非数字,字母的符号
    str = regexprep(str, '[^a-zA-Z0-9]', '');

    % Stem the word   找词干
    % (the porterStemmer sometimes has issues, so we use a try catch
    % block)删除多余空格
    try str = porterStemmer(strtrim(str)); 
    catch str = ''; continue;
    end;

    % Skip the word if it is too short
    if length(str) < 1
       continue;
    end
    % ====================== YOUR CODE HERE ======================

    vocab_length = length(vocabList);
    for i = 1:vocab_length,                               %将str与vocabList中的每个元素进行比较
        if(strcmp(str,vocabList(i)) == 1)
            word_indices = [word_indices;i];              %若存在,则将其对应的字符编号加入word_indices中
        end
    end
    % =============================================================
    % Print to screen, ensuring that the output lines are not too long
    if (l + length(str) + 1) > 78
        fprintf('\n');
        l = 0;
    end
    fprintf('%s ', str);
    l = l + length(str) + 1;
end
% Print footer
fprintf('\n\n=========================\n');

end

        这个文件的作用是对读入的邮件进行改造,先后进行1.大写字符转化成小写    2.删除所有HTML编辑    3.将所有数字改写成字符串number    4.将所有http地址改写成字符串httpaddr    5.将所有email地址改写成emailaddr    6.将所有$符号改写成字符串dollar    7.删除标点,删除多余空格,查找词干,删除非数字字母符号。

        最后,将str与vocabList中的每个元素进行比较,若str在其中,则将他在vocabList中的位置记录在数字数组word_indices中。

3.完善emailFeature.m(此处需要编写代码!!!)

function x = emailFeatures(word_indices)
% Total number of words in the dictionary
n = 1899;
% You need to return the following variables correctly.
x = zeros(n, 1);
% ====================== YOUR CODE HERE ======================
k = length(word_indices);
for i = 1:k,
    if(x(word_indices(i)) == 0)
        x(word_indices(i)) = x(word_indices(i)) + 1;
    end
end
% =========================================================================
end
    这个文件的左右是将2得到的word_indices转换成向量x。因为word_indices存储的是文件中存在单词在词汇列表vocabList中的位置,我们建立一个与vocabList等长的向量x,将所有出现的词汇列表的单词对应位置上标1。这样就将输入样本转换成标准输入数据了。


from:http://blog.csdn.net/a1015553840/article/details/50826728





你可能感兴趣的:(Mac,Jiang的机器学习专栏)