Naive Bayes algorithm for spam classification (Matlab实现)

Materials, data, and algorithms comes from Stanford Andrew Ng Machine Learning course Problem set 2 (Q3).

1. Preprocessing

(1)datatset只保留邮件的subject和正文

(2)所有单词转换成小写

(3)email address 替换成word EMAILADDR,类似的web address (HTTPADDR),currency (DOLLAR), numbers (NUMBER).

(4)set vocabulary. 使用standard stemming algorithm来 stemming, 然后consider only the medium frequency tokens into vocabulary (出现次数高的和低的都不要).

(5)build document-word matrices. the ith row represents the ith document/email, and the jth column represents the jth distinct token. Thus, the (i, j)-entry of this matrix represents thenumber of occurrences of the jth token in the ith document.


下面就可以用matlab实现了 (注:下面程序采用的是另一篇博文Naive Bayes Classifier中的第二种方法)
nb_train.m
clear 
clc
[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN');

trainMatrix = full(spmatrix);%行是document,列是tokens,里面的数值是tokens在document中出现的次数
numTrainDocs = size(trainMatrix, 1);
numTokens = size(trainMatrix, 2);

% trainMatrix is now a (numTrainDocs x numTokens) matrix.
% Each row represents a unique document (email).
% The j-th column of the row $i$ represents the number of times the j-th
% token appeared in email $i$. 

% tokenlist is a long string containing the list of all tokens (words).
% These tokens are easily known by position in the file TOKENS_LIST

% trainCategory is a (1 x numTrainDocs) vector containing the true 
% classifications for the documents just read in. The i-th entry gives the 
% correct class for the i-th email (which corresponds to the i-th row in 
% the document word matrix).

% Spam documents are indicated as class 1, and non-spam as class 0.
% Note that for the SVM, you would want to convert these to +1 and -1.
%-----------------------------
V = size(trainMatrix, 2); % tokens总数
neg = trainMatrix(find(trainCategory == 0), :); % non-spam样本
pos = trainMatrix(find(trainCategory == 1), :); % spam样本
neg_words = sum(sum(neg));%negtive document中出现tokens中词的总数 ,而不是教程中的negtive document中的词汇总数
pos_words = sum(sum(pos));
neg_log_prior = log(size(neg,1) / numTrainDocs); %先验概率= non-spam样本个数/样本总数
pos_log_prior = log(size(pos,1) / numTrainDocs);  %先验概率= spam样本个数/样本总数
for k=1:V,
neg_log_phi(k) = log((sum(neg(:,k)) + 1) / (neg_words + V));%因为第k列是相应的token在所有document中出现的次数,所以直接按列求和
%从分子来看,就是求第k个token在所有negtive documents中出现的总次数
%从分母neg_words + V 来看,negtive documents中总的词数,并不是统计所有词汇,而是只统计字典中词出现的总次数。
pos_log_phi(k) = log((sum(pos(:,k)) + 1) / (pos_words + V));
end
%----------------------------
%下面try to get an informal sense of how indicative token $i$ is for the SPAM
%class
compare_log=log(exp(pos_log_phi)./exp(neg_log_phi));
[i,j]=sort(compare_log);%i是从小到大排的结果,j是sort后每个值在原来序列中的位置
j(:,length(j)-4:length(j));%取j的后5位数,就是compare_log最大值所在的位置。在token list中找到排在前5的,说明这5个词对分类影响最大。

nb_test.m  (test紧跟着train后执行)
[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');

testMatrix = full(spmatrix);
numTestDocs = size(testMatrix, 1);
numTokens = size(testMatrix, 2);

% Assume nb_train.m has just been executed, and all the parameters computed/needed
% by your classifier are in memory through that execution. You can also assume 
% that the columns in the test set are arranged in exactly the same way as for the
% training set (i.e., the j-th column represents the same token in the test data 
% matrix as in the original training data matrix).

% Write code below to classify each document in the test set (ie, each row
% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.

% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry 
% of this vector is the predicted class (1/0) for the i-th  email (i-th row 
% in testMatrix) in the test set.
output = zeros(numTestDocs, 1);

%---------------
for k=1:numTestDocs,
[i,j,v] = find(testMatrix(k,:));%找出其中的非零值,(i,j)是位置,v是相应位置的数值
%由于p(y=1|x)和p(y=0|x)计算式的分母是一样的,所以只需要比较分子的大小
neg_posterior = sum(v .* neg_log_phi(j)) + neg_log_prior;%因为在train的时候求概率都加了log处理,所以这里就直接求和
pos_posterior = sum(v .* pos_log_phi(j)) + pos_log_prior;
if (neg_posterior > pos_posterior)
output(k) = 0;
else
output(k) = 1;
end
end
%---------------

% Compute the error on the test set
error=0;
for i=1:numTestDocs
  if (category(i) ~= output(i))
    error=error+1;
  end
end

%Print out the classification error on the test set
error/numTestDocs




你可能感兴趣的:(Matlab,机器学习)