之所以用matlab实现,是因为这是数据挖掘课的几个大作业之一,作业要求,不然也不会这么蛋疼用matlab....(因为我不会matlab...)
朴素贝叶斯原理非常简单,最重要的就是概率公式:
其余的内容介绍可以参考:http://zh.wikipedia.org/wiki/%E6%9C%B4%E7%B4%A0%E8%B4%9D%E5%8F%B6%E6%96%AF%E5%88%86%E7%B1%BB%E5%99%A8
下面贴用matlab的具体实现算法:
补上readMatrix.m代码:
function [matrix, tokenlist, category] = readMatrix(filename) fid = fopen(filename); %Read the header line headerline = fgetl(fid); %Read number of documents and tokens rowscols = fscanf(fid, '%d %d\n', 2); %Read the list of tokens - just a long string! %blah = fscanf(fid, '%s', 1); % required for octave tokenlist = fgetl(fid); % Document word matrix % Each row represents a document (mail) % Each column represents a distinct token % The (i,j)-th element represents the number of times token j appeared in % document i matrix = sparse(1, 1, 0, rowscols(2), rowscols(1)); % the transpose! % Vector containing the categories corresponding to each row in the % document word matrix % The i-th component is 1 if the i-th document (row) in the document word % matrix is SPAM, and 0 otherwise. category = matrix(rowscols(1)); %Read in the matrix and the categories for m = 1:rowscols(1) % as many rows as number of documents line = fgetl(fid); nums = sscanf(line, '%d'); category(m) = nums(1); matrix(1 + cumsum(nums(2:2:end - 1)), m) = nums(3:2:end - 1); end matrix = matrix'; % flip it back fclose(fid);
train阶段:
[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN'); trainMatrix = full(spmatrix); numTrainDocs = size(trainMatrix, 1); numTokens = size(trainMatrix, 2); % trainMatrix is now a (numTrainDocs x numTokens) matrix. % Each row represents a unique document (email). % The j-th column of the row $i$ represents the number of times the j-th % token appeared in email $i$. % tokenlist is a long string containing the list of all tokens (words). % These tokens are easily known by position in the file TOKENS_LIST % trainCategory is a (1 x numTrainDocs) vector containing the true % classifications for the documents just read in. The i-th entry gives the % correct class for the i-th email (which corresponds to the i-th row in % the document word matrix). % Spam documents are indicated as class 1, and non-spam as class 0. % Note that for the SVM, you would want to convert these to +1 and -1. % YOUR CODE HERE positiveSize = length(find(trainCategory)); negitiveSize = length(trainCategory)-positiveSize; p1 = positiveSize/numTrainDocs; p0 = negitiveSize/numTrainDocs; trainCategory = full(trainCategory); trainMatrixResult1 = linspace(0,0,numTokens); trainMatrixResult0 = linspace(0,0,numTokens); for i=1:numTrainDocs for j=1:numTokens if abs(trainCategory(1,i)-1)<=1e-10 trainMatrixResult1(j) = trainMatrixResult1(j)+trainMatrix(i,j); else trainMatrixResult0(j) = trainMatrixResult0(j)+trainMatrix(i,j); end end end class1sum = sum(trainMatrixResult1); class0sum = sum(trainMatrixResult0); for i=1:numTokens trainMatrixResult1(i) = 1000*trainMatrixResult1(i)/class1sum; trainMatrixResult0(i) = 1000*trainMatrixResult0(i)/class0sum; end
[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST'); testMatrix = full(spmatrix); numTestDocs = size(testMatrix, 1); numTokens = size(testMatrix, 2); % Assume nb_train.m has just been executed, and all the parameters computed/needed % by your classifier are in memory through that execution. You can also assume % that the columns in the test set are arranged in exactly the same way as for the % training set (i.e., the j-th column represents the same token in the test data % matrix as in the original training data matrix). % Write code below to classify each document in the test set (ie, each row % in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM. % Construct the (numTestDocs x 1) vector 'output' such that the i-th entry % of this vector is the predicted class (1/0) for the i-th email (i-th row % in testMatrix) in the test set. output = zeros(numTestDocs, 1); %--------------- % YOUR CODE HERE %--------------- for i=1:numTestDocs belongTo1 = 1; belongTo0 = 1; for j=1:numTokens if testMatrix(i,j) ~= 0 tokenIndex = j; belongTo1 = belongTo1*trainMatrixResult1(tokenIndex); belongTo0 = belongTo0*trainMatrixResult0(tokenIndex); end end if belongTo1>belongTo0 output(i) = 1; else output(i) = 0; end end % Compute the error on the test set error=0; for i=1:numTestDocs if (category(i) ~= output(i)) error=error+1; end end %Print out the classification error on the test set error/numTestDocs分类结果:
图形化展示:
请各位亲不要直接复制黏贴就交上去哦~