本算法依据《数据挖掘概念与技术》第三版(韩家炜)中的朴素贝叶斯算法描述来实现的,其分类过程可分为四步(这里只给了简略的步骤描述,详细的公式需看原书):
(1)建立训练元组与类标号矩阵,并相互对应
(2)计算每个类的最大后验概率
(3)计算属性在不同类别下的概率
(4)预测类标号
朴素贝叶斯分类算法主程序:
clc;
clear;
%%%% first step----to construct the probability tree, the training tuple data
%%%% should be load first in ConstructDecisionTree.m, you must notice
%%%% that if you want make the other decision different from this case,
%%%% you should train the tree first.
% The 'result' contain a decision tree and attribute list
result=ConstructProbability();
PT=result{1,1};
attributeList=result{1,2};
classAttr=result{1,3};
%%%%second step : load the tuple data
% read tuple file
fileID = fopen('D:\matlabFile\NaiveBayesian\NaiveBaysian.txt');
% read as string
D=textscan(fileID,'%s %s %s %s');
fclose(fileID);
%%%%% third step,make the decision,
conclusion=cell(1,1);
% get attributes from D
for i=1:size(attributeList,1)
conclusion{1,i}=attributeList{i,1};
end
if size(D{1,1},1)>2
for i=2:size(D{1,1},1)
tuple=conclusion(1,:);
for j=1:size(D,2)
tuple{2,j}=D{1,j}{i,1};
end
decision=ErgodicPT(PT,attributeList,tuple);
tuple{2,j+1}=decision;
conclusion(size(conclusion,1)+1,:)=tuple(2,:);
end
end
FID=fopen('conclusion.txt','wt');
for i=1:size(conclusion,1)
for j=1:size(conclusion,2)
fprintf(FID, '%s ', conclusion{i,j});
end
fprintf(FID,'\n');
end
fclose(FID);
ConstructProbability函数实现代码:
%construct the probability
function result=ConstructProbability()
% read training tuple file
fileID = fopen('D:\matlabFile\NaiveBayesian\TrainingSet.txt');
% read as string
Dataset=textscan(fileID,'%s %s %s %s %s');
fclose(fileID);
%appoint the attribute class
classA='buys-computer';
attrs={0,0};
% remeber the class attribute id
id=0;
% get attribute list from D
for i=1:size(Dataset,2)
% find the class attribute
if strcmp(classA,Dataset{1,i}{1,1})==1
id=i;
end
attrs{i,1}=Dataset{1,i}{1,1};
% initialize the attr class
attr=cell(1,1);
for j=2:size(Dataset{1,i},1)
% judge the attr class is exist or not
flag_attr=0;
for k=1:size(attr,1)
if strcmp(attr{k,1},Dataset{1,i}{j,1})
Dataset{1,i}{j,1}=k-1;
flag_attr=1;
break;
end
end
% if attr class does not exist,add new attr
if flag_attr==0
attr{k+1,1}=Dataset{1,i}{j,1};
Dataset{1,i}{j,1}=k;
end
end
attr(1,:)=[];
% add attr class to attrs
attrs{i,2}=attr;
Dataset{1,i}(1,:)=[];
end
% create new metrix
DS=zeros(size(Dataset{1,1},1),1);
% convert cell to metrix
for i=1:size(Dataset,2)
DataTemp=cell2mat(Dataset{1,i});
DS=cat(2,DS,DataTemp);
end
DS(:,1)=[];
% move the columns, to make sure that the last column is class attribute
DS=circshift(DS,[0,size(DS,2)-id]);
% adjust the attribute list, to make sure the class attribute at the last
% position
p_temp=attrs(id,:);
attrs(id,:)=[];
attrs(size(attrs,1)+1,:)=p_temp;
% computer the probabilities of all attributes with condition of class
rows=unique(DS(:,size(DS,2)),'rows');
% sort the value so as to mapping the attribute list order
rows=sortrows(rows);
ProbabilityTree=cell(1,2);
for i=1:size(rows,1)
D=DS;
r=find(DS(:,size(DS,2))~=rows(i,1));
D(r,:)=[];
ProbabilityTree{i,1}=size(D,1)/size(DS,1);
% add node to Probability tree
node=cell(1,1);
% compute probability about every value of attribute
for j=1:size(D,2)-1
rows=unique(D(:,j),'rows');
subNode=cell(1,2);
% sort the rows
rows=sortrows(rows);
for k=1:size(rows,1)
subD=D;
subNode{k,1}=rows(k,1);
r=find(D(:,j)~=rows(k,1));
subD(r,:)=[];
subNode{k,2}=size(subD,1)/size(D,1);
end
node{j,1}=subNode;
end
ProbabilityTree{i,2}=node;
end
result={ProbabilityTree,attrs,classA};
end
ErgodicPT函数实现过程如下:
function result=ErgodicPT(PT,attributeList,tuple)
% translate tuple attribute value into integer
t=zeros(1,1);
for i=1:size(tuple,2)
for j=1:size(attributeList{i,2},1)
if strcmp(attributeList{i,2}{j,1},tuple{2,i})
t(1,i)=j;
break;
end
end
end
% computer the probability
r=zeros(1,2);
for i=1:size(PT,1)
r(i,1)=i;
R=1;
for j=1:size(t,2)
flag=0;
for k=1:size(PT{i,2}{j,1},1)
if PT{i,2}{j,1}{k,1}==t(1,j)
R=R*PT{i,2}{j,1}{k,2};
flag=1;
break;
end
end
if flag==0
R=0;
end
end
R=R*PT{i,1};
r(i,2)=R;
end
r=sortrows(r,-2);
result=attributeList{size(attributeList,1),2}{r(1,1),1};
end
TrainingSet.txt训练数据格式,请复制后保存为txt格式
age income student creditrating buys-computer
youth high no fair no
youth high no excellent no
middleaged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middleaged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middleaged medium no excellent yes
middleaged high yes fair yes
senior medium no excellent no
NaiveBaysian.txt需要分类的数据,请复制后保存为txt格式:
age income student creditrating
youth high no fair
youth high no excellent
middleaged high no fair
senior medium no fair
senior low yes fair
senior low yes excellent
middleaged low yes excellent
youth medium no fair
youth low yes fair
senior medium yes fair
youth medium yes excellent
middleaged medium no excellent
middleaged high yes fair
senior medium no excellent
分类结果数据,请参照
age income student creditrating buys-computer
youth high no fair no
youth high no excellent no
middleaged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent yes
middleaged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middleaged medium no excellent yes
middleaged high yes fair yes
senior medium no excellent no