朴素贝叶斯分类Naive Bayesian

本算法依据《数据挖掘概念与技术》第三版(韩家炜)中的朴素贝叶斯算法描述来实现的,其分类过程可分为四步(这里只给了简略的步骤描述,详细的公式需看原书):

(1)建立训练元组与类标号矩阵,并相互对应

(2)计算每个类的最大后验概率

(3)计算属性在不同类别下的概率

(4)预测类标号


朴素贝叶斯分类算法主程序:
clc;
clear;
%%%%  first step----to construct the probability tree, the training tuple data
%%%%  should be load first in ConstructDecisionTree.m, you must notice
%%%%  that if you want make the other decision different from this case,
%%%%  you should train the tree first.
% The 'result' contain a decision tree and attribute list
result=ConstructProbability();
PT=result{1,1};
attributeList=result{1,2};
classAttr=result{1,3};
%%%%second step : load the tuple data
% read tuple file
fileID = fopen('D:\matlabFile\NaiveBayesian\NaiveBaysian.txt');
% read as string
D=textscan(fileID,'%s %s %s %s');
fclose(fileID); 
%%%%% third step,make the decision,
conclusion=cell(1,1);
% get attributes from D 
for i=1:size(attributeList,1)
    conclusion{1,i}=attributeList{i,1};
end
if size(D{1,1},1)>2
    for i=2:size(D{1,1},1)
        tuple=conclusion(1,:);
       for j=1:size(D,2)
           tuple{2,j}=D{1,j}{i,1};
       end 
       decision=ErgodicPT(PT,attributeList,tuple);
       tuple{2,j+1}=decision;
       conclusion(size(conclusion,1)+1,:)=tuple(2,:);
    end
end

FID=fopen('conclusion.txt','wt');
for i=1:size(conclusion,1)
    for j=1:size(conclusion,2)
        fprintf(FID, '%s ', conclusion{i,j});
    end
     fprintf(FID,'\n');    
end
fclose(FID);
ConstructProbability函数实现代码:
%construct the probability
function result=ConstructProbability()
    % read training tuple file
    fileID = fopen('D:\matlabFile\NaiveBayesian\TrainingSet.txt');
    % read as string
    Dataset=textscan(fileID,'%s %s %s %s %s');
    fclose(fileID);
    %appoint the attribute class
    classA='buys-computer';
    attrs={0,0};
    % remeber the class attribute id
    id=0;
    % get attribute list from D
    for i=1:size(Dataset,2)
        % find the class attribute
        if strcmp(classA,Dataset{1,i}{1,1})==1
            id=i;
        end
        attrs{i,1}=Dataset{1,i}{1,1};
        % initialize the attr class
        attr=cell(1,1);
        for j=2:size(Dataset{1,i},1)
            % judge the attr class is exist or not
            flag_attr=0;
            for k=1:size(attr,1)
                if strcmp(attr{k,1},Dataset{1,i}{j,1})
                    Dataset{1,i}{j,1}=k-1;
                    flag_attr=1;
                    break;
                end
            end
            % if attr class does not exist,add new attr
            if flag_attr==0                
                attr{k+1,1}=Dataset{1,i}{j,1};
                Dataset{1,i}{j,1}=k;
            end                
        end
        attr(1,:)=[];
        % add attr class to attrs
        attrs{i,2}=attr;
        Dataset{1,i}(1,:)=[];
    end
    % create new metrix
    DS=zeros(size(Dataset{1,1},1),1);
    % convert cell to metrix
    for i=1:size(Dataset,2)
        DataTemp=cell2mat(Dataset{1,i});
        DS=cat(2,DS,DataTemp);
    end
    DS(:,1)=[];
    % move the columns, to make sure that the last column is class attribute
    DS=circshift(DS,[0,size(DS,2)-id]);
    % adjust the attribute list, to make sure the class attribute at the last
    % position
    p_temp=attrs(id,:);
    attrs(id,:)=[];
    attrs(size(attrs,1)+1,:)=p_temp;    
    % computer the probabilities of all attributes with condition of class
    rows=unique(DS(:,size(DS,2)),'rows');
    % sort the value so as to mapping the attribute list order
    rows=sortrows(rows);
    ProbabilityTree=cell(1,2);
    for i=1:size(rows,1)
        D=DS;
        r=find(DS(:,size(DS,2))~=rows(i,1));
        D(r,:)=[];
        ProbabilityTree{i,1}=size(D,1)/size(DS,1);
        % add node to Probability tree
        node=cell(1,1);
        % compute probability about every value of attribute
        for j=1:size(D,2)-1
            rows=unique(D(:,j),'rows');
            subNode=cell(1,2);
            % sort the rows
            rows=sortrows(rows);
            for k=1:size(rows,1)
                subD=D;
                subNode{k,1}=rows(k,1);
                r=find(D(:,j)~=rows(k,1));
                subD(r,:)=[];
                subNode{k,2}=size(subD,1)/size(D,1);
            end
            node{j,1}=subNode;
        end
        ProbabilityTree{i,2}=node;
    end
    result={ProbabilityTree,attrs,classA};
end
ErgodicPT函数实现过程如下:
function result=ErgodicPT(PT,attributeList,tuple)
 % translate tuple attribute value into integer
 t=zeros(1,1);
 for i=1:size(tuple,2)
     for j=1:size(attributeList{i,2},1)
         if strcmp(attributeList{i,2}{j,1},tuple{2,i})
             t(1,i)=j;
             break;
         end
     end
 end
 % computer the probability
r=zeros(1,2);
for i=1:size(PT,1)
    r(i,1)=i;
    R=1;
    for j=1:size(t,2)
        flag=0;
        for k=1:size(PT{i,2}{j,1},1)
            if PT{i,2}{j,1}{k,1}==t(1,j)
                R=R*PT{i,2}{j,1}{k,2};
                flag=1;
                break;
            end
        end
        if flag==0
            R=0;
        end
    end
    R=R*PT{i,1};
    r(i,2)=R;
end
r=sortrows(r,-2);
result=attributeList{size(attributeList,1),2}{r(1,1),1};
end 
TrainingSet.txt训练数据格式,请复制后保存为txt格式
age income student creditrating buys-computer
youth high no fair no
youth high no excellent no
middleaged high no fair yes
senior medium no fair yes
senior low yes fair yes
senior low yes excellent no
middleaged low yes excellent yes
youth medium no fair no
youth low yes fair yes
senior medium yes fair yes
youth medium yes excellent yes
middleaged medium no excellent yes
middleaged high yes fair yes
senior medium no excellent no
NaiveBaysian.txt需要分类的数据,请复制后保存为txt格式:
age income student creditrating
youth high no fair
youth high no excellent
middleaged high no fair
senior medium no fair
senior low yes fair
senior low yes excellent
middleaged low yes excellent
youth medium no fair
youth low yes fair
senior medium yes fair
youth medium yes excellent
middleaged medium no excellent
middleaged high yes fair
senior medium no excellent
分类结果数据,请参照
age income student creditrating buys-computer 
youth high no fair no 
youth high no excellent no 
middleaged high no fair yes 
senior medium no fair yes 
senior low yes fair yes 
senior low yes excellent yes 
middleaged low yes excellent yes 
youth medium no fair no 
youth low yes fair yes 
senior medium yes fair yes 
youth medium yes excellent yes 
middleaged medium no excellent yes 
middleaged high yes fair yes 
senior medium no excellent no 







你可能感兴趣的:(数据挖掘)