朴素贝叶斯分类器应用的学习任务中,每个实例x可由属性值的合取描述,而目标函数f(x)从某有限集合V中取值。学习器被提供一系列关于目标函数的训练样例,以及新实例(描述为属性值的元组) < a1,a2…an > ,然后要求预测新实例的目标值(或分类)。
贝叶斯方法的新实例分类目标是在给定描述实例的属性值 < a1,a2…an > 下, 得到最可能的目标值 VMAP 。
已知集合: C=y1,y2,...,yn 和 I=x1,x2,...,xm,... ,确定映射规则 y=f(x) , 使得任意 xi∈I 有且仅有一个 yi∈C 使得 yi=f(xi) 成立。
matlab中的贝叶斯分类器(NaiveBayes class)
matlab 2016b —> NaiveBayes class(后续版本将会删除)
A NaiveBayes object defines a Naive Bayes classifier. A Naive Bayes classifier assigns a new observation to the most probable class, assuming the features are conditionally independent given the class value.
NaivesBayes: 构造NaiveBayes类
disp: Display NaiveBayes classifier object;
fit: 通过训练数据构造一个NaiveBayes类;
posterior: 计算测试数据下每个类的后验概率;
predict: 预测测试数据的类标签;
subsasgn: Subscripted reference for NaiveBayes object
subsref: Subscripted reference for NaiveBayes object
ClsNonEmpty: 非空类的标签
ClassLevels: 统计数据中的类型(Class levels)
Dist: 特征分布的名字(Distribution names)
'normal' 正态分布(Normal distribution)
'kernel' 平滑核密度估计(kernel smoothing density estimate)
'mvmn' 多元多项分布(Multivariate multinominal distribution)
'mn' Multinomial bag-of-tokens model
NClasses: 类别数目(Number of classes)
NDims: 特征维度(Number of dimensions)
Params: 评估的参数(Parameter estimates)
属性参数是一个 NClass x NDims 的元胞数组(元胞内的每一个元素为一个数组),它包含了除类先验属性之外的参数信息,Params(i,j)表示第i个类下第j个特征的评估参数。
'normal' 一个二维的向量,向量中第一个元素表示正态分布均值,第二个元素表示标准差;
'kernel' 一个 ProbDistUnivKernel 类
'mvmn' 表示i类下第j个特征可能服从的分布的概率(如高斯混合模型)
'mn' A scalar representing the probability the jth token appearing in the ith class, Prob(token j | class i). It is estimated as (1 + the number of occurrence of token J in class I)/(NDims + the total number of token occurrence in class I).
Prior: 类先验(Class priors), 是一个长度为NClasses的向量,包含了每一个类才样本中出现的概率
1. 考虑正态分布情况(‘normal’)
load fisheriris
X = meas(:,3:4); %选取数据的第3和第4个特征作为训练特征;
Y = species; % Y是分类标签
tabulate(Y); % 用表格的形式统计样本中的种类数量及其概率;
NBModel = fitNaiveBayes(X,Y); %以默认'normal'的形式训练出一个NaiveBayes类;
setosaIndex = strcmp(NBModel.ClassLevels,'setosa'); %获取类setosa的参数索引;
estimates = NBModel.Params{setosaIndex,1} %获取第一个特征的评估参数;
gscatter(X(:,1),X(:,2),Y); %画出散度图(如下图)
h = gca; %获得坐标属性;
xylim = [h.XLim h.YLim];
hold on; %保留图表信息;
Params = cell2mat(NBModel.Params); % 将存储参数的原胞转换为矩阵;
Mu = Params(2*(1:3)-1,1:2); % 获取每个参数对应的高斯均值;
Sigma = zeros(2,2,3);
for j = 1:3
Sigma(:,:,j) = diag(Params(2*j,:)); % Extracts the standard deviations
% ezcontour: 画出引用函数对应的曲线图
% mvnpdf: 多元正态分布函数;
% 句柄函数@(x1,x2)mvnpdf([x1,x2],Mu(j,:),Sigma(:,:,))用来表示对应的二元正态分布函数(取两个特征是为了形象表示)
% Draws contours for the multivariate normal distributions
load fisheriris
X = meas;
Y = species;
NBModel1 = fitNaiveBayes(X,Y);
NBModel1.ClassLevels % Display the class order
predictLabels1 = predict(NBModel1,X); %用训练出的模型对样本数据进行预测
[ConfusionMat1,labels] = confusionmat(Y,predictLabels1) % 统计样本数据和预测结果的差别
%Element (j, k) of ConfusionMat1 represents the number of observations that the software classifies as k, but the data show as being in class j.
NBModel2 = fitNaiveBayes(X,Y,...
3. Nayes Classifiers Using Multinomial Predictors
n = 1000;
rng(1); % For reproducibility
y = randsample([-1 1],n true)
为了建立一个预测数据,假设词汇中有5个tokens,在每封邮件中有20个tokens作为特征。那么垃圾邮件中 tokens 出现的频率与非垃圾中出项的频率不同。
tokenProbs = [0.2 0.3 0.1 0.15 0.25;...
0.4 0.1 0.3 0.05 0.15]; % Token relative frequencies
tokensPerEmail = 20;
X = zeros(n,5);
%% 随机生成垃圾邮件
% mnrnd生成多元随机数
%% r = mnrnd(n,p)
% 生成随机数r,r服从n和p设定的多元分布,其中,n指定每个多元分布输出的样本大小(
% 生成的r的每个元素要小于等于n);
% p是一个k维向量,P的模为1,其中的每个元素指定对应分布的概率;
% r也是一个k维行向量,它包含了n个样本中,对应k个多元概率分布的样本个数;
%% R= mnrnd(n,p,m) 生成m个 1 x k 的样本数据;
X(y==1,:) = mnrnd(tokensPerEmail,tokenProbs(1,:),sum(y==1));
X(y==-1,:) = mnrnd(tokensPerEmail,tokenProbs(2,:),sum(y==-1));
%% 训练数据
NBModel = fitNaiveBayes(X,y,'Distribution','mn');
% 验证错误率
predSpam = predict(NBModel,X);
misclass = sum(y'~=predSpam)/n
Input Arguments
X: predictor data
Y: Class labels
'kernel' Kernel smoothing density estimate
'mn': Multinomial distribution. If you specify mn, then all features are components of a multinomial distribution. Therefore, you cannot include 'mn' as an element of a cell array of character vectors
'mvmn': Multivariate multinomial distribution
'normal': Normal (Gaussian) distribution.
'KSSupport': Kernel smoothing density support
'ubbounded(default)'|'positive'|cell array|mumeric row vector
'KSType': Kernel smoother type
'nomal(default)'|'box'|'epapnechnikov'|'triangle'|cell array of character vectors
'KSWidth': Kernel smoothing window bandwidth
'Prior': Class prior probabilities
'empirical': uses the class relative frequencies distribution for the prior probabilities
'numeric vector': A numeric vector of length K specifying the prior probabilities for each class. The order of the elements of Prior should correspond to the order of the class levels
'structure array': A structure array S containing class levels and their prior probabilities. S must have two fields:
S.prob: A numeric vector of prior probabilities. The software normalizes prior probabilities to sum to 1.
S.group: A vector of the same type as Y containing unique class levels indicating the class for the corresponding element of S.prob. S.class must contain all the K levels in Y. It can also contain classes that do not appear in Y. This can be useful if X is a subset of a larger training set. The software ignores any classes that appear in S.group but not in Y.
'uniform': The prior probabilities are equal for all classes.
Output Arguments
'NBModel': 训练好的朴素贝叶斯分类器(Traied naive Bayes classifier)。