时间:2022/5/11
数值属性数据集依旧是采用鸢尾花iris数据集这里就不过多介绍,标称属性数据集采用weather.arff数据集。这个数据集如下,有五个属性组成,决策属性为最后一个是否出去玩。这个数据集全部由标称属性组成,且数据集大小较小,非常适合入门学习。
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
朴素贝叶斯算法是基于贝叶斯定理进行预测分类的一种算法,贝叶斯定理如下:
P ( H ∣ X ) = P ( X ∣ H ) P ( H ) P ( X ) (1) P(H|X)={P(X|H)P(H)\over P(X)} \tag{1} P(H∣X)=P(X)P(X∣H)P(H)(1)
其中X是n个属性集的测量值,而H为某种假设。P(H|X)是后验概率(posterior probability),或者说条件X下,H的后验概率,以weather数据集举例,后验概率为晴天出去玩的概率P(yes | sunny)。而P(H)为先验概率(prior probability),比如出去玩的概率P(yes)。先验概率考虑的因素要更少,或者说后验概率比先验概率基于更多的信息。先验概率P(H)独立于X。
那朴素贝叶斯算法如何利用贝叶斯定理进行预测分类呢?
从贝叶斯定理可得,令X=x1∧x2∧⋯∧xm 表示一个条件的组合,outlook=sunny ∧ temperature=hot ∧ humidity=high^windy=false,对应一条数据。则Di为是否出去玩的假设,i有两种即yes/no。则由(1)式可得:
P ( D i ∣ X ) = P ( X ∣ D i ) P ( D i ) P ( X ) (2) P(D_i|X)={P(X|D_i)P(D_i) \over P(X)} \tag{2} P(Di∣X)=P(X)P(X∣Di)P(Di)(2)
从已知数据X可计算两种假设的概率,概率最大的发生的可能性就越大,我们则认为该假设为事件的最大后验假设。由于P(X)对于所有的假设而言都是个常数,所以只需要计算 P ( X ∣ D i ) P ( D i ) P(X|D_i)P(D_i) P(X∣Di)P(Di)最大值即可。如果先验概率 P ( D i ) P(D_i) P(Di)是未知的,则假定所有的假设都是等概率的,以此最大化 P ( X ∣ D i ) P(X|D_i) P(X∣Di)。
为了减少 P ( X ∣ D i ) P(X|D_i) P(X∣Di)的计算开销,朴素贝叶斯算法做出类条件独立性的朴素假定。认为所有属性之间均相互独立。因此:
P ( X ∣ D i ) = ∏ j = 1 m P ( X k ∣ D i ) = P ( X 1 ∣ D i ) P ( X 2 ∣ D i ) … P ( X k ∣ D i ) (3) P(X|D_i)=\prod_{j=1}^m {P(X_k|D_i)}=P(X_1|D_i)P(X_2|D_i)\ldots P(X_k|D_i) \tag{3} P(X∣Di)=j=1∏mP(Xk∣Di)=P(X1∣Di)P(X2∣Di)…P(Xk∣Di)(3)
则(2)式变为:
P ( D i ∣ X ) = P ( D i ) ∏ j = 1 m P ( X k ∣ D i ) P ( X ) (4) P(D_i|X)={P(D_i)\prod_{j=1}^m {P(X_k|D_i)} \over P(X)}\tag{4} P(Di∣X)=P(X)P(Di)∏j=1mP(Xk∣Di)(4)
使用对数将连乘转化成连加,得到最后的预测函数:
D ( X ) = a r g max 1 ≤ i ≤ k P ( D i ∣ X ) = a r g max 1 ≤ i ≤ k P ( D i ) ∏ j = 1 m P ( X k ∣ D i ) = a r g max 1 ≤ i ≤ k ( l o g P ( D i ) + ∑ j = 1 m l o g P ( X i ∣ D i ) ) (5) D(X)=arg\,\max_{1\le i \le k}P(D_i|X)=arg\,\max_{1\le i \le k}P(D_i)\prod_{j=1}^m {P(X_k|D_i)}=arg\,\max_{1\le i \le k}(logP(D_i)+\sum_{j=1}^m{logP(X_i|D_i)})\tag{5} D(X)=arg1≤i≤kmaxP(Di∣X)=arg1≤i≤kmaxP(Di)j=1∏mP(Xk∣Di)=arg1≤i≤kmax(logP(Di)+j=1∑mlogP(Xi∣Di))(5)
在计算(3)式时如果出现一个属性的概率为0时则整个结果都为0了,这样计算便出问题了。为了避免这种0概率情况,法国数学家拉普拉斯便提出了一种平滑方法——拉普拉斯平滑。具体公式如下
P L ( x j ∣ D i ) = n P ( X j D i ) + 1 n P ( D i ) + v j (6) P^L(x_j|D_i)={{nP(X_jD_i)+1}\over {nP(D_i)+v_j}}\tag{6} PL(xj∣Di)=nP(Di)+vjnP(XjDi)+1(6)
其中n是对象的数量, v j v_j vj是属性的可能取值数。这样通过分子的加一,便解决的数据的0概率情况。
对于数值型数据,不能使用P(humidity=87),因为湿度恰好等于87的概率过小,小概率事件在大量随机测试中发生概率趋近于0。但P(80
p ( x ) = 1 2 π σ e x p ( − ( s − μ ) 2 2 σ 2 ) (7) p(x)={1\over \sqrt{2\pi}\sigma}exp(-{(s-\mu)^2\over 2\sigma^2})\tag{7} p(x)=2πσ1exp(−2σ2(s−μ)2)(7)
代入(5)式得
D ( X ) = a r g max 1 ≤ i ≤ k ( l o g P ( D i ) + ∑ j = 1 m − l o g σ i j − ( x j − μ i j 2 ) 2 σ i j 2 ) (8) D(X)=arg\,\max_{1\le i \le k}(logP(D_i)+\sum_{j=1}^m{-log\sigma_{ij}-{(x_j-\mu _{ij}^2)\over 2\sigma_{ij}^2}})\tag{8} D(X)=arg1≤i≤kmax(logP(Di)+j=1∑m−logσij−2σij2(xj−μij2))(8)
模型构建
/**
* MyNaiveBayes.java
*
* @author zjy
* @date 2022/5/10
* @Description:
* @version V1.0
*/
package swpu.zjy.ML.NB;
import weka.core.Instance;
import weka.core.Instances;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.Arrays;
public class MyNaiveBayes {
/**
* 内部类,用于存储高斯分布中的期望mu和方差sigma
* 用以对数组型数据集进行Naive Bayes分类
*/
private class GaussianParamters {
double mu;
double sigma;
public GaussianParamters(double mu, double sigma) {
this.mu = mu;
this.sigma = sigma;
}
@Override
public String toString() {
return "GaussianParamters{" +
"mu=" + mu +
", sigma=" + sigma +
'}';
}
}
//标称属性
public static final int NOMINAL = 0;
//数值属性
public static final int NUMERICAL = 1;
//数据属性类别
private int dataType = NOMINAL;
//数据集实体
Instances dataset;
//决策属性取值个数
int numClasses;
//数据集数据个数
int numInstances;
//数据的条件属性个数
int numConditions;
//预测标签 length = numInstance
int predicts[];
//决策属性概率分布 length = numClasses
double[] classDistribution;
//决策属性经过拉普拉斯平滑后的概率分布 length = numClasses
double[] classDistributionLaplacian;
/**
* 条件属性个数,用以计算P(xi|Di)
* 第一维是决策属性 length = numClasses
* 第二维是条件属性 length = numConditions
* 第三维是条件属性取值出现次数 length = 该条件属性可取值个数
*/
double[][][] conditionalCounts;
/**
* 条件属性经过拉普拉斯平滑后的概率,各维度长度同上
* 第三维是条件属性取值出现的概率
*/
double[][][] conditionalLaplacianOdds;
/**
* 存放高斯分布参数 用以预测数值型数据
* 第一维是决策属性 length = numClasses
* 第二维是条件属性 length = numConditions
*/
GaussianParamters[][] gaussianParameters;
/**
* 构造方法,根据传入数据集文件路径,初始化数据集实体及相关参数
*
* @param dataSetFileName 数据集文件路径
*/
public MyNaiveBayes(String dataSetFileName) {
try {
FileReader fileReader = new FileReader(dataSetFileName);
dataset = new Instances(fileReader);
fileReader.close();
} catch (Exception e) {
e.printStackTrace();
}
dataset.setClassIndex(dataset.numAttributes() - 1);
//初始化数据集参数
numClasses = dataset.numClasses();
numConditions = dataset.numAttributes() - 1;
numInstances = dataset.numInstances();
}
/**
* 设置数据集数据类型
*
* @param dataType 数据类型
*/
public void setDataType(int dataType) {
this.dataType = dataType;
}
/**
* 计算决策属性的概率分布和经过拉普拉斯平滑后的概率分布
*/
public void calculateClassDistribution() {
//step1.初始化向量
classDistribution = new double[numClasses];
classDistributionLaplacian = new double[numClasses];
//step2.统计决策属性出现次数
double[] tempCnt = new double[numClasses];
int tempType;
for (int i = 0; i < numInstances; i++) {
tempType = (int) dataset.instance(i).classValue();
tempCnt[tempType]++;
}
//step3.计算决策属性概率分布及拉普拉斯平滑后的概率
for (int i = 0; i < numClasses; i++) {
classDistribution[i] = tempCnt[i] / numInstances;
classDistributionLaplacian[i] = (tempCnt[i] + 1) / (numInstances + numClasses);
}
System.out.println("Class distribution: " + Arrays.toString(classDistribution));
System.out.println("Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
}
/**
* 计算条件属性的概率分布即P(xi|Di)
*/
public void calculateConditionalOdds() {
//step1.初始化向量,最后一维暂留
conditionalCounts = new double[numClasses][numConditions][];
conditionalLaplacianOdds = new double[numClasses][numConditions][];
//step2.根据每一个条件属性的取值个树初始最后一维
int tempCnt;
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
tempCnt = dataset.attribute(j).numValues();
conditionalCounts[i][j] = new double[tempCnt];
conditionalLaplacianOdds[i][j] = new double[tempCnt];
}
}
//step3.统计训练集中的条件属性对应取值出现次数
int[] tempClassCount = new int[numClasses];
int tempClass, tempValue;
for (int i = 0; i < numInstances; i++) {
tempClass = (int) dataset.instance(i).classValue();
tempClassCount[tempClass]++;
for (int j = 0; j < numConditions; j++) {
tempValue = (int) dataset.instance(i).value(j);
conditionalCounts[tempClass][j][tempValue]++;
}
}
//step4.计算条件属性拉普拉斯平滑后的概率
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
int tempNumvalue = dataset.attribute(j).numValues();
for (int k = 0; k < tempNumvalue; k++) {
conditionalLaplacianOdds[i][j][k] = (conditionalCounts[i][j][k] + 1) / (tempClassCount[i] + tempNumvalue);
}
}
}
System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
}
/**
* 计算数值型数据的高斯分布参数mu与sigma
*/
public void calculateGausssianParameters() {
gaussianParameters = new GaussianParamters[numClasses][numConditions];
double[] tempValuesArray = new double[numInstances];
int tempNumValues = 0;
double tempSum = 0;
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
tempSum = 0;
//求和+统计个数
tempNumValues = 0;
for (int k = 0; k < numInstances; k++) {
if ((int) dataset.instance(k).classValue() != i) {
continue;
}
tempValuesArray[tempNumValues] = dataset.instance(k).value(j);
tempSum += tempValuesArray[tempNumValues];
tempNumValues++;
}
//求期望
double tempMu = tempSum / tempNumValues;
//求方差
double tempSigma = 0;
for (int k = 0; k < tempNumValues; k++) {
tempSigma += (tempValuesArray[k] - tempMu) * (tempValuesArray[k] - tempMu);
}
tempSigma /= tempNumValues;
tempSigma = Math.sqrt(tempSigma);
gaussianParameters[i][j] = new GaussianParamters(tempMu, tempSigma);
}
}
System.out.println(Arrays.deepToString(gaussianParameters));
}
/**
* 对标称属性的数据进行分类
*
* @param instance 数据元组
* @return 预测标签
*/
public int classifyNominal(Instance instance) {
//记录最大概率
double tempMaxOdds = -Double.MAX_VALUE;
//记录标签
int classIndex = 0;
for (int i = 0; i < numClasses; i++) {
//Pl(Di)
double tempClassfiyOdds = Math.log(classDistributionLaplacian[i]);
for (int j = 0; j < numConditions; j++) {
int tempConditionValue = (int) instance.value(j);
//sum(Pl(xi|Di))
tempClassfiyOdds += Math.log(conditionalLaplacianOdds[i][j][tempConditionValue]);
}
if (tempClassfiyOdds > tempMaxOdds) {
tempMaxOdds = tempClassfiyOdds;
classIndex = i;
}
}
return classIndex;
}
/**
* 对数值属性数据进行分类,原理同标称属性分类一致,差别在于使用概率密度p代替概率P
*
* @param instance 数据元组
* @return 预测标签
*/
public int classifyNumerical(Instance instance) {
// Find the biggest one
double tempBiggest = -10000;
int resultBestIndex = 0;
for (int i = 0; i < numClasses; i++) {
double tempClassProbabilityLaplacian = Math.log(classDistributionLaplacian[i]);
double tempPseudoProbability = tempClassProbabilityLaplacian;
//计算概率
for (int j = 0; j < numConditions; j++) {
double tempAttributeValue = instance.value(j);
double tempSigma = gaussianParameters[i][j].sigma;
double tempMu = gaussianParameters[i][j].mu;
tempPseudoProbability += -Math.log(tempSigma) - (tempAttributeValue - tempMu)
* (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
}
if (tempBiggest < tempPseudoProbability) {
tempBiggest = tempPseudoProbability;
resultBestIndex = i;
}
}
return resultBestIndex;
}
/**
* 对一个数据进行分类,根据数据类型选择不同方法
*
* @param paraInstance 待预测的数据元组
* @return 预测标签
*/
public int classify(Instance paraInstance) {
if (dataType == NOMINAL) {
return classifyNominal(paraInstance);
} else if (dataType == NUMERICAL) {
return classifyNumerical(paraInstance);
}
return -1;
}
/**
* 对数据集所有数据进行分类
*/
public void classify() {
predicts = new int[numInstances];
for (int i = 0; i < numInstances; i++) {
predicts[i] = classify(dataset.instance(i));
}
}
/**
* 统计准确率
*
* @return 准确率
*/
public double computeAccuracy() {
double tempCorrect = 0;
for (int i = 0; i < numInstances; i++) {
if (predicts[i] == (int) dataset.instance(i).classValue()) {
tempCorrect++;
}
}
double resultAccuracy = tempCorrect / numInstances;
return resultAccuracy;
}
/**
* 标称属性数据分类测试
*/
public static void testNominal() {
System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
String tempFilename = "E:\\DataSet\\weather.arff";
MyNaiveBayes tempLearner = new MyNaiveBayes(tempFilename);
tempLearner.setDataType(NOMINAL);
tempLearner.calculateClassDistribution();
tempLearner.calculateConditionalOdds();
tempLearner.classify();
System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}
/**
* 数值属性数据分类测试
*/
public static void testNumerical() {
System.out.println(
"Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
String tempFilename = "E:\\JAVA项目\\mytest\\src\\main\\java\\swpu\\zjy\\ML\\DataSet\\iris.arff";
MyNaiveBayes tempLearner = new MyNaiveBayes(tempFilename);
tempLearner.setDataType(NUMERICAL);
tempLearner.calculateClassDistribution();
tempLearner.calculateGausssianParameters();
tempLearner.classify();
System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}
public static void main(String[] args) {
// testNominal();
testNumerical();
}
}