条件概率:A,B为两个事件,且 P ( A ) > 0 P(A)\gt 0 P(A)>0,称 P ( B ∣ A ) = P ( A B ) P ( A ) P(B|A)=\frac{P(AB)}{P(A)} P(B∣A)=P(A)P(AB)为在事件A发生的条件下事件B发生的条件概率。
事件的独立性:若A,B两个事件相互独立,则 P ( A B ) = P ( A ) P ( B ) P(AB)=P(A)P(B) P(AB)=P(A)P(B),。
贝叶斯公式:设 B 1 , B 2 , ⋯ , B n B_1,B_2,\cdots,B_n B1,B2,⋯,Bn为样本空间中概率均不为零的一个完备事件组,则对任意事件A,且 P ( A ) > 0 P(A)\gt 0 P(A)>0,有:
P ( B j ∣ A ) = P ( B j ) P ( A ∣ B j ) ∑ i = 1 n P ( B i ) P ( A ∣ B i ) , j = 1 , 2 ⋯ , n P(B_j|A)=\frac{P(B_j)P(A|B_j)}{\sum\limits_{i=1}^nP(B_i)P(A|B_i)},j=1,2\cdots,n P(Bj∣A)=i=1∑nP(Bi)P(A∣Bi)P(Bj)P(A∣Bj),j=1,2⋯,n
朴素贝叶斯(Naive Bayes)算法是假设各个特征之间相互独立的情况下,通过特征向量 x \mathbf{x} x,结合概率公式计算 P ( c ∣ x ) P(c|\mathbf{x}) P(c∣x),选择概率最大的类别标记。
P ( c ∣ x ) = P ( c ) P ( x ∣ c ) P ( x ) = P ( c ) P ( x ) ∏ i = 1 d P ( x i ∣ c ) , 其 中 d 为 属 性 数 目 , x i 为 x 在 第 i 个 属 性 上 的 取 值 P(c|\mathbf{x})=\frac{P(c)P(\mathbf x|c)}{P(\mathbf x)}=\frac{P(c)}{P(\mathbf x)}\prod\limits_{i=1}^dP(x_i|c),其中d为属性数目,x_i为\mathbf x在第i个属性上的取值 P(c∣x)=P(x)P(c)P(x∣c)=P(x)P(c)i=1∏dP(xi∣c),其中d为属性数目,xi为x在第i个属性上的取值
由于对所有类别来说 P ( x ) P(\mathbf x) P(x)相同,所以得到贝叶斯判定准则:
h n b ( x ) = arg max c ∈ y P ( c ) ∏ i = 1 d P ( x i ∣ c ) , 其 中 , N 表 示 样 本 类 别 标 记 的 总 类 , y = { c 1 , c 2 , ⋯ , c N } h_{nb}(\mathbf x)=\arg\ \max_{c\in \mathbf y}P(c)\prod\limits_{i=1}^dP(x_i|c),其中,N表示样本类别标记的总类,\mathbf{y}=\left\{c_1,c_2,\cdots,c_N\right\} hnb(x)=arg c∈ymaxP(c)i=1∏dP(xi∣c),其中,N表示样本类别标记的总类,y={c1,c2,⋯,cN}
P ^ ( c ) = ∣ D c ∣ + 1 ∣ D ∣ + N P ^ ( x i ∣ c ) = ∣ D c , x i ∣ + 1 ∣ D c ∣ + N i 其 中 , N 是 D 中 可 能 的 类 别 数 , N i 表 示 第 i 个 属 性 可 能 的 取 值 数 \hat P(c)=\frac{|D_c|+1}{|D|+N}\\\hat P(x_i|c)=\frac{|D_{c,x_i}|+1}{|D_c|+N_i}\\其中,N是D中可能的类别数,N_i表示第i个属性可能的取值数 P^(c)=∣D∣+N∣Dc∣+1P^(xi∣c)=∣Dc∣+Ni∣Dc,xi∣+1其中,N是D中可能的类别数,Ni表示第i个属性可能的取值数
P ^ ( c ) > 0 且 ∑ i = 1 N P ^ ( c i ) = 1 \hat P(c)>0且\sum_{i=1}^N\hat P(c_i)=1 P^(c)>0且i=1∑NP^(ci)=1
P ^ ( x i ∣ c ) \hat P(x_i|c) P^(xi∣c)同理。
h n b ( x ) = arg max c ∈ y log P ^ ( c ) + ∑ i = 1 d log P ^ ( x i ∣ c ) h_{nb}(\mathbf x)=\arg\ \max_{c\in \mathbf y}\log \hat P(c)+\sum\limits_{i=1}^d\log\hat P(x_i|c) hnb(x)=arg c∈ymaxlogP^(c)+i=1∑dlogP^(xi∣c)
h n b ( x ) = arg max 1 ≤ i ≤ N log P ^ ( c i ) + ∑ j = 1 d ( − l o g σ i j − ( x j − μ i j ) 2 2 σ i j 2 ) h_{nb}(\mathbf x)=\arg\ \max_{1\le i\le N}\log \hat P(c_i)+\sum\limits_{j=1}^d(-log\sigma_{ij}-\frac{(x_j-\mu_{ij})^2}{2\sigma^2_{ij}}) hnb(x)=arg 1≤i≤NmaxlogP^(ci)+j=1∑d(−logσij−2σij2(xj−μij)2)
* An inner class to store parameters.
private class GaussianParameters {
double mu;
double sigma;
public GaussianParameters(double paraMu, double paraSigma) {
mu = paraMu;
sigma = paraSigma;
}// Of the constructor
public String toString() {
return "(" + mu + "," + sigma + ")";
}// Of toString
}// Of GaussianParamters
* The data.
Instances dataset;
* The number of instances.
int numClasses;
* The number of instances.
int numInstances;
* The number of conditional attributes.
int numConditions;
* The prediction,including queried and predicted labels.
int[] predicts;
* Class distribution.
double[] classDistribution;
* Class distribution with Laplacian smooth.
double[] classDistributionLaplacian;
* To calculate the conditional probabilities for all classes over all
* attributes on all values.
double[][][] conditionalCounts;
* The conditional probabilities with Laplacian smooth.
double[][][] conditionalProbabilitiesLaplacian;
* The Gaussian parameters.
GaussianParameters[][] gaussianParameters;
* Data type.
int dataType;
* Nominal.
public static final int NOMINAL = 0;
* Numerical.
public static final int NUMERICAL = 1;
* The constructor.
* @param paraFilename The given file.
public NaiveBayes(String paraFilename) {
dataset = null;
try {
FileReader fileReader = new FileReader(paraFilename);
dataset = new Instances(fileReader);
} catch (Exception ee) {
System.out.println("Cannot open the file: " + paraFilename + "\r\n" + ee);
} // Of try
dataset.setClassIndex(dataset.numAttributes() - 1);
numConditions = dataset.numAttributes() - 1;
numInstances = dataset.numInstances();
numClasses = dataset.attribute(numConditions).numValues();
}// Of the constructor.
* Set the data type.
public void setDataType(int paraDataType) {
dataType = paraDataType;
}// Of setDataType
计算 P ( c ) 、 P ^ ( c ) P(c)、\hat P(c) P(c)、P^(c):
使用数组tempCounts记录每个类别标记的总数,相当于记录上面式子中的 ∣ D c ∣ |D_c| ∣Dc∣
* Calculate the class distribution with Laplacian smooth.
public void calculateClassDistribution() {
classDistribution = new double[numClasses];
classDistributionLaplacian = new double[numClasses];
double[] tempCounts = new double[numClasses];
for (int i = 0; i < numInstances; i++) {
int tempClassValue = (int) dataset.instance(i).classValue();
} // Of for i
for (int i = 0; i < numClasses; i++) {
classDistribution[i] = tempCounts[i] / numInstances;
classDistributionLaplacian[i] = (tempCounts[i] + 1) / (numInstances + numClasses);
} // Of for i
System.out.println("Class distribution: " + Arrays.toString(classDistribution));
System.out.println("Class distribution Laplacian: " + Arrays.toString(classDistributionLaplacian));
}// Of calculateClassDistribution
计算 P ^ ( x i ∣ c ) \hat P(x_i|c) P^(xi∣c)
第一维 i 表示类别,第二维 j 表示属性,第三维 k 表示属性下对应的取值,这个三维数组的值表示:类别为 i 的样本中的 j 属性的属性值为 k 的样本数。
* Calculate the conditional probabilities with Laplacian smooth.Only scan the
* data set once.
public void calculateConditionalProbabilities() {
conditionalCounts = new double[numClasses][numConditions][];
conditionalProbabilitiesLaplacian = new double[numClasses][numConditions][];
// Allocate space.
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
int tempNumValues = (int) dataset.attribute(j).numValues();
conditionalCounts[i][j] = new double[tempNumValues];
conditionalProbabilitiesLaplacian[i][j] = new double[tempNumValues];
} // Of for j
} // Of for i
// Count the numbers
int[] tempClassCounts = new int[numClasses];
for (int i = 0; i < numInstances; i++) {
int tempClass = (int) dataset.instance(i).classValue();
for (int j = 0; j < numConditions; j++) {
int tempValue = (int) dataset.instance(i).value(j);
} // Of for j
} // Of for i
// Now for the real probability with Laplacian
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
int tempNumValues = (int) dataset.attribute(j).numValues();
for (int k = 0; k < tempNumValues; k++) {
conditionalProbabilitiesLaplacian[i][j][k] = (conditionalCounts[i][j][k] + 1)
/ (tempClassCounts[i] + tempNumValues);
} // Of for k
} // Of for j
} // Of for i
System.out.println("Conditional probabilities: " + Arrays.deepToString(conditionalCounts));
}// Of calculationConditionalProbabilities
通过 P ^ ( c ) 、 P ^ ( x i ∣ c ) \hat P(c)、\hat P(x_i|c) P^(c)、P^(xi∣c)求 h n b ( x ) = arg max c ∈ y log P ^ ( c ) + ∑ i = 1 d log P ^ ( x i ∣ c ) h_{nb}(\mathbf x)=\arg\ \max_{c\in \mathbf y}\log \hat P(c)+\sum\limits_{i=1}^d\log\hat P(x_i|c) hnb(x)=arg maxc∈ylogP^(c)+i=1∑dlogP^(xi∣c)的过程:
* Classify an instance with nominal data.
public int classifyNominal(Instance paraInstance) {
// Find the biggest one
double tempBiggest = -10000;
int resultBestIndex = 0;
for (int i = 0; i < numClasses; i++) {
double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
for (int j = 0; j < numConditions; j++) {
int tempAttributeValue = (int) paraInstance.value(j);
tempPseudoProbability += Math.log(conditionalProbabilitiesLaplacian[i][j][tempAttributeValue]);
} // Of for j
if (tempBiggest < tempPseudoProbability) {
tempBiggest = tempPseudoProbability;
resultBestIndex = i;
} // Of if
} // Of for i
return resultBestIndex;
}// Of classifyNominal
gaussianParameters = new GaussianParameters[numClasses][numConditions];
μ \mu μ是均值,所以先计算该类别标记下,某个特征的特征值的和,然后除以该标记的样本数得到 μ \mu μ,然后再用 μ \mu μ去求 σ \sigma σ。
* Calculate the conditional probabilities with Laplacian smooth.
public void calculateGaussianParameters() {
gaussianParameters = new GaussianParameters[numClasses][numConditions];
double[] tempValuesArray = new double[numInstances];
int tempNumValues = 0;
double tempSum = 0;
for (int i = 0; i < numClasses; i++) {
for (int j = 0; j < numConditions; j++) {
tempSum = 0;
// Obtain values for this class.
tempNumValues = 0;
for (int k = 0; k < numInstances; k++) {
if ((int) dataset.instance(k).classValue() != i) {
} // Of if
tempValuesArray[tempNumValues] = dataset.instance(k).value(j);
tempSum += tempValuesArray[tempNumValues];
} // Of for k
// Obtain parameters.
double tempMu = tempSum / tempNumValues;
double tempSigma = 0;
for (int k = 0; k < tempNumValues; k++) {
tempSigma += (tempValuesArray[k] - tempMu) * (tempValuesArray[k] - tempMu);
} // Of for k
tempSigma /= tempNumValues;
tempSigma = Math.sqrt(tempSigma);
gaussianParameters[i][j] = new GaussianParameters(tempMu, tempSigma);
} // Of for j
} // Of for i
}// Of calculateGaussianParameters
* Classify an instance with numerical data.
public int classifyNumerical(Instance paraInstance) {
// Find the biggest one
double tempBiggest = -10000;
int resultBestIndex = 0;
for (int i = 0; i < numClasses; i++) {
double tempPseudoProbability = Math.log(classDistributionLaplacian[i]);
for (int j = 0; j < numConditions; j++) {
double tempAttributeValue = paraInstance.value(j);
double tempSigma = gaussianParameters[i][j].sigma;
double tempMu = gaussianParameters[i][j].mu;
tempPseudoProbability += -Math.log(tempSigma)
- (tempAttributeValue - tempMu) * (tempAttributeValue - tempMu) / (2 * tempSigma * tempSigma);
} // Of for j
if (tempBiggest < tempPseudoProbability) {
tempBiggest = tempPseudoProbability;
resultBestIndex = i;
} // Of if
} // Of for i
return resultBestIndex;
}// Of classifyNumerical
* Classify all instances, the results are stored in predicts[].
public void classify() {
predicts = new int[numInstances];
for (int i = 0; i < numInstances; i++) {
predicts[i] = classify(dataset.instance(i));
} // Of for i
}// Of classify
* Classify an instance.
public int classify(Instance paraInstance) {
if (dataType == NOMINAL) {
return classifyNominal(paraInstance);
} else if (dataType == NUMERICAL) {
return classifyNumerical(paraInstance);
} // Of if
return -1;
}// Of classify
* Test nominal data.
public static void testNominal() {
System.out.println("Hello, Naive Bayes. I only want to test the nominal data.");
String tempFilename = "F:/sampledataMain/mushroom.arff";
NaiveBayes tempLearner = new NaiveBayes(tempFilename);
System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}// Of testNominal
* Test numerical data.
public static void testNumerical() {
System.out.println("Hello, Naive Bayes. I only want to test the numerical data with Gaussian assumption.");
String tempFilename = "F:/sampledataMain/iris.arff";
NaiveBayes tempLearner = new NaiveBayes(tempFilename);
System.out.println("The accuracy is: " + tempLearner.computeAccuracy());
}// Of testNumerical
* Compute accuracy.
public double computeAccuracy() {
double tempCorrect = 0;
for (int i = 0; i < numInstances; i++) {
if (predicts[i] == (int) dataset.instance(i).classValue()) {
} // Of if
} // Of for i
double resultAccuracy = tempCorrect / numInstances;
return resultAccuracy;
}// Of computeAccuracy
* The entrance of the program.
* @param args Not used now.
public static void main(String[] args) {
}// Of main