朴素贝叶斯是一种分类算法,基于贝叶斯定理,是一种生成模型。
补充:
朴素贝叶斯法的假设:每个特征是条件独立的。
朴素贝叶斯没有参数,直接根据训练样本获取联合分布概率,具体原理如下:
训练集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)}
朴素贝叶斯通过训练集学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)
首先通过训练数据计算先验概率分布 P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k), k=1,2,...,K P(Y=ck),k=1,2,...,K
则条件概率分布如下:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ) , k = 1 , 2 , . . . , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}), k=1,2,...,K P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)),k=1,2,...,K
朴素贝叶斯对条件概率做独立性假设,具体如下
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ) = ∏ i = 1 n P ( X ( i ) = x ( i ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)})\\ =\prod_{i=1}^n P(X^{(i)}=x^{(i)}|Y=c_k) P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n))=i=1∏nP(X(i)=x(i)∣Y=ck)
这一假设时朴素贝叶斯变得简单,但也牺牲一定的分类准确率。
那么接下来讲解朴素贝叶斯是怎么实现分类任务的:
根据训练数据可以求出:
则可以根据以上两个值计算后验概率 P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ck∣X=x),然后选择后验概率最大的类别作为预测结果。
其中后验概率根据贝叶斯定理计算:
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) ∗ P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) ∗ P ( Y = c k ) = P ( Y = c k ) ∏ i = 1 n P ( X ( i ) = x ( i ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ i = 1 n P ( X ( i ) = x ( i ) ∣ Y = c k ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)*P(Y=c_k)}{\sum_{k}{P(X=x|Y=c_k)*P(Y=c_k)}}\\ =\frac{P(Y=c_k)\prod_{i=1}^n P(X^{(i)}=x^{(i)}|Y=c_k)}{\sum_{k}{P(Y=c_k)\prod_{i=1}^n P(X^{(i)}=x^{(i)}|Y=c_k)}} P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)∗P(Y=ck)P(X=x∣Y=ck)∗P(Y=ck)=∑kP(Y=ck)∏i=1nP(X(i)=x(i)∣Y=ck)P(Y=ck)∏i=1nP(X(i)=x(i)∣Y=ck)
于是贝叶斯分类器可以表示如下:
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ i = 1 n P ( X ( i ) = x ( i ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ i = 1 n P ( X ( i ) = x ( i ) ∣ Y = c k ) y=f(x)=\mathop{argmax}\limits_{c_k} \frac{P(Y=c_k)\prod_{i=1}^n P(X^{(i)}=x^{(i)}|Y=c_k)}{\sum_{k}{P(Y=c_k)\prod_{i=1}^n P(X^{(i)}=x^{(i)}|Y=c_k)}} y=f(x)=ckargmax∑kP(Y=ck)∏i=1nP(X(i)=x(i)∣Y=ck)P(Y=ck)∏i=1nP(X(i)=x(i)∣Y=ck)
可以利用极大似然估计求出下面两个先验值:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N P(Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)}{N} P(Y=ck)=N∑i=1NI(yi=ck)
P ( X ( i ) = a i l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( i ) = a i l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{(i)}=a_{il}|Y=c_k)=\frac{\sum_{i=1}^N I(x_i^{(i)}=a_{il},y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)} P(X(i)=ail∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(i)=ail,yi=ck)
i = 1 , 2 , . . . , n ; k = 1 , 2 , . . . , K i=1,2,...,n; k=1,2,...,K i=1,2,...,n;k=1,2,...,K
a i l a_{il} ail表示第i个特征的第l个去取值。
朴素贝叶斯将分类结果定位后验概率最大的类别,等价于期望风险最小化
假设损失函数:
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y,f(X))= \begin{cases} 1,Y\ne f(X)\\ 0,Y=f(X)\\ \end{cases} L(Y,f(X))={1,Y=f(X)0,Y=f(X)
则分类器f对应的期望风险函数为:
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] R_{exp}(f)=E[L(Y,f(X))] Rexp(f)=E[L(Y,f(X))]
则条件期望如下:
R e x p ( f ) = E X ∑ k = 1 K [ L ( c k , f ( X ) ) ] P ( c k ∣ X ) R_{exp}(f)=E_X\sum_{k=1}^K [L(c_k,f(X))]P(c_k|X) Rexp(f)=EXk=1∑K[L(ck,f(X))]P(ck∣X)
为了使风险最小,则对每个样本极小化 R e x p ( f ) R_{exp}(f) Rexp(f),以此求出 f f f:
f ( x ) = a r g m i n y ∈ Y ∑ k = 1 K L ( c k , f ( X ) ) P ( c k ∣ X = x ) = a r g m i n y ∈ Y ∑ k = 1 K P ( Y ≠ c k ∣ X = x ) = a r g m i n y ∈ Y ( 1 − P ( Y = c k ∣ X = x ) ) = a r g m a x y ∈ Y P ( Y = c k ∣ X = x ) f(x)=\mathop{argmin}\limits_{y\in \mathcal{Y}}\sum_{k=1}^K L(c_k,f(X))P(c_k|X=x)\\ =\mathop{argmin}\limits_{y\in \mathcal{Y}}\sum_{k=1}^K P(Y \neq c_k|X=x)\\ =\mathop{argmin}\limits_{y\in \mathcal{Y}}(1-P(Y=c_k|X=x))\\ =\mathop{argmax}\limits_{y\in \mathcal{Y}}P(Y = c_k|X=x)\\ f(x)=y∈Yargmink=1∑KL(ck,f(X))P(ck∣X=x)=y∈Yargmink=1∑KP(Y=ck∣X=x)=y∈Yargmin(1−P(Y=ck∣X=x))=y∈YargmaxP(Y=ck∣X=x)
这样通过期望风险最小化准则就得到了后验概率最大化准则:
f ( x ) = a r g m a x c k P ( c k ∣ X = x ) f(x)=\mathop{argmax}\limits_{c_k}P(c_k|X=x) f(x)=ckargmaxP(ck∣X=x)
即朴素贝叶斯法采用的原理
根据应用场景不同,有三种对应类型:
Gaussian Naive Bayes:适合在特征变量具有连续性的时候使用,同时它还假设特征遵从于高斯分布(正态分布)。该方法假设特征项都是正态分布,然后通过样本计算出均值与标准差,这样就得到了正态分布的密度函数,有了密度函数,就可以代入值,进而在预测的时候算出某一点的密度函数的值。
MultiNomial Naive Bayes:与Gaussian Naive Bayes相反,多项式模型更适合处理特征是离散变量的情况,该模型会在计算先验概率 P ( c k ) P(c_k) P(ck)和条件概率 P ( X ( i ) = a i l ∣ Y = c k ) P(X^{(i)}=a_{il}|Y=c_k) P(X(i)=ail∣Y=ck),会做拉普拉斯平滑:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + 1 N + K P(Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)+1}{N+K} P(Y=ck)=N+K∑i=1NI(yi=ck)+1
P ( X ( i ) = a i l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( i ) = a i l , y i = c k ) + 1 ∑ i = 1 N I ( y i = c k ) + K P(X^{(i)}=a_{il}|Y=c_k)=\frac{\sum_{i=1}^N I(x_i^{(i)}=a_{il},y_i=c_k)+1}{\sum_{i=1}^N I(y_i=c_k)+K} P(X(i)=ail∣Y=ck)=∑i=1NI(yi=ck)+K∑i=1NI(xi(i)=ail,yi=ck)+1它的思想其实就是对每类别下所有划分的计数加1,这样如果训练样本数量足够大时,就不会对结果产生影响,并且解决了 P ( X ( i ) = a i l ∣ Y = c k ) P(X^{(i)}=a_{il}|Y=c_k) P(X(i)=ail∣Y=ck)的频率为0的现象(某个类别下的某个特征划分没有出现,这会严重影响分类器的质量)。
Bernoulli Naive Bayes:Bernoulli适用于在特征属性为二进制的场景下,它对每个特征的取值是基于布尔值的,一个典型例子就是判断单词有没有在文本中出现。
首先简单介绍Gaussian NB python的先验公式(带拉普拉斯平滑):
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + 1 N + K P(Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)+1}{N+K} P(Y=ck)=N+K∑i=1NI(yi=ck)+1
P ( X ( i ) = a i l ∣ Y = c k ) = 1 2 π σ e − ( X ( i ) − μ i ) 2 2 σ P(X^{(i)}=a_{il}|Y=c_k)=\frac{1}{\sqrt{2\pi\sigma}}e^\frac{-(X^{(i)}-\mu^{i})^2}{2\sigma} P(X(i)=ail∣Y=ck)=2πσ1e2σ−(X(i)−μi)2
使用sklearn自带数据集
from sklearn import datasets
from sklearn.model_selection import train_test_split
def load_data():
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
return X_train, X_test, y_train, y_test
按类别分隔数据
def SepByClass(self, X, y):
###按类别分隔数据###
###输入未分类的特征和目标,输出分类完成的数据(字典形式)###
self.num_of_samples = len(y) # 总样本数
y = y.reshape(X.shape[0], 1)
data = np.hstack((X, y)) # 把特征和目标合并成完整数据
data_byclass = {} # 初始化分类数据,为一个空字典
# 提取各类别数据,字典的键为类别名,值为对应的分类数据
for i in range(len(data[:, -1])):
if i in data[:, -1]:
data_byclass[i] = data[data[:, -1] == i]
self.class_name = list(data_byclass.keys()) # 类别名
self.num_of_class = len(data_byclass.keys()) # 类别总数
return data_byclass
计算每个特征的高斯分布参数 μ \mu μ, σ \sigma σ:
def CalMeanPerFeature(self, X_byclass):
###计算各类别特征各维度的平均值###
###输入当前类别下的特征,输出该特征各个维度的平均值###
X_mean = []
for i in range(X_byclass.shape[1]):
X_mean.append(np.mean(X_byclass[:, i]))
return X_mean
def CalVarPerFeature(self, X_byclass):
###计算各类别特征各维度的方差###
###输入当前类别下的特征,输出该特征各个维度的方差###
X_var = []
for i in range(X_byclass.shape[1]):
X_var.append(np.var(X_byclass[:, i]))
return X_var
计算每个类别的先验概率(带拉普拉斯平滑):
def CalPriorProb(self, y_byclass):
###计算y的先验概率(使用拉普拉斯平滑)###
###输入当前类别下的目标,输出该目标的先验概率###
# 计算公式:(当前类别下的样本数+1)/(总样本数+类别总数)
return (len(y_byclass) + 1) / (self.num_of_samples + self.num_of_class)
训练模型
def fit(self, X, y):
###训练数据###
###输入训练集特征和目标,输出目标的先验概率,特征的平均值和方差###
# 将输入的X,y转换为numpy数组
X, y = np.asarray(X, np.float32), np.asarray(y, np.float32)
data_byclass = Gaussian_NB.SepByClass(X, y) # 将数据分类
# 计算各类别数据的目标先验概率,特征平均值和方差
for data in data_byclass.values():
X_byclass = data[:, :-1]
y_byclass = data[:, -1]
self.prior_list.append(Gaussian_NB.CalPriorProb(y_byclass))
self.mean_list.append(Gaussian_NB.CalMeanPerFeature(X_byclass))
self.var_list.append(Gaussian_NB.CalVarPerFeature(X_byclass))
预测
def predict(self, X_new):
###预测数据###
###输入新样本的特征,输出新样本最有可能的目标###
# 将输入的x_new转换为numpy数组
X_new = np.asarray(X_new, np.float32)
posteriori_prob = [] # 用于存储每个类别的极大后验概率
idx = -1
for prior, mu, sigma in zip(self.prior_prob, self.X_mean, self.X_var):
gaussian = Gaussian_NB.CalGaussianProb(X_new, mu, sigma)
post = gaussian*prior
posteriori_prob.append(np.log(post))
idx = np.argmax(posteriori_prob)
整理后代码如下:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
def load_data():
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
return X_train, X_test, y_train, y_test
class Gaussian_NB:
def __init__(self):
self.num_of_samples = None
self.num_of_class = None
self.class_name = []
self.prior_list = []
self.mean_list = []
self.var_list = []
def SepByClass(self, X, y):
###按类别分隔数据###
###输入未分类的特征和目标,输出分类完成的数据(字典形式)###
self.num_of_samples = len(y) # 总样本数
y = y.reshape(X.shape[0], 1)
data = np.hstack((X, y)) # 把特征和目标合并成完整数据
data_byclass = {} # 初始化分类数据,为一个空字典
# 提取各类别数据,字典的键为类别名,值为对应的分类数据
for i in range(len(data[:, -1])):
if i in data[:, -1]:
data_byclass[i] = data[data[:, -1] == i]
self.class_name = list(data_byclass.keys()) # 类别名
self.num_of_class = len(data_byclass.keys()) # 类别总数
return data_byclass
def CalPriorProb(self, y_byclass):
###计算y的先验概率(使用拉普拉斯平滑)###
###输入当前类别下的目标,输出该目标的先验概率###
# 计算公式:(当前类别下的样本数+1)/(总样本数+类别总数)
return (len(y_byclass) + 1) / (self.num_of_samples + self.num_of_class)
def CalMeanPerFeature(self, X_byclass):
###计算各类别特征各维度的平均值###
###输入当前类别下的特征,输出该特征各个维度的平均值###
X_mean = []
for i in range(X_byclass.shape[1]):
X_mean.append(np.mean(X_byclass[:, i]))
return X_mean
def CalVarPerFeature(self, X_byclass):
###计算各类别特征各维度的方差###
###输入当前类别下的特征,输出该特征各个维度的方差###
X_var = []
for i in range(X_byclass.shape[1]):
X_var.append(np.var(X_byclass[:, i]))
return X_var
def CalGaussianProb(self, X_new, mean, var):
###计算训练集特征(符合正态分布)在各类别下的条件概率###
###输入新样本的特征,训练集特征的平均值和方差,输出新样本的特征在相应训练集中的分布概率###
# 计算公式:(np.exp(-(X_new-mean)**2/(2*var)))*(1/np.sqrt(2*np.pi*var))
gaussian_prob = []
for a, b, c in zip(X_new, mean, var):
formula1 = np.exp(-(a - b) ** 2 / (2 * c))
formula2 = 1 / np.sqrt(2 * np.pi * c)
gaussian_prob.append(formula2 * formula1)
return gaussian_prob
def fit(self, X, y):
###训练数据###
###输入训练集特征和目标,输出目标的先验概率,特征的平均值和方差###
# 将输入的X,y转换为numpy数组
X, y = np.asarray(X, np.float32), np.asarray(y, np.float32)
data_byclass = Gaussian_NB.SepByClass(X, y) # 将数据分类
# 计算各类别数据的目标先验概率,特征平均值和方差
for data in data_byclass.values():
X_byclass = data[:, :-1]
y_byclass = data[:, -1]
self.prior_list.append(Gaussian_NB.CalPriorProb(y_byclass))
self.mean_list.append(Gaussian_NB.CalMeanPerFeature(X_byclass))
self.var_list.append(Gaussian_NB.CalVarPerFeature(X_byclass))
return self.prior_list, self.mean_list, self.var_list
def predict(self, X_new):
###预测数据###
###输入新样本的特征,输出新样本最有可能的目标###
# 将输入的x_new转换为numpy数组
X_new = np.asarray(X_new, np.float32)
posteriori_prob = [] # 用于存储每个类别的极大后验概率
idx = -1
for prior, mu, sigma in zip(self.prior_prob, self.X_mean, self.X_var):
gaussian = Gaussian_NB.CalGaussianProb(X_new, mu, sigma)
post = gaussian*prior
posteriori_prob.append(np.log(post))
idx = np.argmax(posteriori_prob)
return self.class_name[idx]
if __name__ == "__main__":
X_train, X_test, y_train, y_test = load_data()
Gaussian_NB = Gaussian_NB() # 实例化Gaussian_NB
Gaussian_NB.fit(X_train, y_train) # 使用Gaussian_NB模型训练数据
acc = 0
TP = 0
FP = 0
FN = 0
for i in range(len(X_test)):
predict = Gaussian_NB.predict(X_test[i, :])
target = np.array(y_test)[i]
if predict == 1 and target == 1:
TP += 1
if predict == 0 and target == 1:
FP += 1
if predict == target:
acc += 1
if predict == 1 and target == 0:
FN += 1
print("准确率:", acc / len(X_test))
print("查准率:", TP / (TP + FP))
print("查全率:", TP / (TP + FN))
print("F1:", 2 * TP / (2 * TP + FP + FN))