朴素贝叶斯(Naive Bayes,NB)法是基于贝叶斯定理与特征条件独立假设的分类方法.对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y.
NB包括以下算法:
朴素贝叶斯法的优缺点:
设输入空间 X ⊆ R n \mathcal{X} \subseteq \mathbf{R}^{n} X⊆Rn为n维向量的集合,输出空间为类标记集合 y = { c 1 , c 2 , ⋯   , c x } y=\left\{c_{1},\right.c_{2}, \cdots, c_{x} \} y={c1,c2,⋯,cx},输入为特征向量 x ∈ X x \in \mathcal{X} x∈X,输出为类标记(class label) y ∈ Y y \in \mathcal{Y} y∈Y,X是定义在输入空间 X \mathcal{X} X上的随机向量,Y是定义在输出空间 Y \mathcal{Y} Y上的随机变量. P ( X , Y ) P(X, Y) P(X,Y)是X和Y的联合概率分布.训练数据集:
T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),⋯,(xN,yN)}
由 P ( X , Y ) P(X, Y) P(X,Y)独立分布产生
朴素贝叶斯法通过训练数据集学习联合概率分布 P ( X , Y ) P(X, Y) P(X,Y).具体地,学习以下先验概率分布及条件概率分布.先验概率分布:
P ( Y = c k ) , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right), \quad k=1,2, \cdots, K P(Y=ck),k=1,2,⋯,K
条件概率分布:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯   , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K P\left(X=x | Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right), \quad k=1,2, \cdots, K P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck),k=1,2,⋯,K
于是学习到联合概率分布 P ( X , Y ) P(X, Y) P(X,Y)
条件概率分布 P ( X = x ∣ Y = c k ) P\left(X=x | Y=c_{k}\right) P(X=x∣Y=ck)有指数级数量的参数,其估计实际是不可行的.事实上,假设 x ( j ) x^{(j)} x(j)可取值有 S j S_{j} Sj个, j = 1 , 2 , ⋯   , n j=1,2, \cdots, n j=1,2,⋯,n,Y可取值有K个,那么参数个数 K ∏ j = 1 n S j K \prod_{j=1}^{n} S_{j} K∏j=1nSj
朴素贝叶斯法对条件概率分布作了条件独立性的假设.由于这是一个较强的假设,朴素贝叶斯法也由此得名.具体地,条件独立性假设是:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯   , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} P\left(X=x | Y=c_{k}\right) &=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right) \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) \end{aligned} P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
朴素贝叶斯法实际上学习到生成数据的机制,所以属于生成模型.条件独立假设等于是说用于分类的特征在类确定的条件下都是条件独立的.这一假设使朴素贝叶斯法变得简单,但有时会牺牲一定的分类准确率
朴素贝叶斯法分类时,对给定的输入x,通过学习到的模型计算后验概率分布 P ( Y = c k ∣ X = x ) P\left(Y=c_{k} | X=x\right) P(Y=ck∣X=x),将后验概率最大的类作为x的类输出,后验概率计算根据贝叶斯定理进行:
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P\left(Y=c_{k} | X=x\right)=\frac{P\left(X=x | Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x | Y=c_{k}\right) P\left(Y=c_{k}\right)} P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
将上面两式联合:
P ( Y = c k ∣ X = x ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K P\left(Y=c_{k} | X=x\right)=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}, \quad k=1,2, \cdots, K P(Y=ck∣X=x)=∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
这是朴素贝叶斯法分类的基本公式,于是朴素贝叶斯分类器可表示为:
y = f ( x ) = arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)} y=f(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
注意到,在上式中分母对所有 c k c_{k} ck都是相同的,所以:
y = arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) y=argckmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
朴素贝叶斯法将实例分到后验概率最大的类中,这等价于期望风险最小化.假设选择0-1损失函数:
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y, f(X))=\left\{\begin{array}{ll}{1,} & {Y \neq f(X)} \\ {0,} & {Y=f(X)}\end{array}\right. L(Y,f(X))={1,0,Y̸=f(X)Y=f(X)
式中 f ( X ) f(X) f(X)是分类决策函数.这时期望风险函数为:
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] R_{\mathrm{exp}}(f)=E[L(Y, f(X))] Rexp(f)=E[L(Y,f(X))]
期望是对联合分布 P ( X , Y ) P(X, Y) P(X,Y)取的.由此取条件期望:
R e x p ( f ) = E X ∑ k = 1 K [ L ( c k , f ( X ) ) ] P ( c k ∣ X ) R_{\mathrm{exp}}(f)=E_{X} \sum_{k=1}^{K}\left[L\left(c_{k}, f(X)\right)\right] P\left(c_{k} | X\right) Rexp(f)=EXk=1∑K[L(ck,f(X))]P(ck∣X)
为了使期望风险最小化,只需对 X = x X=x X=x逐个极小化,由此得到:
f ( x ) = arg min y ∈ y ∑ k = 1 K L ( c k , y ) P ( c k ∣ X = x ) = arg min y ∈ y ∑ k = 1 K P ( y ≠ c k ∣ X = x ) = arg min y ∈ Y ( 1 − P ( y = c k ∣ X = x ) ) = arg max y ∈ y P ( y = c k ∣ X = x ) \begin{aligned} f(x) &=\arg \min _{y \in y} \sum_{k=1}^{K} L\left(c_{k}, y\right) P\left(c_{k} | X=x\right) \\ &=\arg \min _{y \in y} \sum_{k=1}^{K} P\left(y \neq c_{k} | X=x\right) \\ &=\arg \min _{y \in \mathcal{Y}}\left(1-P\left(y=c_{k} | X=x\right)\right) \\ &=\arg \max _{y \in y} P\left(y=c_{k} | X=x\right) \end{aligned} f(x)=argy∈ymink=1∑KL(ck,y)P(ck∣X=x)=argy∈ymink=1∑KP(y̸=ck∣X=x)=argy∈Ymin(1−P(y=ck∣X=x))=argy∈ymaxP(y=ck∣X=x)
这样一来,根据期望风险最小化准则就得到了后验概率最大化准则:
f ( x ) = arg max c k P ( c k ∣ X = x ) f(x)=\arg \max _{c_{k}} P\left(c_{k} | X=x\right) f(x)=argckmaxP(ck∣X=x)
即朴素贝叶斯法所采用的原理
在朴素贝叶斯法中,学习意味着估计 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck)和 P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) P(X(j)=x(j)∣Y=ck).可以应用极大似然估计相应的概率.先验概率 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck)的极大似然估计是:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, k=1,2, \cdots, K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
设第j个特征 x ( j ) x^{(j)} x(j)可能取值的集合为 { a j 1 , a j 2 , ⋯   , a j S j } \left\{a_{j 1}, a_{j 2}, \cdots, a_{j S_{j}}\right\} {aj1,aj2,⋯,ajSj}.条件概率 P ( X ( j ) = a j l ∣ Y = c k ) P\left(X^{(j)}=a_{j l} | Y=c_{k}\right) P(X(j)=ajl∣Y=ck)的极大似然估计是:
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l y i = c k ) ∑ i = 1 N I ( y i = c k ) P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l} y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajlyi=ck)
j = 1 , 2 , ⋯   , n ; l = 1 , 2 , ⋯   , S j : k = 1 , 2 , ⋯   , K j=1,2, \cdots, n ; l=1,2, \cdots, S_{j} : k=1,2, \cdots, K j=1,2,⋯,n;l=1,2,⋯,Sj:k=1,2,⋯,K
式中, x i ( j ) x_{i}^{(j)} xi(j)是第i个样本的第j个特征; a j l a_{j l} ajl是第j个特征可能取的第l个值:I为指示函数
输入:训练数据 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中 x i = ( x i ( 1 ) , x i ( 2 ) , ⋯   , x i ( n ) ) T x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \cdots, x_{i}^{(n)}\right)^{\mathrm{T}} xi=(xi(1),xi(2),⋯,xi(n))T, x i ( j ) x_{i}^{(j)} xi(j)是第i个样本的第j个特征, x i ( j ) ∈ { a j 1 , a j 2 , ⋯   , a j s j } x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \cdots, a_{j s_{j}}\right\} xi(j)∈{aj1,aj2,⋯,ajsj}, a j l a_{j l} ajl是第j个特征可能取的第l个值, j = 1 , 2 , ⋯   , n , l = 1 , 2 , ⋯   , S j , y i ∈ { c 1 , c 2 , ⋯   , c K } j=1,2, \cdots, n, \quad l=1,2, \cdots, S_{j}, \quad y_{i} \in\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} j=1,2,⋯,n,l=1,2,⋯,Sj,yi∈{c1,c2,⋯,cK},实例x;
输出:实例x的分类
计算先验概率及条件概率
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯   , K P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯   , n ; l = 1 , 2 , ⋯   , S j ; k = 1 , 2 , ⋯   , K \begin{array}{l}{P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K} \\ {P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}} \\ {j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K}\end{array} P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,KP(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K
对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , ⋯   , x ( n ) ) T x=\left(x^{(1)}, x^{(2)}, \cdots, x^{(n)}\right)^{\mathrm{T}} x=(x(1),x(2),⋯,x(n))T,计算:
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right), \quad k=1,2, \cdots, K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
确定实例x的类:
y = arg max c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) y=argckmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
实例1:通过下表的训练数据学习一个朴素贝叶斯分类器并确定 x = ( 2 , S ) T x=(2, S)^{T} x=(2,S)T的类标记y.表中 X ( 1 ) , X ( 2 ) X^{(1)}, X^{(2)} X(1),X(2)为特征,取值的集合分别为 A 1 = { 1 , 2 , 3 } , A 2 = { S , M , L } A_{1}=\{1,2,3\}, A_{2}=\{S, M, L\} A1={1,2,3},A2={S,M,L},Y为类标记, Y ∈ C = { 1 , − 1 } Y \in C=\{1,-1\} Y∈C={1,−1}
1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X ( 1 ) X^{(1)} X(1) | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |
X ( 2 ) X^{(2)} X(2) | S | M | M | S | S | S | M | M | L | L | L | M | M | L | L |
Y | -1 | -1 | 1 | 1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 |
from IPython.display import Image
Image(filename="./data/4_2.png",width=500)
用极大似然估计可能会出现所要估计的概率值为0的情况.这时会影响到后验概率的计算结果.是分类产生偏差.解决这一问题的方法是采用贝叶斯估计,具体地,条件概率的贝叶斯估计是:
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_{\lambda}\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda} Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
式中 λ ⩾ 0 \lambda \geqslant 0 λ⩾0,等价于在随机变量各个取值的频数上赋予一个正数 λ > 0 \lambda>0 λ>0.当 λ = 0 \lambda=0 λ=0时就是极大似然估计.常取 λ = 1 \lambda=1 λ=1,这时称为拉普拉斯平滑(Laplace smoothing).显然对任何 l = 1 , 2 , ⋯   , S j , k = 1 , 2 , ⋯   , K l=1,2, \cdots, S_{j}, \quad k=1,2, \cdots, K l=1,2,⋯,Sj,k=1,2,⋯,K,有:
P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 ∑ i = 1 s j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \begin{array}{l}{P_{\lambda}\left(X^{(j)}=a_{j l} | Y=c_{k}\right)>0} \\ {\sum_{i=1}^{s_{j}} P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=1}\end{array} Pλ(X(j)=ajl∣Y=ck)>0∑i=1sjP(X(j)=ajl∣Y=ck)=1
同样,先验概率的贝叶斯估计是:
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
实例2:对实例1,按照拉普拉斯平滑估计概率,即取 λ = 1 \lambda=1 λ=1
Image(filename="./data/4_1.png",width=500)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
import math
def load_data():
iris=load_iris()
df=pd.DataFrame(iris.data,columns=iris.feature_names)
df["label"]=iris.target
df.columns=["sepal lenght","sepal width","petal length","petal width","label"]
data=np.array(df.iloc[:100,:])
return data[:,:-1],data[:,-1]
X,y=load_data()
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
X_test[0],y_test[0]
(array([4.5, 2.3, 1.3, 0.3]), 0.0)
特征的可能性被假设为高斯
概率密度函数:
P ( x i ∣ y k ) = 1 2 π σ y k 2 exp ( − ( x i − μ y k ) 2 2 σ y k 2 ) P\left(x_{i} | y_{k}\right)=\frac{1}{\sqrt{2 \pi \sigma_{y k}^{2}}} \exp \left(-\frac{\left(x_{i}-\mu_{y k}\right)^{2}}{2 \sigma_{y k}^{2}}\right) P(xi∣yk)=2πσyk21exp(−2σyk2(xi−μyk)2)
数学期望(mean): μ \mu μ,方差: σ 2 = ∑ ( X − μ ) 2 N \sigma^{2}=\frac{\sum(X-\mu)^{2}}{N} σ2=N∑(X−μ)2
class NaiveBayes(object):
def __init__(self):
self.model=None
# 数学期望
@staticmethod
def mean(X):
return sum(X)/float(len(X))
# 标准差(方差)
def stdev(self,X):
avg=self.mean(X)
return math.sqrt(sum([pow(x-avg,2) for x in X])/float(len(X)))
#概率密度函数
def gaussian_probability(self,x,mean,stdev):
exponent=math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
return (1/(math.sqrt(x*math.pi)*stdev))*exponent
# 处理X_train
def summarize(self,train_data):
summaries=[(self.mean(i),self.stdev(i)) for i in zip(*train_data)]
return summaries
# 分类别求出数学期望和标准差
def fit(self,X,y):
labels=list(set(y))
data={label:[] for label in labels}
for f,label in zip(X,y):
data[label].append(f)
self.model={label:self.summarize(value) for label,value in data.items()}
return "GaussianNB train done"
# 计算概率
def calculate_probabilities(self,input_data):
probabilities={}
for label,value in self.model.items():
probabilities[label]=1
for i in range(len(value)):
mean,stdev=value[i]
probabilities[label]*=self.gaussian_probability(input_data[i],mean,stdev)
return probabilities
# 类别
def predict(self,X_test):
label=sorted(self.calculate_probabilities(X_test).items(),key=lambda x:x[-1])[-1][0]
return label
def score(self,X_test,y_test):
right=0
for X,y in zip(X_test,y_test):
label=self.predict(X)
if label==y:
right+=1
return right/float(len(X_test))
model=NaiveBayes()
model.fit(X_train,y_train)
'GaussianNB train done'
print(model.predict([4.4,3.2,1.3,0.2]))
0.0
model.score(X_test,y_test)
1.0
from sklearn.naive_bayes import GaussianNB
clf=GaussianNB()
clf.fit(X_train,y_train)
GaussianNB(priors=None, var_smoothing=1e-09)
clf.score(X_test,y_test)
1.0
clf.predict([[4.4,3.2,1.3,0.2]])
array([0.])