《统计学习方法》Chapter.4 朴素贝叶斯(naive Bayes)

Naive Bayes

朴素贝叶斯理论

基本方法

输入空间: X ⊆ R n X \subseteq R^n XRn n n n维离散向量空间的集合(本文只介绍离散特征空间下的朴素贝叶斯方法)

输出空间: Y = { c 1 , c 2 , . . . , c K } Y =\{c_1,c_2,...,c_K\} Y={c1,c2,...,cK}

训练数据集: T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)} P ( X , Y ) P(X,Y) P(X,Y)独立同分布产生。

朴素贝叶斯通过训练数据集,学习先验概率分布条件概率分布,进而得出联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)

先验概率分布 P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),k=1,2,...,K P(Y=ck),k=1,2,...,K

条件概率分布 P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k),k=1,2,...,K P(X=xY=ck)=P(X(1)=x(1),...,X(n)=x(n)Y=ck),k=1,2,...,K

由于条件概率分布的分布情况太多(指数级),如假设 x ( j ) x^{(j)} x(j)可取值有 S j S_j Sj个, j = 1 , 2 , . . . , n j=1,2,...,n j=1,2,...,n Y Y Y可取值有 K K K个,那么参数个数为 K ∏ j = 1 n S j K\prod^{n}_{j=1}{S_j} Kj=1nSj个,因此其估计是不可行的。因此,朴素贝叶斯对条件概率做了条件独立性假设,朴素贝叶斯也因此而得名。

条件独立性假设:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} P(X=x|Y=c_k)&=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k)\\ &=\prod_{j=1}^{n}{P(X^{(j)}=x^{(j)}|Y=c_k}) \end{aligned} P(X=xY=ck)=P(X(1)=x(1),...,X(n)=x(n)Y=ck)=j=1nP(X(j)=x(j)Y=ck)

条件独立性假设说明用于分类的特征在类确定的条件下都是条件独立的,此假设使条件概率分布的参数量从 K ∏ j = 1 n S j K\prod^{n}_{j=1}{S_j} Kj=1nSj个减少为 K ∑ j = 1 n S j K\sum^{n}_{j=1}{S_j} Kj=1nSj个。

朴素贝叶斯在进行分类时,对于输入的 x x x,利用通过训练集学习到的模型计算后验概率分布 P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ckX=x),将后验概率最大的类作为 x x x的类输出。后验概率计算根据贝叶斯定理:
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k}P(X=x|Y=c_k)P(Y=c_k)} P(Y=ckX=x)=kP(X=xY=ck)P(Y=ck)P(X=xY=ck)P(Y=ck)
将条件独立性假设带入上式得到:
P ( Y = c k ∣ X = x ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k|X=x)=\frac{P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_{k}P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)},\\ k=1,2,...,K P(Y=ckX=x)=kP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck),k=1,2,...,K
此为朴素贝叶斯的基本公式,由于最终我们要选择后验概率最大的类别作为预测的类别,而分母对于每一个类别都一样,因此可以将分母去掉,最终该分类器可表示为:
y = f ( x ) = a r g m a x c k   P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\underset{c_k}{\rm argmax} \,P(Y=c_k)\prod_j P(X^{(j)}=x{(j)}|Y=c_k) y=f(x)=ckargmaxP(Y=ck)jP(X(j)=x(j)Y=ck)

后验概率最大化的含义:

朴素贝叶斯将实例分到后验概率最大的类中等价于期望风险最小化

参数估计

在朴素贝叶斯中,学习即意味着用给定训练样本去估计 P ( Y = c k ) P(Y=c_k) P(Y=ck) P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y=c_k) P(X(j)=x(j)Y=ck)。对于离散特征空间的参数估计这里介绍两种:极大似然估计贝叶斯估计。而对于连续的特征空间,可以用高斯分布估计

极大似然估计

对于先验概率 P ( Y = c k ) P(Y=c_k) P(Y=ck)的极大似然估计是:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N ,    k = 1 , 2 , . . . , K P(Y=c_k)=\frac{\sum_{i=1}^{N}I(y_i=c_k)}{N},\;k=1,2,...,K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,...,K
设第 j j j个特征 x ( j ) x^{(j)} x(j)可能取值的集合为 { a j 1 , a j 2 , . . . , a j S j } \{a_{j1},a_{j2},...,a_{jS_j}\} {aj1,aj2,...,ajSj}.

对于条件概率 P ( X ( j ) = a j l ∣ Y = c k ) P(X^{(j)}=a_{jl}|Y=c_k) P(X(j)=ajlY=ck)的极大似然估计是:
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) , l = 1 , 2 , . . . , S j ;    j = 1 , 2 , . . . , n ;    k = 1 , 2 , . . . , K P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^{N}I(x_i^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^{N}I(y_i=c_k)},\\ l=1,2,...,S_j;\;j=1,2,...,n;\;k=1,2,...,K P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck),l=1,2,...,Sj;j=1,2,...,n;k=1,2,...,K
其中, x i ( j ) x_i^{(j)} xi(j)是第 i i i个样本的第 j j j个特征; a j l a_{jl} ajl是第 j j j个特征可能取的第 l l l个值。

贝叶斯估计

用极大似然估计可能会出现所要估计的概率值为 0 0 0的情况,这是会影响到后验概率的计算结果,使分类产生偏差,贝叶斯估计便可以解决这一问题。贝叶斯估计可以看成是极大似然估计的 一般形式。

先验概率的贝叶斯估计是:
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}(Y=c_k)=\frac{\sum_{i=1}^{N}I(y_i=c_k)+\lambda}{N+K\lambda} Pλ(Y=ck)=N+Kλi=1NI(yi=ck)+λ
条件概率的贝叶斯估计是:
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ l = 1 , 2 , . . . , S j ;    j = 1 , 2 , . . . , n ;    k = 1 , 2 , . . . , K P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^{N}I(x_i^{(j)}=a_{jl},y_i=c_k)+\lambda}{\sum_{i=1}^{N}I(y_i=c_k)+S_j\lambda}\\ l=1,2,...,S_j;\;j=1,2,...,n;\;k=1,2,...,K Pλ(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λl=1,2,...,Sj;j=1,2,...,n;k=1,2,...,K
式中 λ ≥ 0 \lambda \geq 0 λ0常取 1 1 1(称为拉普拉斯平滑)。当 λ = 0 \lambda=0 λ=0时即为极大似然估计

算法流程

由于贝叶斯估计可以看成极大似然估计的一般化形式,为了算法的一般性,这里选用贝叶斯估计来对概率进行估计。

朴素贝叶斯算法(naive Bayes algorithm)

输入: 训练数据 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)},其中 x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( n ) ) T x_i=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)})^T xi=(xi(1),xi(2),...,xi(n))T, x i ( j ) x^{(j)}_i xi(j)是第 i i i个样本的第 j j j个特征, x i ( j ) ∈ { a j 1 , a j 2 , . . . , a j S j } x^{(j)}_i \in \{a_{j1},a_{j2},...,a_{jS_j}\} xi(j){aj1,aj2,...,ajSj},是离散取值的, y i ∈ { c 1 , c 2 , . . . , c K } y_i \in \{c_1,c_2,...,c_K\} yi{c1,c2,...,cK}; 实例 x x x.

输出: 实例 x x x的分类

(1) 计算先验概率及条件概率

先验概率:
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}(Y=c_k)=\frac{\sum_{i=1}^{N}I(y_i=c_k)+\lambda}{N+K\lambda} Pλ(Y=ck)=N+Kλi=1NI(yi=ck)+λ
条件概率:
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ l = 1 , 2 , . . . , S j ;    j = 1 , 2 , . . . , n ;    k = 1 , 2 , . . . , K P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^{N}I(x_i^{(j)}=a_{jl},y_i=c_k)+\lambda}{\sum_{i=1}^{N}I(y_i=c_k)+S_j\lambda}\\ l=1,2,...,S_j;\;j=1,2,...,n;\;k=1,2,...,K Pλ(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λl=1,2,...,Sj;j=1,2,...,n;k=1,2,...,K
(2) 对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , . . . , x ( n ) ) T x=(x^{(1)},x^{(2)},...,x^{(n)})^T x=(x(1),x(2),...,x(n))T,计算:
P λ ( Y = c k ) ∏ j = 1 n P λ ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P_{\lambda}(Y=c_k)\prod_{j=1}^{n}P_{\lambda}(X^{(j)}=x^{(j)}|Y=c_k),k=1,2,...,K Pλ(Y=ck)j=1nPλ(X(j)=x(j)Y=ck),k=1,2,...,K
(3) 确定实例 x x x的类
y = a r g m a x c k    P λ ( Y = c k ) ∏ j = 1 n P λ ( X ( j ) = x ( j ) ∣ Y = c k ) y=\underset{c_k}{\rm argmax} \; P_{\lambda}(Y=c_k)\prod_{j=1}^{n}P_{\lambda}(X^{(j)}=x^{(j)}|Y=c_k) y=ckargmaxPλ(Y=ck)j=1nPλ(X(j)=x(j)Y=ck)

代码实例

import numpy as np
# 构造数据
# 离散数据,四维,100样本
# 训练数据集为随机数据,没有统计意义与应用价值
x1 = np.random.randint(0,2,100)
x2 = np.random.randint(0,3,100)
x3 = np.random.randint(0,4,100)
x4 = np.random.randint(0,5,100)
x = np.c_[x1,x2,x3,x4]
y = np.random.randint(0,3,100)

Sklearn版

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(x, y)
x_test = np.array([[1,2,3,4]])
y_pre = clf.predict(x_test)[0]
print('预测类别为:%r'% y_pre)
预测类别为:0

手写版(基于贝叶斯估计)

构造统计函数,计算 ∑ i = 1 N I ( y i = c k ) \sum^{N}_{i=1}I(y_i=c_k) i=1NI(yi=ck)的值

def sum_label(Y,label):
    return (Y==label).sum()

构造统计函数,计算 ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) \sum_{i=1}^{N}I(x_i^{(j)}=a_{jl},y_i=c_k) i=1NI(xi(j)=ajl,yi=ck)的值

def sum_unit(X,Y,j,x_j,label):
    a = (X[:,j] == x_j)
    b = (Y == label)
    out = a&b
    return out.sum()

计算先验概率 P λ ( Y = c k ) P_{\lambda}(Y=c_k) Pλ(Y=ck):
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}(Y=c_k)=\frac{\sum_{i=1}^{N}I(y_i=c_k)+\lambda}{N+K\lambda} Pλ(Y=ck)=N+Kλi=1NI(yi=ck)+λ

def prior(Y,label,lbd):
    P_prior = (sum_label(Y,label)+lbd)/(len(Y)+len(np.unique(Y))*lbd)
    return P_prior

计算条件概率 P λ ( X ( j ) = a j l ∣ Y = c k ) P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k) Pλ(X(j)=ajlY=ck):
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ l = 1 , 2 , . . . , S j ;    j = 1 , 2 , . . . , n ;    k = 1 , 2 , . . . , K P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^{N}I(x_i^{(j)}=a_{jl},y_i=c_k)+\lambda}{\sum_{i=1}^{N}I(y_i=c_k)+S_j\lambda}\\ l=1,2,...,S_j;\;j=1,2,...,n;\;k=1,2,...,K Pλ(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λl=1,2,...,Sj;j=1,2,...,n;k=1,2,...,K

def cond(X,Y,j,x_j,label,lbd):
    S_j = len(np.unique(X[:,j]))
    P_cond = (sum_unit(X,Y,j,x_j,label)+lbd)/(sum_label(Y,label)+S_j*lbd)
    return P_cond

计算
P λ ( Y = c k ) ∏ j = 1 n P λ ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P_{\lambda}(Y=c_k)\prod_{j=1}^{n}P_{\lambda}(X^{(j)}=x^{(j)}|Y=c_k),k=1,2,...,K Pλ(Y=ck)j=1nPλ(X(j)=x(j)Y=ck),k=1,2,...,K

def p_in_cls_k(X,Y,x,label,lbd):
    P_prior = prior(Y, label, lbd)
    dim = X.shape[1]
    P_cond_all = 1
    for j in range(dim):
        P_cond_all *= cond(X,Y,j,x[j],label,lbd)
    return P_prior*P_cond_all

朴素贝叶斯函数:

def navie_bayes(X,Y,x,lbd):
    all_class = np.unique(Y)#排序了
    all_P = np.zeros(len(all_class))
    for i,cls in enumerate(all_class):
        all_P[i] = p_in_cls_k(X,Y,x,cls,lbd)
    cls_index = np.argmax(all_P)
    return all_class[cls_index]
x_test = np.array([1,2,3,4])
y_pre = navie_bayes(x,y,x_test,1)
print('预测类别为:%r'% y_pre)
预测类别为:0

你可能感兴趣的:(统计学习方法笔记,学习方法,机器学习,人工智能)