《统计学习方法》系列(4)

  本篇对应全书第四章,讲的是朴素贝叶斯法。朴素贝叶斯(Naive Bayes)是基于贝叶斯定理与特征条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入 x x x,利用贝叶斯定理求出后验概率最大的输出 y y y


1、理论讲解

1.1、模型原理

  设输入空间 X ⊆ R n \mathbf X \subseteq \mathbf R^n XRn n n n维向量的集合,输出空间为类标记集合 Y = { c 1 , c 2 , … , c K } \mathbf Y=\{c_1,c_2,\dots,c_K\} Y={c1,c2,,cK}。输入为特征向量 x ∈ X x \in \mathbf X xX,输出为类标记 y ∈ Y y \in \mathbf Y yY X X X是定义在输入空间 X \mathbf X X上的随机向量, Y Y Y是定义在输出空间 Y \mathbf Y Y上的随机变量。 P ( X , Y ) P(X,Y) P(X,Y) X X X Y Y Y的联合概率分布。训练集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),\dots,(x_N,y_N)\} T={(x1,y1),(x2,y2),,(xN,yN)} P ( X , Y ) P(X,Y) P(X,Y)独立同分布产生。
  朴素贝叶斯通过训练集学习联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)。具体地,学习以下先验概率分布及条件概率分布。先验概率分布:
P ( Y = c k ) , k = 1 , 2 , … , K P(Y=c_k),k=1,2,\dots,K P(Y=ck),k=1,2,,K
条件概率分布:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , … , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , … , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},\dots,X^{(n)}=x^{(n)}|Y=c_k),k=1,2,\dots,K P(X=xY=ck)=P(X(1)=x(1),,X(n)=x(n)Y=ck),k=1,2,,K
于是学习到联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)
  朴素贝叶斯对条件概率分布作了条件独立性的假设,即用于分类的特征在类确定的条件下都是条件独立的。由于这是一个较强的假设,朴素贝叶斯法也由此得名。具体地,条件独立性假设是:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , … , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},\dots,X^{(n)}=x^{(n)}|Y=c_k)=\prod^n_{j=1}P(X^{(j)}=x^{(j)}|Y=c_k) P(X=xY=ck)=P(X(1)=x(1),,X(n)=x(n)Y=ck)=j=1nP(X(j)=x(j)Y=ck)
这一假设使模型变得简单,但有时会牺牲一定的分类准确率。
  朴素贝叶斯分类时,对给定的输入 x x x,通过学习到的模型计算后验概率分布 P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ckX=x),将后验概率最大的类输出:
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) = a r g m a x c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) k = 1 , 2 , … , K y=f(x)=arg \underset {c_k}{max} \frac{P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_k \\ P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)} \\ =arg \underset {c_k}{max} P(Y=c_k) \prod_j P(X^{(j)}=x^{(j)}|Y=c_k)\\ k=1,2,\dots,K y=f(x)=argckmaxkP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck)=argckmaxP(Y=ck)jP(X(j)=x(j)Y=ck)k=1,2,,K
此处,后验概率最大化的本质是0-1损失函数下的期望风险最小化。

1.2、参数估计

  在朴素贝叶斯中,学习意味着估计 P ( Y = c k ) P(Y=c_k) P(Y=ck) P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y=c_k) P(X(j)=x(j)Y=ck),既可以采用极大似然估计也可以采用贝叶斯估计。

1.2.1、极大似然估计

  先验概率 P ( Y = c k ) P(Y=c_k) P(Y=ck)的极大似然估计:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , … , K P(Y=c_k)=\frac {\sum^{N}_{i=1}I(y_i=c_k)}{N},k=1,2,\dots,K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,,K
  设第 j j j个特征 x j x^{j} xj可能取值的集合为 { a j 1 , a j 2 , … , a j S j } \{a_{j1},a_{j2},\dots,a_{jS_j}\} {aj1,aj2,,ajSj},条件概率 P ( X ( j ) = a j l ∣ Y = c k ) P(X^{(j)}=a_{jl}|Y=c_k) P(X(j)=ajlY=ck)的极大似然估计:
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , … , n ; l = 1 , 2 , … , S j ; k = 1 , 2 , … , K P(X^{(j)}=a_{jl}|Y=c_k)=\frac {\sum^N_{i=1}I(x^{(j)}_i=a_{jl},y_i=c_k)}{\sum^N_{i=1}I(y_i=c_k)} \\ j=1,2,\dots,n;l=1,2,\dots,S_j;k=1,2,\dots,K P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)j=1,2,,n;l=1,2,,Sj;k=1,2,,K
其中, x i ( j ) x^{(j)}_i xi(j)是第 i i i个样本的第 j j j个特征; a j l a_{jl} ajl是第 j j j个特征可能取的第 l l l个值。

1.2.2、贝叶斯估计

  用极大似然估计可能会出现所要估计的概率值为0的情况。这时会影响到后验概率的计算结果,使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。
  先验概率 P ( Y = c k ) P(Y=c_k) P(Y=ck)的贝叶斯估计:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + k λ , k = 1 , 2 , … , K P(Y=c_k)=\frac {\sum^{N}_{i=1}I(y_i=c_k)+\lambda}{N+k\lambda},k=1,2,\dots,K P(Y=ck)=N+kλi=1NI(yi=ck)+λ,k=1,2,,K
  条件概率 P ( X ( j ) = a j l ∣ Y = c k ) P(X^{(j)}=a_{jl}|Y=c_k) P(X(j)=ajlY=ck)的贝叶斯估计:
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ j = 1 , 2 , … , n ; l = 1 , 2 , … , S j ; k = 1 , 2 , … , K P(X^{(j)}=a_{jl}|Y=c_k)=\frac {\sum^N_{i=1}I(x^{(j)}_i=a_{jl},y_i=c_k)+\lambda}{\sum^N_{i=1}I(y_i=c_k)+S_j \lambda} \\ j=1,2,\dots,n;l=1,2,\dots,S_j;k=1,2,\dots,K P(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λj=1,2,,n;l=1,2,,Sj;k=1,2,,K
其中, λ ≥ 0 \lambda \ge 0 λ0,当 λ = 0 \lambda=0 λ=0时,就是极大似然估计。常取 λ = 1 \lambda=1 λ=1,这时称为拉普拉斯平滑。

1.3、小结

  1. 朴素贝叶斯是典型的生成学习方法;
  2. 由于条件独立性假设,朴素贝叶斯高效且易于实现,但也因此损失了分类性能;
  3. 如果输入特征彼此存在概率依存关系,即没有条件独立性假设,模型就变成了贝叶斯网络;
  4. 实际应用中,对应于条件概率分布 P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y=c_k) P(X(j)=x(j)Y=ck)所做的不同假设,朴素贝叶斯有三种模型:高斯模型、多项式模型和伯努利模型。高斯模型常应用于特征连续的情况中,多项式模型和伯努利模型常应用于文本分类中。读者可阅读参考文献详细了解。

2、代码实现

2.1、手工实现

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter


class GaussianNB:
        def __init__(self, X_train, y_train):
                self.X_train = X_train
                self.y_train = y_train
                self.get_params()

        def get_params(self):
                samples_num = len(self.y_train)
                label_count = Counter(self.y_train).items()
                self.class_ = np.array([_[0] for _ in label_count])
                self.class_count_ = np.array([_[1] for _ in label_count])
                self.class_prior_ = self.class_count_/samples_num

                sample_split = [self.X_train[self.y_train == _] for _ in self.class_]
                self.theta_ = np.array([np.transpose(_).mean(axis = 1) for _ in sample_split])
                self.sigma_ = np.array([np.transpose(_).var(axis = 1) for _ in sample_split])

        def gaussian_func(self, X):
                prob_list = []
                for theta, sigma, prior in zip(self.theta_, self.sigma_, self.class_prior_):
                        prob_ = np.prod(1/np.sqrt(2 * np.pi * sigma) * np.exp(-np.square(X - theta)/(2 * sigma))) * prior
                        prob_list.append(prob_)
                return np.array(prob_list)

        def predict(self, X_test):
                label_list = []
                for X in X_test:
                        prob_list = self.gaussian_func(X)
                        label = self.class_[np.argmax(prob_list)]
                        label_list.append(label)
                return np.array(label_list)

        def score(self, X_test, y_test):
                total_num = len(X_test)
                pre = (self.predict(X_test) == y_test).sum()
                score = pre/total_num
                return score


if __name__ == "__main__":
        iris = load_iris()
        X = iris.data[:100, :2]
        y = iris.target[:100]
        y[y == 0] = -1
        xlabel = iris.feature_names[0]
        ylabel = iris.feature_names[1]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

        X_0 = X_train[y_train == -1]
        X_1 = X_train[y_train == 1]

        plt.figure("nb-mine")
        plt.scatter(X_0[:, 0], X_0[:, 1], label = '-1')
        plt.scatter(X_1[:, 0], X_1[:, 1], label = '1')
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
        plt.legend()

        clf = GaussianNB(X_train, y_train)
        score = clf.score(X_test, y_test)
        print "score : %s" % score

        y_pre = clf.predict(X_test)
        X_test_pre_0 = X_test[y_pre == -1]
        X_test_pre_1 = X_test[y_pre == 1]
        plt.scatter(X_test_pre_0[:, 0], X_test_pre_0[:, 1], color = 'r', label = 'pre -1')
        plt.scatter(X_test_pre_1[:, 0], X_test_pre_1[:, 1], color = 'k', label = 'pre 1')
        plt.legend()
        plt.show()

2.2、sklearn实现

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split


if __name__ == "__main__":
        iris = load_iris()
        X = iris.data[:100, :2]
        y = iris.target[:100]
        y[y == 0] = -1
        xlabel = iris.feature_names[0]
        ylabel = iris.feature_names[1]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

        X_0 = X_train[y_train == -1]
        X_1 = X_train[y_train == 1]

        plt.figure("GaussianNB-sklearn")
        plt.scatter(X_0[:, 0], X_0[:, 1], label = '-1')
        plt.scatter(X_1[:, 0], X_1[:, 1], label = '1')
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
        plt.legend()

        clf = GaussianNB()
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        print "score : %s" % score

        y_pre = clf.predict(X_test)
        X_test_pre_0 = X_test[y_pre == -1]
        X_test_pre_1 = X_test[y_pre == 1]
        plt.scatter(X_test_pre_0[:, 0], X_test_pre_0[:, 1], color = 'r', label = 'pre -1')
        plt.scatter(X_test_pre_1[:, 0], X_test_pre_1[:, 1], color = 'k', label = 'pre 1')
        plt.legend()
        plt.show()

代码已上传至github:https://github.com/xiongzwfire/statistical-learning-method


#参考文献

[1] http://scikit-learn.org/stable/modules/naive_bayes.html
[2] https://www.letiantian.me/2014-10-12-three-models-of-naive-nayes/
以上为本文全部参考文献,对原作者表示感谢。

你可能感兴趣的:(机器学习)