统计学习方法笔记(李航)———第四章(朴素贝叶斯法)

推荐阅读:小白之通俗易懂的贝叶斯定理(Bayes’ Theorem)

朴素贝叶斯法是一种多分类算法,它的基础是“朴素贝叶斯假设”(假设实例的各个特征具有条件独立性)。根据训练集估计模型的先验概率、条件概率,再按照后验概率最大化的准则,给出输入实例的分类预测。它的算法实现很简单,但理论证明并不容易。

具体来说,通过极大似然估计法估计先验概率、条件概率,计算过程比较复杂,书上也没有给出。本章主要分为3个部分:

  • 朴素贝叶斯分类器,介绍它的基本假设与算法实现;
  • 先验概率、条件概率的极大似然估计;
  • 贝叶斯估计与拉普拉斯平滑。

一、朴素贝叶斯分类器

输入:训练集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),(xN,yN)},其中

x i = ( x i ( 1 ) , x i ( 2 ) , … , x i ( n ) ) T , x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \ldots, x_{i}^{(n)}\right)^{T}, xi=(xi(1),xi(2),,xi(n))T, i \mathrm{i} i 个样本的第 j \mathrm{j} j 个特征, x i ( j ) ∈ { a j 1 , a j 2 , … , a j S j } \quad x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \ldots, a_{j S_{j}}\right\} xi(j){aj1,aj2,,ajSj}

即某一特征的取值为有限集合(共 S j S_{j} Sj个,不同的特征取值可能有所不同), y i ∈ { c 1 , c 2 , … , c K } , y_{i} \in\left\{c_{1}, c_{2}, \ldots, c_{K}\right\}, yi{c1,c2,,cK}, 即分类结果为有限集合(共K个 ) ) )

输出:实例 x 的分类

1. 计算先验概率及条件概率

P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , … , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, k=1,2, \ldots, K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,,K

P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)

j = 1 , 2 , … n ; l = 1 , 2 , … , S j ; k = 1 , 2 , … , K j=1,2, \ldots n ; l=1,2, \ldots, S_{j} ; k=1,2, \ldots, K j=1,2,n;l=1,2,,Sj;k=1,2,,K
上述公式可以根据已知数据(训练集),求出总体分布的先验概率、条件概率,其中N为训练集的 实例数, K为分类数, n为特征空间维度, 为第 j 个特征取值的个数。

说明:

先验概率 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) 等于训练集里分类为 c k c_{k} ck 的比例。通过示性函数 I ( y i = c k ) I\left(y_{i}=c_{k}\right) I(yi=ck) 统计分类 为 c k c_{k} ck 的实例数量, 再除以实例总数N, 得到分类为 c k c_{k} ck 的比例;

条件概率 P ( X ( j ) = a l j ∣ Y = c k ) P\left(X^{(j)}=a_{l j} \mid Y=c_{k}\right) P(X(j)=aljY=ck) 等于联合概率除以先验概率,而联合概率又可以用训练集里X、Y取值同时符合条件的数量来表示。实际上,条件概率可以表示为

P ( X ( j ) = a l j ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) / N ∑ i = 1 N I ( y i = c k ) / N P\left(X^{(j)}=a_{l j} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right) / N}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right) / N} P(X(j)=aljY=ck)=i=1NI(yi=ck)/Ni=1NI(xi(j)=ajl,yi=ck)/N

其中分母为先验概率,分子为联合概率。联合概率等于训练集里 x i ( j ) = a j l x_{i}^{(j)}=a_{j l} xi(j)=ajl y i = c k y_{i}=c_{k} yi=ck 两个 条件同时满足的实例的比例。

补充:联合概率、边缘概率、条件概率之间的关系&贝叶斯公式

2. 对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , … , x ( n ) ) T , x=\left(x^{(1)}, x^{(2)}, \ldots, x^{(n)}\right)^{T}, x=(x(1),x(2),,x(n))T, 计算

f ( c k ) = P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , … , K f\left(c_{k}\right)=P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), k=1,2, \ldots, K f(ck)=P(Y=ck)j=1nP(X(j)=x(j)Y=ck),k=1,2,,K

f ( c k ) ∝ f\left(c_{k}\right) \propto f(ck) 后验概率 P ( Y = c k ∣ X = x ) , P\left(Y=c_{k} \mid X=x\right), P(Y=ckX=x), 因为根据贝叶斯公式

P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P ( Y = c k ) P ( X = x ) P\left(Y=c_{k} \mid X=x\right)=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}=\frac{\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)} P(Y=ckX=x)=P(X=x)P(X=xY=ck)P(Y=ck)=P(X=x)j=1nP(X(j)=x(j)Y=ck)P(Y=ck)

其中 P ( X = x ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left(X=x \mid Y=c_{k}\right)=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P(X=xY=ck)=j=1nP(X(j)=x(j)Y=ck) 就是"朴素贝早斯假设" , , , 即 在 Y = c k Y=c_{k} Y=ck 的条件下,特征向量 X=x 的概率等于它们各个分量同时相等的概率。朴素贝叶斯假 设(又称特征条件独立假设)的目的在于简化计算,但实际上特征之间有可能是相关的,所以此假 设有可能导致误差增大。

由于分母 P ( X = x ) P(X=x) P(X=x) 不会随着 c k c_{k} ck 取值不同而改变, 因此可看作一个常数。

3. 确定实例 x 的分类
y = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) = arg ⁡ max ⁡ c k f ( c k ) y=\underset{c_{k}}{\arg \max } P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)=\underset{c_{k}}{\arg \max } f\left(c_{k}\right) y=ckargmaxP(Y=ck)j=1nP(X(j)=x(j)Y=ck)=ckargmaxf(ck)

由于 f ( c k ) ∝ f\left(c_{k}\right) \propto f(ck) 后验概率,只需找到 c k c_{k} ck 使得 f ( c k ) f\left(c_{k}\right) f(ck) 最大化, 即后验概率最大化。

二、参数的极大似然估计

一般在总体分布形式已确定的情况下,才能采用极大似然法进行参数估计。在总体分布未知的情况 下,又该如何进行参数估计呢?待估计的参数是什么?似然函数又是什么?

1. 先验概率的极大似然估计
朴素贝叶斯算法的 “预测系统” 是一个分类器,向它输入一个实例 x, 得到一个分类结果 y, 即 K 种类别中的一种 c k c_{k} ck 。可以把它想象为一个黑盒子,里面装有 K 种颜色的小球, 摸出一个小球, 就完成了一次分类预测。盒子里面各种颜色小球的比例,从概率上决定了预测的结果,因此把 “小 球比例” (即各种分类的先验概率)作为参数是合理的。

θ 1 = P ( Y = c 1 ) , θ 2 = P ( Y = c 2 ) , … , θ K = P ( Y = c K ) \theta_{1}=P\left(Y=c_{1}\right), \theta_{2}=P\left(Y=c_{2}\right), \ldots, \theta_{K}=P\left(Y=c_{K}\right) θ1=P(Y=c1),θ2=P(Y=c2),,θK=P(Y=cK)

训结集里的N个实例,即从黑盒子里摸出的N个小弘的结果。估计问题:在出现这N个结果的前提 下,黑盒子里面小球的比例应该是怎样的呢? 由此得到似然函数:

L ( Y ; θ ) = P ( Y = y 1 , Y = y 2 , … , Y = y N ; θ 1 , … , θ K ) L(Y ; \theta)=P\left(Y=y_{1}, Y=y_{2}, \ldots, Y=y_{N} ; \theta_{1}, \ldots, \theta_{K}\right) L(Y;θ)=P(Y=y1,Y=y2,,Y=yN;θ1,,θK)
y i ∈ { c 1 , c 2 , … , c K } y_{i} \in\left\{c_{1}, c_{2}, \ldots, c_{K}\right\} yi{c1,c2,,cK}

训结集里面的实例具有独立性, L ( Y ; θ ) = ∏ i = 1 N P ( Y = y i ; θ ) \quad L(Y ; \theta)=\prod_{i=1}^{N} P\left(Y=y_{i} ; \theta\right) L(Y;θ)=i=1NP(Y=yi;θ)

其中 P ( Y = y i ) = P ( Y = c k ) = θ k , P\left(Y=y_{i}\right)=P\left(Y=c_{k}\right)=\theta_{k}, P(Y=yi)=P(Y=ck)=θk, 所以

L ( Y ; θ ) = ∏ i = 1 N P ( Y = y i ; θ ) = θ 1 θ 1 … θ k … θ K = ∏ k = 1 K θ k N k L(Y ; \theta)=\prod_{i=1}^{N} P\left(Y=y_{i} ; \theta\right)=\theta_{1} \theta_{1} \ldots \theta_{k} \ldots \theta_{K}=\prod_{k=1}^{K} \theta_{k}^{N_{k}} L(Y;θ)=i=1NP(Y=yi;θ)=θ1θ1θkθK=k=1KθkNk

它的对数似然图数为: l ( Y ; θ ) = ln ⁡ ∏ k = 1 K θ k N h = ∑ k = 1 K N k ln ⁡ θ k \quad l(Y ; \theta)=\ln \prod_{k=1}^{K} \theta_{k}^{N_{h}}=\sum_{k=1}^{K} N_{k} \ln \theta_{k} l(Y;θ)=lnk=1KθkNh=k=1KNklnθk

其中 N k N^{k} Nk 是训结集里分状为 k \mathrm{k} k 的实例的数量.
注意, 这里隐含两个约束条件:

(1) ∑ k = 1 K θ k = θ 1 + θ 2 + … + θ K = 1 , \sum_{k=1}^{K} \theta_{k}=\theta_{1}+\theta_{2}+\ldots+\theta_{K}=1, k=1Kθk=θ1+θ2++θK=1, 即各种类别比例之和等于1

(2) ∑ k = 1 K N k = N 1 + N 2 + … + N K = N , \sum_{k=1}^{K} N_{k}=N_{1}+N_{2}+\ldots+N_{K}=N, k=1KNk=N1+N2++NK=N, 即名种类别的实例数相加等于实例总数.

由于包含等式约束。需引入拉格朗日乘子求 “象件极值":
l ( θ k , λ ) = ∑ k = 1 K N k ln ⁡ θ k + λ ( ∑ k = 1 K θ k − 1 ) , l\left(\theta_{k}, \lambda\right)=\sum_{k=1}^{K} N_{k} \ln \theta_{k}+\lambda\left(\sum_{k=1}^{K} \theta_{k}-1\right), l(θk,λ)=k=1KNklnθk+λ(k=1Kθk1), 它的极大值位于偏导数为零处

∂ l ∂ θ k = ∂ ∂ θ k [ ( N 1 ln ⁡ θ 1 + N 2 ln ⁡ θ 2 + … + N K ln ⁡ θ K ) + λ ( θ 1 + θ 2 + … + θ K − 1 ) ] \frac{\partial l}{\partial \theta_{k}}=\frac{\partial}{\partial \theta_{k}}\left[\left(N_{1} \ln \theta_{1}+N_{2} \ln \theta_{2}+\ldots+N_{K} \ln \theta_{K}\right)+\lambda\left(\theta_{1}+\theta_{2}+\ldots+\theta_{K}-1\right)\right] θkl=θk[(N1lnθ1+N2lnθ2++NKlnθK)+λ(θ1+θ2++θK1)]

= N k θ k + λ = 0 , =\frac{N_{k}}{\theta_{k}}+\lambda=0, =θkNk+λ=0, 经郊理得 N k = − λ θ k , N_{k}=-\lambda \theta_{k}, Nk=λθk, 其中 k = 1 , 2 , … , K k=1,2, \ldots, K k=1,2,,K 共K个等式

如果把这 K个等式左右两边相加, 有 N 1 + N 2 + … + N K = − λ ( θ 1 + θ 2 + … + θ K ) N_{1}+N_{2}+\ldots+N_{K}=-\lambda\left(\theta_{1}+\theta_{2}+\ldots+\theta_{K}\right) N1+N2++NK=λ(θ1+θ2++θK)

根据两个约束条件, 得到 λ = − N , \lambda=-N, λ=N, 所以

先验樹率 P ( Y = c k ) = θ k = N k N = ∑ i = 1 N I ( y i = c k ) N P\left(Y=c_{k}\right)=\theta_{k}=\frac{N_{k}}{N}=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N} P(Y=ck)=θk=NNk=Ni=1NI(yi=ck)

2. 条件概率的极大似然估计

已知先验概率为 θ k = P ( Y = c k ) \theta_{k}=P\left(Y=c_{k}\right) θk=P(Y=ck)

设条件概率为 μ i k = P ( X ( j ) = a j i ∣ Y = c k ) , \mu_{i k}=P\left(X^{(j)}=a_{j i} \mid Y=c_{k}\right), μik=P(X(j)=ajiY=ck), 注意只针对第 j个特征, 其他同理。

P ( Y = c k ) P ( X ( j ) = a j l ∣ Y = c k ) = P ( X ( j ) = a j l , Y = c k ) = θ k μ l k P\left(Y=c_{k}\right) P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=P\left(X^{(j)}=a_{j l}, Y=c_{k}\right)=\theta_{k} \mu_{l k} P(Y=ck)P(X(j)=ajlY=ck)=P(X(j)=ajl,Y=ck)=θkμlk

似然哲数 L ( ( X ( j ) = x 1 ( j ) , Y = y 1 ) , … , ( X ( j ) = x N ( j ) , Y = y N ) ; μ , θ ) = ∏ i = 1 N P ( X ( j ) = x i ( j ) , Y = y i ) L\left(\left(X^{(j)}=x_{1}^{(j)}, Y=y_{1}\right), \ldots,\left(X^{(j)}=x_{N}^{(j)}, Y=y_{N}\right) ; \mu, \theta\right)=\prod_{i=1}^{N} P\left(X^{(j)}=x_{i}^{(j)}, Y=y_{i}\right) L((X(j)=x1(j),Y=y1),,(X(j)=xN(j),Y=yN);μ,θ)=i=1NP(X(j)=xi(j),Y=yi)

其中 x i ( j ) ∈ { a j 1 , a j 2 , … , a j S j } , l ∈ { 1 , 2 , … , S j } x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \ldots, a_{j S_{j}}\right\}, \quad l \in\left\{1,2, \ldots, S_{j}\right\} xi(j){aj1,aj2,,ajSj},l{1,2,,Sj}

L ( X , Y ; μ , θ ) = ∏ i = 1 N P ( X ( j ) = x i ( j ) , Y = y i ) L(X, Y ; \mu, \theta)=\prod_{i=1}^{N} P\left(X^{(j)}=x_{i}^{(j)}, Y=y_{i}\right) L(X,Y;μ,θ)=i=1NP(X(j)=xi(j),Y=yi)

= ∏ i = 1 N P ( X ( j ) = a j l , Y = c k ) = ∏ l = 1 S f ∏ k = 1 K ( θ k μ l k ) N l k =\prod_{i=1}^{N} P\left(X^{(j)}=a_{j l}, Y=c_{k}\right)=\prod_{l=1}^{S_{f}} \prod_{k=1}^{K}\left(\theta_{k} \mu_{l k}\right)^{N_{l k}} =i=1NP(X(j)=ajl,Y=ck)=l=1Sfk=1K(θkμlk)Nlk

其中 N l k N_{l k} Nlk 表示 X ( j ) = a j l X^{(j)}=a_{j l} X(j)=ajl Y = c k Y=c_{k} Y=ck 两个条件同时满足的实例的数量.

对数似然函数为: l ( μ , θ ) = ∑ l = 1 S j ∑ k = 1 K N k ( ln ⁡ μ l k + ln ⁡ θ k ) \quad l(\mu, \theta)=\sum_{l=1}^{S_{j}} \sum_{k=1}^{K} N_{k}\left(\ln \mu_{l k}+\ln \theta_{k}\right) l(μ,θ)=l=1Sjk=1KNk(lnμlk+lnθk)

新增约束条件 (3):

∑ l = 1 S j μ l k = 1 , \sum_{l=1}^{S_{j}} \mu_{l k}=1, l=1Sjμlk=1, 由于 μ l k = P ( X ( j ) = a j l ∣ Y = c k ) , \mu_{l k}=P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right), μlk=P(X(j)=ajlY=ck), x i ( j ) ∈ { a j 1 , a j 2 , … , a j S j } x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \ldots, a_{j S_{j}}\right\} xi(j){aj1,aj2,,ajSj}

为有限集合. 把所有可能取值的概率相加,结果等于1。即

P ( X ( j ) = a j 1 ∣ Y = c k ) + P ( X ( j ) = a j 2 ∣ Y = c k ) − ⊥ … + P ( X ( j ) = a j S j ∣ Y = c k ) = 1 P\left(X^{(j)}=a_{j 1} \mid Y=c_{k}\right)+P\left(X^{(j)}=a_{j 2} \mid Y=c_{k}\right)^{-\perp} \ldots+P\left(X^{(j)}=a_{j S_{j}} \mid Y=c_{k}\right)=1 P(X(j)=aj1Y=ck)+P(X(j)=aj2Y=ck)+P(X(j)=ajSjY=ck)=1

引入拉格朗日乘子求“条件极值”:

l ( μ l k , θ k , λ ) = ∑ l = 1 S j ∑ k = 1 K N k ( ln ⁡ μ l k + ln ⁡ θ k ) + λ ( ∑ l = 1 S j μ l k − 1 ) l\left(\mu_{l k}, \theta_{k}, \lambda\right)=\sum_{l=1}^{S_{j}} \sum_{k=1}^{K} N_{k}\left(\ln \mu_{l k}+\ln \theta_{k}\right)+\lambda\left(\sum_{l=1}^{S_{j}} \mu_{l k}-1\right) l(μlk,θk,λ)=l=1Sjk=1KNk(lnμlk+lnθk)+λ(l=1Sjμlk1)

μ l k \mu_{l k} μlk 的偏导数等于0,过程与之前大同小异,只有下标恰好是 l k lk lk的项留下了,其他均为零.

∂ l ∂ μ l k = N l k μ l k + λ = 0 , \frac{\partial l}{\partial \mu_{l k}}=\frac{N_{l k}}{\mu_{l k}}+\lambda=0, μlkl=μlkNlk+λ=0, 整理得 N l k = − λ μ l k , N_{l k}=-\lambda \mu_{l k}, Nlk=λμlk, 将其按 l l l 野加起来, 得到

N 1 k + N 2 k + … + N S j k = − λ ( μ 1 k + … + μ S j k ) N_{1 k}+N_{2 k}+\ldots+N_{S_{j k}}=-\lambda\left(\mu_{1 k}+\ldots+\mu_{S_{j} k}\right) N1k+N2k++NSjk=λ(μ1k++μSjk) N k = − λ , N_{k}=-\lambda, Nk=λ, 因此

条件概率 P ( X ( j ) = a j l ∣ Y = c k ) = μ l k = N l k N k = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\mu_{l k}=\frac{N_{l k}}{N_{k}}=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} P(X(j)=ajlY=ck)=μlk=NkNlk=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)

三、参数的贝叶斯估计

已知极大似外估计的结果为

先验概率: P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N \quad P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N} P(Y=ck)=Ni=1NI(yi=ck)

条件栖率: P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) \quad P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)

但如果在训练集里面,某一类别的实例数为0,将导致条件概率的分母为0 。
为领决这个问影,引入拉普拉斯平渦,增加一个正数 λ > 0 \lambda>0 λ>0 ,使得

P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} P(Y=ck)=N+Kλi=1NI(yi=ck)+λ

P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda} P(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λ
上述公式,称为先验概率和条件概率的“贝叶斯估计”。

样例:

数据导入

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from collections import Counter
import math

# data
def create_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
    data = np.array(df.iloc[:100, :])
    # print(data)
    return data[:,:-1], data[:,-1]
    
X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

X_test[0], y_test[0]
#(array([5.1, 3.8, 1.9, 0.4]), 0.0)

GaussianNB 高斯朴素贝叶斯

特征的可能性被假设为高斯
概率密度函数:

P ( x i ∣ y k ) = 1 2 π σ y k 2 exp ⁡ ( − ( x i − μ y k ) 2 2 σ y k 2 ) P\left(x_{i} \mid y_{k}\right)=\frac{1}{\sqrt{2 \pi \sigma_{y k}^{2}}} \exp \left(-\frac{\left(x_{i}-\mu_{y k}\right)^{2}}{2 \sigma_{y k}^{2}}\right) P(xiyk)=2πσyk2 1exp(2σyk2(xiμyk)2)

数学期望(mean) : μ \mu μ

方差 : σ 2 = ∑ ( X − μ ) 2 N : \sigma^{2}=\frac{\sum(X-\mu)^{2}}{N} :σ2=N(Xμ)2


class NaiveBayes:
    def __init__(self):
        self.model = None

    # 数学期望
    @staticmethod
    def mean(X):
        return sum(X) / float(len(X))

    # 标准差(方差)
    def stdev(self, X):
        avg = self.mean(X)
        return math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))

    # 概率密度函数
    def gaussian_probability(self, x, mean, stdev):
        exponent = math.exp(-(math.pow(x - mean, 2) /
                              (2 * math.pow(stdev, 2))))
        return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent

    # 处理X_train
    def summarize(self, train_data):
        summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]
        return summaries

    # 分类别求出数学期望和标准差
    def fit(self, X, y):
        labels = list(set(y))
        data = {label: [] for label in labels}
        for f, label in zip(X, y):
            data[label].append(f)
        self.model = {
            label: self.summarize(value)
            for label, value in data.items()
        }
        return 'gaussianNB train done!'

    # 计算概率
    def calculate_probabilities(self, input_data):
        # summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
        # input_data:[1.1, 2.2]
        probabilities = {}
        for label, value in self.model.items():
            probabilities[label] = 1
            for i in range(len(value)):
                mean, stdev = value[i]
                probabilities[label] *= self.gaussian_probability(
                    input_data[i], mean, stdev)
        return probabilities

    # 类别
    def predict(self, X_test):
        # {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
        label = sorted(
            self.calculate_probabilities(X_test).items(),
            key=lambda x: x[-1])[-1][0]
        return label

    def score(self, X_test, y_test):
        right = 0
        for X, y in zip(X_test, y_test):
            label = self.predict(X)
            if label == y:
                right += 1

        return right / float(len(X_test))
model = NaiveBayes()
model.fit(X_train, y_train)
#'gaussianNB train done!'

print(model.predict([4.4,  3.2,  1.3,  0.2]))
#0.0

model.score(X_test, y_test)
#1.0

scikit-learn 实例

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(X_train, y_train)
#GaussianNB(priors=None, var_smoothing=1e-09)

clf.score(X_test, y_test)
#1.0

clf.predict([[4.4,  3.2,  1.3,  0.2]])
#array([0.])

课后习题

习题4.1用极大似外估计法推出朴素贝叶斯法中的概率估计公式(4.8)及公式 (4.9)。

第1步: \quad 证明公式(4.8): P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N \mathrm{P}\left(\mathrm{Y}=\mathrm{c}_{\mathrm{k}}\right)=\frac{\sum_{i=1}^{N} \mathrm{I}\left(\mathrm{y}_{\mathrm{i}}=\mathrm{c}_{\mathrm{k}}\right)}{\mathrm{N}} P(Y=ck)=Ni=1NI(yi=ck)

L ( p ) = f D ( y 1 , y 2 , ⋯   , y n ∣ θ ) = ( N m ) p m ( 1 − p ) ( N − m ) \mathrm{L}(\mathrm{p})=\mathrm{fD}\left(\mathrm{y}_{1}, \mathrm{y}_{2}, \cdots, \mathrm{y}_{\mathrm{n}} \mid \theta\right)=\left(\begin{array}{c} \mathrm{N} \\ \mathrm{m} \end{array}\right) \mathrm{p}^{\mathrm{m}}(1-\mathrm{p})^{(\mathrm{N}-\mathrm{m})} L(p)=fD(y1,y2,,ynθ)=(Nm)pm(1p)(Nm)

使用微分求极值,两边同时对p求微分:
0 = ( N m ) [ m p ( m − 1 ) ( 1 − p ) ( N − m ) − ( N − m ) p m ( 1 − p ) ( N − m − 1 ) ] = ( N m ) [ p ( m − 1 ) ( 1 − p ) ( N − m − 1 ) ( m − N p ) ] \begin{aligned} 0 &=\left(\begin{array}{l} \mathrm{N} \\ \mathrm{m} \end{array}\right)\left[\mathrm{mp}^{(\mathrm{m}-1)}(1-\mathrm{p})^{(\mathrm{N}-\mathrm{m})}-(\mathrm{N}-\mathrm{m}) \mathrm{p}^{\mathrm{m}}(1-\mathrm{p})^{(\mathrm{N}-\mathrm{m}-1)}\right] \\ &=\left(\begin{array}{c} \mathrm{N} \\ \mathrm{m} \end{array}\right)\left[\mathrm{p}^{(\mathrm{m}-1)}(1-\mathrm{p})^{(\mathrm{N}-\mathrm{m}-1)}(\mathrm{m}-\mathrm{Np})\right] \end{aligned} 0=(Nm)[mp(m1)(1p)(Nm)(Nm)pm(1p)(Nm1)]=(Nm)[p(m1)(1p)(Nm1)(mNp)]
可求解得到p = 0 , p = 1 , p = m N =0, \mathrm{p}=1, \mathrm{p}=\frac{\mathrm{m}}{\mathrm{N}} =0,p=1,p=Nm

显然P ( Y = c k ) = p = m N = ∑ i = 1 N I ( y i = c k ) N , \left(\mathrm{Y}=\mathrm{c}_{\mathrm{k}}\right)=\mathrm{p}=\frac{\mathrm{m}}{\mathrm{N}}=\frac{\sum_{\mathrm{i}=1}^{\mathrm{N}} \mathrm{I}\left(\mathrm{y}_{\mathrm{i}}=\mathrm{c}_{\mathrm{k}}\right)}{\mathrm{N}}, (Y=ck)=p=Nm=Ni=1NI(yi=ck), 公式(4.8)得证。

参考自:
黄博机器学习初学者
统计学习方法习题解答

你可能感兴趣的:(机器/深度学习,math,机器学习)