推荐阅读:小白之通俗易懂的贝叶斯定理(Bayes’ Theorem)
朴素贝叶斯法是一种多分类算法,它的基础是“朴素贝叶斯假设”(假设实例的各个特征具有条件独立性)。根据训练集估计模型的先验概率、条件概率,再按照后验概率最大化的准则,给出输入实例的分类预测。它的算法实现很简单,但理论证明并不容易。
具体来说,通过极大似然估计法估计先验概率、条件概率,计算过程比较复杂,书上也没有给出。本章主要分为3个部分:
输入:训练集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),…(xN,yN)},其中
x i = ( x i ( 1 ) , x i ( 2 ) , … , x i ( n ) ) T , x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \ldots, x_{i}^{(n)}\right)^{T}, xi=(xi(1),xi(2),…,xi(n))T, 第 i \mathrm{i} i 个样本的第 j \mathrm{j} j 个特征, x i ( j ) ∈ { a j 1 , a j 2 , … , a j S j } \quad x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \ldots, a_{j S_{j}}\right\} xi(j)∈{aj1,aj2,…,ajSj}
即某一特征的取值为有限集合(共 S j S_{j} Sj个,不同的特征取值可能有所不同), y i ∈ { c 1 , c 2 , … , c K } , y_{i} \in\left\{c_{1}, c_{2}, \ldots, c_{K}\right\}, yi∈{c1,c2,…,cK}, 即分类结果为有限集合(共K个 ) ) )
输出:实例 x 的分类
1. 计算先验概率及条件概率
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , … , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, k=1,2, \ldots, K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,…,K
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
j = 1 , 2 , … n ; l = 1 , 2 , … , S j ; k = 1 , 2 , … , K j=1,2, \ldots n ; l=1,2, \ldots, S_{j} ; k=1,2, \ldots, K j=1,2,…n;l=1,2,…,Sj;k=1,2,…,K
上述公式可以根据已知数据(训练集),求出总体分布的先验概率、条件概率,其中N为训练集的 实例数, K为分类数, n为特征空间维度, 为第 j 个特征取值的个数。
说明:
先验概率 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) 等于训练集里分类为 c k c_{k} ck 的比例。通过示性函数 I ( y i = c k ) I\left(y_{i}=c_{k}\right) I(yi=ck) 统计分类 为 c k c_{k} ck 的实例数量, 再除以实例总数N, 得到分类为 c k c_{k} ck 的比例;
条件概率 P ( X ( j ) = a l j ∣ Y = c k ) P\left(X^{(j)}=a_{l j} \mid Y=c_{k}\right) P(X(j)=alj∣Y=ck) 等于联合概率除以先验概率,而联合概率又可以用训练集里X、Y取值同时符合条件的数量来表示。实际上,条件概率可以表示为
P ( X ( j ) = a l j ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) / N ∑ i = 1 N I ( y i = c k ) / N P\left(X^{(j)}=a_{l j} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right) / N}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right) / N} P(X(j)=alj∣Y=ck)=∑i=1NI(yi=ck)/N∑i=1NI(xi(j)=ajl,yi=ck)/N
其中分母为先验概率,分子为联合概率。联合概率等于训练集里 x i ( j ) = a j l x_{i}^{(j)}=a_{j l} xi(j)=ajl 和 y i = c k y_{i}=c_{k} yi=ck 两个 条件同时满足的实例的比例。
补充:联合概率、边缘概率、条件概率之间的关系&贝叶斯公式
2. 对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , … , x ( n ) ) T , x=\left(x^{(1)}, x^{(2)}, \ldots, x^{(n)}\right)^{T}, x=(x(1),x(2),…,x(n))T, 计算
f ( c k ) = P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , … , K f\left(c_{k}\right)=P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), k=1,2, \ldots, K f(ck)=P(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck),k=1,2,…,K
f ( c k ) ∝ f\left(c_{k}\right) \propto f(ck)∝ 后验概率 P ( Y = c k ∣ X = x ) , P\left(Y=c_{k} \mid X=x\right), P(Y=ck∣X=x), 因为根据贝叶斯公式
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P ( Y = c k ) P ( X = x ) P\left(Y=c_{k} \mid X=x\right)=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}=\frac{\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)} P(Y=ck∣X=x)=P(X=x)P(X=x∣Y=ck)P(Y=ck)=P(X=x)∏j=1nP(X(j)=x(j)∣Y=ck)P(Y=ck)
其中 P ( X = x ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left(X=x \mid Y=c_{k}\right)=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P(X=x∣Y=ck)=∏j=1nP(X(j)=x(j)∣Y=ck) 就是"朴素贝早斯假设" , , , 即 在 Y = c k Y=c_{k} Y=ck 的条件下,特征向量 X=x 的概率等于它们各个分量同时相等的概率。朴素贝叶斯假 设(又称特征条件独立假设)的目的在于简化计算,但实际上特征之间有可能是相关的,所以此假 设有可能导致误差增大。
由于分母 P ( X = x ) P(X=x) P(X=x) 不会随着 c k c_{k} ck 取值不同而改变, 因此可看作一个常数。
3. 确定实例 x 的分类
y = arg max c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) = arg max c k f ( c k ) y=\underset{c_{k}}{\arg \max } P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)=\underset{c_{k}}{\arg \max } f\left(c_{k}\right) y=ckargmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)=ckargmaxf(ck)
由于 f ( c k ) ∝ f\left(c_{k}\right) \propto f(ck)∝ 后验概率,只需找到 c k c_{k} ck 使得 f ( c k ) f\left(c_{k}\right) f(ck) 最大化, 即后验概率最大化。
一般在总体分布形式已确定的情况下,才能采用极大似然法进行参数估计。在总体分布未知的情况 下,又该如何进行参数估计呢?待估计的参数是什么?似然函数又是什么?
1. 先验概率的极大似然估计
朴素贝叶斯算法的 “预测系统” 是一个分类器,向它输入一个实例 x, 得到一个分类结果 y, 即 K 种类别中的一种 c k c_{k} ck 。可以把它想象为一个黑盒子,里面装有 K 种颜色的小球, 摸出一个小球, 就完成了一次分类预测。盒子里面各种颜色小球的比例,从概率上决定了预测的结果,因此把 “小 球比例” (即各种分类的先验概率)作为参数是合理的。
θ 1 = P ( Y = c 1 ) , θ 2 = P ( Y = c 2 ) , … , θ K = P ( Y = c K ) \theta_{1}=P\left(Y=c_{1}\right), \theta_{2}=P\left(Y=c_{2}\right), \ldots, \theta_{K}=P\left(Y=c_{K}\right) θ1=P(Y=c1),θ2=P(Y=c2),…,θK=P(Y=cK)
训结集里的N个实例,即从黑盒子里摸出的N个小弘的结果。估计问题:在出现这N个结果的前提 下,黑盒子里面小球的比例应该是怎样的呢? 由此得到似然函数:
L ( Y ; θ ) = P ( Y = y 1 , Y = y 2 , … , Y = y N ; θ 1 , … , θ K ) L(Y ; \theta)=P\left(Y=y_{1}, Y=y_{2}, \ldots, Y=y_{N} ; \theta_{1}, \ldots, \theta_{K}\right) L(Y;θ)=P(Y=y1,Y=y2,…,Y=yN;θ1,…,θK)
y i ∈ { c 1 , c 2 , … , c K } y_{i} \in\left\{c_{1}, c_{2}, \ldots, c_{K}\right\} yi∈{c1,c2,…,cK}
训结集里面的实例具有独立性, L ( Y ; θ ) = ∏ i = 1 N P ( Y = y i ; θ ) \quad L(Y ; \theta)=\prod_{i=1}^{N} P\left(Y=y_{i} ; \theta\right) L(Y;θ)=∏i=1NP(Y=yi;θ)
其中 P ( Y = y i ) = P ( Y = c k ) = θ k , P\left(Y=y_{i}\right)=P\left(Y=c_{k}\right)=\theta_{k}, P(Y=yi)=P(Y=ck)=θk, 所以
L ( Y ; θ ) = ∏ i = 1 N P ( Y = y i ; θ ) = θ 1 θ 1 … θ k … θ K = ∏ k = 1 K θ k N k L(Y ; \theta)=\prod_{i=1}^{N} P\left(Y=y_{i} ; \theta\right)=\theta_{1} \theta_{1} \ldots \theta_{k} \ldots \theta_{K}=\prod_{k=1}^{K} \theta_{k}^{N_{k}} L(Y;θ)=∏i=1NP(Y=yi;θ)=θ1θ1…θk…θK=∏k=1KθkNk
它的对数似然图数为: l ( Y ; θ ) = ln ∏ k = 1 K θ k N h = ∑ k = 1 K N k ln θ k \quad l(Y ; \theta)=\ln \prod_{k=1}^{K} \theta_{k}^{N_{h}}=\sum_{k=1}^{K} N_{k} \ln \theta_{k} l(Y;θ)=ln∏k=1KθkNh=∑k=1KNklnθk
其中 N k N^{k} Nk 是训结集里分状为 k \mathrm{k} k 的实例的数量.
注意, 这里隐含两个约束条件:
(1) ∑ k = 1 K θ k = θ 1 + θ 2 + … + θ K = 1 , \sum_{k=1}^{K} \theta_{k}=\theta_{1}+\theta_{2}+\ldots+\theta_{K}=1, ∑k=1Kθk=θ1+θ2+…+θK=1, 即各种类别比例之和等于1
(2) ∑ k = 1 K N k = N 1 + N 2 + … + N K = N , \sum_{k=1}^{K} N_{k}=N_{1}+N_{2}+\ldots+N_{K}=N, ∑k=1KNk=N1+N2+…+NK=N, 即名种类别的实例数相加等于实例总数.
由于包含等式约束。需引入拉格朗日乘子求 “象件极值":
l ( θ k , λ ) = ∑ k = 1 K N k ln θ k + λ ( ∑ k = 1 K θ k − 1 ) , l\left(\theta_{k}, \lambda\right)=\sum_{k=1}^{K} N_{k} \ln \theta_{k}+\lambda\left(\sum_{k=1}^{K} \theta_{k}-1\right), l(θk,λ)=∑k=1KNklnθk+λ(∑k=1Kθk−1), 它的极大值位于偏导数为零处
∂ l ∂ θ k = ∂ ∂ θ k [ ( N 1 ln θ 1 + N 2 ln θ 2 + … + N K ln θ K ) + λ ( θ 1 + θ 2 + … + θ K − 1 ) ] \frac{\partial l}{\partial \theta_{k}}=\frac{\partial}{\partial \theta_{k}}\left[\left(N_{1} \ln \theta_{1}+N_{2} \ln \theta_{2}+\ldots+N_{K} \ln \theta_{K}\right)+\lambda\left(\theta_{1}+\theta_{2}+\ldots+\theta_{K}-1\right)\right] ∂θk∂l=∂θk∂[(N1lnθ1+N2lnθ2+…+NKlnθK)+λ(θ1+θ2+…+θK−1)]
= N k θ k + λ = 0 , =\frac{N_{k}}{\theta_{k}}+\lambda=0, =θkNk+λ=0, 经郊理得 N k = − λ θ k , N_{k}=-\lambda \theta_{k}, Nk=−λθk, 其中 k = 1 , 2 , … , K k=1,2, \ldots, K k=1,2,…,K 共K个等式
如果把这 K个等式左右两边相加, 有 N 1 + N 2 + … + N K = − λ ( θ 1 + θ 2 + … + θ K ) N_{1}+N_{2}+\ldots+N_{K}=-\lambda\left(\theta_{1}+\theta_{2}+\ldots+\theta_{K}\right) N1+N2+…+NK=−λ(θ1+θ2+…+θK)
根据两个约束条件, 得到 λ = − N , \lambda=-N, λ=−N, 所以
先验樹率 P ( Y = c k ) = θ k = N k N = ∑ i = 1 N I ( y i = c k ) N P\left(Y=c_{k}\right)=\theta_{k}=\frac{N_{k}}{N}=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N} P(Y=ck)=θk=NNk=N∑i=1NI(yi=ck)
2. 条件概率的极大似然估计
已知先验概率为 θ k = P ( Y = c k ) \theta_{k}=P\left(Y=c_{k}\right) θk=P(Y=ck)
设条件概率为 μ i k = P ( X ( j ) = a j i ∣ Y = c k ) , \mu_{i k}=P\left(X^{(j)}=a_{j i} \mid Y=c_{k}\right), μik=P(X(j)=aji∣Y=ck), 注意只针对第 j个特征, 其他同理。
P ( Y = c k ) P ( X ( j ) = a j l ∣ Y = c k ) = P ( X ( j ) = a j l , Y = c k ) = θ k μ l k P\left(Y=c_{k}\right) P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=P\left(X^{(j)}=a_{j l}, Y=c_{k}\right)=\theta_{k} \mu_{l k} P(Y=ck)P(X(j)=ajl∣Y=ck)=P(X(j)=ajl,Y=ck)=θkμlk
似然哲数 L ( ( X ( j ) = x 1 ( j ) , Y = y 1 ) , … , ( X ( j ) = x N ( j ) , Y = y N ) ; μ , θ ) = ∏ i = 1 N P ( X ( j ) = x i ( j ) , Y = y i ) L\left(\left(X^{(j)}=x_{1}^{(j)}, Y=y_{1}\right), \ldots,\left(X^{(j)}=x_{N}^{(j)}, Y=y_{N}\right) ; \mu, \theta\right)=\prod_{i=1}^{N} P\left(X^{(j)}=x_{i}^{(j)}, Y=y_{i}\right) L((X(j)=x1(j),Y=y1),…,(X(j)=xN(j),Y=yN);μ,θ)=∏i=1NP(X(j)=xi(j),Y=yi)
其中 x i ( j ) ∈ { a j 1 , a j 2 , … , a j S j } , l ∈ { 1 , 2 , … , S j } x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \ldots, a_{j S_{j}}\right\}, \quad l \in\left\{1,2, \ldots, S_{j}\right\} xi(j)∈{aj1,aj2,…,ajSj},l∈{1,2,…,Sj}
L ( X , Y ; μ , θ ) = ∏ i = 1 N P ( X ( j ) = x i ( j ) , Y = y i ) L(X, Y ; \mu, \theta)=\prod_{i=1}^{N} P\left(X^{(j)}=x_{i}^{(j)}, Y=y_{i}\right) L(X,Y;μ,θ)=∏i=1NP(X(j)=xi(j),Y=yi)
= ∏ i = 1 N P ( X ( j ) = a j l , Y = c k ) = ∏ l = 1 S f ∏ k = 1 K ( θ k μ l k ) N l k =\prod_{i=1}^{N} P\left(X^{(j)}=a_{j l}, Y=c_{k}\right)=\prod_{l=1}^{S_{f}} \prod_{k=1}^{K}\left(\theta_{k} \mu_{l k}\right)^{N_{l k}} =∏i=1NP(X(j)=ajl,Y=ck)=∏l=1Sf∏k=1K(θkμlk)Nlk
其中 N l k N_{l k} Nlk 表示 X ( j ) = a j l X^{(j)}=a_{j l} X(j)=ajl 和 Y = c k Y=c_{k} Y=ck 两个条件同时满足的实例的数量.
对数似然函数为: l ( μ , θ ) = ∑ l = 1 S j ∑ k = 1 K N k ( ln μ l k + ln θ k ) \quad l(\mu, \theta)=\sum_{l=1}^{S_{j}} \sum_{k=1}^{K} N_{k}\left(\ln \mu_{l k}+\ln \theta_{k}\right) l(μ,θ)=∑l=1Sj∑k=1KNk(lnμlk+lnθk)
新增约束条件 (3):
∑ l = 1 S j μ l k = 1 , \sum_{l=1}^{S_{j}} \mu_{l k}=1, ∑l=1Sjμlk=1, 由于 μ l k = P ( X ( j ) = a j l ∣ Y = c k ) , \mu_{l k}=P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right), μlk=P(X(j)=ajl∣Y=ck), 且 x i ( j ) ∈ { a j 1 , a j 2 , … , a j S j } x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \ldots, a_{j S_{j}}\right\} xi(j)∈{aj1,aj2,…,ajSj}
为有限集合. 把所有可能取值的概率相加,结果等于1。即
P ( X ( j ) = a j 1 ∣ Y = c k ) + P ( X ( j ) = a j 2 ∣ Y = c k ) − ⊥ … + P ( X ( j ) = a j S j ∣ Y = c k ) = 1 P\left(X^{(j)}=a_{j 1} \mid Y=c_{k}\right)+P\left(X^{(j)}=a_{j 2} \mid Y=c_{k}\right)^{-\perp} \ldots+P\left(X^{(j)}=a_{j S_{j}} \mid Y=c_{k}\right)=1 P(X(j)=aj1∣Y=ck)+P(X(j)=aj2∣Y=ck)−⊥…+P(X(j)=ajSj∣Y=ck)=1
引入拉格朗日乘子求“条件极值”:
l ( μ l k , θ k , λ ) = ∑ l = 1 S j ∑ k = 1 K N k ( ln μ l k + ln θ k ) + λ ( ∑ l = 1 S j μ l k − 1 ) l\left(\mu_{l k}, \theta_{k}, \lambda\right)=\sum_{l=1}^{S_{j}} \sum_{k=1}^{K} N_{k}\left(\ln \mu_{l k}+\ln \theta_{k}\right)+\lambda\left(\sum_{l=1}^{S_{j}} \mu_{l k}-1\right) l(μlk,θk,λ)=∑l=1Sj∑k=1KNk(lnμlk+lnθk)+λ(∑l=1Sjμlk−1)
求 μ l k \mu_{l k} μlk 的偏导数等于0,过程与之前大同小异,只有下标恰好是 l k lk lk的项留下了,其他均为零.
∂ l ∂ μ l k = N l k μ l k + λ = 0 , \frac{\partial l}{\partial \mu_{l k}}=\frac{N_{l k}}{\mu_{l k}}+\lambda=0, ∂μlk∂l=μlkNlk+λ=0, 整理得 N l k = − λ μ l k , N_{l k}=-\lambda \mu_{l k}, Nlk=−λμlk, 将其按 l l l 野加起来, 得到
N 1 k + N 2 k + … + N S j k = − λ ( μ 1 k + … + μ S j k ) N_{1 k}+N_{2 k}+\ldots+N_{S_{j k}}=-\lambda\left(\mu_{1 k}+\ldots+\mu_{S_{j} k}\right) N1k+N2k+…+NSjk=−λ(μ1k+…+μSjk) 即 N k = − λ , N_{k}=-\lambda, Nk=−λ, 因此
条件概率 P ( X ( j ) = a j l ∣ Y = c k ) = μ l k = N l k N k = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\mu_{l k}=\frac{N_{l k}}{N_{k}}=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} P(X(j)=ajl∣Y=ck)=μlk=NkNlk=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
已知极大似外估计的结果为
先验概率: P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N \quad P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N} P(Y=ck)=N∑i=1NI(yi=ck)
条件栖率: P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) \quad P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
但如果在训练集里面,某一类别的实例数为0,将导致条件概率的分母为0 。
为领决这个问影,引入拉普拉斯平渦,增加一个正数 λ > 0 \lambda>0 λ>0 ,使得
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} P(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
上述公式,称为先验概率和条件概率的“贝叶斯估计”。
数据导入
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
import math
# data
def create_data():
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
data = np.array(df.iloc[:100, :])
# print(data)
return data[:,:-1], data[:,-1]
X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_test[0], y_test[0]
#(array([5.1, 3.8, 1.9, 0.4]), 0.0)
GaussianNB 高斯朴素贝叶斯
特征的可能性被假设为高斯
概率密度函数:
P ( x i ∣ y k ) = 1 2 π σ y k 2 exp ( − ( x i − μ y k ) 2 2 σ y k 2 ) P\left(x_{i} \mid y_{k}\right)=\frac{1}{\sqrt{2 \pi \sigma_{y k}^{2}}} \exp \left(-\frac{\left(x_{i}-\mu_{y k}\right)^{2}}{2 \sigma_{y k}^{2}}\right) P(xi∣yk)=2πσyk21exp(−2σyk2(xi−μyk)2)
数学期望(mean) : μ \mu μ
方差 : σ 2 = ∑ ( X − μ ) 2 N : \sigma^{2}=\frac{\sum(X-\mu)^{2}}{N} :σ2=N∑(X−μ)2
class NaiveBayes:
def __init__(self):
self.model = None
# 数学期望
@staticmethod
def mean(X):
return sum(X) / float(len(X))
# 标准差(方差)
def stdev(self, X):
avg = self.mean(X)
return math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))
# 概率密度函数
def gaussian_probability(self, x, mean, stdev):
exponent = math.exp(-(math.pow(x - mean, 2) /
(2 * math.pow(stdev, 2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
# 处理X_train
def summarize(self, train_data):
summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]
return summaries
# 分类别求出数学期望和标准差
def fit(self, X, y):
labels = list(set(y))
data = {label: [] for label in labels}
for f, label in zip(X, y):
data[label].append(f)
self.model = {
label: self.summarize(value)
for label, value in data.items()
}
return 'gaussianNB train done!'
# 计算概率
def calculate_probabilities(self, input_data):
# summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
# input_data:[1.1, 2.2]
probabilities = {}
for label, value in self.model.items():
probabilities[label] = 1
for i in range(len(value)):
mean, stdev = value[i]
probabilities[label] *= self.gaussian_probability(
input_data[i], mean, stdev)
return probabilities
# 类别
def predict(self, X_test):
# {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
label = sorted(
self.calculate_probabilities(X_test).items(),
key=lambda x: x[-1])[-1][0]
return label
def score(self, X_test, y_test):
right = 0
for X, y in zip(X_test, y_test):
label = self.predict(X)
if label == y:
right += 1
return right / float(len(X_test))
model = NaiveBayes()
model.fit(X_train, y_train)
#'gaussianNB train done!'
print(model.predict([4.4, 3.2, 1.3, 0.2]))
#0.0
model.score(X_test, y_test)
#1.0
scikit-learn 实例
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)
#GaussianNB(priors=None, var_smoothing=1e-09)
clf.score(X_test, y_test)
#1.0
clf.predict([[4.4, 3.2, 1.3, 0.2]])
#array([0.])
习题4.1用极大似外估计法推出朴素贝叶斯法中的概率估计公式(4.8)及公式 (4.9)。
第1步: \quad 证明公式(4.8): P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N \mathrm{P}\left(\mathrm{Y}=\mathrm{c}_{\mathrm{k}}\right)=\frac{\sum_{i=1}^{N} \mathrm{I}\left(\mathrm{y}_{\mathrm{i}}=\mathrm{c}_{\mathrm{k}}\right)}{\mathrm{N}} P(Y=ck)=N∑i=1NI(yi=ck)
L ( p ) = f D ( y 1 , y 2 , ⋯ , y n ∣ θ ) = ( N m ) p m ( 1 − p ) ( N − m ) \mathrm{L}(\mathrm{p})=\mathrm{fD}\left(\mathrm{y}_{1}, \mathrm{y}_{2}, \cdots, \mathrm{y}_{\mathrm{n}} \mid \theta\right)=\left(\begin{array}{c} \mathrm{N} \\ \mathrm{m} \end{array}\right) \mathrm{p}^{\mathrm{m}}(1-\mathrm{p})^{(\mathrm{N}-\mathrm{m})} L(p)=fD(y1,y2,⋯,yn∣θ)=(Nm)pm(1−p)(N−m)
使用微分求极值,两边同时对p求微分:
0 = ( N m ) [ m p ( m − 1 ) ( 1 − p ) ( N − m ) − ( N − m ) p m ( 1 − p ) ( N − m − 1 ) ] = ( N m ) [ p ( m − 1 ) ( 1 − p ) ( N − m − 1 ) ( m − N p ) ] \begin{aligned} 0 &=\left(\begin{array}{l} \mathrm{N} \\ \mathrm{m} \end{array}\right)\left[\mathrm{mp}^{(\mathrm{m}-1)}(1-\mathrm{p})^{(\mathrm{N}-\mathrm{m})}-(\mathrm{N}-\mathrm{m}) \mathrm{p}^{\mathrm{m}}(1-\mathrm{p})^{(\mathrm{N}-\mathrm{m}-1)}\right] \\ &=\left(\begin{array}{c} \mathrm{N} \\ \mathrm{m} \end{array}\right)\left[\mathrm{p}^{(\mathrm{m}-1)}(1-\mathrm{p})^{(\mathrm{N}-\mathrm{m}-1)}(\mathrm{m}-\mathrm{Np})\right] \end{aligned} 0=(Nm)[mp(m−1)(1−p)(N−m)−(N−m)pm(1−p)(N−m−1)]=(Nm)[p(m−1)(1−p)(N−m−1)(m−Np)]
可求解得到p = 0 , p = 1 , p = m N =0, \mathrm{p}=1, \mathrm{p}=\frac{\mathrm{m}}{\mathrm{N}} =0,p=1,p=Nm
显然P ( Y = c k ) = p = m N = ∑ i = 1 N I ( y i = c k ) N , \left(\mathrm{Y}=\mathrm{c}_{\mathrm{k}}\right)=\mathrm{p}=\frac{\mathrm{m}}{\mathrm{N}}=\frac{\sum_{\mathrm{i}=1}^{\mathrm{N}} \mathrm{I}\left(\mathrm{y}_{\mathrm{i}}=\mathrm{c}_{\mathrm{k}}\right)}{\mathrm{N}}, (Y=ck)=p=Nm=N∑i=1NI(yi=ck), 公式(4.8)得证。
参考自:
黄博机器学习初学者
统计学习方法习题解答