在机器学习中,朴素贝叶斯是一个分类模型,输出的预测值是离散值。在讲该模型之前首先有必要先了解贝叶斯定理,以该定理为基础的统计学派在统计学领域占据重要的地位,它是从观察者的角度出发,观察者所掌握的信息量左右了观察者对事件的认知。
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) = P ( B ∣ A ) P ( A ) ∑ A P ( B ∣ A ) P ( A ) P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{\sum_{A}^{ }P(B|A)P(A)} P(A∣B)=P(B)P(B∣A)P(A)=∑AP(B∣A)P(A)P(B∣A)P(A)
其中, P ( B ∣ A ) P(B|A) P(B∣A)是事件 B 在另一个事件 A已经发生条件下的概率, ∑ A P ( B ∣ A ) P ( A ) \sum_{A}^{ }P(B|A)P(A) ∑AP(B∣A)P(A)表示A所有可能情况下的概率,现在要来求事件A在事件B发生情况下的条件概率 P ( A ∣ B ) P(A|B) P(A∣B),又称后验概率。
工厂生产产品,合格产品的概率是0.9,误检的概率是0.2,误检情况中合格产品的概率是0.1,那合格产品被误检的概率是多少?
直接套用贝叶斯公式,可得:
P ( 误 检 ∣ 合 格 ) = P ( 合 格 ∣ 误 检 ) P ( 误 检 ) P ( 合 格 ) = 0.1 ∗ 0.2 0.9 = 1 45 P(误检|合格) = \frac{P(合格|误检)P(误检)}{P(合格)} = \frac{0.1*0.2}{0.9}=\frac{1}{45} P(误检∣合格)=P(合格)P(合格∣误检)P(误检)=0.90.1∗0.2=451
假设有个严重的疾病,发生的概率是万分之一,现在某人神经大条,怀疑得了该病,跑到医院检查,结果确有此病,不过医院的测试报告只有99%准确度(另外,也指正常人检测99%没问题,1%检测错误)。那么,他的死亡风险有多大?
咋一看,此人快要和世界说拜拜了,这个时候贝叶斯公式派上用场了,其中,用D表示该病(D=1,得病;D=0,健康),P表示报告准确度(T=1,准确;T=0,错误),可得:
P ( D = 1 ∣ T = 1 ) = P ( T = 1 ∣ D = 1 ) P ( D = 1 ) P ( T = 1 ∣ D = 1 ) P ( D = 1 ) + P ( T = 1 ∣ D = 0 ) P ( D = 0 ) = 0.99 ∗ 0.0001 0.99 ∗ 0.0001 + 0.01 ∗ 0.9999 = 0.0098 P(D=1|T=1) = \frac{P(T=1|D=1)P(D=1)}{P(T=1|D=1)P(D=1)+P(T=1|D=0)P(D=0)}= \frac{0.99*0.0001}{0.99*0.0001+0.01*0.9999}=0.0098 P(D=1∣T=1)=P(T=1∣D=1)P(D=1)+P(T=1∣D=0)P(D=0)P(T=1∣D=1)P(D=1)=0.99∗0.0001+0.01∗0.99990.99∗0.0001=0.0098
可见,患病的几率很小,并没有想象中可怕。
用朴素贝叶斯分类器有个前提,需要假设所有的特征在数据集中是同等重要,但又相互独立,显然在现实情况中这是不太可能的,但还是要尽可能地满足它这个“朴素”的要求。
朴素贝叶斯的朴素思想,对于给定的待分类项,求解在此分类项出现的条件下,各个类别出现的概率,谁的概率最大,该分类项就属于哪个类别。
在特征项集合: F = { f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m } F=\left \{ f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m}\right \} F={f1,f2,f3,⋅⋅⋅,fm} 和类别集合: C = { c 1 , c 2 , c 3 , ⋅ ⋅ ⋅ , C n } C=\left \{ c_{1}, c_{2}, c_{3}, \cdot \cdot \cdot, C_{n}\right \} C={c1,c2,c3,⋅⋅⋅,Cn}条件的下,然后根据贝叶斯定理,可得:
P ( c k ∣ f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m ) = P ( c k ) P ( f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m ∣ c k ) P ( f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m ) ( k = 1 , 2 , 3 , ⋅ ⋅ ⋅ , n ) P(c_{k}|f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m}) = \frac{P(c_{k})P(f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m}|c_{k})}{P(f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m})}(k=1,2,3,\cdot \cdot \cdot,n ) P(ck∣f1,f2,f3,⋅⋅⋅,fm)=P(f1,f2,f3,⋅⋅⋅,fm)P(ck)P(f1,f2,f3,⋅⋅⋅,fm∣ck)(k=1,2,3,⋅⋅⋅,n)
由于都要除以 P ( f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m ) P(f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m}) P(f1,f2,f3,⋅⋅⋅,fm),故原公式简化为:
P ( c k ∣ f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m ) = P ( c k ) P ( f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m ∣ c k ) P(c_{k}|f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m}) = P(c_{k})P(f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m}|c_{k}) P(ck∣f1,f2,f3,⋅⋅⋅,fm)=P(ck)P(f1,f2,f3,⋅⋅⋅,fm∣ck)
各个特征属性独立,则公式变为:
P ( c k ∣ f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m ) = P ( c k ) P ( f 1 ∣ c k ) P ( f 2 ∣ c k ) P ( f 3 ∣ c k ) ⋅ ⋅ ⋅ P ( f m ∣ c k ) = P ( c k ) ∏ j = 1 m P ( f j ∣ c k ) P(c_{k}|f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m})=P(c_{k})P(f_{1}|c_{k})P(f_{2}|c_{k})P(f_{3}|c_{k})\cdot \cdot \cdot P(f_{m}|c_{k})=P(c_{k})\prod_{j=1}^{m}P(f_{j}|c_{k}) P(ck∣f1,f2,f3,⋅⋅⋅,fm)=P(ck)P(f1∣ck)P(f2∣ck)P(f3∣ck)⋅⋅⋅P(fm∣ck)=P(ck)∏j=1mP(fj∣ck)
而所要求的就是arg max P ( c k ∣ f 1 , f 2 , f 3 , ⋅ ⋅ ⋅ , f m ) P(c_{k}|f_{1}, f_{2}, f_{3}, \cdot \cdot \cdot, f_{m}) P(ck∣f1,f2,f3,⋅⋅⋅,fm),不过针对离散情况,只要求出各个 P ( f j ∣ c k ) P(f_{j}|c_{k}) P(fj∣ck)就行,但是面对连续值,可以使用高斯分布来求值,高斯函数如下:
P ( x i ∣ c ) = 1 2 π σ 2 e x p ( − ( x i − μ ) 2 2 σ 2 ) P(x_{i}|c)=\frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{(x_{i}-\mu )^{2}}{2\sigma ^{2}}) P(xi∣c)=2πσ21exp(−2σ2(xi−μ)2)
而关键是要求解公式中的 μ , σ 2 \mu,\sigma^{2} μ,σ2,可行的途径就是使用极大似然估计法(频率学派的方式,他们认为世界是确定的,有一个本体,这个本体的真值不变,而我们的目标是要找到这个真值或真值所在的范围),所以高斯函数中的 μ , σ 2 \mu,\sigma^{2} μ,σ2是可以求解的。
参数 μ , σ 2 \mu,\sigma^{2} μ,σ2的似然函数记作 L ( μ , σ 2 ) L(\mu,\sigma^{2}) L(μ,σ2),表示了m个样本 X 1 , X 2 , X 3 , ⋅ ⋅ ⋅ , X m X_{1}, X_{2}, X_{3}, \cdot \cdot \cdot, X_{m} X1,X2,X3,⋅⋅⋅,Xm在特征 c k c_{k} ck上的联合概率分布:
L ( μ , σ 2 ) = ∏ j = 1 m P ( x i j ; μ , σ 2 ) = ∏ j = 1 m 1 2 π σ 2 e x p ( − ( x i j − μ ) 2 2 σ 2 ) L(\mu,\sigma^{2})=\prod_{j=1}^{m}P(x_{i}^{j};\mu,\sigma^{2})=\prod_{j=1}^{m}\frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{(x_{i}^{j}-\mu )^{2}}{2\sigma ^{2}}) L(μ,σ2)=∏j=1mP(xij;μ,σ2)=∏j=1m2πσ21exp(−2σ2(xij−μ)2)
其中 x i j x_{i}^{j} xij表示的是第 j j j个样本的第 i i i个特征,为了方便求解,会在两边添加对数,可得:
l n L ( μ , σ 2 ) = l n ∏ j = 1 m 1 2 π σ 2 e x p ( − ( x i j − μ ) 2 2 σ 2 ) lnL(\mu,\sigma^{2})=ln\prod_{j=1}^{m}\frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{(x_{i}^{j}-\mu )^{2}}{2\sigma ^{2}}) lnL(μ,σ2)=ln∏j=1m2πσ21exp(−2σ2(xij−μ)2)
化简可得:
l n L ( μ , σ 2 ) = − n 2 l n 2 π − n 2 l n σ 2 − 1 2 σ 2 ∑ i = 1 m ( x i j − μ ) 2 lnL(\mu,\sigma^{2})=-\frac{n}{2}ln2\pi-\frac{n}{2}ln\sigma^{2}-\frac{1}{2\sigma^{2}}\sum_{i=1}^{m}(x_{i}^{j}-\mu )^{2} lnL(μ,σ2)=−2nln2π−2nlnσ2−2σ21∑i=1m(xij−μ)2
分别对 μ , σ 2 \mu,\sigma^{2} μ,σ2求偏导可得:
∂ l n L ( μ , σ 2 ) ∂ μ = 1 σ 2 ∑ i = 1 m ( x i j − μ ) = 0 \frac{\partial lnL(\mu,\sigma^{2})}{\partial\mu } =\frac{1}{\sigma^{2}}\sum_{i=1}^{m}(x_{i}^{j}-\mu )=0 ∂μ∂lnL(μ,σ2)=σ21∑i=1m(xij−μ)=0
∂ l n L ( μ , σ 2 ) ∂ σ 2 = − 1 2 σ 2 ∑ i = 1 m ( 1 − ( x i j − μ ) 2 2 σ 2 ) ) = 0 \frac{\partial lnL(\mu,\sigma^{2})}{\partial\sigma^{2} } =-\frac{1}{2\sigma^{2}}\sum_{i=1}^{m}(1- \frac{(x_{i}^{j}-\mu )^{2}}{2\sigma^{2}}) )=0 ∂σ2∂lnL(μ,σ2)=−2σ21∑i=1m(1−2σ2(xij−μ)2))=0
求解有:
μ = 1 m ∑ i = 1 m x i j \mu=\frac{1}{m}\sum_{i=1}^{m}x_{i}^{j} μ=m1∑i=1mxij
σ 2 = 1 m ∑ i = 1 m ( x i j − μ ) 2 \sigma^{2}=\frac{1}{m}\sum_{i=1}^{m}(x_{i}^{j}-\mu )^{2} σ2=m1∑i=1m(xij−μ)2
其实就是求某个类别对应每个特征的平均值(mean)和方差(variance),有了参数 μ , σ 2 \mu,\sigma^{2} μ,σ2,就可以很容易估计第 i i i个特征的分布,从而求解 x i x_{i} xi的概率。
有一些鸢尾花数据,总共有150行,其中特征有四个,分别是萼片的长度,萼片的宽度,花瓣的长度,花瓣的宽度,类别有三个,分别是setosa、versicolor和virginica,然后给你一组待预测数据[3.1, 4.4, 2.1, 3.1],估计它属于哪个类?(样本地址:https://download.csdn.net/download/zhuqiang9607/10764313)
代码:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
data = pd.read_csv()
class_features = ["Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"]
X_train = data[class_features]
y_train = data["Species"]
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print (gnb.predict(X_test))
如果想深入地了解高斯贝叶斯的代码实现,可以访问github进行下载:https://github.com/AutumnBoat/MachineLearning/blob/master/GaussianNaiveBayes.ipynb