朴素贝叶斯法是基于贝叶斯定理与特征条件独立假设的分类方法。
设输入空间 χ ∈ R n \chi \in R^n χ∈Rn是n维向量集合,输出空间 y = { c 1 , c 2 , ⋯ , c K } y=\{c_1,c_2,\cdots,c_K\} y={c1,c2,⋯,cK}为类标记集合。X是定义在输入空间 χ \chi χ上的随机向量,Y是定义在输出空间 y y y上的随机变量。 P ( X , Y ) P(X,Y) P(X,Y)是X和Y的联合概率分布。训练数据集
T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\} T={(x1,y1),(x2,y2),⋯,(xN,yN)}
由 P ( X , Y ) P(X,Y) P(X,Y)独立同分布产生。
先验概率分布
P ( Y = c k ) , k = 1 , 2 , ⋯ , K P(Y=c_k),k=1,2,\cdots,K P(Y=ck),k=1,2,⋯,K
条件概率分布
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , ⋯ , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},\cdots,X^{(n)}=x^{(n)}|Y=c_k),k=1,2,\cdots,K P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),⋯,X(n)=x(n)∣Y=ck),k=1,2,⋯,K
朴素贝叶斯法对条件概率分布做了条件独立性的假设。条件独立性假设是
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , ⋯ , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},\cdots,X^{(n)}=x^{(n)}|Y=c_k)\\ \qquad\qquad\qquad\qquad =\prod \limits_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),⋯,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
条件独立假设可解释为:用于分类的特征在类确定的条件下都是条件独立的。
后验概率分布
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum \limits_{k}P(X=x|Y=c_k)P(Y=c_k)} P(Y=ck∣X=x)=k∑P(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
根据条件独立假设有
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) = P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , , K P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum \limits_{k}P(X=x|Y=c_k)P(Y=c_k)}\\ \qquad\qquad\qquad\qquad =\frac{P(Y=c_k)\prod \limits_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum \limits_{k}P(Y=c_k)\prod \limits_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)},k=1,2,\cdots,,K P(Y=ck∣X=x)=k∑P(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)=k∑P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,⋯,,K
朴素贝叶斯分类器可表示为
y = f ( x ) = arg max c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\argmax \limits_{c_k}\frac{P(Y=c_k)\prod \limits_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum \limits_{k}P(Y=c_k)\prod \limits_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)} y=f(x)=ckargmaxk∑P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
进一步(分母不变)
y = f ( x ) = arg max c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\argmax \limits_{c_k}{P(Y=c_k)\prod \limits_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)} y=f(x)=ckargmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
在朴素贝叶斯法中,学习意味着估计 P ( Y = c k ) P(Y=c_k) P(Y=ck)和 P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y=c_k) P(X(j)=x(j)∣Y=ck)
应用极大似然估计法估计相应的概率,
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯ , , K P(Y=c_k)=\frac{\sum \limits_{i=1}^{N}I(y_i=c_k)}{N},k=1,2,\cdots,,K P(Y=ck)=Ni=1∑NI(yi=ck),k=1,2,⋯,,K
设第j个特征 x ( j ) x^{(j)} x(j)可能取值的集合为 { a j 1 , a j 2 , ⋯ , a j s j } \{a_{j1},a_{j2},\cdots,a_{jsj}\} {aj1,aj2,⋯,ajsj},条件概率 P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y=c_k) P(X(j)=x(j)∣Y=ck)的极大似然估计为
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum \limits_{i=1}^{N} I(x^{(j)}_{i}=a_{jl},y_i=c_k)}{\sum \limits_{i=1}^{N}I(y_i=c_k)} P(X(j)=ajl∣Y=ck)=i=1∑NI(yi=ck)i=1∑NI(xi(j)=ajl,yi=ck)
输入:训练数据 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } , T=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}, T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中 x i = { x i ( 1 ) , x i ( 2 ) , ⋯ , x i ( n ) } T , x i ( j ) x_i=\{x_i^{(1)},x_i^{(2)},\cdots,x_i^{(n)}\}^T,x_i^{(j)} xi={xi(1),xi(2),⋯,xi(n)}T,xi(j)是第 i i i个样本的第 j j j个特征, x i ( j ) ∈ { a j 1 , a j 2 , ⋯ , a j S j } , a j l x_i^{(j)}\in \{a_{j1},a_{j2},\cdots,a_{jS_j}\},a_{jl} xi(j)∈{aj1,aj2,⋯,ajSj},ajl是第 j j j特征可能取的第 l l l个值, j = 1 , 2 , ⋯ , n , l = 1 , 2 , ⋯ , S j , y i ∈ { c 1 , c 2 , ⋯ , c K } j=1,2,\cdots,n,l=1,2,\cdots,S_j,y_i\in \{c_1,c_2,\cdots,c_K\} j=1,2,⋯,n,l=1,2,⋯,Sj,yi∈{c1,c2,⋯,cK};实例 x x x
输出:实例 x x x的分类
(1)计算先验概率和条件概率
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯ , , K P(Y=c_k)=\frac{\sum \limits_{i=1}^{N}I(y_i=c_k)}{N},k=1,2,\cdots,,K P(Y=ck)=Ni=1∑NI(yi=ck),k=1,2,⋯,,K
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) , j = 1 , 2 , ⋯ , n ; l = 1 , 2 , ⋯ , S j ; k = 1 , 2 , ⋯ , K P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum \limits_{i=1}^{N} I(x^{(j)}_{i}=a_{jl},y_i=c_k)}{\sum \limits_{i=1}^{N}I(y_i=c_k)},j=1,2,\cdots,n;l=1,2,\cdots,S_j;k=1,2,\cdots,K P(X(j)=ajl∣Y=ck)=i=1∑NI(yi=ck)i=1∑NI(xi(j)=ajl,yi=ck),j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K
(2)对于给定的实例 x = { x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) } T , x=\{x^{(1)},x^{(2)},\cdots,x^{(n)}\}^T, x={x(1),x(2),⋯,x(n)}T,计算
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K P(Y=c_k)\prod \limits_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k),k=1,2,\cdots,K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
(3)确定实例 x x x的类
y = arg max c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\argmax \limits_{c_k}{P(Y=c_k)\prod \limits_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)} y=ckargmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
\qquad 采用极大似然估计可能会出现所要估计的概率值为0的情况,进而影响后验概率的计算结果,使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ \qquad P_{\lambda}(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum\limits_{i=1}^{N}I(x_i^{(j)}=a_{jl},y_i=c_k)+\lambda}{\sum\limits_{i=1}^{N}I(y_i=c_k)+S_j\lambda} Pλ(X(j)=ajl∣Y=ck)=i=1∑NI(yi=ck)+Sjλi=1∑NI(xi(j)=ajl,yi=ck)+λ
其中 λ ≥ 0 \lambda\geq0 λ≥0
当 λ = 0 \lambda=0 λ=0时,是极大似然估计。
当 λ = 1 \lambda=1 λ=1时,称为拉普拉斯平滑。
先验概率的贝叶斯估计为
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ \qquad P_{\lambda}(Y=c_k)=\frac{\sum\limits_{i=1}^{N}I(y_i=c_k)+\lambda}{N+K\lambda} Pλ(Y=ck)=N+Kλi=1∑NI(yi=ck)+λ
对于上例,采用拉普拉斯平滑估计概率。
##朴素贝叶斯
x1=[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
x2=['S','M','M','S','S','S','M','M','L','L','L','M','M','L','L']
y=[-1,-1,1,1,-1,-1,-1,1,1,1,1,1,1,1,-1]
data={'X1':x1,'X2':x2,'Y':y}
df=pd.DataFrame(data)
A1={1,2,3}
A2={'S','M','L'}
C={1,-1}
def priorPro(y):
'''先验概率p(y)'''
C=y.unique()
pro_y={}
for c_k in C:
pro=sum(y==c_k)/len(y)
pro_y[c_k]=pro
return pro_y
def conditionalPro(x,y):
'''条件概率p(X=x|Y=y)'''
a=list(x.unique())
c=list(y.unique())
inter=pd.concat([x,y],axis=1)
conditionalpro={}
for c_k in c:
subpro={}
for a_j in a:
num=len(inter[inter.iloc[:,1]==c_k])
num1=len(inter[(inter.iloc[:,0]==a_j)&(inter.iloc[:,1]==c_k)])
pro=num1/num
subpro[a_j]=pro
conditionalpro[c_k]=subpro
return pd.DataFrame(conditionalpro)
整理结果
priorpro=priorPro(df['Y'])
a1=conditionalPro(df['X1'],df['Y'])
a2=conditionalPro(df['X2'],df['Y'])
a1['变量']=1
a2['变量']=2
conPro=pd.concat([a1,a2])
conpro=conPro.reset_index()
conpro.rename(columns={'index':'X_value'},inplace=True)
def pred(x):
'''预测'''
postpros={}
for c_k in list(C):
postpro=priorpro[c_k]
for i in range(len(x)):
postpro*=conpro.loc[(conpro['X_value']==x[i])&(conpro['变量']==i+1),c_k].values[0]
postpros[c_k]=postpro
for key, val in postpros.items():
if val==max(postpros.values()):
max_key=key
return max_key,postpros
x_sample=[2,'S']
pred(x_sample)
引入 λ \lambda λ。
##拉普拉斯平滑
def priorPro_lap(y,lam=1):
'''先验概率p(y)'''
C=y.unique()
pro_y={}
for c_k in C:
pro=(sum(y==c_k)+lam)/(len(y)+len(C)*lam)
pro_y[c_k]=pro
return pro_y
def conditionalPro_lap(x,y,lam=1):
'''条件概率p(X=x|Y=y)'''
a=list(x.unique())
c=list(y.unique())
inter=pd.concat([x,y],axis=1)
conditionalpro={}
for c_k in c:
subpro={}
for a_j in a:
num=len(inter[inter.iloc[:,1]==c_k])
num1=len(inter[(inter.iloc[:,0]==a_j)&(inter.iloc[:,1]==c_k)])
pro=(num1+lam)/(num+len(a)*lam)
subpro[a_j]=pro
conditionalpro[c_k]=subpro
return pd.DataFrame(conditionalpro)
priorpro_lap=priorPro_lap(df['Y'],lam=1)
a1=conditionalPro_lap(df['X1'],df['Y'])
a2=conditionalPro_lap(df['X2'],df['Y'])
a1['变量']=1
a2['变量']=2
conPro=pd.concat([a1,a2])
conpro=conPro.reset_index()
conpro.rename(columns={'index':'value'},inplace=True)
def pred(x):
'''预测'''
postpros={}
for c_k in list(C):
postpro=priorpro_lap[c_k]
for i in range(len(x)):
postpro*=conpro.loc[(conpro['value']==x[i])&(conpro['变量']==i+1),c_k].values[0]
postpros[c_k]=postpro
for key, val in postpros.items():
if val==max(postpros.values()):
max_key=key
return max_key,postpros
概率 P ( Y = c ∣ X = x ) P(Y=c|X=x) P(Y=c∣X=x)最大值所对应的 c c c就是我们所预测的 X X X的 Y Y Y值。根据贝叶斯公式有 P ( Y = c ∣ X = x ) = P ( X = x , Y = c ) P ( X = x ) ⇒ 贝 叶 斯 定 理 = P ( X = x ∣ Y = c ) P ( Y = c ) ∑ c P ( X = x ∣ Y = c ) P ( Y = c ) ⇒ 特 征 条 件 独 立 = ∏ j P ( X ( j ) = x ( j ) ∣ Y = c ) P ( Y = c ) ∑ c ∏ j P ( X ( j ) = x ( j ) ∣ Y = c ) P ( Y = c ) \;P(Y=c|X=x)=\frac{P(X=x,Y=c)}{P(X=x)}\\ \xRightarrow{贝叶斯定理}\;=\frac{P(X=x|Y=c)P(Y=c)}{\sum\limits_{c}P(X=x|Y=c)P(Y=c)}\\ \xRightarrow{特征条件独立}=\frac{\prod\limits_{j}P(X^{(j)}=x^{(j)}|Y=c)P(Y=c)}{\sum\limits_{c}\prod\limits_{j}P(X^{(j)}=x^{(j)}|Y=c)P(Y=c)} P(Y=c∣X=x)=P(X=x)P(X=x,Y=c)贝叶斯定理=c∑P(X=x∣Y=c)P(Y=c)P(X=x∣Y=c)P(Y=c)特征条件独立=c∑j∏P(X(j)=x(j)∣Y=c)P(Y=c)j∏P(X(j)=x(j)∣Y=c)P(Y=c)