朴素贝叶斯是基于贝叶斯定理与特征条件独立假设的分类方法
表达:输入空间 X ⊆ R n X \subseteq R^n X⊆Rn,输出空间 Y = c 1 , c 2 , ⋯   , c K Y={c_1,c_2,\cdots,c_K} Y=c1,c2,⋯,cK。P(X,Y)是X和Y的联合分布。对条件概率分布作条件独立性假设,即 P ( X ∣ Y ) = P ( x 0 ∣ Y ) P ( x 1 ∣ Y ) ⋯ P ( x n ∣ Y ) P(X|Y)=P(x^0|Y)P(x^1|Y)\cdots P(x^n|Y) P(X∣Y)=P(x0∣Y)P(x1∣Y)⋯P(xn∣Y),在分类确定的条件下,用于分类的特征都是条件独立的。
通过训练集T歇息联合概率分布P(X,Y)。
通过学习到的模型计算后验概率分布
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{P(X=x)} P(Y=ck∣X=x)=P(X=x)P(X=x∣Y=ck)P(Y=ck)
朴素贝叶斯的分类器模型
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x) = argmax_{c_k}P(Y=c_k)\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k) y=f(x)=argmaxckP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
选择损失函数:对于分类模型,一般选择0-1损失函数:
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y,f(X))=\begin{cases}1,& Y\neq f(X) \\ 0,& Y= f(X) \end{cases} L(Y,f(X))={1,0,Y̸=f(X)Y=f(X)
期望风险:
对于离散值期望为 ∑ x P ( x ) \sum xP(x) ∑xP(x)
R e x p ( f ) = E [ L ( Y , f ( X ) ) ] = ∑ X ∑ Y L ( Y , f ( X ) ) P ( X , Y ) = R_{exp}(f) = E[L(Y,f(X))] = \sum_X\sum_YL(Y,f(X))P(X,Y)= Rexp(f)=E[L(Y,f(X))]=X∑Y∑L(Y,f(X))P(X,Y)=
∑ X ∑ Y L ( Y , f ( X ) ) P ( Y ∣ X ) P ( X ) = ∑ X ( ∑ Y L ( Y , f ( X ) ) P ( Y ∣ X ) ) P ( X ) \sum_X\sum_YL(Y,f(X))P(Y|X)P(X)=\sum_X(\sum_YL(Y,f(X))P(Y|X))P(X) X∑Y∑L(Y,f(X))P(Y∣X)P(X)=X∑(Y∑L(Y,f(X))P(Y∣X))P(X)
对想要期望风险最小化,对 ∑ Y L ( Y , f ( X ) ) P ( Y ∣ X ) \sum_YL(Y,f(X))P(Y|X) ∑YL(Y,f(X))P(Y∣X)逐个极小化
m i n ∑ Y L ( Y , f ( X ) ) P ( Y ∣ X ) = m i n ∑ k L ( Y = c k , f ( X ) ) P ( Y = c k ∣ X ) min\sum_YL(Y,f(X))P(Y|X)= min\sum_kL(Y=c_k,f(X))P(Y=c_k|X) minY∑L(Y,f(X))P(Y∣X)=mink∑L(Y=ck,f(X))P(Y=ck∣X)
当 Y ≠ f ( x ) Y \neq f(x) Y̸=f(x)即 f ( x ) ≠ c k f(x) \neq c_k f(x)̸=ck时, L ( Y = c k , f ( X ) ) L(Y=c_k,f(X)) L(Y=ck,f(X))为1,所以上面公式等效于
m i n ∑ k I ( f ( X ) ≠ c k ) P ( Y = c k ∣ X ) = m i n ∑ k ( 1 − I ( f ( X ) = c k ) ) P ( Y = c k ∣ X ) = min\sum_kI(f(X) \neq c_k)P(Y=c_k|X) = min\sum_k(1-I(f(X) = c_k))P(Y=c_k|X)= mink∑I(f(X)̸=ck)P(Y=ck∣X)=mink∑(1−I(f(X)=ck))P(Y=ck∣X)=
m i n { ∑ k P ( Y = c k ∣ X ) − ∑ k I ( f ( X ) = c k ) P ( Y = c k ∣ X ) } = m a x ∑ k I ( f ( X ) = c k ) P ( Y = c k ∣ X ) min \{\sum_kP(Y=c_k|X)-\sum_kI(f(X) = c_k)P(Y=c_k|X)\} =max\sum_kI(f(X) = c_k)P(Y=c_k|X) min{k∑P(Y=ck∣X)−k∑I(f(X)=ck)P(Y=ck∣X)}=maxk∑I(f(X)=ck)P(Y=ck∣X)
P ( Y = c k ) = ∑ i = 1 I ( y i = c k ) N P(Y=c_k)=\frac{\sum_{i=1}I(y_i=c_k)}{N} P(Y=ck)=N∑i=1I(yi=ck)
输出空间 Y = c 1 , c 2 , ⋯   , c K Y={c_1,c_2,\cdots,c_K} Y=c1,c2,⋯,cK,假设参数为 θ \theta θ假设每个值对应概率为 θ 1 , θ 2 , ⋯   , θ K \theta_1,\theta_2,\cdots, \theta_K θ1,θ2,⋯,θK,即 P ( Y = c i ∣ θ ) = θ i P(Y=c_i|\theta)=\theta_i P(Y=ci∣θ)=θi,且满足 s t : ∑ i = 1 K θ i = 1 st:\sum_{i=1}^K\theta_i=1 st:∑i=1Kθi=1
证:
m a x P ( y 1 y 2 ⋯ y N ∣ θ ) = P ( y 1 ∣ θ ) P ( y 2 ∣ θ ) ⋯ P ( y N ∣ θ ) = θ 1 m 1 θ 2 m 2 ⋯ θ N m N max P(y_1y_2\cdots y_N|\theta)=P(y_1|\theta)P(y_2|\theta)\cdots P(y_N|\theta)=\theta_1^{m_1}\theta_2^{m_2}\cdots \theta_N^{m_N} maxP(y1y2⋯yN∣θ)=P(y1∣θ)P(y2∣θ)⋯P(yN∣θ)=θ1m1θ2m2⋯θNmN
m a x l n ( P ( y 1 y 2 ⋯ y N ∣ θ ) ) = m 1 l n θ 1 + m 2 l n θ 2 + ⋯ + m K l n θ K max ln(P(y_1y_2\cdots y_N|\theta))=m_1ln\theta_1 + m_2ln\theta_2 + \cdots + m_Kln\theta_K maxln(P(y1y2⋯yN∣θ))=m1lnθ1+m2lnθ2+⋯+mKlnθK
L = l n ( P ( y 1 y 2 ⋯ y N ∣ θ ) ) + λ ( θ 1 + θ 2 + ⋯ + θ K − 1 ) = L=ln(P(y_1y_2\cdots y_N|\theta)) + \lambda(\theta_1 + \theta_2 + \cdots + \theta_K -1) = L=ln(P(y1y2⋯yN∣θ))+λ(θ1+θ2+⋯+θK−1)=
m 1 l n θ 1 + m 2 l n θ 2 + ⋯ + m K l n θ K + λ ( θ 1 + θ 2 + ⋯ + θ K − 1 ) m_1ln\theta_1 + m_2ln\theta_2 + \cdots + m_Kln\theta_K + \lambda(\theta_1 + \theta_2 + \cdots + \theta_K -1) m1lnθ1+m2lnθ2+⋯+mKlnθK+λ(θ1+θ2+⋯+θK−1)
∂ L ∂ θ i = m i θ i + λ = 0 , θ i = − m i λ , \frac{\partial L}{\partial \theta_i} = \frac{m_i}{\theta_i} + \lambda=0, \theta_i=-\frac{m_i}{\lambda}, ∂θi∂L=θimi+λ=0,θi=−λmi,
∑ K − m i λ = 1 , λ = − ∑ K m i = − N , θ i = m i N \sum_K-\frac{m_i}{\lambda}=1,\lambda=-\sum_Km_i=-N,\theta_i=\frac{m_i}{N} K∑−λmi=1,λ=−K∑mi=−N,θi=Nmi
P ( X ( j ) = a ∣ Y = c k ) = ∑ i = 1 N I ( x ( j ) = a , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{(j)}=a|Y=c_k)=\frac{\sum_{i=1}^NI(x^{(j)}=a,y_i=c_k)}{\sum_{i=1}^NI(y_i=c_k)} P(X(j)=a∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(x(j)=a,yi=ck)
假设样本较少,有可能某些条件下计数为0,则计算出该条件下的先验概率为0,会影响到以后后验概率的结果。因此采用贝叶斯估计。
P ( Y = c k ) = ∑ i = 1 I ( y i = c k ) + λ N + K λ P(Y=c_k)=\frac{\sum_{i=1}I(y_i=c_k) + \lambda}{N + K\lambda} P(Y=ck)=N+Kλ∑i=1I(yi=ck)+λ
P ( X ( j ) = a ∣ Y = c k ) = ∑ i = 1 N I ( x ( j ) = a , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S λ P(X^{(j)}=a|Y=c_k)= \frac{\sum_{i=1}^NI(x^{(j)}=a,y_i=c_k) + \lambda}{\sum_{i=1}^NI(y_i=c_k) + S\lambda} P(X(j)=a∣Y=ck)=∑i=1NI(yi=ck)+Sλ∑i=1NI(x(j)=a,yi=ck)+λ
其中K是分类结果的个数,S是第j维输入特征的取值个数。
证:
对于多项分布 Y = c 1 , c 2 , ⋯   , c K Y={c_1,c_2,\cdots,c_K} Y=c1,c2,⋯,cK,先验概率分布为
π ( θ ) = γ ( α 1 + α 2 + ⋯ + α K ) γ ( α 1 ) γ ( α 2 ) ⋯ γ ( α K ) θ 1 α 1 − 1 θ 2 α 2 − 1 ⋯ θ K α K − 1 \pi(\theta)=\frac{\gamma(\alpha_1+\alpha_2+\cdots+\alpha_K)}{\gamma(\alpha_1)\gamma(\alpha_2)\cdots\gamma(\alpha_K)}\theta_1^{\alpha_1-1}\theta_2^{\alpha_2-1}\cdots \theta_K^{\alpha_K-1} π(θ)=γ(α1)γ(α2)⋯γ(αK)γ(α1+α2+⋯+αK)θ1α1−1θ2α2−1⋯θKαK−1
倾向于认为 α 1 = α 2 = ⋯ = α K = α \alpha_1 = \alpha_2 = \cdots = \alpha_K=\alpha α1=α2=⋯=αK=α
P ( θ ∣ y 1 ⋯ y N ) = P ( θ , y 1 ⋯ y N ) P ( y 1 ⋯ y N ) ∝ π ( θ ) P ( y 1 ⋯ y N ∣ θ ) ∝ P(\theta|y_1\cdots y_N)=\frac{P(\theta,y_1\cdots y_N)}{P(y_1\cdots y_N)}\varpropto \pi(\theta)P(y_1\cdots y_N|\theta) \varpropto P(θ∣y1⋯yN)=P(y1⋯yN)P(θ,y1⋯yN)∝π(θ)P(y1⋯yN∣θ)∝
θ 1 α − 1 θ 2 α − 1 ⋯ θ K α − 1 θ 1 m 1 θ 2 m 2 ⋯ θ N m N \theta_1^{\alpha-1}\theta_2^{\alpha-1}\cdots \theta_K^{\alpha-1}\theta_1^{m_1}\theta_2^{m_2}\cdots \theta_N^{m_N} θ1α−1θ2α−1⋯θKα−1θ1m1θ2m2⋯θNmN
L = θ 1 m 1 + α − 1 θ 2 m 2 + α − 1 ⋯ θ K m k + α − 1 L = \theta_1^{m_1+\alpha-1}\theta_2^{m_2+\alpha-1}\cdots \theta_K^{m_k+\alpha-1} L=θ1m1+α−1θ2m2+α−1⋯θKmk+α−1
L 1 = l n L + λ ( θ 1 + ⋯ + θ K − 1 ) = { ∑ K ( m i + α − 1 ) l n θ i + λ θ i } − 1 L_1 = lnL+\lambda(\theta_1 + \cdots + \theta_K -1)=\{\sum_K(m_i+\alpha-1)ln\theta_i+\lambda\theta_i\} -1 L1=lnL+λ(θ1+⋯+θK−1)={K∑(mi+α−1)lnθi+λθi}−1
∂ L ∂ θ i = m i + α − 1 θ i + λ = 0 , θ i = − m i + α − 1 λ , \frac{\partial L}{\partial \theta_i} = \frac{m_i+\alpha-1}{\theta_i} + \lambda=0, \theta_i=-\frac{m_i+\alpha-1}{\lambda}, ∂θi∂L=θimi+α−1+λ=0,θi=−λmi+α−1,
∑ K − m i + α − 1 λ = 1 , λ = − ∑ K ( m i + α − 1 ) = − N − K ( α − 1 ) , \sum_K-\frac{m_i+\alpha-1}{\lambda}=1,\lambda=-\sum_K(m_i+\alpha-1)=-N-K(\alpha-1), K∑−λmi+α−1=1,λ=−K∑(mi+α−1)=−N−K(α−1),
θ i = m i + α − 1 N + K ( α − 1 ) \theta_i=\frac{m_i + \alpha-1}{N+K(\alpha-1)} θi=N+K(α−1)mi+α−1
import numpy as np
import pandas as pd
from functools import reduce
from pandas import Series,DataFrame
class naive_bayes:
def __init__(self,X,Y):
"""
xclass是个二维数组每行数据对应输入空间X的对应维度的所有取值,
yclass是输出空间所有取值
"""
self.X = X
self.Y = Y
self.model = self.training()
def training(self):
data = DataFrame(self.X)
data['yvalue'] = self.Y
# 计算每个y分类的数目
d_y = data['yvalue'].value_counts()
# p(y) = y分类数目/总数目
m_y = d_y/(len(data))
# p(x|y),m_xy最终结构,第一层index是各特征的index,第二层index是各特征的取值,
#columns是y分类
m_xy = DataFrame([],columns = data['yvalue'].unique())
for i in range(len(data.columns)-1):
#按照y和第i个特征计数
d_xy = pd.crosstab(data[i],data['yvalue'])
# p(x|y)
d_xy = d_xy/d_y
d_xy['feature'] = i
m_xy = pd.concat([m_xy,d_xy])
m_xy.set_index(['feature',m_xy.index],inplace=True)
return {'m_y':m_y,'m_xy':m_xy}
def predict(self,x):
# 最终y分类
py = ''
# y分类对应的最大概率
maxp = 0
# 循环y分类,计算概率
for y in self.model['m_y'].index:
p = self.model['m_y'][y]
for i_feature in range(len(x)):
p*=self.model['m_xy'].loc[(i_feature,x[i_feature]),y]
if maxp < p:
py = y
maxp = p
print('预测结果为:',py,',得到概率为:',maxp)
x=[[1,'s'],[1,'m'],[1,'m'],[1,'s'],[1,'s'],[2,'s'],[2,'m'],[2,'m'],[2,'l'],[2,'l'],[3,'l'],[3,'m'],[3,'m'],[3,'l'],[3,'l']]
y=[-1,-1,1,1,-1,-1,-1,1,1,1,1,1,1,1,-1]
n = naive_bayes(x,y)
n.predict([2,'s'])