统计学习方法——朴素贝叶斯(一)

朴素贝叶斯

  • 朴素贝叶斯
    • 贝叶斯定律
    • 朴素贝叶斯的学习与分类
        • 基本方法
        • 后验概率最大化的含义
    • 朴素贝叶斯的参数估计
        • 极大似然估计
        • 学习与分类算法
        • 贝叶斯估计
            • 参考文献

朴素贝叶斯

朴素贝叶斯是基于贝叶斯定律特征之间条件独立这个假设的分类方法,属于生成模型

贝叶斯定律

首先,我们给出贝叶斯定律的公式:
P ( B i ∣ A ) = P ( B i ) P ( A ∣ B i ) ∑ j = 1 n P ( B j ) P ( A ∣ B j ) P\left( {{B_i}\left| A \right.} \right) = \frac{{P\left( {{B_i}} \right)P\left( {A\left| {{B_i}} \right.} \right)}}{{\sum\nolimits_{j = 1}^n {P\left( {{B_j}} \right)P\left( {A\left| {{B_j}} \right.} \right)} }} P(BiA)=j=1nP(Bj)P(ABj)P(Bi)P(ABi)
其中 P ( ⋅ ) P\left( \cdot \right) P()为时间发生的概率, P ( A ∣ B ) P\left( {A\left| B \right.} \right) P(AB)则表示在 B B B发生的情况下 A A A发生的概率。

当特征条件独立时,则可以写为:
P ( 类 别 ∣ 特 征 ) = P ( 特 征 ∣ 类 别 ) P ( 类 别 ) P ( 特 征 ) P\left( {类别\left| {特征} \right.} \right){\rm{ = }}\frac{{P\left( {特征\left| {类别} \right.} \right)P\left( {类别} \right)}}{{P\left( {特征} \right)}} P()=P()P()P()

这里不再进行过多的赘述(我觉得知道这条就足够)。下面我们开始介绍朴素贝叶斯。

朴素贝叶斯的学习与分类

基本方法

我希望尽可能说的简单一些。

训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } T = \left\{ {\left( {{x_1},{y_1}} \right),\left( {{x_2},{y_2}} \right), \cdots ,\left( {{x_N},{y_N}} \right)} \right\} T={(x1,y1),(x2,y2),,(xN,yN)} P ( X , Y ) P\left( {X,Y} \right) P(X,Y)独立同分布产生。其中 X , Y X,Y X,Y分别是输入、输出空间 X , Y = { c 1 , c 2 , . . . , c K } \mathcal{X},\mathcal{Y}=\left\{c_1,c_2,...,c_K\right\} X,Y={c1,c2,...,cK}的随机变量。

朴素贝叶斯法则是通过训练集学习联合概率分布 P ( X , Y ) P\left( {X,Y} \right) P(X,Y)

我们看一下详细过程:

  1. 计算先验概率分布: P ( Y = c k ) , k = 1 , 2 , . . . , K P\left( {Y = {c_k}} \right),k=1,2,...,K P(Y=ck)k=1,2,...,K
  2. 计算条件概率分布: P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , ⋯   , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K P\left( {X = x\left| {Y = {c_k}} \right.} \right) = P\left( {{X^{\left( 1 \right)}} = {x^{\left( 1 \right)}},{X^{\left( 2 \right)}} = {x^{\left( 2 \right)}}, \cdots ,{X^{\left( n \right)}} = {x^{\left( n \right)}}\left| {Y = {c_k}} \right.} \right),k = 1,2, \cdots ,K P(X=xY=ck)=P(X(1)=x(1),X(2)=x(2),,X(n)=x(n)Y=ck),k=1,2,,K
    意思是:在标签为 c k c_k ck时样本为 x x x的概率=标签为 c k c_k ck时,每个特征都取相应的取值时的概率。【计算很困难】
  3. 又因为作了特征之间条件独立的假设,所以上式可改写为为:
    P ( X = x ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left( {X = x\left| {Y = {c_k}} \right.} \right) = \prod\limits_{j = 1}^n {P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right)} P(X=xY=ck)=j=1nP(X(j)=x(j)Y=ck)
    简化了模型,但可能会损失一定的分类准确率。
  4. 进行分类时,对于给定的输入 x x x,通过学习的模型计算后验概率分布 P ( Y = c k ∣ X = x ) P\left( {Y = {c_k}\left| {X = x} \right.} \right) P(Y=ckX=x),将最大的作为 x x x的类输出。
    P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left( {Y = {c_k}\left| {X = x} \right.} \right) = \frac{{P\left( {X = x\left| {Y = {c_k}} \right.} \right)P\left( {Y = {c_k}} \right)}}{{\sum\nolimits_k {P\left( {X = x\left| {Y = {c_k}} \right.} \right)P\left( {Y = {c_k}} \right)} }} = \frac{{P\left( {Y = {c_k}} \right)\prod\nolimits_j {P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right)} }}{{\sum\nolimits_k {P\left( {Y = {c_k}} \right)\prod\nolimits_j {P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right)} } }} P(Y=ckX=x)=kP(X=xY=ck)P(Y=ck)P(X=xY=ck)P(Y=ck)=kP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck)
  5. 最终贝叶斯分类器可表示为:
    y = f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y = f\left( x \right) = \arg \mathop {\max }\limits_{{c_k}} \frac{{P\left( {Y = {c_k}} \right)\prod\nolimits_j {P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right)} }}{{\sum\nolimits_k {P\left( {Y = {c_k}} \right)\prod\nolimits_j {P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right)} } }} y=f(x)=argckmaxkP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck)
    考虑到其实分母对所有 c k c_k ck都是相同的,所以【如果不想看理论,就记住这个式子】
    y = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y = \arg \mathop {\max }\limits_{{c_k}} P\left( {Y = {c_k}} \right)\prod\nolimits_j {P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right)} y=argckmaxP(Y=ck)jP(X(j)=x(j)Y=ck)

后验概率最大化的含义

朴素贝叶斯将实例分类到后验概率最大的类中,等价于期望风险最小化。

期望风险函数为:
R exp ⁡ ( f ) = E [ L ( Y , f ( X ) ) ] {R_{\exp }}\left( f \right) = E\left[ {L\left( {Y,f\left( X \right)} \right)} \right] Rexp(f)=E[L(Y,f(X))]
其中 L ( ) L\left(\right) L()为损失函数, f ( ) f\left(\right) f()为分类决策函数。

由此取条件期望:
R exp ⁡ ( f ) = E X ∑ k = 1 K [ L ( c k , f ( X ) ) ] P ( c k ∣ X ) {R_{\exp }}\left( f \right) = {E_X}\sum\limits_{k = 1}^K {\left[ {L\left( {{c_k},f\left( X \right)} \right)} \right]P\left( {{c_k}\left| X \right.} \right)} Rexp(f)=EXk=1K[L(ck,f(X))]P(ckX)

为了使期望风险最小,只需对 X = x X=x X=x逐个极小,也就是对后验概率的最大化:
f ( x ) = arg ⁡ max ⁡ c k P ( c k ∣ X = x ) f\left( x \right) = \arg \mathop {\max }\limits_{{c_k}} P\left( {{c_k}\left| {X = x} \right.} \right) f(x)=argckmaxP(ckX=x)

朴素贝叶斯的参数估计

极大似然估计

学习意味着估计 P ( Y = c k ) P\left( {Y = {c_k}} \right) P(Y=ck) P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right) P(X(j)=x(j)Y=ck)

  • 先验概率 P ( Y = c k ) P\left( {Y = {c_k}} \right) P(Y=ck)的极大似然估计为:
    P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N ,    k = 1 , 2 , ⋯   , K P\left( {Y = {c_k}} \right) = \frac{{\sum\limits_{i = 1}^N {I\left( {{y_i} = {c_k}} \right)} }}{N},\;k = 1,2, \cdots ,K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,,K
  • 假设第 j j j个特征 x ( j ) {x^{\left( j \right)}} x(j)的取值集合为 { a j 1 , a j 2 , ⋯   , a j S j } \left\{ {{a_{j1}},{a_{j2}}, \cdots ,{a_{j{S_j}}}} \right\} {aj1,aj2,,ajSj},条件概率 P ( X ( j ) = a j l ∣ Y = c k ) P\left( {{X^{\left( j \right)}} = {a_{jl}}\left| {Y = {c_k}} \right.} \right) P(X(j)=ajlY=ck)的极大似然估计为:
    P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯   , n ; l = 1 , 2 , ⋯   , S ; k = 1 , 2 , ⋯   , K P\left( {{X^{\left( j \right)}} = {a_{jl}}\left| {Y = {c_k}} \right.} \right) = \frac{{\sum\limits_{i = 1}^N {I\left( {x_i^{\left( j \right)} = {a_{jl}},{y_i} = {c_k}} \right)} }}{{\sum\limits_{i = 1}^N {I\left( {{y_i} = {c_k}} \right)} }}\quad j = 1,2, \cdots ,n;l = 1,2, \cdots ,S;k = 1,2, \cdots ,K P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)j=1,2,,n;l=1,2,,S;k=1,2,,K
    其中 x i ( j ) x_i^{\left( j \right)} xi(j)是第 i i i个样本的第 j j j个特征, a j l a_{jl} ajl为特征 j j j可能的第 l l l个值, I I I为指示函数。

学习与分类算法

算法的规范表述:

  • 输入:训练集 T = { ( x 1 , y 1 ) , ⋯   , ( x N , y N ) } T = \left\{ {\left( {{x_1},{y_1}} \right), \cdots ,\left( {{x_N},{y_N}} \right)} \right\} T={(x1,y1),,(xN,yN)},其中 x i = ( x i ( 1 ) , ⋯   , x i ( n ) ) T {x_i} = {\left( {x_i^{\left( 1 \right)}, \cdots ,x_i^{\left( n \right)}} \right)^T} xi=(xi(1),,xi(n))T x i ( j ) ∈ { a j 1 , ⋯   , a j S j } x_i^{\left( j \right)} \in \left\{ {{a_{j1}}, \cdots ,{a_{j{S_j}}}} \right\} xi(j){aj1,,ajSj} y i ∈ { c 1 , c 2 , ⋯   , c K } {y_i} \in \left\{ {{c_1},{c_2}, \cdots ,{c_K}} \right\} yi{c1,c2,,cK}和实例 x x x
  • 输出:实例 x x x的分类
  • 计算过程
    • 计算先验概率及条件概率【公式已经在上面了】
    • 对于给定的实例 x = ( x ( 1 ) , ⋯   , x ( n ) ) T x = {\left( {{x^{\left( 1 \right)}}, \cdots ,{x^{\left( n \right)}}} \right)^T} x=(x(1),,x(n))T计算
      P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) ,    k = 1 , 2 , ⋯   , K P\left( {Y = {c_k}} \right)\prod\limits_{j = 1}^n {P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right)} ,\;k = 1,2, \cdots ,K P(Y=ck)j=1nP(X(j)=x(j)Y=ck),k=1,2,,K
    • 确定 x x x的类
      y = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y = \arg \mathop {\max }\limits_{{c_k}} P\left( {Y = {c_k}} \right)\prod\limits_{j = 1}^n {P\left( {{X^{\left( j \right)}} = {x^{\left( j \right)}}\left| {Y = {c_k}} \right.} \right)} y=argckmaxP(Y=ck)j=1nP(X(j)=x(j)Y=ck)

贝叶斯估计

用极大似然估计可能会出现所要估计的概率为 0 0 0的情况,所以使用贝叶斯估计

条件概率的贝叶斯估计为:
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ {P_\lambda }\left( {{X^{\left( j \right)}} = {a_{jl}}\left| {Y = {c_k}} \right.} \right) = \frac{{\sum\limits_{i = 1}^N {I\left( {x_i^{\left( j \right)} = {a_{jl}},{y_i} = {c_k}} \right) + \lambda } }}{{\sum\limits_{i = 1}^N {I\left( {{y_i} = {c_k}} \right) + {S_j}\lambda } }} Pλ(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λ
其中 λ ≥ 0 \lambda \ge 0 λ0,当等于 0 0 0时为极大似然估计,为 1 1 1时为拉普拉斯平滑,其中 N N N为样本总数, K K K为分类总数, S j S_j Sj为当前考虑的特征的可能取值数量。

先验概率的贝叶斯估计为:
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ {P_\lambda }\left( {Y = {c_k}} \right) = \frac{{\sum\limits_{i = 1}^N {I\left( {{y_i} = {c_k}} \right) + \lambda } }}{{N + K\lambda }} Pλ(Y=ck)=N+Kλi=1NI(yi=ck)+λ

参考文献

《统计学习方法》
《机器学习》
https://baike.baidu.com/item/贝叶斯定理/1185949?fr=aladdin

你可能感兴趣的:(机器学习)