统计学习方法第四章 朴素贝叶斯法公式推导

文章目录

    • 第四章 朴素贝叶斯法
      • 朴素贝叶斯法的学习与分类
    • 后验概率最大化
    • 朴素贝叶斯的参数估计
    • 朴素贝叶斯算法流程

第四章 朴素贝叶斯法

  • 基于贝叶斯定理和特征条件独立假设的分类方法

  • 朴素贝叶斯法与贝叶斯估计是不同的概念

  • 生成模型与判别模型
    { 生成模型: P ( Y ∣ X ) = P ( X , Y ) P ( X ) X,Y为随机变量 判别模型: Y = f ( X ) , P ( Y ∣ X ) \left\{ \begin{aligned} &\text{生成模型} :P(Y|X) = \frac{P(X,Y)}{P(X)} \text{X,Y为随机变量} \\ &\text{判别模型} :Y=f(X),P(Y|X) \end{aligned} \right. 生成模型P(YX)=P(X)P(X,Y)X,Y为随机变量判别模型Y=f(X),P(YX)

朴素贝叶斯法的学习与分类

  • 输入: 特征向量 x ∈ X ⊆ R n x \in \mathcal{X} \subseteq \mathrm{R}^{n} xXRn 为实例的特征向量,

    输出:类标记 y i ∈ Y = { c 1 , c 2 , ⋯   , c K } y_{i} \in \mathcal{Y}=\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} yiY={c1,c2,,cK}

    X X X是定义在输入空间 X \mathcal{X} X 上的随机向量, Y Y Y 是定义在输出空间 Y \mathcal{Y} Y 上的随机变量。 P ( X , Y ) P(X, Y) P(X,Y) X X X Y Y Y 的联合概率分布。训练数据集
    T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } = { ( x i , y i ) } i = 1 N \begin{aligned} T& = \left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} \\ & = \left\{ (x_i,y_i)\right\}_{i=1}^N \end{aligned} T={(x1,y1),(x2,y2),,(xN,yN)}={(xi,yi)}i=1N
    P ( X , Y ) P(X, Y) P(X,Y) 独立同分布产生。

  • 先验概率分布:
    P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),\quad k =1,2,...,K P(Y=ck),k=1,2,...,K

  • 条件概率分布:
    P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯   , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K P\left(X=x \mid Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(X=xY=ck)=P(X(1)=x(1),,X(n)=x(n)Y=ck),k=1,2,,K

  • 联合概率分布:

    由条件概率公式 P ( A B ) = P ( B ) P ( A ∣ B ) P(AB) = P(B)P(A|B) P(AB)=P(B)P(AB) , 结合上面的先验概率分布条件概率分布,可得联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)(或者写成 P ( Y , X ) P(Y,X) P(Y,X)
    P ( Y , X ) = P ( Y = c k , X = x ) = P ( Y = c k ) P ( X = x ∣ Y = c k ) \begin{aligned} P(Y,X) &= P(Y=c_k,X=x) \\ &= P(Y=c_k)P(X=x\mid Y=c_k) \end{aligned} P(Y,X)=P(Y=ck,X=x)=P(Y=ck)P(X=xY=ck)

  • 生成模型(后验概率):

    根据全概率公式和贝叶斯公式:
    P ( B ∣ A ) = P ( A B ) P ( A ) = P ( B ) P ( A ∣ B ) P ( A ) = P ( B ) P ( A ∣ B ) ∑ P ( B ) P ( A ∣ B ) ⇒ P ( Y = c k ∣ X = x ) = P ( Y = c k , X = x ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) \begin{aligned} P(B|A) &= \frac{P(AB)}{P(A)} \\ &=\frac{P(B)P(A|B)}{P(A)}\\ &=\frac{P(B)P(A|B)}{\sum P(B)P(A|B)}\\ \quad \\ \Rightarrow P\left(Y=c_{k} \mid X=x\right)&= \frac{P\left(Y=c_{k} ,X=x\right)}{P(X=x)} \\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}\\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} \end{aligned} P(BA)P(Y=ckX=x)=P(A)P(AB)=P(A)P(B)P(AB)=P(B)P(AB)P(B)P(AB)=P(X=x)P(Y=ck,X=x)=P(X=x)P(X=xY=ck)P(Y=ck)=kP(X=xY=ck)P(Y=ck)P(X=xY=ck)P(Y=ck)

  • 模型假设:条件独立性

    条件概率分布 P ( X = x ∣ Y = c k ) P\left(X=x \mid Y=c_{k}\right) P(X=xY=ck) 有指数级数量的参数,若不假设各属性条件独立性,其估计实际是不可行的。
    P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯   , X ( n ) = x ( n ) ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) ∣ Y = c k ) ∗ P ( X ( 2 ) = x ( 2 ) ∣ Y = c k ) ∗ ⋯ = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} P\left(X=x \mid Y=c_{k}\right) &=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right) \\ &= P(X^{(1)}=x^{(1)} \mid Y=c_k) * P(X^{(2)}=x^{(2)} \mid Y=c_k) * \cdots \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) \end{aligned} P(X=xY=ck)=P(X(1)=x(1),,X(n)=x(n)Y=ck)=P(X(1)=x(1)Y=ck)P(X(2)=x(2)Y=ck)=j=1nP(X(j)=x(j)Y=ck)
    事实上, 假设 x ( j ) x^{(j)} x(j) 可取值有 S j S_{j} Sj 个, j = 1 , 2 , ⋯   , n , Y j=1,2, \cdots, n, Y j=1,2,,n,Y 可取值有 K K K 个, 那么参数个数 为 K ∏ j = 1 n S j K \prod\limits_{j=1}^{n} S_{j} Kj=1nSj

    朴素贝叶斯法对条件概率分布作了条件独立性的假设。由于这是一个较强的假 设, 朴素贝叶斯法也由此得名。

  • 预测准则:后验概率最大(后面证明)

    结合条件独立性,后验概率为
    y = arg ⁡ max ⁡ c k P ( Y = c k ∣ X = x ) = P ( Y = c k , X = x ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K \begin{aligned} y&=\arg\max_{c_k} P\left(Y=c_{k} \mid X=x\right)\\ &= \frac{P\left(Y=c_{k} ,X=x\right)}{P(X=x)} \\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}\\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} \\ &=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}, \quad k=1,2, \cdots, K \end{aligned} y=argckmaxP(Y=ckX=x)=P(X=x)P(Y=ck,X=x)=P(X=x)P(X=xY=ck)P(Y=ck)=kP(X=xY=ck)P(Y=ck)P(X=xY=ck)P(Y=ck)=kP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck),k=1,2,,K
    这是朴素贝叶斯法分类的基本公式。于是, 朴素贝叶斯分类器可表示为
    y = f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)} y=f(x)=argckmaxkP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck)


后验概率最大化

  • 朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化。假设 选择 0-1 损失函数:
    L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y, f(X))= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases} L(Y,f(X))={1,0,Y=f(X)Y=f(X)
    式中 f ( X ) f(X) f(X) 是分类决策函数。这时, 期望风险函数为
    R exp ⁡ ( f ) = E [ L ( Y , f ( X ) ) ] R_{\exp }(f)=E[L(Y, f(X))] Rexp(f)=E[L(Y,f(X))]
    当损失函数期望最小时候,等价与后验概率最大化
    y = arg ⁡ max ⁡ c k P ( Y = c k ∣ X = x ) y =\arg\max_{c_k} P\left(Y=c_{k} \mid X=x\right) y=argckmaxP(Y=ckX=x)

  • 最小化期望风险
    arg ⁡ min ⁡ R exp ⁡ ( f ) = arg ⁡ min ⁡ E [ L ( Y , f ( X ) ) ] = arg ⁡ min ⁡ ∑ Y ∑ X L ( Y , f ( X ) ) P ( X , Y ) = arg ⁡ min ⁡ ∑ Y ∑ X L ( Y , f ( X ) ) P ( Y ∣ X ) P ( X ) = arg ⁡ min ⁡ ∑ X { ∑ Y L ( Y , f ( X ) ) P ( Y ∣ X ) } P ( X ) = E X ∑ Y L ( Y = c k , f ( X ) ) P ( Y = c k ∣ X ) \begin{aligned} \arg\min R_{\exp }(f) & = \arg\min E[L(Y, f(X))] \\ & = \arg\min \sum_Y\sum_X L(Y, f(X))P(X,Y) \\ & = \arg\min \sum_Y\sum_X L(Y, f(X))P(Y|X)P(X) \\ & = \arg\min \sum_X \left\{\sum_Y L(Y, f(X))P(Y|X)\right\}P(X) \\ & = E_X \sum_Y L(Y=c_k, f(X))P(Y=c_k|X) \end{aligned} argminRexp(f)=argminE[L(Y,f(X))]=argminYXL(Y,f(X))P(X,Y)=argminYXL(Y,f(X))P(YX)P(X)=argminX{YL(Y,f(X))P(YX)}P(X)=EXYL(Y=ck,f(X))P(Y=ckX)
    即期望是对联合分布 P ( X , Y ) P(X, Y) P(X,Y) 取的。由此取条件期望
    R exp ⁡ ( f ) = E X ∑ k = 1 K [ L ( Y = c k , f ( X ) ) ] P ( Y = c k ∣ X ) R_{\exp }(f)=E_{X} \sum_{k=1}^{K}\left[L\left(Y=c_{k}, f(X)\right)\right] P\left(Y=c_{k} \mid X\right) Rexp(f)=EXk=1K[L(Y=ck,f(X))]P(Y=ckX)
    为了使期望风险最小化, 只需对 X = x X=x X=x 逐个极小化, 由此得到:
    f ( x ) = arg ⁡ min ⁡ R exp ⁡ ( f ) = arg ⁡ min ⁡ E [ L ( Y , f ( X ) ) ] = arg ⁡ min ⁡ ∑ k = 1 K [ L ( Y = c k , f ( X ) ) ] P ( Y = c k ∣ X ) ∵ L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) i f f ( X ) = Y = c k , t h e n L ( Y = c k , f ( X ) ) = 0 , ∴ L ( Y = c k , f ( X ) ) = I [ f ( x ) ≠ c k ] ⇒ = arg ⁡ min ⁡ ∑ k = 1 K I [ f ( x ) ≠ c k ] P ( Y = c k ∣ X ) = arg ⁡ min ⁡ ∑ k = 1 K [ 1 − I [ f ( x ) = c k ] ] P ( Y = c k ∣ X ) = arg ⁡ min ⁡ ∑ k = 1 K { P ( Y = c k ∣ X ) − I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } = arg ⁡ min ⁡ { ∑ k = 1 K P ( Y = c k ∣ X ) − ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } ∵ ∑ k = 1 K P ( Y = c k ∣ X ) = 1 ⇒ = arg ⁡ min ⁡ { 1 − ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } 等价于 ⇒ = arg ⁡ max ⁡ ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) \begin{aligned} f(x) &= \arg\min R_{\exp }(f) \\ & = \arg\min E[L(Y, f(X))] \\ & = \arg\min \sum_{k=1}^{K}\left[L\left(Y=c_{k}, f(X)\right)\right] P\left(Y=c_{k} \mid X\right) \\ \because L(Y, f(X))&= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases}\quad \mathcal{if}\quad f(X) = Y =c_k, then \quad L\left(Y=c_{k}, f(X)\right) = 0,\therefore L\left(Y=c_{k}, f(X)\right)= I[f(x)\neq c_k] \\ \Rightarrow & = \arg\min \sum_{k=1}^{K}I[f(x)\neq c_k]P\left(Y=c_{k} \mid X\right)\\ & = \arg\min \sum_{k=1}^{K} \left[1-I[f(x)= c_k]\right]P\left(Y=c_{k} \mid X\right)\\ & = \arg\min \sum_{k=1}^{K} \left\{P\left(Y=c_{k} \mid X\right)-I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ & = \arg\min \left\{\sum_{k=1}^{K}P\left(Y=c_{k} \mid X\right)-\sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ \because &\sum_{k=1}^{K}P\left(Y=c_{k} \mid X\right) = 1 \\ \Rightarrow& = \arg\min \left\{1-\sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ &\text{等价于} \\ \Rightarrow& = \arg\max \sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right) \end{aligned} f(x)L(Y,f(X))=argminRexp(f)=argminE[L(Y,f(X))]=argmink=1K[L(Y=ck,f(X))]P(Y=ckX)={1,0,Y=f(X)Y=f(X)iff(X)=Y=ck,thenL(Y=ck,f(X))=0,L(Y=ck,f(X))=I[f(x)=ck]=argmink=1KI[f(x)=ck]P(Y=ckX)=argmink=1K[1I[f(x)=ck]]P(Y=ckX)=argmink=1K{P(Y=ckX)I[f(x)=ck]P(Y=ckX)}=argmin{k=1KP(Y=ckX)k=1KI[f(x)=ck]P(Y=ckX)}k=1KP(Y=ckX)=1=argmin{1k=1KI[f(x)=ck]P(Y=ckX)}等价于=argmaxk=1KI[f(x)=ck]P(Y=ckX)
    朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化
    f ( x ) = arg ⁡ min ⁡ R exp ⁡ ( f ) = arg ⁡ max ⁡ ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) \begin{aligned} f(x) &= \arg\min R_{\exp }(f) \\ & = \arg\max \sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right) \end{aligned} f(x)=argminRexp(f)=argmaxk=1KI[f(x)=ck]P(Y=ckX)
    因为预测后验概率的最大,所以得找一个 c k c_k ck,使得 I [ f ( x ) = c k ] P ( Y = c k ) I[f(x)= c_k]P(Y=c_{k} ) I[f(x)=ck]P(Y=ck) 为真,这样一来,根据期望风险最小化准则就得到了后验概率最大化准则
    f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ∣ X = x ) f(x) = \arg\max_{c_k}P(Y=c_k\mid X=x) f(x)=argckmaxP(Y=ckX=x)


朴素贝叶斯的参数估计

  • 根据上面的推导
    y = arg ⁡ max ⁡ f ( x ) = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∝ arg ⁡ max ⁡ c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} y&=\arg\max f(x)\\ &=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}\\ &\propto \arg \max _{c_{k}}P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) \end{aligned} y=argmaxf(x)=argckmaxkP(Y=ck)jP(X(j)=x(j)Y=ck)P(Y=ck)jP(X(j)=x(j)Y=ck)argckmaxP(Y=ck)jP(X(j)=x(j)Y=ck)
    在朴素贝叶斯法中, 学习意味着估计 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P(X(j)=x(j)Y=ck)

  • 极大似然估计:

    1. 先验概率 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck)极大似然估计
      P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,,K

    2. 条件概率 P ( X ( j ) = a j l ∣ Y = P\left(X^{(j)}=a_{j l} \mid Y=\right. P(X(j)=ajlY= c k ) \left.c_{k}\right) ck) 的极大似然估计是:(设第 j j j 个特征 x ( j ) x^{(j)} x(j) 可能取值的集合为 { a j 1 , a j 2 , ⋯   , a j S j } \left\{a_{j 1}, a_{j 2}, \cdots, a_{j S_{j}}\right\} {aj1,aj2,,ajSj})
      P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯   , n ; l = 1 , 2 , ⋯   , S j ; k = 1 , 2 , ⋯   , K \begin{aligned} &P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} \\ &j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K \end{aligned} P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)j=1,2,,n;l=1,2,,Sj;k=1,2,,K
      , x i ( j ) x_{i}^{(j)} xi(j) 是第 i i i 个样本的第 j j j 个特征; a j l a_{j l} ajl 是第 j j j 个特征可能取的第 l l l 个值; I I I 为指 示函数。

  • 贝叶斯估计

    用极大似然估计可能会出现所要估计的概率值为 0 的情况。这时会影响到后验概 率的计算结果, 使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。具体地:

    1. 条件概率的贝叶斯估计
      P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda} Pλ(X(j)=ajlY=ck)=i=1NI(yi=ck)+Sjλi=1NI(xi(j)=ajl,yi=ck)+λ
      式中 λ ⩾ 0 \lambda \geqslant 0 λ0 。等价于在随机变量各个取值的频数上赋予一个正数 λ > 0 \lambda>0 λ>0 λ = 0 \lambda=0 λ=0 时就 是极大似然估计。常取 λ = 1 \lambda=1 λ=1, 这时称为拉普拉斯平滑 (Laplacian smoothing)。显然, 对任何 l = 1 , 2 , ⋯   , S j , k = 1 , 2 , ⋯   , K l=1,2, \cdots, S_{j}, k=1,2, \cdots, K l=1,2,,Sj,k=1,2,,K, 有
      P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 ∑ l = 1 S j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \begin{aligned} &P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)>0 \\ &\sum_{l=1}^{S_{j}} P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=1 \end{aligned} Pλ(X(j)=ajlY=ck)>0l=1SjP(X(j)=ajlY=ck)=1
      同样

    2. 先验概率的贝叶斯估计
      P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} Pλ(Y=ck)=N+Kλi=1NI(yi=ck)+λ


朴素贝叶斯算法流程

  • 算法 4.1 (朴素贝叶斯算法 (naïve Bayes algorithm))
    输入: 训练数据 T = { ( x i , y i ) } i = 1 N T=\left\{ (x_i,y_i)\right\}_{i=1}^N T={(xi,yi)}i=1N, 其中 x i = ( x i ( 1 ) , x i ( 2 ) , ⋯   x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \cdots\right. xi=(xi(1),xi(2),, x i ( n ) ) T , x i ( j ) \left.x_{i}^{(n)}\right)^{\mathrm{T}}, x_{i}^{(j)} xi(n))T,xi(j) 是第 i i i 个样本的第 j j j 个特征, x i ( j ) ∈ { a j 1 , a j 2 , ⋯   , a j S j } , a j l x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \cdots, a_{j} S_{j}\right\}, a_{j l} xi(j){aj1,aj2,,ajSj},ajl 是第 j j j 个特 征可能取的第 l l l 个值, j = 1 , 2 , ⋯   , n , l = 1 , 2 , ⋯   , S j , y i ∈ { c 1 , c 2 , ⋯   , c K } j=1,2, \cdots, n, l=1,2, \cdots, S_{j}, y_{i} \in\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} j=1,2,,n,l=1,2,,Sj,yi{c1,c2,,cK}; 实例 x x x;

    输出:实例 x x x 的分类。
    (1) 计算先验概率及条件概率

    1. 先验概率

    P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=Ni=1NI(yi=ck),k=1,2,,K
    2. 条件概率

    P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯   , n ; l = 1 , 2 , ⋯   , S j ; k = 1 , 2 , ⋯   , K P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}\\ j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K P(X(j)=ajlY=ck)=i=1NI(yi=ck)i=1NI(xi(j)=ajl,yi=ck)j=1,2,,n;l=1,2,,Sj;k=1,2,,K
    (2) 对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , ⋯   , x ( n ) ) T x=\left(x^{(1)}, x^{(2)}, \cdots, x^{(n)}\right)^{\mathrm{T}} x=(x(1),x(2),,x(n))T, 计算
    P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯   , K P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(Y=ck)j=1nP(X(j)=x(j)Y=ck),k=1,2,,K
    (3)确定实例 x x x 的类
    y = arg ⁡ max ⁡ c k P ( Y = c k ) ∏ n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) y=argckmaxP(Y=ck)nP(X(j)=x(j)Y=ck)


你可能感兴趣的:(统计学习方法公式推导,学习,机器学习,概率论,算法,人工智能)