基于贝叶斯定理和特征条件独立假设的分类方法
朴素贝叶斯法与贝叶斯估计是不同的概念
生成模型与判别模型
{ 生成模型: P ( Y ∣ X ) = P ( X , Y ) P ( X ) X,Y为随机变量 判别模型: Y = f ( X ) , P ( Y ∣ X ) \left\{ \begin{aligned} &\text{生成模型} :P(Y|X) = \frac{P(X,Y)}{P(X)} \text{X,Y为随机变量} \\ &\text{判别模型} :Y=f(X),P(Y|X) \end{aligned} \right. ⎩ ⎨ ⎧生成模型:P(Y∣X)=P(X)P(X,Y)X,Y为随机变量判别模型:Y=f(X),P(Y∣X)
输入:
特征向量 x ∈ X ⊆ R n x \in \mathcal{X} \subseteq \mathrm{R}^{n} x∈X⊆Rn 为实例的特征向量,
输出:
类标记 y i ∈ Y = { c 1 , c 2 , ⋯ , c K } y_{i} \in \mathcal{Y}=\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} yi∈Y={c1,c2,⋯,cK}
X X X是定义在输入空间 X \mathcal{X} X 上的随机向量, Y Y Y 是定义在输出空间 Y \mathcal{Y} Y 上的随机变量。 P ( X , Y ) P(X, Y) P(X,Y) 是 X X X 和 Y Y Y 的联合概率分布。训练数据集
T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } = { ( x i , y i ) } i = 1 N \begin{aligned} T& = \left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\} \\ & = \left\{ (x_i,y_i)\right\}_{i=1}^N \end{aligned} T={(x1,y1),(x2,y2),⋯,(xN,yN)}={(xi,yi)}i=1N
由 P ( X , Y ) P(X, Y) P(X,Y) 独立同分布产生。
先验概率分布:
P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),\quad k =1,2,...,K P(Y=ck),k=1,2,...,K
条件概率分布:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯ , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K P\left(X=x \mid Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck),k=1,2,⋯,K
联合概率分布:
由条件概率公式 P ( A B ) = P ( B ) P ( A ∣ B ) P(AB) = P(B)P(A|B) P(AB)=P(B)P(A∣B) , 结合上面的先验概率分布
和条件概率分布
,可得联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)(或者写成 P ( Y , X ) P(Y,X) P(Y,X))
P ( Y , X ) = P ( Y = c k , X = x ) = P ( Y = c k ) P ( X = x ∣ Y = c k ) \begin{aligned} P(Y,X) &= P(Y=c_k,X=x) \\ &= P(Y=c_k)P(X=x\mid Y=c_k) \end{aligned} P(Y,X)=P(Y=ck,X=x)=P(Y=ck)P(X=x∣Y=ck)
生成模型(后验概率):
根据全概率公式和贝叶斯公式:
P ( B ∣ A ) = P ( A B ) P ( A ) = P ( B ) P ( A ∣ B ) P ( A ) = P ( B ) P ( A ∣ B ) ∑ P ( B ) P ( A ∣ B ) ⇒ P ( Y = c k ∣ X = x ) = P ( Y = c k , X = x ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) \begin{aligned} P(B|A) &= \frac{P(AB)}{P(A)} \\ &=\frac{P(B)P(A|B)}{P(A)}\\ &=\frac{P(B)P(A|B)}{\sum P(B)P(A|B)}\\ \quad \\ \Rightarrow P\left(Y=c_{k} \mid X=x\right)&= \frac{P\left(Y=c_{k} ,X=x\right)}{P(X=x)} \\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}\\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} \end{aligned} P(B∣A)⇒P(Y=ck∣X=x)=P(A)P(AB)=P(A)P(B)P(A∣B)=∑P(B)P(A∣B)P(B)P(A∣B)=P(X=x)P(Y=ck,X=x)=P(X=x)P(X=x∣Y=ck)P(Y=ck)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
模型假设:条件独立性
条件概率分布 P ( X = x ∣ Y = c k ) P\left(X=x \mid Y=c_{k}\right) P(X=x∣Y=ck) 有指数级数量的参数,若不假设各属性条件独立性,其估计实际是不可行的。
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , ⋯ , X ( n ) = x ( n ) ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) ∣ Y = c k ) ∗ P ( X ( 2 ) = x ( 2 ) ∣ Y = c k ) ∗ ⋯ = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} P\left(X=x \mid Y=c_{k}\right) &=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} \mid Y=c_{k}\right) \\ &= P(X^{(1)}=x^{(1)} \mid Y=c_k) * P(X^{(2)}=x^{(2)} \mid Y=c_k) * \cdots \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) \end{aligned} P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck)=P(X(1)=x(1)∣Y=ck)∗P(X(2)=x(2)∣Y=ck)∗⋯=j=1∏nP(X(j)=x(j)∣Y=ck)
事实上, 假设 x ( j ) x^{(j)} x(j) 可取值有 S j S_{j} Sj 个, j = 1 , 2 , ⋯ , n , Y j=1,2, \cdots, n, Y j=1,2,⋯,n,Y 可取值有 K K K 个, 那么参数个数 为 K ∏ j = 1 n S j K \prod\limits_{j=1}^{n} S_{j} Kj=1∏nSj 。
朴素贝叶斯法对条件概率分布作了条件独立性的假设。由于这是一个较强的假 设, 朴素贝叶斯法也由此得名。
预测准则:后验概率最大
(后面证明)
结合条件独立性,后验概率为
y = arg max c k P ( Y = c k ∣ X = x ) = P ( Y = c k , X = x ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) P ( X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) = P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K \begin{aligned} y&=\arg\max_{c_k} P\left(Y=c_{k} \mid X=x\right)\\ &= \frac{P\left(Y=c_{k} ,X=x\right)}{P(X=x)} \\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{P(X=x)}\\ &=\frac{P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x \mid Y=c_{k}\right) P\left(Y=c_{k}\right)} \\ &=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}, \quad k=1,2, \cdots, K \end{aligned} y=argckmaxP(Y=ck∣X=x)=P(X=x)P(Y=ck,X=x)=P(X=x)P(X=x∣Y=ck)P(Y=ck)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)=∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
这是朴素贝叶斯法分类的基本公式。于是, 朴素贝叶斯分类器可表示为
y = f ( x ) = arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)} y=f(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化。假设 选择 0-1 损失函数:
L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y, f(X))= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases} L(Y,f(X))={1,0,Y=f(X)Y=f(X)
式中 f ( X ) f(X) f(X) 是分类决策函数。这时, 期望风险函数为
R exp ( f ) = E [ L ( Y , f ( X ) ) ] R_{\exp }(f)=E[L(Y, f(X))] Rexp(f)=E[L(Y,f(X))]
当损失函数期望最小时候,等价与后验概率最大化
y = arg max c k P ( Y = c k ∣ X = x ) y =\arg\max_{c_k} P\left(Y=c_{k} \mid X=x\right) y=argckmaxP(Y=ck∣X=x)
最小化期望风险
arg min R exp ( f ) = arg min E [ L ( Y , f ( X ) ) ] = arg min ∑ Y ∑ X L ( Y , f ( X ) ) P ( X , Y ) = arg min ∑ Y ∑ X L ( Y , f ( X ) ) P ( Y ∣ X ) P ( X ) = arg min ∑ X { ∑ Y L ( Y , f ( X ) ) P ( Y ∣ X ) } P ( X ) = E X ∑ Y L ( Y = c k , f ( X ) ) P ( Y = c k ∣ X ) \begin{aligned} \arg\min R_{\exp }(f) & = \arg\min E[L(Y, f(X))] \\ & = \arg\min \sum_Y\sum_X L(Y, f(X))P(X,Y) \\ & = \arg\min \sum_Y\sum_X L(Y, f(X))P(Y|X)P(X) \\ & = \arg\min \sum_X \left\{\sum_Y L(Y, f(X))P(Y|X)\right\}P(X) \\ & = E_X \sum_Y L(Y=c_k, f(X))P(Y=c_k|X) \end{aligned} argminRexp(f)=argminE[L(Y,f(X))]=argminY∑X∑L(Y,f(X))P(X,Y)=argminY∑X∑L(Y,f(X))P(Y∣X)P(X)=argminX∑{Y∑L(Y,f(X))P(Y∣X)}P(X)=EXY∑L(Y=ck,f(X))P(Y=ck∣X)
即期望是对联合分布 P ( X , Y ) P(X, Y) P(X,Y) 取的。由此取条件期望
R exp ( f ) = E X ∑ k = 1 K [ L ( Y = c k , f ( X ) ) ] P ( Y = c k ∣ X ) R_{\exp }(f)=E_{X} \sum_{k=1}^{K}\left[L\left(Y=c_{k}, f(X)\right)\right] P\left(Y=c_{k} \mid X\right) Rexp(f)=EXk=1∑K[L(Y=ck,f(X))]P(Y=ck∣X)
为了使期望风险最小化, 只需对 X = x X=x X=x 逐个极小化, 由此得到:
f ( x ) = arg min R exp ( f ) = arg min E [ L ( Y , f ( X ) ) ] = arg min ∑ k = 1 K [ L ( Y = c k , f ( X ) ) ] P ( Y = c k ∣ X ) ∵ L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) i f f ( X ) = Y = c k , t h e n L ( Y = c k , f ( X ) ) = 0 , ∴ L ( Y = c k , f ( X ) ) = I [ f ( x ) ≠ c k ] ⇒ = arg min ∑ k = 1 K I [ f ( x ) ≠ c k ] P ( Y = c k ∣ X ) = arg min ∑ k = 1 K [ 1 − I [ f ( x ) = c k ] ] P ( Y = c k ∣ X ) = arg min ∑ k = 1 K { P ( Y = c k ∣ X ) − I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } = arg min { ∑ k = 1 K P ( Y = c k ∣ X ) − ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } ∵ ∑ k = 1 K P ( Y = c k ∣ X ) = 1 ⇒ = arg min { 1 − ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) } 等价于 ⇒ = arg max ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) \begin{aligned} f(x) &= \arg\min R_{\exp }(f) \\ & = \arg\min E[L(Y, f(X))] \\ & = \arg\min \sum_{k=1}^{K}\left[L\left(Y=c_{k}, f(X)\right)\right] P\left(Y=c_{k} \mid X\right) \\ \because L(Y, f(X))&= \begin{cases}1, & Y \neq f(X) \\ 0, & Y=f(X)\end{cases}\quad \mathcal{if}\quad f(X) = Y =c_k, then \quad L\left(Y=c_{k}, f(X)\right) = 0,\therefore L\left(Y=c_{k}, f(X)\right)= I[f(x)\neq c_k] \\ \Rightarrow & = \arg\min \sum_{k=1}^{K}I[f(x)\neq c_k]P\left(Y=c_{k} \mid X\right)\\ & = \arg\min \sum_{k=1}^{K} \left[1-I[f(x)= c_k]\right]P\left(Y=c_{k} \mid X\right)\\ & = \arg\min \sum_{k=1}^{K} \left\{P\left(Y=c_{k} \mid X\right)-I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ & = \arg\min \left\{\sum_{k=1}^{K}P\left(Y=c_{k} \mid X\right)-\sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ \because &\sum_{k=1}^{K}P\left(Y=c_{k} \mid X\right) = 1 \\ \Rightarrow& = \arg\min \left\{1-\sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right)\right\}\\ &\text{等价于} \\ \Rightarrow& = \arg\max \sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right) \end{aligned} f(x)∵L(Y,f(X))⇒∵⇒⇒=argminRexp(f)=argminE[L(Y,f(X))]=argmink=1∑K[L(Y=ck,f(X))]P(Y=ck∣X)={1,0,Y=f(X)Y=f(X)iff(X)=Y=ck,thenL(Y=ck,f(X))=0,∴L(Y=ck,f(X))=I[f(x)=ck]=argmink=1∑KI[f(x)=ck]P(Y=ck∣X)=argmink=1∑K[1−I[f(x)=ck]]P(Y=ck∣X)=argmink=1∑K{P(Y=ck∣X)−I[f(x)=ck]P(Y=ck∣X)}=argmin{k=1∑KP(Y=ck∣X)−k=1∑KI[f(x)=ck]P(Y=ck∣X)}k=1∑KP(Y=ck∣X)=1=argmin{1−k=1∑KI[f(x)=ck]P(Y=ck∣X)}等价于=argmaxk=1∑KI[f(x)=ck]P(Y=ck∣X)
朴素贝叶斯法将实例分到后验概率最大的类中。这等价于期望风险最小化
f ( x ) = arg min R exp ( f ) = arg max ∑ k = 1 K I [ f ( x ) = c k ] P ( Y = c k ∣ X ) \begin{aligned} f(x) &= \arg\min R_{\exp }(f) \\ & = \arg\max \sum_{k=1}^{K}I[f(x)= c_k]P\left(Y=c_{k} \mid X\right) \end{aligned} f(x)=argminRexp(f)=argmaxk=1∑KI[f(x)=ck]P(Y=ck∣X)
因为预测后验概率的最大,所以得找一个 c k c_k ck,使得 I [ f ( x ) = c k ] P ( Y = c k ) I[f(x)= c_k]P(Y=c_{k} ) I[f(x)=ck]P(Y=ck) 为真,这样一来,根据期望风险最小化准则就得到了后验概率最大化准则
f ( x ) = arg max c k P ( Y = c k ∣ X = x ) f(x) = \arg\max_{c_k}P(Y=c_k\mid X=x) f(x)=argckmaxP(Y=ck∣X=x)
根据上面的推导
y = arg max f ( x ) = arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∝ arg max c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) \begin{aligned} y&=\arg\max f(x)\\ &=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right)}\\ &\propto \arg \max _{c_{k}}P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) \end{aligned} y=argmaxf(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)∝argckmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
在朴素贝叶斯法中, 学习意味着估计 P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) 和 P ( X ( j ) = x ( j ) ∣ Y = c k ) P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) P(X(j)=x(j)∣Y=ck)
极大似然估计:
先验概率
P ( Y = c k ) P\left(Y=c_{k}\right) P(Y=ck) 的极大似然估计
是
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
条件概率
P ( X ( j ) = a j l ∣ Y = P\left(X^{(j)}=a_{j l} \mid Y=\right. P(X(j)=ajl∣Y= c k ) \left.c_{k}\right) ck) 的极大似然估计
是:(设第 j j j 个特征 x ( j ) x^{(j)} x(j) 可能取值的集合为 { a j 1 , a j 2 , ⋯ , a j S j } \left\{a_{j 1}, a_{j 2}, \cdots, a_{j S_{j}}\right\} {aj1,aj2,⋯,ajSj})
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯ , n ; l = 1 , 2 , ⋯ , S j ; k = 1 , 2 , ⋯ , K \begin{aligned} &P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} \\ &j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K \end{aligned} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K
, x i ( j ) x_{i}^{(j)} xi(j) 是第 i i i 个样本的第 j j j 个特征; a j l a_{j l} ajl 是第 j j j 个特征可能取的第 l l l 个值; I I I 为指 示函数。
贝叶斯估计
用极大似然估计可能会出现所要估计的概率值为 0 的情况。这时会影响到后验概 率的计算结果, 使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。具体地:
条件概率的贝叶斯估计
是
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda} Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
式中 λ ⩾ 0 \lambda \geqslant 0 λ⩾0 。等价于在随机变量各个取值的频数上赋予一个正数 λ > 0 \lambda>0 λ>0 。当
λ = 0 \lambda=0 λ=0 时就 是极大似然估计。常取
λ = 1 \lambda=1 λ=1, 这时称为拉普拉斯平滑 (Laplacian smoothing)
。显然, 对任何 l = 1 , 2 , ⋯ , S j , k = 1 , 2 , ⋯ , K l=1,2, \cdots, S_{j}, k=1,2, \cdots, K l=1,2,⋯,Sj,k=1,2,⋯,K, 有
P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 ∑ l = 1 S j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \begin{aligned} &P_{\lambda}\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)>0 \\ &\sum_{l=1}^{S_{j}} P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=1 \end{aligned} Pλ(X(j)=ajl∣Y=ck)>0l=1∑SjP(X(j)=ajl∣Y=ck)=1
同样
先验概率的贝叶斯估计
是
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
算法 4.1 (朴素贝叶斯算法 (naïve Bayes algorithm))
输入:
训练数据 T = { ( x i , y i ) } i = 1 N T=\left\{ (x_i,y_i)\right\}_{i=1}^N T={(xi,yi)}i=1N, 其中 x i = ( x i ( 1 ) , x i ( 2 ) , ⋯ x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \cdots\right. xi=(xi(1),xi(2),⋯, x i ( n ) ) T , x i ( j ) \left.x_{i}^{(n)}\right)^{\mathrm{T}}, x_{i}^{(j)} xi(n))T,xi(j) 是第 i i i 个样本的第 j j j 个特征, x i ( j ) ∈ { a j 1 , a j 2 , ⋯ , a j S j } , a j l x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \cdots, a_{j} S_{j}\right\}, a_{j l} xi(j)∈{aj1,aj2,⋯,ajSj},ajl 是第 j j j 个特 征可能取的第 l l l 个值, j = 1 , 2 , ⋯ , n , l = 1 , 2 , ⋯ , S j , y i ∈ { c 1 , c 2 , ⋯ , c K } j=1,2, \cdots, n, l=1,2, \cdots, S_{j}, y_{i} \in\left\{c_{1}, c_{2}, \cdots, c_{K}\right\} j=1,2,⋯,n,l=1,2,⋯,Sj,yi∈{c1,c2,⋯,cK}; 实例 x x x;
输出:
实例 x x x 的分类。
(1) 计算先验概率及条件概率
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
2. 条件概率
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) j = 1 , 2 , ⋯ , n ; l = 1 , 2 , ⋯ , S j ; k = 1 , 2 , ⋯ , K P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}\\ j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K
(2) 对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( n ) ) T x=\left(x^{(1)}, x^{(2)}, \cdots, x^{(n)}\right)^{\mathrm{T}} x=(x(1),x(2),⋯,x(n))T, 计算
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , ⋯ , K P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), \quad k=1,2, \cdots, K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
(3)确定实例 x x x 的类
y = arg max c k P ( Y = c k ) ∏ n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) y=argckmaxP(Y=ck)∏nP(X(j)=x(j)∣Y=ck)