(尾巴:补充一些例子)
例子4.1:
女朋友和妈妈掉河里了,路人拿出来3颗豆, 两颗红豆1颗绿豆。如果我抽中红豆救女朋友, 抽中绿豆救妈妈。我和路人各自抽了一颗, 路人发现自己抽中的是绿豆,他想用剩下的那颗和我换,我换不换?换不换豆女朋友活下去的概率一样吗?
直觉来讲:
换不换豆我抽中红豆的概率应该都是 1 / 3 1 / 3 1/3 。这时路人跟我说他的是绿豆, 排除一颗, 我抽中红豆的概率是 1 / 2 1 / 2 1/2 。换不换概率都是 1 / 2 1 / 2 1/2 。
计算一下:
如果更换,那么其实就是重新在两个豆子中选了,所以概率是:
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) = 1 ⋅ 1 3 2 3 = 1 2 P(A \mid B)=\frac{P(B \mid A) P(A)}{P(B)}=\frac{1 \cdot \frac{1}{3}}{\frac{2}{3}}=\frac{1}{2} P(A∣B)=P(B)P(B∣A)P(A)=321⋅31=21
如果不换,其实还是按照一开始两人同时挑选的概率:
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) = 1 ⋅ 1 3 1 = 1 3 P(A \mid B)=\frac{P(B \mid A) P(A)}{P(B)}=\frac{1 \cdot \frac{1}{3}}{1}=\frac{1}{3} P(A∣B)=P(B)P(B∣A)P(A)=11⋅31=31
设 A表示我抽中的是红豆,B表示路人抽中的是绿豆。这里的差别在于有了先后抽取的顺序。更换,一位着重新在第一次抽取的基础上进行了第二次抽取。 结论:如果要救女朋友,最好和路人交换一下。如果要救妈, 最好不要换。
条件概率: P ( A ∣ B ) P(A \mid B) P(A∣B) 表示在 B B B发生的条件下发生 A A A的概率。
P ( A ∣ B ) = P ( A B ) P ( B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A \mid B)=\frac{P(A B)}{P(B)}=\frac{P(B \mid A) P(A)}{P(B)} P(A∣B)=P(B)P(AB)=P(B)P(B∣A)P(A)
例子4.2:
假设有一个手写数据集,里面有 100 条记录,其中第0-9条记录是10个人分别写的0。10-19条是10个人分别写的 1。 ⋯ ⋯ \cdots \cdots ⋯⋯ 。第90-99条是10个人分别写的10 。小红写了一个数字X,怎么判断是数字几呢?
朴素贝叶斯工作原理:
P ( Y = 0 ∣ X ) = ? , P ( Y = 1 ∣ X ) = ? , ⋯ ⋯ , P ( Y = 10 ∣ X ) = ? P(Y=0 \mid X)=?, P(Y=1 \mid X)=?, \cdots \cdots, P(Y=10 \mid X)=? P(Y=0∣X)=?,P(Y=1∣X)=?,⋯⋯,P(Y=10∣X)=?
找到概率值最高的,就是对应的数字。
对于刚刚的手写数据集, 我们设数字的类别为 C k , C 0 C_{k}, C_{0} Ck,C0 表示数字 0 , ⋯ ⋯ 0, \cdots \cdots 0,⋯⋯ 。刚才数字判别公式可以修改为 P ( Y = C k ∣ X = x ) P\left(Y=C_{\mathrm{k}} \mid X=x\right) P(Y=Ck∣X=x) 。
P ( Y = C k ∣ X = x ) = P ( X = x ∣ Y = C k ) P ( Y = C k ) P ( X = x ) = P ( X = x ∣ Y = C k ) P ( Y = C k ) ∑ k P ( X = x , Y = C k ) = P ( X = x ∣ Y = C k ) P ( Y = C k ) ∑ k P ( X = x ∣ Y = C k ) P ( Y = C k ) = P ( X = x ∣ Y = C k ) P ( Y = C k ) ∑ k P ( X = x ∣ Y = C k ) P ( Y = C k ) = P ( Y = C k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = C k ) ∑ k P ( Y = C k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = C k ) \begin{aligned} P\left(Y=C_{\mathrm{k}} \mid X=x\right) &=\frac{P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}{P(X=x)} \\ =& \frac{P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}{\sum_{k} P\left(X=x, Y=C_{k}\right)} \\ =& \frac{P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}{\sum_{k} P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}\\ =&\frac{P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)}{\sum_{k} P\left(X=x \mid Y=C_{k}\right) P\left(Y=C_{k}\right)} \\ =&\frac{P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right)}{\sum_{k} P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right)} \end{aligned} P(Y=Ck∣X=x)=====P(X=x)P(X=x∣Y=Ck)P(Y=Ck)∑kP(X=x,Y=Ck)P(X=x∣Y=Ck)P(Y=Ck)∑kP(X=x∣Y=Ck)P(Y=Ck)P(X=x∣Y=Ck)P(Y=Ck)∑kP(X=x∣Y=Ck)P(Y=Ck)P(X=x∣Y=Ck)P(Y=Ck)∑kP(Y=Ck)∏jP(X(j)=x(j)∣Y=Ck)P(Y=Ck)∏jP(X(j)=x(j)∣Y=Ck)
另外:
P ( X = x ∣ Y = C k ) = P ( X ( 1 ) = x ( 1 ) ∣ Y = C k ) P ( X ( 2 ) = x ( 2 ) ∣ Y = C k ) ⋯ P ( X ( j ) = x ( j ) ∣ Y = C k ) = ∏ j P ( X ( j ) = x ( j ) ∣ Y = C k ) \begin{aligned} P\left(X=x \mid Y=C_{k}\right) &=P\left(X^{(1)}=x^{(1)} \mid Y=C_{k}\right) P\left(X^{(2)}=x^{(2)} \mid Y=C_{k}\right) \cdots P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right) \\ &=\prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right) \end{aligned} P(X=x∣Y=Ck)=P(X(1)=x(1)∣Y=Ck)P(X(2)=x(2)∣Y=Ck)⋯P(X(j)=x(j)∣Y=Ck)=j∏P(X(j)=x(j)∣Y=Ck)
朴素的意义:特征独立
f ( x ) = argmax C k P ( Y = C k ∣ X = x ) = P ( Y = C k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = C k ) ∑ k P ( Y = C k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = C k ) = P ( Y = C k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = C k ) \begin{aligned} f(x)=\underset{C_{k}}{\operatorname{argmax}} P\left(Y=C_{k} \mid X=x\right) &=\frac{P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right)}{\sum_{k} P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right)} \\ &=P\left(Y=C_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} \mid Y=C_{k}\right) \end{aligned} f(x)=CkargmaxP(Y=Ck∣X=x)=∑kP(Y=Ck)∏jP(X(j)=x(j)∣Y=Ck)P(Y=Ck)∏jP(X(j)=x(j)∣Y=Ck)=P(Y=Ck)j∏P(X(j)=x(j)∣Y=Ck)
又
P ( Y = C k ) = ∑ i = 1 N I ( y i = C k ) N , k = 1 , 2 , … , K P\left(Y=C_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=C_{k}\right)}{N}, k=1,2, \ldots, K P(Y=Ck)=N∑i=1NI(yi=Ck),k=1,2,…,K
其中I是指示函数:
I ( x ) = { 1 , 条件 x 为真 0 , 条件 x 为假 I(x)= \begin{cases}1, & \text { 条件 } x \text { 为真 } \\ 0, & \text { 条件 } x \text { 为假 }\end{cases} I(x)={1,0, 条件 x 为真 条件 x 为假
假设第 j \mathrm{j} j 个特征 x ( j ) x^{(j)} x(j) 可能取值的集合为 { a j 1 , a j 2 , … , a j S j } \left\{a_{j 1}, a_{j 2}, \ldots, a_{j S_{j}}\right\} {aj1,aj2,…,ajSj}
P ( X ( j ) = a j l ∣ Y = C k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = C k ) ∑ i = 1 N I ( y i = C k ) j = 1 , 2 , … , n ; l = 1 , 2 , … , S j ; k = 1 , 2 , … , K \begin{gathered} P\left(X^{(j)}=a_{j l} \mid Y=C_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=C_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=C_{k}\right)} \\ j=1,2, \ldots, n ; l=1,2, \ldots, S_{j} ; k=1,2, \ldots, K \end{gathered} P(X(j)=ajl∣Y=Ck)=∑i=1NI(yi=Ck)∑i=1NI(xi(j)=ajl,yi=Ck)j=1,2,…,n;l=1,2,…,Sj;k=1,2,…,K
这里注意:
算法4.1: 朴素贝叶斯算法
输入:训练数据 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),…,(xN,yN)},其中 x i = ( x i ( 1 ) , x i ( 2 ) , … , x i ( n ) ) T , x i ( j ) x_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \ldots, x_{i}^{(n)}\right)^{T}, x_{i}^{(j)} xi=(xi(1),xi(2),…,xi(n))T,xi(j)是第 i i i个样本的第 j j j个特征, x i ( j ) ∈ { a j 1 , a j 2 , … , a j s j } x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \ldots, a_{j s_{j}}\right\} xi(j)∈{aj1,aj2,…,ajsj}, a j l a_{j l} ajl是第 j j j个特征可能取的第 l l l个值, j = 1 , 2 , … , n , l = 1 , 2 , … , S j , y i ∈ { c 1 , c 2 , … , c K } j=1,2, \ldots, n, l=1,2, \ldots, S_{j}, \quad y_{i} \in\left\{c_{1}, c_{2}, \ldots, c_{K}\right\} j=1,2,…,n,l=1,2,…,Sj,yi∈{c1,c2,…,cK}.实例 x x x
输出:实例 x x x的分类
(1)计算先验概率以及条件概率
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , … , K P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) , j = 1 , 2 , … , n ; l = 1 , 2 , … , S j ; k = 1 , 2 , … , k \begin{gathered} P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \ldots, K \\ P\left(X^{(j)}=a_{j l} \mid Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}, \quad j=1,2, \ldots, n ; l=1,2, \ldots, S_{j} ; k=1,2, \ldots, k \end{gathered} P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,…,KP(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck),j=1,2,…,n;l=1,2,…,Sj;k=1,2,…,k
(2)对于给定的实例 x = ( x ( 1 ) , x ( 2 ) , … , x ( n ) ) T x=\left(x^{(1)}, x^{(2)}, \ldots, x^{(n)}\right)^{T} x=(x(1),x(2),…,x(n))T,计算
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , … , K P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right), \quad k=1,2, \ldots, K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,…,K
(3)确定实例 x x x的类
y = argmax c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=\underset{c_{k}}{\operatorname{argmax}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} \mid Y=c_{k}\right) y=ckargmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
其实我们在上一个section得到的
P ( X ( j ) = a j l ∣ Y = C k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = C k ) ∑ i = 1 N I ( y i = C k ) j = 1 , 2 , … , n ; l = 1 , 2 , … , S j ; k = 1 , 2 , … , K \begin{gathered} P\left(X^{(j)}=a_{j l} \mid Y=C_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=C_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=C_{k}\right)} \\ j=1,2, \ldots, n ; l=1,2, \ldots, S_{j} ; k=1,2, \ldots, K \end{gathered} P(X(j)=ajl∣Y=Ck)=∑i=1NI(yi=Ck)∑i=1NI(xi(j)=ajl,yi=Ck)j=1,2,…,n;l=1,2,…,Sj;k=1,2,…,K
是有问题的。因为公式的坟墓有可能为0!
例子:
数据集大小为 100 , 其中属于数字0的样本数有 10 个, 属于数字1的样本数0个, ⋯ ⋯ \cdots \cdots ⋯⋯ 。当我要计算 P ( X ( j ) = a j l ∣ Y = C 1 ) P\left(X^{(j)}=a_{j l} \mid Y=C_{1}\right) P(X(j)=ajl∣Y=C1) 时 ⋯ ⋯ \cdots \cdots ⋯⋯
上式就不能直接使用了。
我们要对公式做一点改动
P ( X ( j ) = a j l ∣ Y = C k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = C k ) + λ ∑ i = 1 N I ( y i = C k ) + S j λ P\left(X^{(j)}=a_{j l} \mid Y=C_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=C_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=C_{k}\right)+S_{j} \lambda} P(X(j)=ajl∣Y=Ck)=∑i=1NI(yi=Ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=Ck)+λ
S j : x i S_j:x_i Sj:xi是可取特征数目
后验概率最大化等价于期望风险最小化。假设朴素贝叶斯使用0-1损失函数:
L ( Y , f ( x ) ) = { 1 , Y ≠ f ( x ) 0 , Y = f ( x ) L(Y, f(x))= \begin{cases}1, & Y \neq f(x) \\ 0, & Y=f(x)\end{cases} L(Y,f(x))={1,0,Y=f(x)Y=f(x)
此时期望风险为:
R exp ( f ) = E [ L ( Y , f ( x ) ) ] = E x ∑ k = 1 K [ L ( C k , f ( x ) ) ] P ( C k ∣ X = x ) \begin{aligned} R_{\exp }(f) &=E[L(Y, f(x))] \\ &=E_{x} \sum_{k=1}^{K}\left[L\left(C_{k}, f(x)\right)\right] P\left(C_{k} \mid X=x\right) \end{aligned} Rexp(f)=E[L(Y,f(x))]=Exk=1∑K[L(Ck,f(x))]P(Ck∣X=x)
只需对 X = x X=x X=x逐个极小化:
f ( x ) = argmin y ∈ γ ∑ k = 1 K [ L ( C k , y ) ] P ( C k ∣ X = x ) = argmin y ∈ γ ∑ k = 1 K P ( y ≠ C k ∣ X = x ) = argmin y ∈ γ ( 1 − P ( y ≠ C k ∣ X = x ) ) = argmax y ∈ γ P ( y = C k ∣ X = x ) \begin{aligned} f(x) &=\underset{y \in \gamma}{\operatorname{argmin}} \sum_{k=1}^{K}\left[L\left(C_{k}, y\right)\right] P\left(C_{k} \mid X=x\right) \\ &=\underset{y \in \gamma}{\operatorname{argmin}} \sum_{k=1}^{K} P\left(y \neq C_{k} \mid X=x\right) \\ &=\underset{y \in \gamma}{\operatorname{argmin}}\left(1-P\left(y \neq C_{k} \mid X=x\right)\right) \\ &=\underset{y \in \gamma}{\operatorname{argmax}} P\left(y=C_{k} \mid X=x\right) \end{aligned} f(x)=y∈γargmink=1∑K[L(Ck,y)]P(Ck∣X=x)=y∈γargmink=1∑KP(y=Ck∣X=x)=y∈γargmin(1−P(y=Ck∣X=x))=y∈γargmaxP(y=Ck∣X=x)
由此可得,期望风险最小化准则变成了后验概率最大化准则。也就是朴素贝叶斯所采用的定理。