朴素贝叶斯法是基于贝叶斯定理与特征条件独立假设的分类模型。通过给定训练数据,学习输入x/输出y的联合概率分布P(x,y),然后用贝叶斯定理求出后验概率最大的输出y,作为预测结果。
贝叶斯定理是关于随机事件A和B发生的条件概率,P(B|A)是在A发生的情况下B发生的可能性。定理简化版如下:
P ( B ∣ A ) = P ( A , B ) P ( A ) = P ( A ∣ B ) P ( B ) P ( A ) P(B|A)= \frac{P(A,B)}{P(A)} = \frac{P(A|B)P(B)}{P(A)} P(B∣A)=P(A)P(A,B)=P(A)P(A∣B)P(B)
如果事件B有n个小事件组成,且相互独立,贝叶斯公式完整版:
P ( B i ∣ A ) = P ( A , B i ) P ( A ) = P ( A ∣ B i ) P ( B i ) ∑ j = 1 n P ( A ∣ B j ) P ( B j ) P(B_i|A)= \frac{P(A,B_i)}{P(A)} = \frac{P(A|B_i)P(B_i)}{\sum_{j=1}^{n}P(A|B_j)P(B_j)} P(Bi∣A)=P(A)P(A,Bi)=∑j=1nP(A∣Bj)P(Bj)P(A∣Bi)P(Bi)
输入空间X为n维向量,输出空间为类标记集合 Y = { c 1 , c 2 , . . . , c k } Y = \{c_1, c_2,...,c_k\} Y={c1,c2,...,ck},训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T = \{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)} 通过训练数据可以学习联合概率分布P(X,Y),具体需要学习先验分布和条件分布。
先验概率分布: P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),\quad k=1,2,...,K P(Y=ck),k=1,2,...,K
条件概率分布,加入了条件独立性假设:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) ∣ Y = c k ) P(X=x|Y=c_k) = P(X^{(1)}=x^{(1)},...,X^{(n)}=^{(n)}|Y=c_k) \\ = \prod_{j=1}^{n}P(X^{(j)}|Y=c_k) P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=(n)∣Y=ck)=j=1∏nP(X(j)∣Y=ck) 贝叶斯分类时,对给定的输入x,通过学习到的模型计算后验概率分布 P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ck∣X=x),后验概率最大的类作为预测输出。
P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k = 1 K P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k=1}^{K}P(X=x|Y=c_k)P(Y=c_k)} P(Y=ck∣X=x)=∑k=1KP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
贝叶斯分类器可表示为对y的k中结果求最大值:
y = f ( x ) = a r g m a x P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k = 1 K P ( X = x ∣ Y = c k ) P ( Y = c k ) y=f(x)=argmax\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k=1}^{K}P(X=x|Y=c_k)P(Y=c_k)} y=f(x)=argmax∑k=1KP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
P ( X = x ∣ Y = c k ) P(X=x|Y=c_k) P(X=x∣Y=ck)中x的不同分量相互独立,所以
P ( X = x ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k) = \prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
最后求最大值时,分母都是一样的,分类器最终简化为:
y = f ( x ) = a r g m a x P ( X = x ∣ Y = c k ) P ( Y = c k ) = a r g m a x P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=argmax P(X=x|Y=c_k)P(Y=c_k) \\ =argmax P(Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) y=f(x)=argmaxP(X=x∣Y=ck)P(Y=ck)=argmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
先验概率 P ( Y = c k ) P(Y=c_k) P(Y=ck)的极大似然估计为:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . , k P(Y=c_k)= \frac{\sum_{i=1}^{N}I(y_i=c_k)}{N}, \quad k=1,2,..,k P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,..,k
设第j个特征 x ( j ) x^{(j)} x(j)取值集合为 { a j l , a j 2 , . . . , a j s } \{a_{jl},a_{j2},...,a_{js}\} {ajl,aj2,...,ajs},条件概率极大似然估计:
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i j = a j l , y i = c k ) N P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^{N}I(x^{j}_i=a_{jl},y_i=c_k)}{N} P(X(j)=ajl∣Y=ck)=N∑i=1NI(xij=ajl,yi=ck)
给定实例 x = ( x ( 1 ) , x ( 2 ) , . . . , x ( n ) ) x = (x^{(1)},x^{(2)},...,x^{(n)}) x=(x(1),x(2),...,x(n)),计算:
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)预测实例x的类别:
y = a r g m a x P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=argmax P(Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) y=argmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
极大似然估计可能出现概率值为0的情况,也会影响分类的结果,解决这一问题的方法就是加入拉普拉斯平滑,为随机变量每个取值加入一个常数 λ \lambda λ,常取 λ = 1 \lambda=1 λ=1。
先验分布的贝叶斯估计:
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P(Y=c_k) = \frac{\sum_{i=1}^{N}I(y_i=c_k)+\lambda}{N+K\lambda} P(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
条件概率的贝叶斯估计:
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i j = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^{N}I(x^{j}_i=a_{jl},y_i=c_k)+\lambda}{\sum_{i=1}^{N}I(y_i=c_k)+S_j \lambda } P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xij=ajl,yi=ck)+λ