假设有两个事件,事件 A 和事件 B ,已知事件 A 发生的概率为 p(A) ,事件 B 发生的概率为 P(B) ,事件 A 发生的前提下,事件 B 发生的概率为 p(B|A) ,事件 B 发生的前提下,事件 A 发生的概率为 p(A|B) ,事件 A 和事件 B 同时发生的概率是 p(AB) ,则有
给定一组训练数据集 {(X1,y1),(X2,y2),(X3,y3),…,(Xm,ym)} ,其中, m 是样本的个数,每个数据集包含着 n 个特征,即 Xi=(xi1,xi2,…,xin) 。类标记集合为 {y1,y2,…,yk} 。设 p(y=yi|X=x) 表示输入的 X 样本为 x 时,输出的 y 为 yk 的概率。
假设现在给定一个新的样本 x ,要判断其属于哪一类,可分别求解 p(y=y1|x) , p(y=y2|x) , p(y=y3|x) ,…, p(y=yk|x) 的值,哪一个值最大,就属于那一类。即,求解最大的后验概率 argmaxp(y|x) 。
那如何求解出这些后验概率呢?根据贝叶斯定理,有
下面,是如何通过样本对 p(y) 和 p(x|y) 进行概率估计。
在朴素贝叶斯法中,学习就是意味着估计先验概率 p(y) 和 条件概率 p(x|y) ,然后根据先验概率和条件概率,去计算新的样本的后验概率 p(y|x) 。其中,估计先验概率和条件概率的方法有很多,比如极大似然估计,多项式,高斯,伯努利等。
其中,在极大似然估计中,先验概率 p(y) 的极大似然估计如下:
例子1
该例子来自李航的《统计学习方法》。
表中 X(1) 和 X(2) 为特征,取值的集合分别是 A1={1,2,3} , A2={S,M,L} , Y 为类标记, Y=1,−1 。试求, x=(2,S) 的类标记。数据如下所示,其中,特征 X(2) 的取值 {S,M,L} 分别表示成 {0,1,2} 。
import numpy as np
import pandas as pd
x1 = np.array([1,1,1,1,1,2,2,2,2,2,3,3,3,3,3])
x2 = np.array([0,1,1,0,0,0,1,1,2,2,2,1,1,2,2])
y = np.array([-1,-1,1,1,-1,-1,-1,1,1,1,1,1,1,1,-1])
dataSet = np.concatenate((x1[:,None],x2[:,None],y[:,None]),axis=1)
df = pd.DataFrame(dataSet,index=np.arange(1,16,1),columns=['X1','X2','y'])
df.T
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |
X2 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 2 | 2 |
y | -1 | -1 | 1 | 1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 |
求解
step1: 求解先验概率
p(y=−1)=615 , p(y=1)=915
step2 求解条件概率
(2.1) 特征 X1
p(X1=1|y=−1)=36=12 , p(X1=2|y=−1)=26=13 , p(X1=3|y=−1)=16
p(X1=1|y=1)=29 , p(X1=2|y=1)=39=13 , p(X1=3|y=1)=49
(2.2) 特征 X1
p(X2=0|y=−1)=36=12 , p(X2=1|y=−1)=26=13 , p(X2=2|y=−1)=16
p(X2=0|y=1)=19 , p(X2=1|y=1)=49= , p(X2=2|y=1)=49
step3 求解后验概率
p(y=−1)p(X=(2,S)|y=−1)=p(y=−1)p(X1=2|y=−1)p(X2=S|y=−1)=6151312=115
p(y=1)p(X=(2,S)|y=1)=p(y=1)p(X1=2|y=1)p(X2=S|y=1)=9151319=145
因为 115>145 , 所以该样本的类标记为 −1
如下是python的极大似然估计的朴素贝叶斯代码,代码运行结果跟求解一致。
class MLENB:
"""
Maximum likelihood estimation Naive Bayes
Attributes
----------
class_prior_ : array, shape (n_classes, )
Smoothed empirical probability for each class.
class_count_: array, shape (n_classes,)
number of training samples observed in each class.
MLE_: array, shape(n_classes, n_features)
Maximum likelihood estimation of each feature per class, each of element is a dict
"""
def __init__(self):
pass
def fit(self,X,y):
"""Fit maximum likelihood estimation Naive Bayes according to X, y
Parameters
----------
X : array-like, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape (n_samples,)
Target values.
Returns
-------
self : object
Returns self.
"""
n_samples = X.shape[0]
n_features = X.shape[1]
n_classes = len(set(y))
self.class_count_ = np.empty(n_classes)
self.class_prior_ = np.empty(n_classes)
self.MLE_ = np.empty((n_classes,n_features),dtype=dict)
self.target_unique = np.unique(y)
for i in range(n_classes):
dataX_tu = X[y == self.target_unique[i]]
self.class_prior_[i] = dataX_tu.shape[0] / float(len(y))
self.class_count_[i] = dataX_tu.shape[0]
for j in range(n_features):
feature = dataX_tu[:,j]
feature_unique = np.unique(feature)
fp = {}
for f_item in feature_unique:
fp[f_item] = list(feature).count(f_item) / float(len(feature))
self.MLE_[i,j] = fp
return self
def __predict_likelihood(self,x):
if x.ndim == 1:
x = np.array([x])
n_samples = x.shape[0]
n_features = x.shape[1]
n_classes = len(self.class_count_)
likelihood = []
for x_item in x:
class_p = []
for i in range(n_classes):
p = self.class_prior_[i]
for j in range(n_features):
if x_item[j] in self.MLE_[i,j]:
p *= self.MLE_[i,j][x_item[j]]
else:
p *= 0
class_p.append(p)
likelihood.append(class_p)
return np.array(likelihood)
def predict(self,x):
"""Perform classification on an array of test vectors X.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
C : array, shape = [n_samples]
Predicted target values for X
"""
likelihood = self.__predict_likelihood(x)
max_index = np.argmax(likelihood, axis=1)
return np.array([self.target_unique[i] for i in max_index])
def predict_proba(self,x):
"""
Return probability estimates for the test vector X.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
C : array-like, shape = [n_samples, n_classes]
Returns the probability of the samples for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute `classes_`.
"""
likelihood = self.__predict_likelihood(x)
return np.array([lh / np.sum(lh) for lh in likelihood])
# 测验结果
X = dataSet[:,0:-1]
y = dataSet[:,-1]
mlenb = MLENB()
mlenb.fit(X,y)
print(mlenb.predict(np.array([2,0])))
print(mlenb.predict_proba(np.array([2,0])))
[-1]
[[ 0.75 0.25]]
用极大似然估计可能会出现所要估计的概率值为0的情况。这时会影响到后验概率的计算结果,使分类产生偏差。这时,可以采用多项式模型,对先验概率和条件概率做一些平滑处理。具体公式为:
先验概率 p(y) 的估计如下:
假设输入样本的第 j 个特征的所有可能取值的集合是 {aj1,aj2,…,ajsj} ,则条件概率 p(x(j)|y=yi) 的估计如下:
有个疑问:多项式朴素贝叶斯与李航《统计学习方法》中说的贝叶斯估计有啥区别?本文的方法是参考李航的贝叶斯估计。
python的多项式朴素贝叶斯的参考代码如下:
class MultinomialNB:
"""Naive Bayes classifier for multinomial models
Attributes
----------
class_prior_ : array, shape (n_classes, )
Smoothed empirical probability for each class.
class_count_: array, shape (n_classes,)
number of training samples observed in each class.
bayes_estimation_: array, shape(n_classes, n_features)
bayes estimations of each feature per class, each of element is a dict
"""
def __init__(self, alpha=1.0):
self.alpha_ = 1.0
def fit(self,X,y):
n_samples = X.shape[0]
n_features = X.shape[1]
n_classes = len(set(y))
self.class_count_ = np.empty(n_classes)
self.class_prior_ = np.empty(n_classes)
self.bayes_estimation_ = np.empty((n_classes,n_features),dtype=dict)
self.target_unique = np.unique(y)
for i in range(n_classes):
dataX_tu = X[y == self.target_unique[i]]
self.class_prior_[i] = (dataX_tu.shape[0] + self.alpha_) / (float(len(y)) + n_classes * self.alpha_)
self.class_count_[i] = dataX_tu.shape[0]
for j in range(n_features):
feature = dataX_tu[:,j]
feature_unique = np.unique(feature)
fp = {}
for f_item in feature_unique:
fp[f_item] = (list(feature).count(f_item) + self.alpha_) / (float(len(feature)) + len(feature_unique) * self.alpha_)
self.bayes_estimation_[i,j] = fp
return self
def __predict_likelihood(self,x):
if x.ndim == 1:
x = np.array([x])
n_samples = x.shape[0]
n_features = x.shape[1]
n_classes = len(self.class_count_)
likelihood = []
for x_item in x:
class_p = []
for i in range(n_classes):
p = self.class_prior_[i]
for j in range(n_features):
if x_item[j] in self.bayes_estimation_[i,j]:
p *= self.bayes_estimation_[i,j][x_item[j]]
else:
p *= 0
class_p.append(p)
likelihood.append(class_p)
return np.array(likelihood)
def predict(self,x):
likelihood = self.__predict_likelihood(x)
max_index = np.argmax(likelihood, axis=1)
return np.array([self.target_unique[i] for i in max_index])
def predict_proba(self,x):
likelihood = self.__predict_likelihood(x)
return np.array([lh / np.sum(lh) for lh in likelihood])
# 测验结果
X = dataSet[:,0:-1]
y = dataSet[:,-1]
mnb = MultinomialNB()
mnb.fit(X,y)
print(mnb.predict(np.array([2,0])))
print(mnb.predict_proba(np.array([2,0])))
[-1]
[[ 0.65116279 0.34883721]]
当输入的特征是连续值的时候,我们无法用上面的方法来估计先验概率和条件概率,可以采用高斯模型。高斯模型假设特征服从高斯分布。
其特征的似然估计如下所示:
class GaussianNB:
"""
Attributes
----------
class_prior_ : array, shape (n_classes,)
probability of each class.
class_count_ : array, shape (n_classes,)
number of training samples observed in each class.
theta_ : array, shape (n_classes, n_features)
mean of each feature per class
sigma_ : array, shape (n_classes, n_features)
variance of each feature per class
"""
def __init__(self):
pass
def fit(self, X, y):
n_samples = X.shape[0]
n_features = X.shape[1]
n_classes = len(set(y))
self.theta_ = np.zeros([n_classes,n_features])
self.sigma_ = np.zeros([n_classes,n_features])
self.class_prior = np.zeros(n_classes)
self.class_count = np.zeros(n_classes)
self.target_unique = np.unique(y)
for i in range(n_classes):
dataX_tu = X[y == self.target_unique[i]]
self.class_prior[i] = dataX_tu.shape[0] / float(len(y))
self.class_count[i] = dataX_tu.shape[0]
self.theta_[i,:] = np.mean(dataX_tu,axis=0)
self.sigma_[i,:] = np.var(dataX_tu,axis=0)
return self
def __predict_likelihood(self,x):
if x.ndim == 1:
x = np.array([x])
n_samples = x.shape[0]
likelihood = []
for x_item in x:
gaussian = np.exp(-(x_item-self.theta_)**2 / (2 * self.sigma_)) / np.sqrt(2*np.pi*self.sigma_)
p = np.exp(np.sum(np.log(gaussian),axis=1))
likelihood.append(self.class_prior * p)
return np.array(likelihood)
def predict(self,x):
likelihood = self.__predict_likelihood(x)
max_index = np.argmax(likelihood, axis=1)
return np.array([self.target_unique[i] for i in max_index])
def predict_proba(self,x):
likelihood = self.__predict_likelihood(x)
return np.array([lh / np.sum(lh) for lh in likelihood])
# 测验结果
X = dataSet[:,0:-1]
y = dataSet[:,-1]
gnb = GaussianNB()
gnb.fit(X,y)
print(gnb.predict(np.array([2,0])))
print(gnb.predict_proba(np.array([2,0])))
[-1]
[[ 0.74566865 0.25433135]]