朴素贝叶斯分类算法主要是基于概率模型建立的分类算法。主要原理详见《小瓜讲机器学习——分类算法(三)朴素贝叶斯法(naive Bayes)算法原理及Python代码实现》。这里我们介绍sklearn中的朴素贝叶斯分类器。
MultinomialNB能够处理特征是离散数据的情况,比如多项式模型能够处理文本分类中的以词频为特征的情况。原理在《小瓜讲机器学习——分类算法(三)朴素贝叶斯法(naive Bayes)算法原理及Python代码实现》中有详细介绍。
sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
参数说明:
1.alpha:平滑参数
2.fit_prior:是否考虑先验概率 P ( y ) P(y) P(y),如果是false,则所有样本类别有相同的先验概率;
3.class_prior:设定先验概率 P ( y ) P(y) P(y),如果不设定,从样本的极大似然估计得到。
属性说明:
1.class_log_prior_:经过平滑后的先验概率的log值
2.intercept_:与上值同;
3.feature_log_prob_:每个类的各个特征概率(即先验条件概率 P ( x i ∣ y ) P(x_i|y) P(xi∣y))
4.coef_:与上值同;
5.class_count_:训练样本中每个类别的样本数;
6.feature_count_:每个类别中的各个特征出现次数。
import numpy as np
x = np.random.randint(5, size=(6, 10))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(x, y)
print('------- feature vector -------')
print(x)
print('------- prior of each class ------')
print(clf.class_log_prior_)
print('------- probability of features in each class ------')
print(clf.feature_log_prob_)
print('------- predict ------')
print(clf.predict(x[2:3]))
结果如下:
------- feature vector -------
------- feature vector -------
[[0 3 0 0]
[4 1 4 4]
[1 4 0 1]
[2 3 0 3]
[1 0 2 0]
[0 0 4 1]]
------- prior of each class ------
[-1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947]
------- probability of features in each class ------
[[-1.94591015 -0.55961579 -1.94591015 -1.94591015]
[-1.22377543 -2.14006616 -1.22377543 -1.22377543]
[-1.60943791 -0.69314718 -2.30258509 -1.60943791]
[-1.38629436 -1.09861229 -2.48490665 -1.09861229]
[-1.25276297 -1.94591015 -0.84729786 -1.94591015]
[-2.19722458 -2.19722458 -0.58778666 -1.5040774 ]]
------- predict ------
[3]
BernouliNB只能处理特征是二值的离散数据的情况,假设数据特征呈现伯努利分布,则后验概率由下式确定
特 征 x i = 1 时 , P ( x i ∣ y ) = P ( x i = 1 ∣ y ) 特征x_i=1时,P(x_i|y)=P(x_i=1|y) 特征xi=1时,P(xi∣y)=P(xi=1∣y)
特 征 x i = 0 时 , P ( x i ∣ y ) = P ( x i = 0 ∣ y ) 特征x_i=0时,P(x_i|y)=P(x_i=0|y) 特征xi=0时,P(xi∣y)=P(xi=0∣y)
sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)
参数说明:
1.alpha(default=1.0):平滑参数
2.binarize(default=0.0):
3.fit_prior(default=True):是否考虑先验概率 P ( y ) P(y) P(y),如果是false,则所有样本类别有相同的先验概率;
4.class_prior(default=None):设定先验概率 P ( y ) P(y) P(y),如果不设定,从样本的极大似然估计得到。
属性说明:
1.class_log_prior_:经过平滑后的先验概率的log值
2.feature_log_prob_:每个类的各个特征概率(即先验条件概率 P ( x i ∣ y ) P(x_i|y) P(xi∣y))
3.class_count_:训练样本中每个类别的样本数;
4.feature_count_:每个类别中的各个特征出现次数。
import numpy as np
from sklearn.naive_bayes import BernoulliNB
x = np.random.randint(2, size=(6, 4))
y = np.array([1, 2, 3, 4, 5, 6])
clf = BernoulliNB()
clf.fit(x, y)
print('------- feature vector -------')
print(x)
print('------- prior of each class ------')
print(clf.class_log_prior_)
print('------- probability of features in each class ------')
print(clf.feature_log_prob_)
print('------- predict ------')
print(clf.predict(x[2:3]))
输出如下
------- feature vector -------
[[0 1 1 0]
[0 0 0 0]
[0 1 0 0]
[0 1 1 0]
[0 1 1 1]
[1 1 0 1]]
------- prior of each class ------
[-1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947]
------- probability of features in each class ------
[[-1.09861229 -0.40546511 -0.40546511 -1.09861229]
[-1.09861229 -1.09861229 -1.09861229 -1.09861229]
[-1.09861229 -0.40546511 -1.09861229 -1.09861229]
[-1.09861229 -0.40546511 -0.40546511 -1.09861229]
[-1.09861229 -0.40546511 -0.40546511 -0.40546511]
[-0.40546511 -0.40546511 -1.09861229 -0.40546511]]
------- predict ------
[3]
GaussianNB是高斯型朴素贝叶斯分类器,它假设了条件概率 P ( x i ∣ y ) P(x_i|y) P(xi∣y)符合高斯分布,即
P ( x i ∣ y ) = 1 2 π σ y 2 exp ( − ( x i − μ y ) 2 2 σ y 2 ) P(x_i|y)=\frac{1}{\sqrt{2\pi\sigma_y^2}}\exp(-\frac{(x_i-\mu_y)^2}{2\sigma_y^2}) P(xi∣y)=2πσy21exp(−2σy2(xi−μy)2)
参数 σ y \sigma_y σy和 μ y \mu_y μy用最大似然估计得到。
sklearn.naive_bayes.GaussianNB(priors=None, var_smoothing=1e-09)
参数说明:
1.prior:每个类的先验概率,若不设定,则通过样本极大似然估计得到;
2.var_smoothing:比例因子,将所有特征中最大的方差的一定比例添加到方差中,为了计算稳定性(原文:Portion of the largest variance of all features that is added to variances for calculation stability)
属性说明:
1.class_prior:分属不同类的概率;
2.class_count_:每个类的样本数;
3.theta_:每个类的特征均值;
4.sigma_:每个类的特征方差
5.epsilon_:方差增加值?
训练样本如下
4.45925637575900 8.22541838354701 0
0.0432761720122110 6.30740040001402 0
6.99716180262699 9.31339338579386 0
4.75483224215432 9.26037784240288 0
8.66190392439652 9.76797698918454 0
...
4.21408348419092 2.97014277918461 1
5.52248511695330 3.63263027130760 1
4.15244831176753 1.44597290703838 1
9.55986996363196 1.13832040773527 1
1.63276516895206 0.446783742774178 1
9.38532498107474 0.913169554364942 1
代码如下
import numpy as np
from sklearn.naive_bayes import GaussianNB
with open(r'H:\python dataanalysis\sklearn\naive_bayes_data.txt') as f:
data = []
label = []
for loopi in f.readlines():
line = loopi.strip().split('\t')
data.append([float(line[0]), float(line[1])])
label.append(float(line[2]))
feature = np.array(data)
label = np.array(label)
clf = GaussianNB()
clf.fit(feature, label)
print('----priors probablity of each class----')
print(clf.class_prior_)
print('----number of samples in each class----')
print(clf.class_count_)
print('-----mean of each feature per class------')
print(clf.theta_)
print('-----variance of each feature per class------')
print(clf.sigma_)
x = np.array([1.0,1.0]).reshape((1,2))
print('-----predict------')
print(clf.predict(x))
结果如下
----priors probablity of each class----
[0.5 0.5]
----number of samples in each class----
[100. 100.]
-----mean of each feature per class------
[[4.11770459 7.57552293]
[5.76083782 2.3511532 ]]
-----variance of each feature per class------
[[6.60834055 4.73694733]
[6.18880712 3.96585982]]
-----predict------
[1.]