[数据挖掘之scikit-learn] sklean.naive_bayes实例详解

文章目录

      • 概述
      • 2. sklearn.naive_bayes
        • 2.1 sklearn.naive_bayes.MultinomialNB
          • 2.1.1 MultinomialNB示例
        • 2.2 sklearn.naive_bayes.BernoulliNB
          • 2.2.1 BernoulliNB示例
        • 2.3 sklearn.naive_bayes.GaussianNB
          • 2.3.1 GaussianNB示例

概述

朴素贝叶斯分类算法主要是基于概率模型建立的分类算法。主要原理详见《小瓜讲机器学习——分类算法(三)朴素贝叶斯法(naive Bayes)算法原理及Python代码实现》。这里我们介绍sklearn中的朴素贝叶斯分类器。

2. sklearn.naive_bayes

2.1 sklearn.naive_bayes.MultinomialNB

MultinomialNB能够处理特征是离散数据的情况,比如多项式模型能够处理文本分类中的以词频为特征的情况。原理在《小瓜讲机器学习——分类算法(三)朴素贝叶斯法(naive Bayes)算法原理及Python代码实现》中有详细介绍。

sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

参数说明:
1.alpha:平滑参数
2.fit_prior:是否考虑先验概率 P ( y ) P(y) P(y),如果是false,则所有样本类别有相同的先验概率;
3.class_prior:设定先验概率 P ( y ) P(y) P(y),如果不设定,从样本的极大似然估计得到。

属性说明:
1.class_log_prior_:经过平滑后的先验概率的log值
2.intercept_:与上值同;
3.feature_log_prob_:每个类的各个特征概率(即先验条件概率 P ( x i ∣ y ) P(x_i|y) P(xiy))
4.coef_:与上值同;
5.class_count_:训练样本中每个类别的样本数;
6.feature_count_:每个类别中的各个特征出现次数。

2.1.1 MultinomialNB示例
import numpy as np
x = np.random.randint(5, size=(6, 10))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(x, y)
print('------- feature vector -------')
print(x)
print('------- prior of each class ------')
print(clf.class_log_prior_)
print('------- probability of features in each class ------')
print(clf.feature_log_prob_)
print('------- predict ------')
print(clf.predict(x[2:3]))

结果如下:

------- feature vector -------
------- feature vector -------
[[0 3 0 0]
 [4 1 4 4]
 [1 4 0 1]
 [2 3 0 3]
 [1 0 2 0]
 [0 0 4 1]]
------- prior of each class ------
[-1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947]
------- probability of features in each class ------
[[-1.94591015 -0.55961579 -1.94591015 -1.94591015]
 [-1.22377543 -2.14006616 -1.22377543 -1.22377543]
 [-1.60943791 -0.69314718 -2.30258509 -1.60943791]
 [-1.38629436 -1.09861229 -2.48490665 -1.09861229]
 [-1.25276297 -1.94591015 -0.84729786 -1.94591015]
 [-2.19722458 -2.19722458 -0.58778666 -1.5040774 ]]
------- predict ------
[3]
2.2 sklearn.naive_bayes.BernoulliNB

BernouliNB只能处理特征是二值的离散数据的情况,假设数据特征呈现伯努利分布,则后验概率由下式确定
特 征 x i = 1 时 , P ( x i ∣ y ) = P ( x i = 1 ∣ y ) 特征x_i=1时,P(x_i|y)=P(x_i=1|y) xi=1,P(xiy)=P(xi=1y)
特 征 x i = 0 时 , P ( x i ∣ y ) = P ( x i = 0 ∣ y ) 特征x_i=0时,P(x_i|y)=P(x_i=0|y) xi=0,P(xiy)=P(xi=0y)

sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)

参数说明:
1.alpha(default=1.0):平滑参数
2.binarize(default=0.0):
3.fit_prior(default=True):是否考虑先验概率 P ( y ) P(y) P(y),如果是false,则所有样本类别有相同的先验概率;
4.class_prior(default=None):设定先验概率 P ( y ) P(y) P(y),如果不设定,从样本的极大似然估计得到。

属性说明:
1.class_log_prior_:经过平滑后的先验概率的log值
2.feature_log_prob_:每个类的各个特征概率(即先验条件概率 P ( x i ∣ y ) P(x_i|y) P(xiy))
3.class_count_:训练样本中每个类别的样本数;
4.feature_count_:每个类别中的各个特征出现次数。

2.2.1 BernoulliNB示例
import numpy as np
from sklearn.naive_bayes import BernoulliNB

x = np.random.randint(2, size=(6, 4))
y = np.array([1, 2, 3, 4, 5, 6])

clf = BernoulliNB()
clf.fit(x, y)
print('------- feature vector -------')
print(x)
print('------- prior of each class ------')
print(clf.class_log_prior_)
print('------- probability of features in each class ------')
print(clf.feature_log_prob_)
print('------- predict ------')
print(clf.predict(x[2:3]))

输出如下

------- feature vector -------
[[0 1 1 0]
 [0 0 0 0]
 [0 1 0 0]
 [0 1 1 0]
 [0 1 1 1]
 [1 1 0 1]]
------- prior of each class ------
[-1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947 -1.79175947]
------- probability of features in each class ------
[[-1.09861229 -0.40546511 -0.40546511 -1.09861229]
 [-1.09861229 -1.09861229 -1.09861229 -1.09861229]
 [-1.09861229 -0.40546511 -1.09861229 -1.09861229]
 [-1.09861229 -0.40546511 -0.40546511 -1.09861229]
 [-1.09861229 -0.40546511 -0.40546511 -0.40546511]
 [-0.40546511 -0.40546511 -1.09861229 -0.40546511]]
------- predict ------
[3]
2.3 sklearn.naive_bayes.GaussianNB

GaussianNB是高斯型朴素贝叶斯分类器,它假设了条件概率 P ( x i ∣ y ) P(x_i|y) P(xiy)符合高斯分布,即
P ( x i ∣ y ) = 1 2 π σ y 2 exp ⁡ ( − ( x i − μ y ) 2 2 σ y 2 ) P(x_i|y)=\frac{1}{\sqrt{2\pi\sigma_y^2}}\exp(-\frac{(x_i-\mu_y)^2}{2\sigma_y^2}) P(xiy)=2πσy2 1exp(2σy2(xiμy)2)
参数 σ y \sigma_y σy μ y \mu_y μy用最大似然估计得到。

sklearn.naive_bayes.GaussianNB(priors=None, var_smoothing=1e-09)

参数说明:
1.prior:每个类的先验概率,若不设定,则通过样本极大似然估计得到;
2.var_smoothing:比例因子,将所有特征中最大的方差的一定比例添加到方差中,为了计算稳定性(原文:Portion of the largest variance of all features that is added to variances for calculation stability)

属性说明:
1.class_prior:分属不同类的概率;
2.class_count_:每个类的样本数;
3.theta_:每个类的特征均值;
4.sigma_:每个类的特征方差
5.epsilon_:方差增加值?

2.3.1 GaussianNB示例

训练样本如下

4.45925637575900	8.22541838354701	0
0.0432761720122110	6.30740040001402	0
6.99716180262699	9.31339338579386	0
4.75483224215432	9.26037784240288	0
8.66190392439652	9.76797698918454	0
...
4.21408348419092	2.97014277918461	1
5.52248511695330	3.63263027130760	1
4.15244831176753	1.44597290703838	1
9.55986996363196	1.13832040773527	1
1.63276516895206	0.446783742774178	1
9.38532498107474	0.913169554364942	1

代码如下

import numpy as np
from sklearn.naive_bayes import GaussianNB

with open(r'H:\python dataanalysis\sklearn\naive_bayes_data.txt') as f:
	data = []
	label = []
    for loopi in f.readlines():
    	line = loopi.strip().split('\t')
    	data.append([float(line[0]), float(line[1])])
    	label.append(float(line[2]))

feature = np.array(data)
label = np.array(label)
clf = GaussianNB()
clf.fit(feature, label)

print('----priors probablity of each class----')
print(clf.class_prior_)
print('----number of samples in each class----')
print(clf.class_count_)
print('-----mean of each feature per class------')
print(clf.theta_)
print('-----variance of each feature per class------')
print(clf.sigma_)

x = np.array([1.0,1.0]).reshape((1,2))
print('-----predict------')
print(clf.predict(x))

结果如下

----priors probablity of each class----
[0.5 0.5]
----number of samples in each class----
[100. 100.]
-----mean of each feature per class------
[[4.11770459 7.57552293]
 [5.76083782 2.3511532 ]]
-----variance of each feature per class------
[[6.60834055 4.73694733]
 [6.18880712 3.96585982]]
 -----predict------
[1.]

你可能感兴趣的:(Python数据分析,Python,机器学习,naive_bayes,伯努利,多项式,高斯)