朴素贝叶斯是基于贝叶斯定理与特征条件独立假设的分类方法
三种朴素贝叶斯模型:
(1)高斯模型: P ( X j = x j ∣ Y = C k ) = 1 2 π σ k 2 e x p ( − ( x j − μ k ) 2 2 σ k 2 ) P(X_j=x_j | Y=C_k)=\frac{1}{\sqrt{2\pi}\sigma^2_{k}}exp(-\frac{(x_j-\mu_{k})^2}{2\sigma^2_{k}}) P(Xj=xj∣Y=Ck)=2πσk21exp(−2σk2(xj−μk)2)
(2)多项式模型: P ( X j = x j l ∣ Y = C k ) = x j l + λ m k + n λ P(X_j=x_{jl}|Y=C_k) = \frac{x_{jl} + \lambda}{m_k + n\lambda} P(Xj=xjl∣Y=Ck)=mk+nλxjl+λ
(3)伯努利模型: P ( X j = x j l ∣ Y = C k ) = P ( j ∣ Y = C k ) x j l + ( 1 − P ( j ∣ Y = C k ) ( 1 − x j l ) P(X_j=x_{jl}|Y=C_k) = P(j|Y=C_k)x_{jl} + (1 - P(j|Y=C_k)(1-x_{jl}) P(Xj=xjl∣Y=Ck)=P(j∣Y=Ck)xjl+(1−P(j∣Y=Ck)(1−xjl)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
import math
# 加载数据集
def create_data():
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = [
'sepal length', 'sepal width', 'petal length', 'petal width', 'label'
]
data = np.array(df.iloc[:100, :])
return data[:, :-1], data[:, -1]
# 生成训练集与测试集
X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# 查看数据
X.ndim, X.shape, y.ndim, y.shape
X[1:5]
y[1:5]
X_train[1:5]
y_train[1:5]
list(set(y_train))
(2, (100, 4), 1, (100,))
array([[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2]])
array([0., 0., 0., 0.])
array([[5.1, 3.8, 1.6, 0.2],
[5.8, 4. , 1.2, 0.2],
[4.8, 3.4, 1.6, 0.2],
[5.6, 3. , 4.1, 1.3]])
array([0., 0., 0., 1.])
[0.0, 1.0]
特征的可能性被假设为高斯
概率密度函数: P ( x i ∣ y k ) = 1 2 π σ y k 2 e x p ( − ( x i − μ y k ) 2 2 σ y k 2 ) P(x_i | y_k)=\frac{1}{\sqrt{2\pi}\sigma^2_{yk}}exp(-\frac{(x_i-\mu_{yk})^2}{2\sigma^2_{yk}}) P(xi∣yk)=2πσyk21exp(−2σyk2(xi−μyk)2)
数学期望(mean): μ = ∑ i = 1 N x i N \mu=\frac{\sum_{i=1}^{N}x_{i}}{N} μ=N∑i=1Nxi
方差: σ 2 = ∑ i = 1 N ( x i − x ‾ ) 2 N \sigma^2=\frac{\sum_{i=1}^{N}(x_{i}-\overline{x})^2}{N} σ2=N∑i=1N(xi−x)2
标准差: σ = 1 N ∑ i = 1 N ( x i − x ‾ ) 2 \sigma =\sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\overline{x})^2} σ=N1∑i=1N(xi−x)2
class NaiveBayes:
# 类创建对象时,自动执行,进行初始化操作
def __init__(self):
self.model = None
@staticmethod
# 数学期望
def mean(X):
return sum(X) / float(len(X))
# 标准差(方差)
def stdev(self, X):
avg = self.mean(X)
return math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))
# 概率密度函数
def gaussian_probability(self, x, mean, stdev):
exponent = math.exp(-(math.pow(x - mean, 2) /
(2 * math.pow(stdev, 2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
# 处理X_train
def summarize(self, train_data):
summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]
# zip(*zipped)可理解为解压,返回二维矩阵式,这里返回每个特征的集合列表
# zipped = [(1, 4), (2, 5), (3, 6)],zip(*zipped)=[(1, 2, 3), (4, 5, 6)]
return summaries
# 分类别求出数学期望和标准差
def fit(self, X, y):
labels = list(set(y)) # set() 函数创建一个无序不重复元素集
data = {label: [] for label in labels} # 根据类别创建一个空列表
for f, label in zip(X, y):
# zip() 函数将特征和类别对应的元素打包成一个个元组,然后返回由这些元组组成的列表
# a = [[5.1, 3.8, 1.6, 0.2],[5.8, 4. , 1.2, 0.2]],b = [0.0, 0.0]
# zip(a,b) = [([5.1, 3.8, 1.6, 0.2], 0.0),([5.8, 4. , 1.2, 0.2], 0.0)]
data[label].append(f)
self.model = {
label: self.summarize(value)
for label, value in data.items()
# data.items()以列表返回可遍历的(类别, value) 元组数组
}
return 'gaussianNB train done!'
# 计算概率
def calculate_probabilities(self, input_data):
# summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
# input_data:[1.1, 2.2]
probabilities = {}
for label, value in self.model.items():
probabilities[label] = 1
for i in range(len(value)):
mean, stdev = value[i]
probabilities[label] *= self.gaussian_probability(
input_data[i], mean, stdev)
return probabilities
# 预测类别
def predict(self, X_test):
# {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
label = sorted(
self.calculate_probabilities(X_test).items(),
key=lambda x: x[-1])[-1][0]
return label
# 测试集预测的正确率
def score(self, X_test, y_test):
right = 0
for X, y in zip(X_test, y_test):
label = self.predict(X)
if label == y:
right += 1
return right / float(len(X_test))
model = NaiveBayes()
model.fit(X_train, y_train)
print(model.predict([4.4, 3.2, 1.3, 0.2]))
model.score(X_test, y_test)
'gaussianNB train done!'
0.0
1.0
sklearn.naive_bayes
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
# 高斯模型/伯努利模型/多项式模型
clf = GaussianNB()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
clf.predict([[4.4, 3.2, 1.3, 0.2]])
GaussianNB(priors=None)
1.0
array([0.])
在scikit-learn中,一共有3个朴素贝叶斯的分类算法,分别是:
(1)GaussianNB类:
(2)MultinomialNB类:
(3)BernoulliNB类:
后续补充介绍…
参考资料:
李航《统计学习方法》
代码部分转自“机器学习初学者”作者内容