《统计学习方法》第四章:朴素贝叶斯 ——python实现


理论推导:https://blog.csdn.net/ACM_hades/article/details/89677342

数据集

  • 数据集:MNIST数据,图片大小是28×28的,10个类别,使用数据的原始特征,所有每个样本有28×28=784个特征。
  • 朴素贝叶斯比较适合特征维度较小的情况,但是MNIST数据已到达上百唯的特征,概率联乘起来超过Python float能表示的极限,
  • 由于Python 浮点数精度的原因,784个浮点数联乘后结果变为Inf,而Python中int可以无限相乘的,因此可以利用python int的特性对先验概率与条件概率进行一些改造。 由决策函数: y = f ( x ) = max ⁡ c k ⁡ P ( Y = c k ) ∗ ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=\max_{c_k }⁡P(Y=c_k )*∏_j P(X^{(j) }=x^{(j) } |Y=c_k) y=f(x)=ckmaxP(Y=ck)jP(X(j)=x(j)Y=ck)可知我们对先验概率 P ( Y = c k ) P(Y=c_k ) P(Y=ck)同时扩大 N N N倍,对各条件概率 P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j) }=x^{(j) } |Y=c_k) P(X(j)=x(j)Y=ck)同时扩大 M M M倍不影响选择概率最大值
    • 先验概率: 由于先验概率分母都是 N N N,因此不用除于 N N N,直接用分子即可。
    • 条件概率: 条件概率公式如下图所示,我们得到概率后再乘以10000,将概率映射到[0,10000]中,但是为防止出现概率值为0的情况,人为的加上1,使概率映射到[1,10001]中。

代码

#encoding=utf-8
import pandas as pd
import numpy as np
import cv2
import random
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
from collections import defaultdict

class Naive_bayes(object):
    def __init__(self):
        self.class_num = 10
        self.feature_len = 784
    # 二值化
    def binaryzation(self,img):
        for i in range(len(img)):
            img_1 =img[i] # 图片二值化
            cv_img = img_1.astype(np.uint8)#将图片的0-255取值变为0-1
            cv2.threshold(cv_img,50,1,cv2.THRESH_BINARY_INV,cv_img)
            img[i]=cv_img

    #训练
    def Train(self,trainset,train_labels):
        trainset=list(trainset)
        train_labels=list(train_labels)

        Date=defaultdict(list)
        for i in range(len(train_labels)):
            Date[train_labels[i]].append(trainset[i])

        self.prior_probability = np.zeros(self.class_num)                         # 先验概率
        self.conditional_probability = np.zeros((self.class_num, self.feature_len, 2))  # 条件概率
        for label in Date:
            self.prior_probability[label]=len(Date[label])
            temp = list(np.sum(np.array(Date[label]),axis=0))
            for j in range(self.feature_len):
                self.conditional_probability[label][j][1] += temp[j]
                self.conditional_probability[label][j][0]+=(self.prior_probability[label]-temp[j])

        # 将概率归到[1.10001]
        for i in range(self.class_num):
            for j in range(self.feature_len):

                # 经过二值化后图像只有0,1两种取值
                pix_0 = self.conditional_probability[i][j][0]
                pix_1 = self.conditional_probability[i][j][1]

                # 计算0,1像素点对应的条件概率
                probalility_0 = (float(pix_0)/float(pix_0+pix_1))*10000 + 1
                probalility_1 = (float(pix_1)/float(pix_0+pix_1))*10000 + 1

                self.conditional_probability[i][j][0] = probalility_0
                self.conditional_probability[i][j][1] = probalility_1

    def Predict(self,testset):
        predict = []
        for img in testset:
            temp=[]#一定要转化为python中的int型list不然就会溢出
            for j in range(10):
                temp.append(int(self.prior_probability[j]))
            for i in range(len(img)):
                temp_1=self.conditional_probability[:, i, img[i]]
                for j in range(10):
                    temp[j] *= int(temp_1[j])

            max_label=np.argmax(temp)
            predict.append(max_label)
        return np.array(predict)


if __name__ == '__main__':

    Model=Naive_bayes()
    print('Start read data')
    S = time.time()
    raw_data = pd.read_csv('../data/train.csv')  # 读取数据
    data = raw_data.values  # 获取数据
    print("data shape:", data.shape)
    imgs = data[0:, 1:]
    labels = data[:, 0]
    Model.binaryzation(imgs)

    print("imgs shape:", imgs.shape)
    print("labels shape:", labels.shape)

    # 选取 2/3 数据作为训练集, 1/3 数据作为测试集
    train_features, test_features, train_labels, test_labels = train_test_split(
        imgs, labels, test_size=0.33, random_state=23323)
    print("train data count :%d" % len(train_labels))
    print("test data count :%d" % len(test_labels))

    print('read data cost ', time.time() - S, ' second')

    print('Start training')
    S = time.time()
    Model.Train(train_features, train_labels)
    print('training cost ', time.time() - S, ' second')

    print('Start predicting')
    S = time.time()
    test_predict = Model.Predict(test_features)
    print('predicting cost ', time.time() - S, ' second')

    score = accuracy_score(test_labels, test_predict)
    print("The accruacy socre is ", score)
输出:
	Start read data
	data shape: (42000, 785)
	imgs shape: (42000, 784)
	labels shape: (42000,)
	train data count :28140
	test data count :13860
	read data cost  4.07903265953064  second
	Start training
	training cost  0.21043634414672852  second
	Start predicting
	predicting cost  93.11626148223877  second
	The accruacy socre is  0.8331168831168831

你可能感兴趣的:(NLP代码)