P(A|B) = P(B|A) * P(A) / P(B)
推导:
=> P(A,B) = P(A) * P(B|A)
=> P(B,A) = P(B) * P(A|B)
=> P(A,B) = P(B,A)
=> P(A) * P(B|A) = P(B) * P(A|B)
=> P(A|B) = P(B|A) * P(A) / P(B)
简单应用:比如有10个西瓜,西瓜有很多特征[圆/椭圆,平滑/粗糙],根据特征训练并判断分类标签[好瓜/坏瓜]。
P(标签|特征) = P(特征|标签) * P(标签) / P(特征)
朴素贝叶斯有一个很重要的假设:条件独立性,即特征之间是独立的,这也是贝叶斯“朴素”的原因,它将问题简化了。实际生活中很多特征之间大多都是有关系的。
先验概率:标签的概率,比如上面西瓜分类中,好瓜标签的概率。
后验概率:在特征已知的情况下发生的概率,比如特征为圆且平滑的西瓜,它是好瓜的概率。
贝叶斯决策论通过相关概率已知的情况下,利用误判损失来选择最优的类别分类。
假设有N种可能的分类标记,记为Y = {c1, c2, c3, …, cN},那对于样本x,它属于哪一类呢?计算步骤如下:
step1:算出样本x属于第i个类别的概率,即P(ci|x);
step2:通过比较所有的P(ci|x),得到样本x所属的最佳类别;
step3:将类别ci和样本x代入贝叶斯公式中,得到:
P(ci|x) = P(x|ci) * P(ci) / P(x)
其中,P(ci)为先验概率,P(x|ci)为条件概率,我们需要求的就是P(x|ci)条件概率。
假设样本x包含d个属性,即x = {x1, x2, x3, …, xd},那么:
P(x|ci) = P(x1, x2, x3, …, xd|ci)
这个联合概率难以从有限训练样本中直接计算得到。朴素贝叶斯采用“属性条件独立性假设”,即假设所有的属性是相互独立的,那么:
P(x|ci) = P(x1, x2, x3, …, xd|ci) = P(xj|ci)的乘积
最终只需要对条件概率P(xj|ci)求解,即对各自特征属性的条件概率求解,按照条件概率公式,采用统计的方式求解:
P(xj|ci) = P(xj, ci) / P(ci) = num(xj, ci) / num(ci)
其中,num(xj, ci)表示训练样本中xj, ci同时出现的次数。
西瓜训练集数据:https://download.csdn.net/download/LWY_Xing/13209988
对下面的测试数据进行分类:
计算过程:
import math
import pandas as pd
watermelon_frame = pd.read_csv('./xigua.csv', sep=' ')
print(watermelon_frame.shape)
good_melon_num = watermelon_frame.loc[watermelon_frame['好瓜'] == '是'].shape[0]
bad_melon_num = watermelon_frame.loc[watermelon_frame['好瓜'] == '否'].shape[0]
total_num = watermelon_frame.shape[0]
prob_good_melon = round(good_melon_num / total_num, 3)
prob_bad_melon = round(bad_melon_num / total_num, 3)
print('P(好瓜 = 是) = %.3f' %(prob_good_melon))
print('P(好瓜 = 否) = %.3f' %(prob_bad_melon))
green_yes_num = watermelon_frame.loc[(watermelon_frame['色泽'] == '青绿') & (watermelon_frame['好瓜'] == '是')].shape[0]
prob_green_yes = round(green_yes_num / good_melon_num, 3)
print('P(青绿|是) = %.3f' %(prob_green_yes))
green_no_num = watermelon_frame.loc[(watermelon_frame['色泽'] == '青绿') & (watermelon_frame['好瓜'] == '否')].shape[0]
prob_green_no = round(green_no_num / bad_melon_num, 3)
print('P(青绿|否) = %.3f' %(prob_green_no))
rollup_yes_num = watermelon_frame.loc[(watermelon_frame['根蒂'] == '蜷缩') & (watermelon_frame['好瓜'] == '是')].shape[0]
prob_rollup_yes = round(rollup_yes_num / good_melon_num, 3)
print('P(蜷缩|是) = %.3f' %(prob_rollup_yes))
rollup_no_num = watermelon_frame.loc[(watermelon_frame['根蒂'] == '蜷缩') & (watermelon_frame['好瓜'] == '否')].shape[0]
prob_rollup_no = round(rollup_no_num / bad_melon_num, 3)
print('P(蜷缩|否) = %.3f' %(prob_rollup_no))
voicedsound_yes_num = watermelon_frame.loc[(watermelon_frame['敲声'] == '浊响') & (watermelon_frame['好瓜'] == '是')].shape[0]
prob_voicedsound_yes = round(voicedsound_yes_num / good_melon_num, 3)
print('P(浊响|是) = %.3f' %(prob_voicedsound_yes))
voicedsound_no_num = watermelon_frame.loc[(watermelon_frame['敲声'] == '浊响') & (watermelon_frame['好瓜'] == '否')].shape[0]
prob_voicedsound_no = round(voicedsound_no_num / bad_melon_num, 3)
print('P(浊响|否) = %.3f' %(prob_voicedsound_no))
clear_yes_num = watermelon_frame.loc[(watermelon_frame['纹理'] == '清晰') & (watermelon_frame['好瓜'] == '是')].shape[0]
prob_clear_yes = round(clear_yes_num / good_melon_num, 3)
print('P(清晰|是) = %.3f' %(prob_clear_yes))
clear_no_num = watermelon_frame.loc[(watermelon_frame['纹理'] == '清晰') & (watermelon_frame['好瓜'] == '否')].shape[0]
prob_clear_no = round(clear_no_num / bad_melon_num, 3)
print('P(清晰|否) = %.3f' %(prob_clear_no))
sunken_yes_num = watermelon_frame.loc[(watermelon_frame['脐部'] == '凹陷') & (watermelon_frame['好瓜'] == '是')].shape[0]
prob_sunken_yes = round(sunken_yes_num / good_melon_num, 3)
print('P(凹陷|是) = %.3f' %(prob_sunken_yes))
sunken_no_num = watermelon_frame.loc[(watermelon_frame['脐部'] == '凹陷') & (watermelon_frame['好瓜'] == '否')].shape[0]
prob_sunken_no = round(sunken_no_num / bad_melon_num, 3)
print('P(凹陷|否) = %.3f' %(prob_sunken_no))
hardslippery_yes_num = watermelon_frame.loc[(watermelon_frame['触感'] == '硬滑') & (watermelon_frame['好瓜'] == '是')].shape[0]
prob_hardslippery_yes = round(hardslippery_yes_num / good_melon_num, 3)
print('P(硬滑|是) = %.3f' %(prob_hardslippery_yes))
hardslippery_no_num = watermelon_frame.loc[(watermelon_frame['触感'] == '硬滑') & (watermelon_frame['好瓜'] == '否')].shape[0]
prob_hardslippery_no = round(hardslippery_no_num / bad_melon_num, 3)
print('P(硬滑|否) = %.3f' %(prob_hardslippery_no))
def prop_density_fun(x, mean, var):
return round(math.e**(-(x - mean)**2 / (2 * var)) / math.sqrt(2 * math.pi * var), 3)
density_yes_frame = watermelon_frame.loc[watermelon_frame['好瓜'] == '是']
print(density_yes_frame)
density_yes_frame = density_yes_frame.loc[:, '密度']
density_yes_mean = round(density_yes_frame.mean(), 3)
density_yes_var = round(density_yes_frame.var(), 3)
print('density and good melon mean = %0.3f' %(density_yes_mean))
print('density and good melon var = %0.3f' %(density_yes_var))
prop_density_yes = prop_density_fun(0.697, density_yes_mean, density_yes_var)
print('P(密度=0.697|是) = %0.3f' %(prop_density_yes))
density_no_frame = watermelon_frame.loc[watermelon_frame['好瓜'] == '否']
print(density_no_frame)
density_no_frame = density_no_frame.loc[:, '密度']
density_no_mean = round(density_no_frame.mean(), 3)
density_no_var = round(density_no_frame.var(), 3)
print('density and bad melon mean = %0.3f' %(density_no_mean))
print('density and bad melon var = %0.3f' %(density_no_var))
prop_density_no = prop_density_fun(0.697, density_no_mean, density_no_var)
print('P(密度=0.697|否) = %0.3f' %(prop_density_no))
sugary_yes_frame = watermelon_frame.loc[watermelon_frame['好瓜'] == '是']
print(sugary_yes_frame)
sugary_yes_frame = sugary_yes_frame.loc[:, '含糖率']
sugary_yes_mean = round(sugary_yes_frame.mean(), 3)
sugary_yes_var = round(sugary_yes_frame.var(), 3)
print('sugary and good melon mean = %0.3f' %(sugary_yes_mean))
print('sugary and good melon var = %0.3f' %(sugary_yes_var))
prop_sugary_yes = prop_density_fun(0.460, sugary_yes_mean, sugary_yes_var)
print('P(含糖率=0.460|是) = %0.3f' %(prop_sugary_yes))
sugary_no_frame = watermelon_frame.loc[watermelon_frame['好瓜'] == '否']
print(sugary_no_frame)
sugary_no_frame = sugary_no_frame.loc[:, '含糖率']
sugary_no_mean = round(sugary_no_frame.mean(), 3)
sugary_no_var = round(sugary_no_frame.var(), 3)
print('sugary and bad melon mean = %0.3f' %(sugary_no_mean))
print('sugary and bad melon var = %0.3f' %(sugary_no_var))
prop_sugary_no = prop_density_fun(0.460, sugary_no_mean, sugary_no_var)
print('P(含糖率=0.460|否) = %0.3f' %(prop_sugary_no))
prop_good_melon_test = round(prob_green_yes * prob_rollup_yes * prob_voicedsound_yes * prob_clear_yes * prob_sunken_yes * prob_hardslippery_yes * prop_density_yes * prop_sugary_yes, 6)
prop_bad_melon_test = round(prob_green_no * prob_rollup_no * prob_voicedsound_no * prob_clear_no * prob_sunken_no * prob_hardslippery_no * prop_density_no * prop_sugary_no, 6)
print('prop good melon test = %0.6f' %(prop_good_melon_test))
print('prop bad melon test = %0.6f' %(prop_bad_melon_test))
if prop_good_melon_test > prop_bad_melon_test:
print('test data is good melon!')
else:
print('test data is bad melon!')
(base) k8s-master@k8s-master:~/Desktop/python/nlp_learning/class1$ python xigua_classification_by_Naive_Bayes.py
(17, 10)
P(好瓜 = 是) = 0.471
P(好瓜 = 否) = 0.529
P(青绿|是) = 0.375
P(青绿|否) = 0.333
P(蜷缩|是) = 0.625
P(蜷缩|否) = 0.333
P(浊响|是) = 0.750
P(浊响|否) = 0.444
P(清晰|是) = 0.875
P(清晰|否) = 0.222
P(凹陷|是) = 0.625
P(凹陷|否) = 0.222
P(硬滑|是) = 0.750
P(硬滑|否) = 0.667
density and good melon mean = 0.574
density and good melon var = 0.017
P(密度=0.697|是) = 1.961
density and bad melon mean = 0.496
density and bad melon var = 0.038
P(密度=0.697|否) = 1.203
sugary and good melon mean = 0.279
sugary and good melon var = 0.010
P(含糖率=0.460|是) = 0.775
sugary and bad melon mean = 0.154
sugary and bad melon var = 0.012
P(含糖率=0.460|否) = 0.074
prop good melon test = 0.109572
prop bad melon test = 0.000144
test data is good melon!