一些截图上传出现问题了,找时间再改过来
参考:https://blog.csdn.net/jbb0523
贝叶斯决策论是概率框架下实施决策的基本方法。
R ( c i ∣ x ) = ∑ j = 1 N λ i j P ( c j ∣ x ) R(c_i|\mathbf{x})=\sum_{j=1}^{N}{\lambda_{ij}P(c_j|\mathbf{x})} R(ci∣x)=j=1∑NλijP(cj∣x)
R ( h ) = E x [ R ( h ( x ) ∣ x ) ] = ∑ x ∈ D R ( h ( x ) ∣ x ) P ( x ) R(h)=\mathbb{E_x}[R(h(\mathbf{x})|\mathbf{x})]=\sum_{\mathbf{x}\in{D}}R(h(\mathbf{x})|\mathbf{x})P(\mathbf{x}) R(h)=Ex[R(h(x)∣x)]=x∈D∑R(h(x)∣x)P(x)
式(1)是针对单个样本,而式(2)是针对整个数据集D所以样本的期望。
贝叶斯判定准则:为最小化总体风险,只需在每个样本上选择那个能使条件风险最小的类别标记。
h ∗ ( x ) = arg min c ∈ y R ( c ∣ x ) h^*(\mathbf{x})=\arg\min\limits_{c\in{y}}{R(c|\mathbf{x})} h∗(x)=argc∈yminR(c∣x)
R ( c ∣ x ) = 1 − P ( c ∣ x ) R(c|\mathbf{x})=1-P(c|\mathbf{x}) R(c∣x)=1−P(c∣x)
h ∗ ( x ) = arg max c ∈ y P ( c ∣ x ) h^*(\mathbf{x})=\arg\max\limits_{c\in{y}}{P(c|\mathbf{x})} h∗(x)=argc∈ymaxP(c∣x)
式(3)的贝叶斯最优分类器是最小化式(4),等价于最大化式(5)
P ( c ∣ x ) = P ( x , c ) P ( x ) = P ( c ) P ( x ∣ c ) P ( x ) P(c|\mathbf{x})=\frac{P(\mathbf{x},c)}{P(\mathbf{x})}=\frac{P(c)P(\mathbf{x}|c)}{P(\mathbf{x})} P(c∣x)=P(x)P(x,c)=P(x)P(c)P(x∣c)
朴素贝叶斯分类器:
KaTeX parse error: Expected group after '_' at position 29: …bf{x})=\arg\max_̲\limits{c\in{y}…
**背景:**朴素贝叶斯分类器采用了属性条件独立性假设,但在现实任务中这个假设往往很难成立。于是,人们尝试对属性条件独立性假设进行一定程度的放松。半朴素贝叶斯的基本想法是适当考虑一部分属性间的相互依赖信息,从而既不需要进行完全联合概率计算,又不至于彻底忽略了比较强的属性依赖关系。
试使用极大似然法估算西瓜数据集3.0中前3个属性的类条件概率
试证明:条件独立性假设不成立时,朴素贝叶斯分类器仍有可能产生最优贝叶斯分类器
条件不独立的那些属性都一致,或者放松一些,同一类的样本的条件不独立的属性一致时,朴素贝叶斯分类器依旧可以是最优贝叶斯分类器。
试编程实现拉普拉斯修正的朴素贝叶斯分类器,并以西瓜数据集3.0为训练集,对p.151“测1”样本进行判别
参考:https://blog.csdn.net/qdbszsj/article/details/79130431
import numpy as np
import pandas as pd
dataset = pd.read_csv('C:/Users/Administrator/PycharmProjects/tes/WatermelonBook/watermelon3_0_Ch.csv',encoding="gbk",delimiter=",")
del dataset['编号']
print(dataset)
X = dataset.values[:, :-1]
m, n = np.shape(X)
for i in range(m):
X[i, n - 1] = round(X[i, n - 1], 3)
X[i, n - 2] = round(X[i, n - 2], 3)
y = dataset.values[:, -1]
columnName = dataset.columns
colIndex = {}
for i in range(len(columnName)):
colIndex[columnName[i]] = i
Pmap = {} # memory the P to avoid the repeat computing
kindsOfAttribute = {} # kindsOfAttribute[0]=3 because there are 3 different types in '色泽'
# this map is for laplacian correction
for i in range(n): kindsOfAttribute[i] = len(set(X[:, i]))
continuousPara = {} # memory some parameters of the continuous data to avoid repeat computing
goodList = []
badList = []
for i in range(len(y)):
if y[i] == '是':
goodList.append(i)
else:
badList.append(i)
import math
def P(colID, attribute, C): # P(colName=attribute|C) P(色泽=青绿|是)
if (colID, attribute, C) in Pmap:
return Pmap[(colID, attribute, C)]
curJudgeList = []
if C == '是':
curJudgeList = goodList
else:
curJudgeList = badList
ans = 0
if colID >= 6: # density or ratio which are double type data
mean = 1
std = 1
if (colID, C) in continuousPara:
curPara = continuousPara[(colID, C)]
mean = curPara[0]
std = curPara[1]
else:
curData = X[curJudgeList, colID]
mean = curData.mean()
std = curData.std()
# print(mean,std)
continuousPara[(colID, C)] = (mean, std)
ans = 1 / (math.sqrt(math.pi * 2) * std) * math.exp((-(attribute - mean) ** 2) / (2 * std * std))
else:
for i in curJudgeList:
if X[i, colID] == attribute: ans += 1
ans = (ans + 1) / (len(curJudgeList) + kindsOfAttribute[colID])
Pmap[(colID, attribute, C)] = ans
# print(ans)
return ans
def predictOne(single):
ansYes = math.log2((len(goodList) + 1) / (len(y) + 2))
ansNo = math.log2((len(badList) + 1) / (len(y) + 2))
for i in range(len(single)):
ansYes += math.log2(P(i, single[i], '是'))
ansNo += math.log2(P(i, single[i], '否'))
# print(ansYes,ansNo,math.pow(2,ansYes),math.pow(2,ansNo))
if ansYes > ansNo:
return '是'
else:
return '否'
def predictAll(iX):
predictY = []
for i in range(m):
predictY.append(predictOne(iX[i]))
return predictY
predictY = predictAll(X)
print(y)
print(np.array(predictAll(X)))
confusionMatrix = np.zeros((2, 2))
for i in range(len(y)):
if predictY[i] == y[i]:
if y[i] == '否':
confusionMatrix[0, 0] += 1
else:
confusionMatrix[1, 1] += 1
else:
if y[i] == '否':
confusionMatrix[0, 1] += 1
else:
confusionMatrix[1, 0] += 1
print(confusionMatrix)
解决下溢的常用方法就是取对数,将连乘操作变为连加操作 。
试证明:二分类任务中两类数据满足高斯分布且方差相同时,线性判别分析产生贝叶斯最优分类器。
待完成,LDA的推导以及原理还未完成! T T
试编程实现AODE分类器,并以西瓜数据集3.0为训练集,对p.151的“测1”样本进行判别
参考:https://github.com/han1057578619/MachineLearning_Zhouzhihua_ProblemSets/tree/master/ch7–%E8%B4%9D%E5%8F%B6%E6%96%AF%E5%88%86%E7%B1%BB/7.6
最好情况:
每一类的每个属性都一致,则需要 30×2=60个样例
最坏情况:
需要 30×2×d=60d个样例