使用朴素贝叶斯,特征向量为离散型
x1,x2是两个特征向量,Y是类别
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |
x2 | S | M | M | S | S | S | M | M | L | L | L | M | M | L | L |
Y | -1 | -1 | 1 | 1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 |
手算过程见文末参考博客
1. 创建数据集
2. 计算各类概率
2.1 算p(y = -1),p(y = 1), 即算各类概率
2.2 根据输入特征向量x = (2, 'S'), 计算p(x向量 | y) = 累乘 p(xi | y)
3. 预测:给一个特征向量,按照2中算出的值相乘,各类概率大的获胜
def createDataSet():
dataSet = [[1, 'S', -1],
[1, 'M', -1],
[1, 'M', 1],
[1, 'S', 1],
[1, 'S', -1],
[2, 'S', -1],
[2, 'M', -1],
[2, 'M', 1],
[2, 'L', 1],
[2, 'L', 1],
[3, 'L', 1],
[3, 'M', 1],
[3, 'M', 1],
[3, 'L', 1],
[3, 'L', -1]]
labels = ['x1', 'x2', 'y']
return dataSet, labels
# 统计yi的个数
def typeCount(typeList, t):
cnt = 0
for tL in typeList:
if tL == t:
cnt += 1
return cnt
# 计算Y=-1或1的条件下,X等于某值 个数
def featCount(dataSet, i, feat, y):
cnt = 0
#print(i, feat, y)
for row in dataSet:
if row[i] == feat and row[-1] == y:
cnt += 1
return cnt
def calcBayes(dataSet):
# 以 x = (2, 'S') 为例
X = [2 , 'S']
lenDataSet = len(dataSet)
typeList = [row[-1] for row in dataSet]
typeSet = set(typeList) # 类别集合
print(typeList, typeSet)
typeLen = len(typeSet)
# 遍历一类 t=1; t=-1
pList = [] # 记录预计 各类类别 概率
for t in typeSet:
yNum = typeCount(typeList, t)# 计算yi的个数
print(f'{t} num =',yNum)
py = yNum / lenDataSet
print(f'P(Y = {t}) =', py)
pSum = py
# 对每个特征分量计数
for i in range(len(X)):
xiNum = featCount(dataSet, i, X[i], t) # 统计Y条件下 Xi取相应特征 的数量
print(f'特征{X[i]} num =',xiNum)
# 条件概率P{X = xi | Y = yi}
pxy = xiNum / yNum
print(f'条件概率 =', pxy)
pSum *= pxy
pList.append(pSum)
#print(pList)
return pList, typeSet
# 就是找最大的概率,记录下标
def predict(pList, typeList):
for i in range(len(pList)):
if pList[i] == max(pList):
print('*'*50)
print(f'预测类 为 = {typeList[i]}')
if __name__ == '__main__':
dataSet, labels = createDataSet()
pList, typeSet = calcBayes(dataSet)
predict(pList, list(typeSet))
参考博客
后序可能会写其他贝叶斯和 处理连续型特征向量,占个坑~