机器学习算法理论太过强大,既要知其然,又要知其所以然
手撕代码的好处就是对原理深入的理解,不仅从理论层面,也在代码技巧上有所认识
一个强大系统的实现就是一个个小系统的堆积,原理都是一样的,从小见大,流程都是相通的
【手撕机器学习系列】KNN算法
原理:(资源太多,这里不再赘述)
手撕代码:
import numpy as np
import math
from collections import Counter
'''
Author:kevinelstri
Date:2021/04/25
Desc:手撕机器学习系列
'''
# 读取数据
def load_data(filename):
initMat = []
initLab = []
for line in open(filename).readlines():
line = [float(i) for i in line.strip().split('\t')]
initMat.append(line[:3]) # 分类数据 list格式
initLab.append(line[-1]) # 分类标签
npInitMat = np.array(initMat) # 分类数据 np.array格式,更加归整
return initMat, initLab, npInitMat
# 归一化计算
# norm = (value-min)/(max-min)
def Norm(dataSet):
minValues = dataSet.min(0) # np矩阵计算最小值
maxValues = dataSet.max(0) # np矩阵计算最大值
return (dataSet - minValues) / (maxValues - minValues)
# 划分训练集和测试集(8:2)
def splitTrainTest(dataSet):
totalLen = len(dataSet)
trainSet = dataSet[:int(totalLen * 0.8)]
testSet = dataSet[:int(totalLen * 0.2)]
# trainLabel = trainSet[:, -1:]
# testLabel = testSet[:, -1:]
return trainSet, testSet # , trainLabel, testLabel
# 欧式距离
def dist(vec1, vec2):
return math.sqrt(sum(pow(vec1 - vec2, 2)))
# knn算法
def knn_classify(trainSet, testSet, trainLabel, k):
testLabels = []
for testVec in testSet:
distList = []
for trainVec in trainSet:
distList.append(dist(testVec, trainVec)) # 计算单个测试向量与单个训练向量之间的距离
distList_k = sorted(distList)[:k] # 距离从小到大排序,获取前k个
distList_k_lab = [trainLabel[distList.index(data_k)] for data_k in distList_k] # 获取最小的k个元素的标签
result = Counter(distList_k_lab) # 计算元素个数
maxValue = sorted(result.items(), reverse=True)[0][0] # 获取出现次数最多的标签
testLabels.append(maxValue) # 出现次数最多的标签作为k近邻标签
return testLabels
# 计算错误率
def errorRate(testLabels, actualLabels):
errorCount = 0
totalCount = len(testLabels)
for i in range(totalCount):
if testLabels[i] != actualLabels[i]:
errorCount += 1
return float(errorCount / totalCount)
# 主函数调用
if __name__ == '__main__':
# 读取数据
filename = 'datingTestSet2.txt'
initMat, initLab, npInitMat = load_data(filename)
print('分类数据(list):\n', initMat) # 分类数据
print('分类标签:\n', initLab) # 分类标签
print('分类数据(np.array):\n', npInitMat)
# 归一化计算
normMat = Norm(npInitMat)
print('分类数据(np.array)归一化结果:\n', normMat)
# 划分训练集和测试集
trainSet, testSet = splitTrainTest(normMat)
print('训练集:\n', trainSet)
print('训练集长度:\n', len(trainSet))
print('测试集:\n', testSet)
print('测试集长度:\n', len(testSet))
trainLabel = initLab[:int(len(initLab) * 0.8)]
actualLabel = initLab[:int(len(initLab) * 0.2)]
print('训练集标签:\n', trainLabel)
print('训练集标签个数:\n', len(trainLabel))
print('测试集实际标签:\n', actualLabel)
print('测试集实际标签个数:\n', len(actualLabel))
# knn算法
k = 3
testLabel = knn_classify(trainSet, testSet, trainLabel, k)
print('测试集结果标签: \n', testLabel)
# 错误率
error_Rate = errorRate(testLabel, actualLabel)
print('错误率:\n', error_Rate)
代码输出:
分类数据(list):
[[40920.0, 8.326976, 0.953952], [14488.0, 7.153469, 1.673904], [26052.0, 1.441871, 0.805124], [75136.0, 13.147394, 0.428964], [38344.0, 1.669788, 0.134296],...,[15917.0, 0.0, 1.369726], [6131.0, 0.608457, 0.51222], [67432.0, 6.558239, 0.667579], [30354.0, 12.315116, 0.197068], [69696.0, 7.014973, 1.494616], [33481.0, 8.822304, 1.194177], [43075.0, 10.086796, 0.570455],[2299.0, 3.733617, 0.698269], [5262.0, 2.002589, 1.380184], [4659.0, 2.502627, 0.184223], [17582.0, 6.382129, 0.876581], [27750.0, 8.546741, 0.128706], [9868.0, 2.694977, 0.432818], [18333.0, 3.951256, 0.3333], [3780.0, 9.856183, 0.329181], [18190.0, 2.068962, 0.429927], [11145.0, 3.410627, 0.631838], [68846.0, 9.974715, 0.669787], [26575.0, 10.650102, 0.866627], [48111.0, 9.134528, 0.728045], [43757.0, 7.882601, 1.332446]]
分类标签:
[3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 1.0, 3.0, 1.0, ..., 1.0, 2.0, 3.0, 1.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 3.0, 3.0, 1.0, 2.0, 1.0, 2.0, 3.0, 1.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 3.0, 3.0, 3.0]
分类数据(np.array):
[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
[1.4488000e+04 7.1534690e+00 1.6739040e+00]
...
[4.8111000e+04 9.1345280e+00 7.2804500e-01]
[4.3757000e+04 7.8826010e+00 1.3324460e+00]]
分类数据(np.array)归一化结果:
[[0.44832535 0.39805139 0.56233353]
[0.15873259 0.34195467 0.98724416]
...
[0.52711097 0.43665451 0.4290048 ]
[0.47940793 0.3768091 0.78571804]]
训练集:
[[0.44832535 0.39805139 0.56233353]
[0.15873259 0.34195467 0.98724416]
...
[0.28542943 0.06892523 0.47449629]
[0.31903192 0.16448638 0.04554814]]
训练集长度:
800
测试集:
[[0.44832535 0.39805139 0.56233353]
[0.15873259 0.34195467 0.98724416]
...
[0.48854535 0.5507484 0.1768832 ]
[0.11112815 0.05446857 0.24446797]]
测试集长度:
200
训练集标签:
[3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 1.0, 3.0, 1.0, 1.0, ..., 1.0, 2.0, 1.0, 3.0, 1.0, 2.0, 2.0, 1.0, 3.0, 2.0, 1.0, 3.0, 3.0, 2.0, 2.0, 2.0, 1.0, 2.0, 2.0, 1.0, 3.0, 1.0, 3.0, 1.0, 3.0, 3.0, 1.0]
训练集标签个数:
800
测试集实际标签:
[3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 1.0, 3.0, 1.0, 1.0, ...,1.0, 3.0, 2.0, 3.0, 1.0, 3.0, 2.0, 1.0, 3.0, 2.0, 2.0, 3.0, 2.0, 3.0, 2.0, 1.0, 1.0, 3.0, 1.0, 3.0, 2.0, 2.0, 2.0, 3.0, 2.0]
测试集实际标签个数:
200
测试集结果标签:
[3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 1.0, 3.0, 1.0, 1.0, ...,3.0, 3.0, 2.0, 3.0, 1.0, 3.0, 2.0, 3.0, 3.0, 2.0, 2.0, 3.0, 2.0, 3.0, 3.0, 1.0, 2.0, 3.0, 1.0, 3.0, 2.0, 2.0, 2.0, 3.0, 2.0]
错误率:
0.06
代码链接:https://github.com/kevinelstri/Hander-Marchine-Learning-Series/tree/main/K-means