【手撕机器学习系列】KNN算法

机器学习算法理论太过强大,既要知其然,又要知其所以然
手撕代码的好处就是对原理深入的理解,不仅从理论层面,也在代码技巧上有所认识
一个强大系统的实现就是一个个小系统的堆积,原理都是一样的,从小见大,流程都是相通的

【手撕机器学习系列】KNN算法

原理:(资源太多,这里不再赘述)
手撕代码:

import numpy as np
import math
from collections import Counter

'''
Author:kevinelstri
Date:2021/04/25
Desc:手撕机器学习系列
'''

# 读取数据
def load_data(filename):
    initMat = []
    initLab = []
    for line in open(filename).readlines():
        line = [float(i) for i in line.strip().split('\t')]
        initMat.append(line[:3])  # 分类数据 list格式
        initLab.append(line[-1])  # 分类标签
    npInitMat = np.array(initMat)  # 分类数据 np.array格式,更加归整
    return initMat, initLab, npInitMat


# 归一化计算
# norm = (value-min)/(max-min)
def Norm(dataSet):
    minValues = dataSet.min(0)  # np矩阵计算最小值
    maxValues = dataSet.max(0)  # np矩阵计算最大值
    return (dataSet - minValues) / (maxValues - minValues)


# 划分训练集和测试集(8:2)
def splitTrainTest(dataSet):
    totalLen = len(dataSet)
    trainSet = dataSet[:int(totalLen * 0.8)]
    testSet = dataSet[:int(totalLen * 0.2)]
    # trainLabel = trainSet[:, -1:]
    # testLabel = testSet[:, -1:]
    return trainSet, testSet  # , trainLabel, testLabel


# 欧式距离
def dist(vec1, vec2):
    return math.sqrt(sum(pow(vec1 - vec2, 2)))


# knn算法
def knn_classify(trainSet, testSet, trainLabel, k):
    testLabels = []
    for testVec in testSet:
        distList = []
        for trainVec in trainSet:
            distList.append(dist(testVec, trainVec))  # 计算单个测试向量与单个训练向量之间的距离
        distList_k = sorted(distList)[:k]  # 距离从小到大排序,获取前k个
        distList_k_lab = [trainLabel[distList.index(data_k)] for data_k in distList_k]  # 获取最小的k个元素的标签
        result = Counter(distList_k_lab)  # 计算元素个数
        maxValue = sorted(result.items(), reverse=True)[0][0]  # 获取出现次数最多的标签
        testLabels.append(maxValue)  # 出现次数最多的标签作为k近邻标签
    return testLabels


# 计算错误率
def errorRate(testLabels, actualLabels):
    errorCount = 0
    totalCount = len(testLabels)
    for i in range(totalCount):
        if testLabels[i] != actualLabels[i]:
            errorCount += 1
    return float(errorCount / totalCount)


# 主函数调用
if __name__ == '__main__':
    # 读取数据
    filename = 'datingTestSet2.txt'
    initMat, initLab, npInitMat = load_data(filename)
    print('分类数据(list):\n', initMat)  # 分类数据
    print('分类标签:\n', initLab)  # 分类标签
    print('分类数据(np.array):\n', npInitMat)

    # 归一化计算
    normMat = Norm(npInitMat)
    print('分类数据(np.array)归一化结果:\n', normMat)

    # 划分训练集和测试集
    trainSet, testSet = splitTrainTest(normMat)
    print('训练集:\n', trainSet)
    print('训练集长度:\n', len(trainSet))
    print('测试集:\n', testSet)
    print('测试集长度:\n', len(testSet))

    trainLabel = initLab[:int(len(initLab) * 0.8)]
    actualLabel = initLab[:int(len(initLab) * 0.2)]
    print('训练集标签:\n', trainLabel)
    print('训练集标签个数:\n', len(trainLabel))
    print('测试集实际标签:\n', actualLabel)
    print('测试集实际标签个数:\n', len(actualLabel))

    # knn算法
    k = 3
    testLabel = knn_classify(trainSet, testSet, trainLabel, k)
    print('测试集结果标签: \n', testLabel)

    # 错误率
    error_Rate = errorRate(testLabel, actualLabel)
    print('错误率:\n', error_Rate)

代码输出:

分类数据(list):
 [[40920.0, 8.326976, 0.953952], [14488.0, 7.153469, 1.673904], [26052.0, 1.441871, 0.805124], [75136.0, 13.147394, 0.428964], [38344.0, 1.669788, 0.134296],...,[15917.0, 0.0, 1.369726], [6131.0, 0.608457, 0.51222], [67432.0, 6.558239, 0.667579], [30354.0, 12.315116, 0.197068], [69696.0, 7.014973, 1.494616], [33481.0, 8.822304, 1.194177], [43075.0, 10.086796, 0.570455],[2299.0, 3.733617, 0.698269], [5262.0, 2.002589, 1.380184], [4659.0, 2.502627, 0.184223], [17582.0, 6.382129, 0.876581], [27750.0, 8.546741, 0.128706], [9868.0, 2.694977, 0.432818], [18333.0, 3.951256, 0.3333], [3780.0, 9.856183, 0.329181], [18190.0, 2.068962, 0.429927], [11145.0, 3.410627, 0.631838], [68846.0, 9.974715, 0.669787], [26575.0, 10.650102, 0.866627], [48111.0, 9.134528, 0.728045], [43757.0, 7.882601, 1.332446]]
分类标签:
 [3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 1.0, 3.0, 1.0, ..., 1.0, 2.0, 3.0, 1.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 3.0, 3.0, 1.0, 2.0, 1.0, 2.0, 3.0, 1.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 3.0, 3.0, 3.0]
分类数据(np.array):
 [[4.0920000e+04 8.3269760e+00 9.5395200e-01]
 [1.4488000e+04 7.1534690e+00 1.6739040e+00]
 ...
 [4.8111000e+04 9.1345280e+00 7.2804500e-01]
 [4.3757000e+04 7.8826010e+00 1.3324460e+00]]
分类数据(np.array)归一化结果:
 [[0.44832535 0.39805139 0.56233353]
 [0.15873259 0.34195467 0.98724416]
 ...
 [0.52711097 0.43665451 0.4290048 ]
 [0.47940793 0.3768091  0.78571804]]
训练集:
 [[0.44832535 0.39805139 0.56233353]
 [0.15873259 0.34195467 0.98724416]
 ...
 [0.28542943 0.06892523 0.47449629]
 [0.31903192 0.16448638 0.04554814]]
训练集长度:
 800
测试集:
 [[0.44832535 0.39805139 0.56233353]
 [0.15873259 0.34195467 0.98724416]
 ...
 [0.48854535 0.5507484  0.1768832 ]
 [0.11112815 0.05446857 0.24446797]]
测试集长度:
 200
训练集标签:
 [3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 1.0, 3.0, 1.0, 1.0, ..., 1.0, 2.0, 1.0, 3.0, 1.0, 2.0, 2.0, 1.0, 3.0, 2.0, 1.0, 3.0, 3.0, 2.0, 2.0, 2.0, 1.0, 2.0, 2.0, 1.0, 3.0, 1.0, 3.0, 1.0, 3.0, 3.0, 1.0]
训练集标签个数:
 800
测试集实际标签:
 [3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 1.0, 3.0, 1.0, 1.0, ...,1.0, 3.0, 2.0, 3.0, 1.0, 3.0, 2.0, 1.0, 3.0, 2.0, 2.0, 3.0, 2.0, 3.0, 2.0, 1.0, 1.0, 3.0, 1.0, 3.0, 2.0, 2.0, 2.0, 3.0, 2.0]
测试集实际标签个数:
 200
测试集结果标签: 
 [3.0, 2.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 1.0, 3.0, 1.0, 1.0, ...,3.0, 3.0, 2.0, 3.0, 1.0, 3.0, 2.0, 3.0, 3.0, 2.0, 2.0, 3.0, 2.0, 3.0, 3.0, 1.0, 2.0, 3.0, 1.0, 3.0, 2.0, 2.0, 2.0, 3.0, 2.0]
错误率:
 0.06

代码链接:https://github.com/kevinelstri/Hander-Marchine-Learning-Series/tree/main/K-means

你可能感兴趣的:(手撕机器学习系列,算法,机器学习,最近邻分类算法)