KNN算法知识集

数学知识:
李航《统计学习方法》,叙述了K邻近算法,K邻近模型和它的三要素(距离度量、K值、分类决策规则),然后讲解了算法实现的数据结构——kd树,和基于这个树的搜索kd树算法。

一些数学细节的补充:
https://www.cnblogs.com/eyeszjwang/articles/2429382.html
讲解了Kd树的原理、例子和伪代码。

在python上的实现:
https://zhuanlan.zhihu.com/p/23191325
介绍了sk库实现的KNeighborsClassifier类,它的参数,主要函数等

一个python例子:
在jupyter notebook中操作的,且所用数据集为《机器学习实战》KNN算法部分的。

KNN算法

将图像(黑白)转为一维数组

import numpy as np
def re_shape(filename):
    return_matrix = np.zeros((1,1024))
    with open(filename) as inf:
        for i in range(32):
            row = inf.readline()
            for n in range(32):
                return_matrix[0,32*i+n] = int(row[n])
    return return_matrix[0]  

获得类别(文件名)

from sklearn.neighbors import KNeighborsClassifier
import os
labels = []
file_forder = "E:\\DataMining\\Project\\MLBook\\机器学习实战源代码\\machinelearninginaction\\Ch02\\digits\\trainingDigits"
trainingFileList = os.listdir(file_forder)
#print(trainingFileList)
for name in trainingFileList:
    labels.append(name.split("_")[0])

获得训练数据

X_train = []
for name in trainingFileList:
    fileneme = os.path.join(file_forder,name)
    row = re_shape(fileneme)
    X_train.append([n for n in row])
#print(X_train[:5])

获得测试数据类别

testLabels = []
file_forder = "E:\\DataMining\\Project\\MLBook\\机器学习实战源代码\\machinelearninginaction\\Ch02\\digits\\testDigits"
testFileList = os.listdir(file_forder)
#print(trainingFileList)
for name in testFileList:
    testLabels.append(name.split("_")[0])

获得测试数据

X_test = []
for name in testFileList:
    fileneme = os.path.join(file_forder,name)
    row = re_shape(fileneme)
    X_test.append([n for n in row])
clf = KNeighborsClassifier(n_neighbors=1, weights='uniform', algorithm='auto', p=2, metric='minkowski', metric_params=None)
clf.fit(X_train,labels)
Y_prected = clf.predict(X_test)

进行评估

from sklearn.metrics import accuracy_score
score = accuracy_score(Y_prected,testLabels)
print("When k is 5,the accuracy score is {}".format(score))
When k is 5,the accuracy score is 0.9809725158562368
score = accuracy_score(Y_prected,testLabels)
print("When k is 1,the accuracy score is {}".format(score))
When k is 1,the accuracy score is 0.9862579281183932

你可能感兴趣的:(数据挖掘,优秀博文)