算法背景

K最近邻(K-Nearest Neighbor,KNN)算法，是著名的模式识别统计学方法，在机器学习分类算法中占有相当大的地位。它是一个理论上比较成熟的方法。既是最简单的机器学习算法之一，也是基于实例的学习方法中最基本的，又是最好的文本分类算法之一。

KNN工作原理

存在一个样本数据集合，也称为训练样本集，并且样本集中每个数据都存在标签，即我们知道样本集中每一数据与所属分类对应的关系。输入没有标签的数据后，将新数据中的每个特征与样本集中数据对应的特征进行比较，提取出样本集中特征最相似数据（最近邻）的分类标签。一般来说，我们只选择样本数据集中前k个最相似的数据，这就是k近邻算法中k的出处，通常k是不大于20的整数。最后选择k个最相似数据中出现次数最多的分类作为新数据的分类。

说明：KNN没有显示的训练过程，它是“懒惰学习”的代表，它在训练阶段只是把数据保存下来，训练时间开销为0，等收到测试样本后进行处理。

KNN算法的优缺点

KNN算法的优点：

1）简单、有效。
2）重新训练的代价较低（类别体系的变化和训练集的变化，在Web环境和电子商务应用中是很常见的）。
3）计算时间和空间线性于训练集的规模（在一些场合不算太大）。
4）由于KNN方法主要靠周围有限的邻近的样本，而不是靠判别类域的方法来确定所属类别的，因此对于类域的交叉或重叠较多的待分样本集来说，KNN方法较其他方法更为适合。
5）该算法比较适用于样本容量比较大的类域的自动分类，而那些样本容量较小的类域采用这种算法比较容易产生误分。

KNN算法缺点：

1）KNN算法是懒散学习方法（lazy learning,基本上不学习），一些积极学习的算法要快很多。
2）类别评分不是规格化的（不像概率评分）。
3）输出的可解释性不强，例如决策树的可解释性较强。
4）该算法在分类时有个主要的不足是，当样本不平衡时，如一个类的样本容量很大，而其他类样本容量很小时，有可能导致当输入一个新样本时，该样本的K个邻居中大容量类的样本占多数。该算法只计算“最近的”邻居样本，某一类的样本数量很大，那么或者这类样本并不接近目标样本，或者这类样本很靠近目标样本。无论怎样，数量并不能影响运行结果。可以采用权值的方法（和该样本距离小的邻居权值大）来改进。
5）计算量较大。目前常用的解决方法是事先对已知样本点进行剪辑，事先去除对分类作用不大的样本。

Python实现

采用均值方差归一化方法，数据进行归一化之后，考虑距离加权，计算鸢尾花的分类。

源程序


#!/usr/bin/env python 
# -*- coding: UTF-8 -*-

import numpy as np
from math import sqrt
from collections import Counter
from sklearn import datasets
from sklearn.metrics import accuracy_score

class KNN:

    def __init__(self, k, p):
        """初始化kNN分类器"""
        assert k >= 1, "k must be valid"
        self.k = k
        self.p = p
        self._X_train = None
        self._y_train = None
        self.y_possible = None
    
    def data_split (self, X, y, test_ratio=0.2, seed=None):
        """根据test_ratio分隔数据集"""
        assert X.shape[0] == y.shape[0],\
        "the size of X must equal to the size of y"
        assert 0.0 <=test_ratio <=1.0, \
        "test_ratio must be valid"

        if seed:
            np.random.seed(seed)

        #将原始数据打乱
        shuffle_indexes = np.random.permutation(len(X)) 
        test_size = int(len(X)*test_ratio)
        test_indexes = shuffle_indexes[:test_size]
        train_indexes = shuffle_indexes[test_size:]

        X_train = X[train_indexes]
        y_train = y[train_indexes]

        X_test = X[test_indexes]
        y_test = y[test_indexes]

        #y的所有可能取值
        y_list = y_train.tolist() #将训练y值矩阵转换为列表
        y_possible_list = sorted(set(y_list),key=y_list.index)  #去除列表中重复数据，数据排列不变
        self.y_possible = np.array(y_possible_list) #将列表转换为矩阵


        # 数据归一化
        X_train, X_test= self._scaling(X_train, X_train), self._scaling(X_train, X_test)

        return X_train, y_train, X_test, y_test

    def _scaling(self, X_train, original_X):
#        """最值归一化"""
#        scaling_X = np.ones(shape=original_X.shape)
#        for i in range(original_X.shape[1]):
#            scaling_X[:,i] = (original_X[:,i] - np.min(X_train[:,i]))/(np.max(X_train[:,i])-np.min(X_train[:,i]))

        """均值方差归一化"""
        scaling_X = np.zeros(shape=original_X.shape)
        for i in range(original_X.shape[1]):
            scaling_X[:,i] = (original_X[:,i] - np.mean(X_train[:,i]))/(np.std(X_train[:,i]))
        return scaling_X

    def fit(self, X_train, y_train):
        """根据训练数据集X_train和y_train训练kNN分类器"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert self.k <= X_train.shape[0], \
            "the size of X_train must be at least k."

        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict，返回表示X_predict的结果向量"""
        assert self._X_train is not None and self._y_train is not None, \
                "must fit before predict!"
        assert X_predict.shape[1] == self._X_train.shape[1], \
                "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """给定单个待预测数据x，返回x的预测结果值"""
        assert x.shape[0] == self._X_train.shape[1], \
            "the feature number of x must be equal to X_train"

        distances = [pow(np.sum((abs(x_train - x)) ** self.p), 1/self.p) 
                    for x_train in self._X_train]
        nearest = np.argsort(distances)
        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        topK_distance = [distances[i] for i in nearest[:self.k]]
        votes_count = np.zeros(shape=(self.y_possible.shape))

        for j in range(self.k):
            vote_id = np.argwhere(self.y_possible == topK_y[j])
            votes_count[vote_id] += 1/(topK_distance[j]+1e-8)     #防止分母为零
                
        predict_value_id = np.argmax(votes_count)
        predict_value = self.y_possible[predict_value_id]

        return predict_value

    def score(self, X_test, y_test):
        """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""

        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "KNN(k=%d)" % self.k


#KNN算法用于鸢尾花种类判定

iris = datasets.load_iris()
X = iris.data
y = iris.target

iris_knn = KNN(3,7)
X_train, y_train, X_test, y_test = iris_knn.data_split (X, y)
iris_knn.fit(X_train, y_train)
score = iris_knn.score(X_test, y_test)
print(score)

调用sklearn

#!/usr/bin/env python 
# -*- coding: UTF-8 -*-
import numpy as np
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler


if __name__=='__main__':
    iris = datasets.load_iris()
    print(iris.DESCR) #显示数据具体信息
    X = iris.data
    y = iris.target

    #数据分割
    X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, random_state=None)

    #数据归一化
    StandardScaler = StandardScaler()
    StandardScaler.fit(X_train)
    X_train = StandardScaler.transform(X_train)
    X_test = StandardScaler.transform(X_test)

    #KNN算法处理
    knn_clf = KNeighborsClassifier(n_neighbors=3)
    knn_clf.fit(X_train, y_train)
    print(knn_clf.score(X_test, y_test))

    param_grid = [ 
     { 
         'weights':['uniform'],
         'n_neighbors':[i for i in range(1,6)]
      },
     { 
         'weights':['distance'],
          'n_neighbors':[i for i in range(1,6)],
          'p':[i for i in range(1,4)]
      }
    ]

    #优化超参数

    grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2)
    grid_search.fit(X_train, y_train)

    print(grid_search.best_score_)
    print(grid_search.best_params_)

    knn_clf = grid_search.best_estimator_
    print(knn_clf.score(X_test, y_test))

python tips

将原始数据打乱：
shuffle_indexes = np.random.permutation(len(X))
test_size = int(len(X)*test_ratio)
test_indexes = shuffle_indexes[:test_size]
train_indexes = shuffle_indexes[test_size:]

列表和矩阵的互相转换：
y_list = y_train.tolist() #将训练y值矩阵转换为列表
self.y_possible = np.array(y_possible_list) #将列表转换为矩阵

KNN（K-Nearest Neighbor）算法