kNN 算法
基本原理
最近邻居法(KNN算法,又译K-近邻算法)是一种用于分类和回归的非参数统计方法。
- 在k-NN分类中,输出是一个分类族群。一个对象的分类是由其邻居的“多数表决”确定的,k个最近邻居(k为正整数,通常较小)中最常见的分类决定了赋予该对象的类别。若k = 1,则该对象的类别直接由最近的一个节点赋予。
简单的来说就是把将要预测的样本放在特征空间中(为了方便下面延时二维特征空间),然后找出其中距离带预测样本点最近的k个点,再由这几个点的标签来投票预测带预测样本的标签值。
其中距离为了方便表示演示,我们采用欧拉距离,即
并且暂时忽略距离的权重值来简化理解,并且不区分训练集和测试集
应用简单实现
根据下述过程可以看到,可以说kNN是一个不需要训练过程的算法。
k近邻算法是非常特殊的,可以被认为是没有模型的算法。
为了和其他算法统一,可以认为训练数据集就是模型本身。
首先创建一个虚拟的数据集,和一个带预测样本 x
import numpy as np
import matplotlib.pyplot as plt
raw_data_X = [[3.393, 2.331],
[3.110, 1.782],
[1.343, 3.368],
[3.582, 4.679],
[2.280, 2.967],
[7.423, 4.697],
[5.745, 3.534],
[9.172, 2.511],
[7.792, 3.424],
[7.940, 0.792]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)
x = np.array([8.094, 3.366])
plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1], color='g')
plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1], color='r')
plt.scatter(x[0], x[1], color='b')
plt.show()
将数据集和带预测样本在坐标轴上画出来
其中我们假设 标签0 (即绿色点)代表肿瘤早期, 标签1 (即红色点)代表肿瘤晚期,蓝色点代表带预测样本,可以看到带预测点距离最近的几个点都是红色的,很遗憾,患者可能是晚期
编写代码进行预测
kNN_simple_implement.py
import numpy as np
from math import sqrt
from collections import Counter
def kNN_classify(k, X_train, y_train, x):
assert 1 <= k <= X_train.shape[0], "k must be vaild"
assert X_train.shape[0] == y_train.shape[0], ("the size"
+ " of X_train must be equal to the size of y_train")
assert X_train.shape[1] == x.shape[0], ("the feature "
+ "number of x must be equal to X_train")
distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]
nearest = np.argsort(distances)
topK_y = [y_train[i] for i in nearest[:k]]
votes = Counter(topK_y)
return votes.most_common(1)[0][0]
可以看到关键代码都比较短只有简单的5行
distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in X_train]
nearest = np.argsort(distances)
topK_y = [y_train[i] for i in nearest[:k]]
votes = Counter(topK_y)
votes.most_common(1)[0][0]
将代码加载到 jupyter 中
%run kNN_simple_implement.py
进行预测
predict_y = kNN_classify(k=3, X_train=X_train, y_train=y_train, x=x)
predict_y
1
可以看到预测的标签和我们猜测的相近,是代表晚期的1,我这里采用的k值为3,其实最近的5个应该都是晚期样本,代码预测准确
使用scikit-learn中的kNN
下面我们可以看看 scikit-learn 中如何使用kNN算法,其中预测样本放到一个矩阵中可以做批量预测
from sklearn.neighbors import KNeighborsClassifier
kNN_classifier = KNeighborsClassifier(n_neighbors=6)
kNN_classifier.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=6, p=2,
weights='uniform')
# 将要预测的数据放入一个矩阵
X_predict = x.reshape(1, -1)
y_predict = kNN_classifier.predict(X_predict)
y_predict[0]
1
重新整理kNN代码
根据上述可以看到,我们预测结果和 sklearn 中的是一样的,但是我们的代码只是一个方法,并没有封装成面向对象的组织模式,下面仿照 sklearn 我们改写一下代码
kNN.py
import numpy as np
from math import sqrt
from collections import Counter
class kNNClassifier():
def __init__(self, k):
assert k >=1, "k must valid"
self.k = k
self._X_train = None
self._y_train = None
def fit(self, X_train, y_train):
assert X_train.shape[0] == y_train.shape[0], ("the "
+ "size of X_train must be equal to the size of "
+ "y_train")
assert self.k <= X_train.shape[0], ("the size of "
+ "X_train must be at least k")
self._X_train = X_train
self._y_train = y_train
return self
def predict(self, X_predict):
assert (self._X_train is not None and
self._y_train is not None),("must fit "
+ "before predict")
assert X_predict.shape[1] == self._X_train.shape[1],\
("the feature number of X_predict must be equal"
+ " to X_train")
y_predict = [self._predict(x) for x in X_predict]
return np.array(y_predict)
def _predict(self, x):
distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._X_train]
nearest = np.argsort(distances)
topK_y = [self._y_train[i] for i in nearest[:self.k]]
votes = Counter(topK_y)
return votes.most_common(1)[0][0]
def __repr__(self):
return "kNN(k=%d)" % self.k
%run kNN.py
kNN_classifier_re = kNNClassifier(6)
kNN_classifier_re.fit(X_train, y_train)
kNN(k=6)
y_predict_re = kNN_classifier_re.predict(X_predict)
y_predict_re[0]
1
可以看到结果一样的,很多时候我们可以仿照,学习别人的框架的封装来学习编程思想。