分类思想:输入x,计算它和已知数据集之间中样本的距离,将距离进行升序排列,输出前k个距离,输出前k个结果中类别最多的一类,就是x的所属类别。
回归思想:对前k个样本输出的平均值作为回归预测值
关键点:
k值的选取:可以先选择一个较小的值,再交叉验证
距离度量的选择:
欧式距离: 对于两个n维向量x和y
D ( x , y ) = ( x 1 − y 1 ) 2 + ( x 2 − y 2 ) 2 + … + ( x n − y n ) 2 = ∑ i = 1 n ( x i − y i ) 2 D(x, y)=\sqrt{\left(x_{1}-y_{1}\right)^{2}+\left(x_{2}-y_{2}\right)^{2}+\ldots+\left(x_{n}-y_{n}\right)^{2}}\\=\sqrt{\sum_{i=1}^{n}\left(x_{i}-y_{i}\right)^{2}} D(x,y)=(x1−y1)2+(x2−y2)2+…+(xn−yn)2=i=1∑n(xi−yi)2
曼哈顿距离:对于两个n维向量x和y
D ( x , y ) = ∣ x 1 − y 1 ∣ + ∣ x 2 − y 2 ∣ + … + ∣ x n − y n ∣ = ∑ i = 1 n ∣ x i − y i ∣ D(x, y)=\left|x_{1}-y_{1}\right|+\left|x_{2}-y_{2}\right|+\ldots+\left|x_{n}-y_{n}\right|\\=\sum_{i=1}^{n}\left|x_{i}-y_{i}\right| D(x,y)=∣x1−y1∣+∣x2−y2∣+…+∣xn−yn∣=i=1∑n∣xi−yi∣
分类决策的规则:多数表决法,k最好选择奇数,不然有相同的结果
knn优点
knn缺点:
步骤
# 导入库
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# 构建数据集
dataset = load_iris()
X = dataset.data
y = dataset.target
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.7,random_state = 777)
y_test.shape[0]
def KNN_EUdist(X_test,X_train,y_train,k):
# initialization
num_test = X_test.shape[0]
y_pred = np.zeros(X_test.shape[0])
# 计算每个预测样本和训练集样本的距离
for i in range(num_test):
d = np.sum((X_test[i]-X_train)**2,axis=1) # 欧式距离,按行求和
d = np.sqrt(d) #开方
#d = np.sum(np.abs((X_test[i]-X_train)),axis=1) # 曼哈顿距离,按行求和
d_k_index = np.argsort(d)[:k] #升序排序后直接返回index
pred_k_label = y_train[d_k_index]
y_pred[i] = np.argmax(np.bincount(pred_k_label.tolist())) #返回类别最多的一类
return y_pred
def acc(y_pred,y_test):
'''
计算KNN_EUdist()准确率
'''
num_test = y_test.shape[0]
cnt = 0
for i in range(num_test):
if y_pred[i] == y_test[i]:
cnt +=1
#print(y_pred[i],y_test[i])
else:
pass
accuracy = cnt/num_test
return accuracy
y_pred = KNN_EUdist(X_test,X_train,y_train,k)
acc(y_pred,y_test)
>>>0.97
[1] K近邻法(KNN)原理小结-刘建平 https://www.cnblogs.com/pinard/p/6061661.html
[2] knn部分总结 https://github.com/zhengjingwei/machine-learning-interview#2-2-1