k近邻法(K-NN)是一种基本分类与回归方法。输入为实例的特征向量,对应于特征向量的点;输出为实例的类别,对应于实例的类别,可取多类。
给定一个训练数据集,对新的输入实例,在训练数据集中找到与该实例最邻近的k个实例,这k个实例的多个属于某个类,就把该输入实例分为此类。
k=1时是特殊情形,称为最近邻算法。对于输入的实例点(特征向量)x,最近邻法将训练数据集中与x最邻近点的类作为x的类。
3.2 k近邻模型
模型基本要素:距离度量,k值选择,分类决策规则
距离度量:
k:
一般取一个比较小的数值,通常采用交叉验证法来选取最优的k值
分类决策规则:
大多为多数表决。
3.3 kd树
例3.1
# 距离度量,默认欧氏距离
import math
def L(x,y,p=2):
if len(x)==len(y) and len(x)>1:
sum=0
for i in range(len(x)):
sum += math.pow(abs(x[i]-y[i]),p)
return math.pow(sum,1/p)
else:
return 0
# 例3.1
x1=[1,1]
x2=[5,1]
x3=[4,4]
for i in range(1,5):
r={'1-{}'.format(c):L(x1,c,p=i) for c in [x2,x3]}
print(min(zip(r.values(), r.keys())))
例题中的距离可以直接用numpy中的函数
np.linalg.norm(np.array(x1)-np.array(x2),ord=2)
KNN代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
class KNN:
def __init__(self, X_train, y_train, n_neighbors=3, p=2):
self.X_train = X_train
self.y_train = y_train
self.n = n_neighbors
self.p = p
def predict(self, X):
knn_list = []
for i in range(self.n):
dist = np.linalg.norm(X - self.X_train[i], ord=self.p)
knn_list.append((dist, y_train[i]))
for i in range(self.n, len(self.X_train)):
max_index = knn_list.index(max(knn_list, key=lambda x: x[0]))
dist = np.linalg.norm(X - self.X_train[i], ord=self.p)
if knn_list[max_index][0] > dist:
knn_list[max_index] = (dist, self.y_train[i])
knn = [k[-1] for k in knn_list]
count_pairs = Counter(knn)
max_count = sorted(count_pairs.items(), key=lambda x: x[1])[-1][0]
return max_count
def score(self, X_test, y_test):
right_count = 0
for X, y in zip(X_test, y_test):
label = self.predict(X)
if label == y:
right_count += 1
return right_count / len(X_test)
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
data = np.array(df.iloc[:100, [0, 1, -1]]) # 前100行+分类
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 划分训练集+测试集
clf = KNN(X_train, y_train)
print(clf.score(X_test, y_test))
test_point = [4.7,3.2]
print(clf.predict(test_point))
plt.scatter(df.iloc[:50,0], df.iloc[:50,1], label='0')
plt.scatter(df.iloc[50:100,0], df.iloc[50:100,1], label='1')
plt.plot(test_point[0], test_point[1], 'ro', label='test_point')
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.legend()
plt.show()
分类为0 投影图如下:
scikit-learn:
from sklearn.neighbors import KNeighborsClassifier