k 近邻法,k-nearest neighbor, k-NN,是一种基本的分类与回归的算法。其三大要素:k的选取、距离判别公式、分类决策.
k 近邻法模型:
N k ( x ) N_k(x) Nk(x)代表与 x 最近邻的 k 个点的邻域。
y = arg max c j ∑ x j ∈ N k ( x ) I ( y i = c j ) y = \arg \max _{c_{j}} \sum\limits_{x_j \in N_k(x)} I(y_i = c_j) y=argcjmaxxj∈Nk(x)∑I(yi=cj)
1、k取值
取值小,结构复杂,相似误差小,但容易过拟合;
取值大,结构简单,相似误差大。
在应用中,k 一般选择较小的值,可通过交叉验证来确定最佳的 k 值。此外,k 一般取奇数,防止出现类别数相等的情况。
2、距离度量
一般使用欧氏距离,或者更一般的 L p L_p Lp距离。
L p ( x 1 , x 2 ) = ( ∑ l ∣ x i l − x j l ∣ p ) 1 / p L_p(x_1, x_2) = (\sum\limits_l |x_i^l - x_j^l|^p)^{1/p} Lp(x1,x2)=(l∑∣xil−xjl∣p)1/p
这里, l l l 代表特征维度.
3、分类决策
多数表决,即k个最近点类别多数的那个类别。
kd 树,k-dimensional tree,一种分割 k 维度数据空间的数据结构。
创建kd树:
1、选择当前维度下数据的中位数对应的样本当作父节点,从而,划分剩余数据划分到左、右子树;
2、选取当前维度的下一个维度,对左、右子树重复操作 1。
最近邻搜索:
1、搜索目标节点在kd 树对应的“最佳叶节点”。
具体是从根节点出发,根据目标节点在相应维度下的值进行划分,直到划分到叶子结点。
2、向上遍历。
具体是从“最佳叶节点”出发,如果当前点比“最佳叶节点”更靠近目标节点,则该节点为当前最佳点,并且检查该节点的兄弟节点;
直到根节点,搜索结束。
使用暴力法进行实现, N N N 个样本, D D D维数据建模的算法复杂度 O ( D N 2 ) O(DN^2) O(DN2),因为计算距离时复杂度 O ( D N ) O(DN) O(DN), 找出k个最领域时复杂度 O ( N ) O(N) O(N)。
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
## Example 1: iris for classification( 3 classes)
# X, y = datasets.load_iris(return_X_y=True)
# Example 2
X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# my k-NN
class KNN():
def __init__(self,data,K=3):
self.X = data[0]
self.y = data[1]
self.K = K
def fit(self, X_test):
diffX = np.repeat(X_test, len(self.X), axis=0) - np.tile(self.X,(len(X_test),1))
square_diffX = (diffX**2).sum(axis=1).reshape(len(X_test),len(self.X))
sorted_index = square_diffX.argsort()
predict = [0 for _ in range(len(X_test))]
for i in range(len(X_test)):
class_count={}
for j in range(self.K):
vote_label = self.y[sorted_index[i][j]]
class_count[vote_label] = class_count.get(vote_label,0) + 1
sorted_count = sorted(class_count.items(), key=lambda x: x[1],reverse=True)
predict[i] = sorted_count[0][0]
return predict
def predict(self, X_test):
return self.fit(X_test)
def score(self,X,y):
y_pred = self.predict(X)
return 1 - np.count_nonzero(y-y_pred)/len(y)
knn = KNN((X_train,y_train), K=3)
print(knn.score(X_test,y_test))
运行结果:0.9415204678362573
.
使用sklearn的API:
# sklearn API
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
print(neigh.score(X_test,y_test))
运行结果:0.9415204678362573
只要kNN法在实现上 k 的选择,以及距离的定义一致,结果是完全相同的。
定义kd 树。时间复杂度: O [ D N log ( N ) ] O[D N \log (N)] O[DNlog(N)]。
# kd-tree
class KDNode:
'''
vaule: [X,y]
'''
def __init__(self, value=None, parent=None, left=None, right=None, index=None):
self.value = value
self.parent = parent
self.left = left
self.right = right
@property
def brother(self):
if not self.parent:
bro = None
else:
if self.parent.left is self:
bro = self.parent.right
else:
bro = self.parent.left
return bro
class KDTree():
def __init__(self,K=3):
self.root = KDNode()
self.K = K
def _build(self, data, axis=0,parent=None):
'''
data:[X,y]
'''
# choose median point
if len(data) == 0:
root = KDNode()
return root
data = np.array(sorted(data, key=lambda x:x[axis]))
median = int(len(data)/2)
loc = data[median]
root = KDNode(loc,parent=parent)
new_axis = (axis+1)%(len(data[0])-1)
if len(data[:median,:]) == 0:
root.left = None
else:
root.left = self._build(data[:median,:],axis=new_axis,parent=root)
if len(data[median+1:,:]) == 0:
root.right = None
else:
root.right = self._build(data[median+1:,:],axis=new_axis,parent=root)
self.root = root
return root
def fit(self, X, y):
# concat X,y
data = np.concatenate([X, y.reshape(-1,1)],axis=1)
root = self._build(data)
def _get_eu_distance(self,arr1:np.ndarray, arr2:np.ndarray) -> float:
return ((arr1 - arr2) ** 2).sum() ** 0.5
def _search_node(self,current,point,result={},class_count={}):
# Get max_node, max_distance.
if not result:
max_node = None
max_distance = float('inf')
else:
# find the nearest (node, distance) tuple
max_node, max_distance = sorted(result.items(), key=lambda n:n[1],reverse=True)[0]
node_dist = self._get_eu_distance(current.value[:-1],point)
if len(result) == self.K:
if node_dist < max_distance:
result.pop(max_node)
result[current] = node_dist
class_count[current.value[-1]] = class_count.get(current.value[-1],0) + 1
class_count[max_node.value[-1]] = class_count.get(max_node.value[-1],1) - 1
elif len(result) < self.K:
result[current] = node_dist
class_count[current.value[-1]] = class_count.get(current.value[-1],0) + 1
return result,class_count
def search(self,point):
# find the point belongs to which leaf node(current).
current = self.root
axis = 0
while current:
if point[axis] < current.value[axis]:
prev = current
current = current.left
else:
prev = current
current = current.right
axis = (axis+1)%len(point)
current = prev
# search k nearest points
result = {}
class_count={}
while current:
result,class_count = self._search_node(current,point,result,class_count)
if current.brother:
result,class_count = self._search_node(current.brother,point,result,class_count)
current = current.parent
return result,class_count
def predict(self,X_test):
predict = [0 for _ in range(len(X_test))]
for i in range(len(X_test)):
_,class_count = self.search(X_test[i])
sorted_count = sorted(class_count.items(), key=lambda x: x[1],reverse=True)
predict[i] = sorted_count[0][0]
return predict
def score(self,X,y):
y_pred = self.predict(X)
return 1 - np.count_nonzero(y-y_pred)/len(y)
def print_tree(self,X_train,y_train):
height = int(math.log(len(X_train))/math.log(2))
max_width = pow(2, height)
node_width = 2
in_level = 1
root = self.fit(X_train,y_train)
from collections import deque
q = deque()
q.append(root)
while q:
count = len(q)
width = int(max_width * node_width / in_level)
in_level += 1
while count>0:
node = q.popleft()
if node.left:
q.append(node.left )
if node.right:
q.append(node.right)
node_str = (str(node.value) if node else '').center(width)
print(node_str, end=' ')
count -= 1
print("\n")
kd = KDTree()
kd.fit( X_train, y_train)
print(kd.score(X_test,y_test))
运行结果:0.9590643274853801
Q1: 为什么结果和上述暴力法的不一样?
A1: 应该是在向上回溯的过程,这里对每个节点的兄弟节点进行计算距离,并且如果兄弟节点是当前最佳节点,并没有继续向下搜索了。正确做法应该是,如果当前节点是当前最佳点,并对兄弟节点搜索,如果兄弟节点成为最佳点,还得对该兄弟节点进行向下搜索(我猜的)。
比较kd 树和蛮力法的运行效率
自定义样本集
# Example 3
X, y = datasets.make_blobs(n_samples=10000, centers=3,
n_features=3, random_state=0)
import time
# 暴力法
st = time.time()
knn = KNN((X_train,y_train), K=3)
print(knn.score(X_test,y_test))
et = time.time()
print("use", et-st)
# kd tree
st = time.time()
kd = KDTree()
kd.fit( X_train, y_train)
print(kd.score(X_test,y_test))
et = time.time()
print("use", et-st)
运行结果:
# 暴力法
0.9993333333333333
use 2.8174679279327393
# kd tree
0.9973333333333333
use 0.7757344245910645
在精度相近下, kd-tree的运行时间是更短的。
参考: