K近邻(KNN)算法及python实现

1. k近邻模型

k 近邻法,k-nearest neighbor, k-NN,是一种基本的分类与回归的算法。其三大要素:k的选取、距离判别公式、分类决策.
k 近邻法模型:
N k ( x ) N_k(x) Nk(x)代表与 x 最近邻的 k 个点的邻域。
y = arg ⁡ max ⁡ c j ∑ x j ∈ N k ( x ) I ( y i = c j ) y = \arg \max _{c_{j}} \sum\limits_{x_j \in N_k(x)} I(y_i = c_j) y=argcjmaxxjNk(x)I(yi=cj)

1、k取值
取值小,结构复杂,相似误差小,但容易过拟合;
取值大,结构简单,相似误差大。
在应用中,k 一般选择较小的值,可通过交叉验证来确定最佳的 k 值。此外,k 一般取奇数,防止出现类别数相等的情况。

2、距离度量
一般使用欧氏距离,或者更一般的 L p L_p Lp距离。
L p ( x 1 , x 2 ) = ( ∑ l ∣ x i l − x j l ∣ p ) 1 / p L_p(x_1, x_2) = (\sum\limits_l |x_i^l - x_j^l|^p)^{1/p} Lp(x1,x2)=(lxilxjlp)1/p

这里, l l l 代表特征维度.

3、分类决策
多数表决,即k个最近点类别多数的那个类别。

2. kd树

kd 树,k-dimensional tree,一种分割 k 维度数据空间的数据结构。

创建kd树
1、选择当前维度下数据的中位数对应的样本当作父节点,从而,划分剩余数据划分到左、右子树;
2、选取当前维度的下一个维度,对左、右子树重复操作 1。

最近邻搜索
1、搜索目标节点在kd 树对应的“最佳叶节点”。
具体是从根节点出发,根据目标节点在相应维度下的值进行划分,直到划分到叶子结点。
2、向上遍历。
具体是从“最佳叶节点”出发,如果当前点比“最佳叶节点”更靠近目标节点,则该节点为当前最佳点,并且检查该节点的兄弟节点;
直到根节点,搜索结束。

3. 实践

3.1暴力法

使用暴力法进行实现, N N N 个样本, D D D维数据建模的算法复杂度 O ( D N 2 ) O(DN^2) O(DN2),因为计算距离时复杂度 O ( D N ) O(DN) O(DN), 找出k个最领域时复杂度 O ( N ) O(N) O(N)

from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
## Example 1: iris for classification( 3 classes)
# X, y = datasets.load_iris(return_X_y=True)
# Example 2
X, y = datasets.load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# my k-NN
class KNN():
    def __init__(self,data,K=3):
        self.X = data[0]
        self.y = data[1]
        self.K = K
    def fit(self, X_test):
        diffX = np.repeat(X_test, len(self.X), axis=0) - np.tile(self.X,(len(X_test),1))
        square_diffX = (diffX**2).sum(axis=1).reshape(len(X_test),len(self.X))
        sorted_index = square_diffX.argsort()
        predict = [0 for _ in range(len(X_test))]
        for i in range(len(X_test)):
            class_count={}
            for j in range(self.K):
                vote_label = self.y[sorted_index[i][j]]
                class_count[vote_label] = class_count.get(vote_label,0) + 1
            sorted_count = sorted(class_count.items(), key=lambda x: x[1],reverse=True)
            predict[i] = sorted_count[0][0]
        return predict
    def predict(self, X_test):
        return self.fit(X_test)
    def score(self,X,y):
        y_pred = self.predict(X)
        return 1 - np.count_nonzero(y-y_pred)/len(y)

knn = KNN((X_train,y_train), K=3)
print(knn.score(X_test,y_test))

运行结果:0.9415204678362573.
使用sklearn的API:

# sklearn API
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
print(neigh.score(X_test,y_test))

运行结果:0.9415204678362573
只要kNN法在实现上 k 的选择,以及距离的定义一致,结果是完全相同的。

3.2 kd树

定义kd 树。时间复杂度: O [ D N log ⁡ ( N ) ] O[D N \log (N)] O[DNlog(N)]

# kd-tree
class KDNode:
    '''
    vaule: [X,y]
    '''
    def __init__(self, value=None, parent=None, left=None, right=None, index=None):
        self.value = value
        self.parent = parent
        self.left = left
        self.right = right 
    @property
    def brother(self):
        if not self.parent:
            bro = None
        else:
            if self.parent.left is self:
                bro = self.parent.right
            else:
                bro = self.parent.left
        return bro

class KDTree():
    def __init__(self,K=3):
        self.root = KDNode()
        self.K = K
        
    def _build(self, data, axis=0,parent=None):
        '''
        data:[X,y]
        '''
        # choose median point 
        if len(data) == 0:
            root = KDNode()
            return root
        data = np.array(sorted(data, key=lambda x:x[axis]))
        median = int(len(data)/2)
        loc = data[median]
        root = KDNode(loc,parent=parent)
        new_axis = (axis+1)%(len(data[0])-1)
        if len(data[:median,:]) == 0:
            root.left = None
        else:
            root.left = self._build(data[:median,:],axis=new_axis,parent=root)
        if len(data[median+1:,:]) == 0:
            root.right = None
        else:
            root.right = self._build(data[median+1:,:],axis=new_axis,parent=root)
        self.root = root
        return root
    
    def fit(self, X, y):
        # concat X,y
        data = np.concatenate([X, y.reshape(-1,1)],axis=1)
        root = self._build(data)
        
    def _get_eu_distance(self,arr1:np.ndarray, arr2:np.ndarray) -> float:
        return ((arr1 - arr2) ** 2).sum() ** 0.5
        
    def _search_node(self,current,point,result={},class_count={}):
        # Get max_node, max_distance.
        if not result:
            max_node = None
            max_distance = float('inf')
        else:
            # find the nearest (node, distance) tuple
            max_node, max_distance = sorted(result.items(), key=lambda n:n[1],reverse=True)[0]
        node_dist = self._get_eu_distance(current.value[:-1],point)
        if len(result) == self.K:
            if node_dist < max_distance:
                result.pop(max_node)
                result[current] = node_dist
                class_count[current.value[-1]] = class_count.get(current.value[-1],0) + 1
                class_count[max_node.value[-1]] = class_count.get(max_node.value[-1],1) - 1
        elif len(result) < self.K:
            result[current] = node_dist
            class_count[current.value[-1]] = class_count.get(current.value[-1],0) + 1
        return result,class_count
        
    def search(self,point):
        # find the point belongs to which leaf node(current).
        current = self.root
        axis = 0
        while current:
            if point[axis] < current.value[axis]:
                prev = current
                current = current.left
            else:
                prev = current
                current = current.right
            axis = (axis+1)%len(point)
        current = prev
        # search k nearest points
        result = {}
        class_count={}
        while current:
            result,class_count = self._search_node(current,point,result,class_count)
            if current.brother:
                result,class_count = self._search_node(current.brother,point,result,class_count)
            current = current.parent
        return result,class_count
        
    def predict(self,X_test):
        predict = [0 for _ in range(len(X_test))]
        for i in range(len(X_test)):
            _,class_count = self.search(X_test[i])
            sorted_count = sorted(class_count.items(), key=lambda x: x[1],reverse=True)
            predict[i] = sorted_count[0][0]
        return predict
        
    def score(self,X,y):
        y_pred = self.predict(X)
        return 1 - np.count_nonzero(y-y_pred)/len(y)
        
    def print_tree(self,X_train,y_train):  
        height = int(math.log(len(X_train))/math.log(2))
        max_width = pow(2, height)
        node_width = 2
        in_level = 1
        root = self.fit(X_train,y_train)
        from collections import deque
        q = deque()
        q.append(root)
        while q:
            count = len(q)
            width = int(max_width * node_width / in_level)
            in_level += 1
            while count>0:
                node = q.popleft()
                if node.left:
                    q.append(node.left )
                if node.right:
                    q.append(node.right)
                node_str = (str(node.value) if node else '').center(width)
                print(node_str, end=' ')
                count -= 1 
            print("\n")
kd = KDTree()
kd.fit( X_train, y_train)
print(kd.score(X_test,y_test))

运行结果:0.9590643274853801

Q1: 为什么结果和上述暴力法的不一样?
A1: 应该是在向上回溯的过程,这里对每个节点的兄弟节点进行计算距离,并且如果兄弟节点是当前最佳节点,并没有继续向下搜索了。正确做法应该是,如果当前节点是当前最佳点,并对兄弟节点搜索,如果兄弟节点成为最佳点,还得对该兄弟节点进行向下搜索(我猜的)。

比较kd 树和蛮力法的运行效率
自定义样本集

# Example 3
X, y = datasets.make_blobs(n_samples=10000, centers=3, 
n_features=3, random_state=0)

import time
# 暴力法
st = time.time()
knn = KNN((X_train,y_train), K=3)
print(knn.score(X_test,y_test))
et = time.time()
print("use", et-st)

# kd tree
st = time.time()
kd = KDTree()
kd.fit( X_train, y_train)
print(kd.score(X_test,y_test))
et = time.time()
print("use", et-st)

运行结果:

# 暴力法
0.9993333333333333
use 2.8174679279327393
# kd tree
0.9973333333333333
use 0.7757344245910645

在精度相近下, kd-tree的运行时间是更短的。


参考:

  1. zhihu knn;
  2. 机器学习实战 k-近邻算法;
  3. sklearn.neighbors.KNeighborsClassifier;
  4. sklearn Nearest Neighbor Algorithms;
  5. 百科 kd-tree;
  6. github imylu kd-tree;
  7. github python-kNN;

你可能感兴趣的:(算法,算法)