以下引用自维基百科
在模式识别领域中,最近邻居法(KNN算法,又译K-近邻算法)是一种用于分类和回归的非参数统计方法。在这两种情况下,输入包含特征空间中的k个最接近的训练样本。
- 在k-NN分类中,输出是一个分类族群。一个对象的分类是由其邻居的“多数表决”确定的,k个最近邻居(k为正整数,通常较小)中最常见的分类决定了赋予该对象的类别。若k = 1,则该对象的类别直接由最近的一个节点赋予。
- 在k-NN回归中,输出是该对象的属性值。该值是其k个最近邻居的值的平均值。
最近邻居法采用向量空间模型来分类,概念为相同类别的案例,彼此的相似度高,而可以借由计算与已知类别案例之相似度,来评估未知类别案例可能的分类。
具体参考维基百科
实现k近邻法时,主要考虑的问题是如何对训练数据进行快速k近邻搜索。这点在特征空间的维数大及训练数据容量大时尤其必要。
k近邻法最简单的实现方式是线性扫描(linear scan)。赭石需要计算机输入实力与每一个训练实例的距离。当训练集很大时,计算非常耗时,这种方法是不可行的。
为了提高k近邻搜索的效率,可以考虑使用特殊的结构储存训练数据,以减少计算距离的次数。具体方法很多,下面介绍其中的kd树(kd tree)方法。
k-d树是每个节点都为k维点的二叉树。所有非叶子节点可以视作用一个超平面把空间分区成两个半空间( Half-space )。节点左边的子树代表在超平面左边的点,节点右边的子树代表在超平面右边的点。选择超平面的方法如下:每个节点都与k维中垂直于超平面的那一维有关。因此,如果选择按照x轴划分,所有x值小于指定值的节点都会出现在左子树,所有x值大于指定值的节点都会出现在右子树。这样,超平面可以用该x值来确定,其法线为x轴的单位向量。
有很多种方法可以选择轴垂直分区面( axis-aligned splitting planes ),所以有很多种创建k-d树的方法。 最典型的方法如下:
这个方法产生一个平衡的k-d树。每个叶节点的高度都十分接近。然而,平衡的树不一定对每个应用都是最佳的。
以下是python的实现:
class Tree_node(object):
'''kd tree node
classdocs
'''
def __init__(self, points, area, dimCount = 0, dim = 0, parent = None, lchild = None, rchild = None):
'''
Constructor
'''
self.points = deepcopy(points)
if len(points) > 0:
self.dimCount = len(points[0])
self.dim = dim
self.parent = parent
self.lchild = lchild
self.rchild = rchild
self.point = None
self.area = area
def build_tree(points, min, max):
area = Area(min, max)
root = Tree_node(points, area)
divide(root, 0) #divide kd_tree
return root
def divide(node, dim):
"""divide node into two child nodes
Args:
node: A kd_tree node
dim: The dimensions with which the node be divided
"""
if len(node.points) == 1:
node.point = node.points[0]
return
elif not node.points:
return
array = []
for point in node.points:
array.append(point[dim])
medain = select_k_st(array, len(array) / 2 + 1)
lPoints = []
rPoints = []
flag = True
for point in node.points:
if point[dim] <= medain:
if flag and point[dim] == medain:
node.point = point
flag = False
else:
lPoints.append(point)
else:
rPoints.append(point)
if lPoints or rPoints:
lmax = deepcopy(node.area.max)
lmax[dim] = medain
lArea = Area(node.area.min, lmax)
lNode = Tree_node(lPoints, lArea)
node.lchild = lNode
node.lchild.parent = node
divide(node.lchild, (dim + 1) % node.dimCount)
rmin = deepcopy(node.area.min)
rmin[dim] = medain
rArea = Area(rmin, node.area.max)
rNode = Tree_node(rPoints, rArea)
node.rchild = rNode
node.rchild.parent = node
divide(node.rchild, (dim + 1) % node.dimCount)
def select_k_st(array, k):
"""find the k-th largest number in a array
find k-th largest number in a given array
"""
if k == 1:
return array[0]
elif k == len(array):
return array[-1]
index = random.randint(0, len(array) - 1)
flag = array[index]
fArray = []
aArray = []
for elem in array:
if elem <= flag:
fArray.append(elem)
else:
aArray.append(elem)
if len(fArray) >= k:
return select_k_st(fArray, k)
else:
return select_k_st(aArray, k - len(fArray))
最邻近搜索用来找出在树中与输入点最接近的点。
k-d树最邻近搜索的过程如下:
def distance(point1, point2):
"""Calculate the distance between point 1 and point 2
Calculate the distance between point 1 and point 2 in European space
"""
sum = 0.0
for i in range(len(point1)):
sum = sum + (point1[i] - point2[i]) ** 2
return sum ** 0.5
def search_with_kd_tree(kd_tree, target, nearest_node = None):
return recall(kd_tree, find_leaf_node(kd_tree, target), target, nearest_node)
def find_leaf_node(root, target):
if len(root.points) <= 1:
return root
if target[root.dim] < root.point[root.dim]:
return find_leaf_node(root.lchild, target)
else:
return find_leaf_node(root.rchild, target)
def recall(root, node, target, nearest_node):
if node.point and (nearest_node == None or distance(node.point, target) < distance(nearest_node, target)):
nearest_node = node.point
if node == root:
return nearest_node
if node.parent.lchild.point and node.parent.lchild != node and (nearest_node == None or intersect(node.parent.lchild.area, target, nearest_node)):
nearest_node = search_with_kd_tree(node.parent.lchild, target, nearest_node)
elif node.parent.rchild.point and node.parent.rchild != node and (nearest_node == None or intersect(node.parent.rchild.area, target, nearest_node)):
nearest_node = search_with_kd_tree(node.parent.rchild, target, nearest_node)
return recall(root, node.parent, target, nearest_node)
def intersect(area, target, nearest_node):
"""Determine whether the hypersphere and the hypercube intersect
Determine whether the hypersphere defined by target and nearest_node and the hypercube intersect
Args:
area: A hypercube defined by two vector
target: Center of the hypersphere
nearest_node: A point on the sphere of a hypersphere
"""
return minDistance(area, target) < distance(target, nearest_node)
def minDistance(area, target):
"""Calculate the minmum distance from a given point to a hypercube
Calculate the minmum distance from a given point to a hypercube
"""
nearest_node = [0.0 for i in range(len(target))]
sum = 0.0
for i in range(len(target)):
if target[i] <= area.max[i] and target[i] >= area.min[i]:
continue
dif1 = abs(target[i] - area.min[i])
dif2 = abs(target[i] - area.max[i])
sum = sum + min([dif1, dif2]) ** 2.0
return sum ** 0.5