在进行图像搜索时,要求系统返回与查询图片相似的图像集,这里头涉及高维空间的数据查询问题。排序和寻找是计算机技术的基础算法,最基本的是对一维数组排序和在一维数组中寻找元素。在图像检索系统中面对的是大规模的高维数据,对查询算法的效率要求更高,这里介绍是一种基本的查询算法--最近邻(Nearest Neighbor )或K近邻(KNN)。最近邻搜索(Nearest Neighbor Search):最近邻查询是最重要的空间查询之一,也是文本着重考虑的类型。其定义是:给定一个查询点q和一个距离度量(有欧几里德距离、曼哈顿距离等),一个最近邻查询找出一个离查询点q最近的空间数据对象。它的一般化形式为k(k>=1)最近邻查询,定义为:给定一个查询点q,一个正整数k以及一个距离度量,则一个k最邻近查询找出k个离q最近的空间数据对象。如上图所示,查询对象q的k邻近1NN(q)={a},4NN(q)={a,b,c,d}。
最近邻规则也是一种分类方法,如上图中,绿色圆要被决定赋予哪个类,是红色三角形还是蓝色四方形?如果K=3,由于红色三角形所占比例为2/3,绿色圆将被赋予红色三角形那个类,如果K=5,由于蓝色四方形比例为3/5,因此绿色圆被赋予蓝色四方形类。
用原始的穷举法可实现最近邻查询,结果如上图(K=3),Matlab代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
function
knn_2d_demo()
% KNN 2D Demo
dim = 2;
num = 40;
X = randi(100,num,dim);
Q = randi(100,1,dim);
K = 3;
[indexs distances] = KNNS(X,Q,K);
figure
hold on
plot(X(:,1),X(:,2),
'b.'
)
plot(X(indexs,1),X(indexs,2),
'ro'
)
plot(Q(1),Q(2),
'g*'
)
plot(Q(1),Q(2),
'co'
,
'MarkerSize'
,ceil(distances(K))*6)
hold off
function
[indexs distances] = KNNS(X,Q,K)
% X -- 数据点集 N×D
% Q -- 查询数据 N×D
% K -- 查询近邻数
% indexs -- KNN 点集索引
% distances -- 查询点同KNN点的距离
[n,dim] = size(X);
query_mat = repmat(Q,n,1);
dist_list = sqrt(
sum
((X - query_mat).^2,2));
[value, index] =
sort
(dist_list);
indexs = index(1:K);
distances = value(1:K);
|
当数据点数目增大时,穷举法的计算量会以指数级别增长。1977年Friedman提出的k-d树快速搜索算法[Friedman & Bentley s.t. '77]可以快速找到最近邻,但当特征点描述符多于10维时其效率就很低,并不比穷举法效率高。1997年Beis和Lowe提出了一种近似算法称为Best-Bin-First(BBF)[Beis & Lowe'99] ,这种近似算法能够以很高的概率找到最近邻点。
k-d树(K-dimension tree)是二叉检索树的扩展,k-d tree的每一层将空间分成两个。树的顶层结点按一维进行划分,下一层结点按另一维进行划分,以此类推,各个维循环往复。划分要使得在每个结点,存储在子树中大约一半的点落入一侧,而另一半落入另一侧。
参照上图,用二维(K=2)数据的实例来说明K-D Tree的建立:
1. 根结点选择:将数据按x维排序,以中值所在数据点(5,8)作为根结点,然后以此结点为参考作第一次分裂;
注意,这里涉及两种基本操作:
(1)选择分裂(split)的维度,即在哪一维数据上将原数据分为两半,有两种做法:
一是取 当前层数n mod 数据维数K:如第一层时,n = 0,n mod K = 0,即在第一维数据上寻找分裂点。
二是取 方差最大的那维数据,即在各维数据上分别计算方差,方差最大者数据越分散,故在该维数据上寻找分裂点。
(2)在分裂维度上选择分裂点:通常是取该数据的中值点。
2. 分裂:(父)根结点p的分裂维split_dim = 0,即在x维上分裂,第split_dim分量小于p[split_dim]的放于左侧,否之放在右侧。对左右侧分别确定分裂维、分裂点,循环,直至最后。
照以上步骤,Matlab中递归实现如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
|
function
tree_output = kdtree_create(X,parent_number,split_dimension)
global tree_cell;
global node_number;
if
nargin ==1
% add the index values to the last column
% easy way to keep track of them
[n,d] = size(X);
X(:,d+1)=1:n';
node_number=1;
split_dimension=0;
parent_number=0;
% intialize the node
Node.
type
=
'node'
;
Node.left=0;
Node.right=0;
Node.nodevector=zeros(1,d);
Node.hyperrect=[zeros(1,d);zeros(1,d)];
Node.numpoints=0;
Node.index=0;
Node.parent=0;
Node.splitval=0;
% initilaze the tree
hold_cell_data(1:n)=Node;
tree_cell=hold_cell_data;
clear
hold_cell_data;
else
[n,d] = size(X(:,1:end-1));
node_number=node_number+1;
split_dimension=split_dimension+1;
end
if
n==0
fprintf(
'Error: 0 points in node, causing endless loop, press ctrl-C.\n'
);
end
assigned_nn=node_number; % assigned node number
for
this particular iteration
%
set
assignments to current node
tree_cell(assigned_nn).
type
=
'node'
;
tree_cell(assigned_nn).parent=parent_number;
%
if
there is 1 datapoint left,
make
a leaf out of it
if
n==1
tree_cell(assigned_nn).left=[];
tree_cell(assigned_nn).right=[];
tree_cell(assigned_nn).nodevector=X(:,1:end-1);
tree_cell(assigned_nn).hyperrect=[X(:,1:end-1);X(:,1:end-1)];
tree_cell(assigned_nn).
type
=
'leaf'
;
tree_cell(assigned_nn).numpoints=1;
tree_cell(assigned_nn).index=X(:,end);
tree_output = assigned_nn;
return
;
end
%
if
there are
more
than 1 data points left calculate some of the node
% values
tree_cell(assigned_nn).numpoints = n;
a = min(X(:,1:end-1)); b = max(X(:,1:end-1));
tree_cell(assigned_nn).hyperrect = [a; b];
%
if
the feature vectors happen to be the same
then
leaf again
if
a==b tree_cell(assigned_nn).
type
=
'leaf'
; end
% recursively build rest of tree
%
if
the node is a leaf
then
assign the index and node vector
if
(strcmp(tree_cell(assigned_nn).
type
,
'leaf'
))
tree_cell(assigned_nn).nodevector = mean(X(:,1:end-1));
tree_cell(assigned_nn).index=X(:,end);
else
%
if
it is not a leaf
% figure out
which
dimension to
split
(going
in
order)
tree_cell(assigned_nn).splitdim=mod(split_dimension,d)+1;
% figure out the median value to
cut
this dimension
median_value=median(X(:,tree_cell(assigned_nn).splitdim));
%
find
out all the values that are lower than or equal to this median
i=
find
(X(:, tree_cell(assigned_nn).splitdim)<=median_value);
%
if
there are
more
than 2 points
if
(tree_cell(assigned_nn).numpoints>2)
% as there
more
than two points
[max_val,max_pos]=max(X(i, tree_cell(assigned_nn).splitdim));
% recurse
for
everything to the left of the median
tree_cell(assigned_nn).left = kdtree_create(X([i(1:max_pos-1);i(max_pos+1:end)], <img src=
"http://blog.youtueye.com/wp-includes/images/smilies/icon_smile.gif"
alt=
":)"
class=
"wp-smiley"
> , assigned_nn,split_dimension);
% recurse
for
everything to the right of the median
if
(size(X,1)>size(i,1))
tree_cell(assigned_nn).right = kdtree_create(X(
find
(~(X(:, tree_cell(assigned_nn).splitdim)<=median_value)), <img src=
"http://blog.youtueye.com/wp-includes/images/smilies/icon_smile.gif"
alt=
":)"
class=
"wp-smiley"
> , assigned_nn,split_dimension);
else
tree_cell(assigned_nn).right =[];
end
else
%
if
there are only two data points left
% choose the left value as the median
%
make
the right value as a leaf
% leave the left leaf blank
[max_val,max_pos]=max(X(i, tree_cell(assigned_nn).splitdim));
if
(i(max_pos)==1); min_pos=2;
else
min_pos=1; end
tree_cell(assigned_nn).left = [];
tree_cell(assigned_nn).right = kdtree_create(X(min_pos,:), assigned_nn,split_dimension);
end
% assign the median vector to this node
tree_cell(assigned_nn).nodevector = X(i(max_pos),1:end-1);
tree_cell(assigned_nn).index=X(i(max_pos),end);
tree_cell(assigned_nn).splitval=X(i(max_pos), tree_cell(assigned_nn).splitdim);
end
% final clean up
if
nargin ==1
% As all computation is complete
then
return
tree structure
% and
clear
other data
tree_output=tree_cell;
clear
global tree_cell;
else
% otherwise output the assigned_nn
for
storage
in
the parent
tree_output=assigned_nn;
end
|
其划分2-D平面的过程如下:
k-d树的数据结构决定了能够减少查询量,因为很多结点在查询时不参与比较。下面通过实例说明最近邻(K=1)的查询过程:
1. 以q(2,6)做为查询点,在KD Tree上自顶向下进行比较。
2. 在结点p处,如果q(split_dim)<=p(split_dim),q进入p的左子树比较;否则,进入右子树比较;比较及游走规则同前,循环,直至最后。
如此,最终得最近邻点为(1,4),如下图:
真实的最近邻点应该为(2,7),以上查询步骤似乎还缺少一步:
3. 回溯搜索路径:
将步骤(2)得到的近邻点c加入队列,并计算距离dist,记录best_dist = dist;
判断以结点c为圆心,best_dist为半径的区域是否与其他父结点的边界相交,无则终止,c即为最近邻;
否则,记录这些相交的父结点,跳转到相应区域比较,并计算dist,如果dist<best_dist,best_dist = dist,重复以上步骤。
修正后的结果如下:
另一个实例:
维数灾难:
传统的KNN查询主要方法是基于空间划分的算法——tree类似算法,如R-tree,Kd-tree,SR-tree。这种算法返回的结果是精确的,但是这种算法在高维数据集上的时间效率并不高。实验[Weber & Schek s.t '98]指出维度高于10之后,基于空间划分的算法时间复杂度反而不如线性查找。
扩展: Approximate Nearest Neighbor
k nearest neighbors
r nearest neighbors
(c,r) nearest neighbors
参考资料:
http://www.cs.haifa.ac.il/~rita/ml_course/lectures/KNN.pdf
KNN Demo
SIFT特征点匹配与消除错配:BBF,RANSAC
k-d tree算法的研究
KD Tree Demo
Introduction to kd-trees (推荐)
[Lee & Wong '77] Lee, D. T.; Wong, C. K. (1977). "Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees". Acta Informatica 9 (1): 23–29.[PDF]
[Beis & Lowe'99] Jeffrey S. Beis and David G. Lowe, "Shape indexing using approximate nearest-neighbour search in high-dimensional spaces," Conference on Computer Vision and Pattern Recognition, Puerto Rico (June 1997), pp. 1000-1006. [PDF];
[Weber & Schek s.t '98] Roger Weber, Hans-Jörg Schek, Stephen Blott: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998: 194-205.[PDF]
[Friedman & Bentley s.t. '77] J. H. Friedman, J. L. Bentley, and R.A. Finkel. Analgorithm for finding best matches in logarithmic ex-pected time. ACM Transactions on Mathematical Soft-ware, 3(3):209{226, September 1977.[PDF]
[PDF] Approximate Best Bin First k-d. Tree All Nearest Neighbor Search with Incremental Updates. Jan Kybic, Ivan Vnucko,2010
资源:
Kdtree implementation in matlab (本文代码参考)
tutorial-KNN
http://www.cs.umd.edu/~mount/ANN/
http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
https://kdtree.googlecode.com/svn-history/r18/kdtree/