海量高维向量中如何找出相似的topN

问题: 海量高维向量中如何找出相似的topN

原理:

假设如果两个点无限近的话,任何平面都无法切分他们,所以可对这些点在空间中用超平面进行切分,如果这些点紧挨着的,会被切分到同一边

annoy算法详细解释:https://www.cnblogs.com/futurehau/p/6524396.html

github项目地址: https://github.com/spotify/annoy

python演示代码:

#coding=utf-8
from annoy import AnnoyIndex
import random



f = 2 #维度
t = AnnoyIndex(f)  # Length of item vector that will be indexed

tmp=[];
x=[];
y=[];
for i in xrange(500):
    v = [random.gauss(0, 1) for z in xrange(f)]
    tmp.append(v)
    x.append(v[0])
    y.append(v[1])
    t.add_item(i, v) #添加向量

t.build(100) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f)
u.load('test.ann') # super fast, will just mmap the file
nearest= u.get_nns_by_item(1, 40) # will find the 1000 nearest         neighbors of the first(0) vec

target = tmp.__getitem__(1)

nearx=[];
neary=[];
nearest.pop(0)
for i in nearest:
   near= tmp.__getitem__(i)
   #print u.get_distance(1,i)
   print u.get_item_vector(i)
   nearx.append(near[0])
   neary.append(near[1])

import matplotlib.pyplot as plt

p1 = plt.scatter(x, y, marker='x', color='g', label='1', s=30)

p1 = plt.scatter(target[0], target[1], marker='*', color='r', label='1', s=30)

plt.scatter(nearx, neary, marker='+', color='b', label='1', s=30)


plt.title('Scatter')
plt.legend(loc='upper right')
plt.xticks(x)
plt.show()

结果:

海量高维向量中如何找出相似的topN_第1张图片

图中红色点为目标点,蓝色为跟这个目标点相似的,

你可能感兴趣的:(机器学习)