facebook向量搜索聚类faiss安装与使用示例

安装

从这里下载相关的安装文本

1. 安装fortran
    yum install libgfortran
    yum install gcc-gfortran

2. 安装blas
    rpm -ivh blas-3.2.1-5.el6.x86_64.rpm
    rpm -ivh blas-devel-3.2.1-5.el6.x86_64.rpm 

3. 安装lapack
    rpm -ivh lapack-3.2.1-5.el6.x86_64.rpm 
    rpm -ivh lapack-devel-3.2.1-5.el6.x86_64.rpm 

5.  克隆代码
    git clone [email protected]:facebookresearch/faiss.git

6. 构建安装
    ./configure
    make
    make install
    
7. 测试
    make test

如果最后看到如下输出,则说明成功了

test_IndexIVFPQ (test_index.TestSearchAndReconstruct) ... WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
Reconstruction error = 0.455
ok
test_IndexTransform (test_index.TestSearchAndReconstruct) ... WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
Reconstruction error = 3.241
ok
test_MultiIndex (test_index.TestSearchAndReconstruct) ... WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1500 points to 256 centroids: please provide at least 9984 training points
Reconstruction error = 0.437
ok

----------------------------------------------------------------------
Ran 74 tests in 118.620s

OK

8. 安装python wrapper

make py

    注意,完毕之后进入faiss/python目录, 执行:

python -c "import faiss" 

如果成功,记得把当前目录下的faiss目录 拷贝到/usr/lib/python2.7/site-packages目录下。这也看你使用的python是什么,如果是anaconda的话。

然后查看有没有安装成功:

[root@aws ~]# python
Python 2.7.5 (default, Jul 13 2018, 13:06:57)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import faiss

 

使用示例

IndexFlatL2

IndexFlatL2是精确查找,没有做任何的数据压缩,使用欧式距离来衡量距离,不需要训练过程。不支持删除向量,也不支持带ID插入。

import numpy as np
import faiss

np.random.seed(100)
train_v = np.random.rand(1000, 128).astype('float32')
q_v = np.random.rand(10,128).astype('float32')

index = faiss.IndexFlatL2(128)  ##创建索引
index.add(train_v) ## 添加数据
D, I = index.search(q_v, 10) ##搜索

返回的D向量是相似度,I是向量的索引,该索引是向量插入到索引中的顺序。 如果需要自己定义ID,例如图片ID,那么需要自己在插入的时候维护一个ID和索引ID的对应关系。或者使用faiss.IndexIDMap包装一下。

index = faiss.IndexFlatL2(128)
index = faiss.IndexIDMap(index)
index.add_with_ids(train_v,np.arange(1000))
D, I = index.search(q_v, 10)

这时候返回的I,是插入的时候,向量对应的ID。

IndexIVFFlat

为了加速检索速度,可以使用IndexIVFFlat,该索引再创建的时候,需要另一个索引,即quantizer, 也需要指定距离计算公式,不提供的话,默认也是L2距离。此外,还需要指定一个nlist,指定索引划分数。

quantizer = faiss.IndexFlatL2(dims)
index = faiss.IndexIVFFlat(quantizer, dims, 16, faiss.METRIC_L2)
index.train(train_v)
index_v = np.random.rand(5000, 128).astype('float32')
index.add_with_ids(index_v, np.arange(5000))
index.search(q_v, 10)
##删除前一步查询结果中出现的向量ID,再执行搜索试试
index.remove_ids(np.array([3821]))
index.search(q_v, 10)
## 会发现,这个结果中被删除的ID不见了

此外,IndexIVFFlat还支持设定nprobe,该参数的作用是控制速度和精度,该参数默认值是1. nprobe越小,搜索精度越高,速度越慢。 IndexIVF有两个基本组成部分:

  1. quantizer index 给定一个向量,quantizer index返回该向量属于的group
  2. InvertedLists 给定一个查询向量把一个id(nlist中的一个)映射到一个(code, id)的序列,这里的code,id分别是?

IndexIVFFlat没有对数据进行压缩,如果很介意内存(尤其是GPU)占用的话,考虑使用PQ。

PQ原理

PQ全称product quantization, 本质上是一中通过分治、数据压缩来实现高效向量检索的近似检索算法,在追求高效的大空间检索情况下,通常不会使用精确检索。 首先介绍一下quantization的概念,vector quantization通过定义一个量化器(映射函数q),把一个D维向量,映射成一个k维向量(k通常是2的幂),通常这个k会远小于D。 product quantization在quantzation前面加了一个分治,例如原始向量是D=128维,我们把它分成m=4组,那么每组的子向量就是128/4=32维, 在每个32维子向量组里,利用kmeans算法学习到映射函数q。 第一步分治,第二步压缩,加速了检索

IndexIVFPQ

IndexIVFPQ对原始数据进行了压缩,所以提供不精确检索。均匀分布的数据是很难被压缩的。

quantizer=faiss.IndexFlatL2(dims)
index = faiss.IndexIVFPQ(quantizer, dims, 16, 8, 8)
index.train(train_v)

其中第一,二,三个参数和之前介绍的一样。第一个8是向量分段数,就是前面qp中介绍的m, 第二个8是指分段后的每段的聚类中心点的个数(或者说码)占用的bit数,8意味256个聚类中心点(每段)。 程序会不断warning,原因不记得了,可自行谷歌解决。

WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
WARNING clustering 1000 points to 256 centroids: please provide at least 9984 training points
train_v = np.random.rand(10000, 128).astype('float32')
index.train(train_v)
index.add_with_ids(index_v, np.arange(5000))
##index.nprobe=10 也支持设置这个
index.search(q_v, 10)

工厂方法

index = faiss.index_factory(16, "Flat", faiss.METRIC_L2)
index = faiss.index_factory(16, "Flat", faiss.METRIC_INNER_PRODUCT)
index = faiss.index_factory(16, "IVF100,Flat")
index = faiss.index_factory(128, "IVF100,PQ8")

 

使用GPU

res = faiss.StandardGpuResources() ## 获取gpu资源
dims = 1024
quantizer = faiss.IndexFlatL2(dims)
 #index = faiss.IndexIVFFlat(quantizer, dims, 10)
index = faiss.IndexIVFPQ(quantizer, dims, 128, 8, 8)
self.index = faiss.index_cpu_to_gpu(res, 0, index) ## 使用gpu,并指定第0块gpu

 

你可能感兴趣的:(机器学习)