faiss 使用

添加faiss到python路径

vim ~/anaconda2/lib/python2.7/site-packages/my.pth
添加路径 /home/bi_tag/faiss/faiss

参照官网： https://github.com/facebookresearch/faiss/wiki/Getting-started

import numpy as np
import faiss                   # make faiss available

# 构造数据
import time
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

IndexFlatL2

# 为向量集构建IndexFlatL2索引，它是最简单的索引类型，只执行强力L2距离搜索
%time index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)
k = 4   # we want to see 4 nearest neighbors
%time D, I = index.search(xb[:5], k) # sanity check
print(I)     # 向量索引位置
print(D)     # 相似度矩阵
%time D, I = index.search(xq, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 80.1 µs
True
100000
CPU times: user 197 ms, sys: 7 ms, total: 204 ms
Wall time: 4.86 ms
[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
[[ 0.          7.17517376  7.20763016  7.25116253]
 [ 0.          6.32356453  6.6845808   6.79994583]
 [ 0.          5.79640865  6.39173603  7.28151226]
 [ 0.          7.27790546  7.527987    7.66284657]
 [ 0.          6.76380348  7.29512024  7.36881447]]
CPU times: user 26.7 s, sys: 3.66 s, total: 30.3 s
Wall time: 683 ms
[[ 381  207  210  477]
 [ 526  911  142   72]
 [ 838  527 1290  425]
 [ 196  184  164  359]
 [ 526  377  120  425]]
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]

IndexIVFFlat 加聚类

# https://github.com/facebookresearch/faiss/wiki/Faster-search
# 加速计算

'''
To speed up the search, it is possible to segment the dataset into pieces. 
We define Voronoi cells in the d-dimensional space, and each database vector falls in one of the cells. 
At search time, 
only the database vectors y contained in the cell the query x 
falls in and a few neighboring ones are compared against the query vector.
'''

# 通过使用IndexIVFFlat索引，将数据集分割成多个，我们在d维空间中定义Voronoi cells，
# 只计算和x落在同一个cells中的y的距离
'''
This is done via the IndexIVFFlat index. This type of index requires a training stage, 
that can be performed on any collection of vectors 
that has the same distribution as the database vectors. 
In this case we just use the database vectors themselves.
# 需要预训练
# 搜索方法有两个参数：nlist（单元格数），nprobe（执行搜索访问的单元格数）
'''
'''
There are two parameters to the search method: nlist, the number of cells, 
and nprobe, the number of cells (out of nlist) that are visited to perform a search. 
The search time roughly increases linearly with the number of probes plus some constant due to the quantization.
'''

The nprobe parameter is always a way of adjusting the tradeoff between speed and accuracy of the result. 
Setting nprobe = nlist gives the same result as the brute-force search (but slower).'''
# nprobe， 准确度和时间的折中

nlist = 100 # 单元格数
k = 4
quantizer = faiss.IndexFlatL2(d)  # the other index  d是向量维度
%time index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# here we specify METRIC_L2, by default it performs inner-product search
assert not index.is_trained
%time index.train(xb)
assert index.is_trained
%time index.add(xb)                  # add may be a bit slower as well
%timeit D, I = index.search(xq, k)     # actual search
print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = 10        # 执行搜索访问的单元格数（nlist以外）      # default nprobe is 1, try a few more
%time D, I = index.search(xq, k)
print(I[-5:]) # neighbors of the 5 last queries
index.search??

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 111 µs
CPU times: user 1.51 s, sys: 33 ms, total: 1.54 s
Wall time: 62.9 ms
CPU times: user 1.02 s, sys: 26 ms, total: 1.05 s
Wall time: 23.9 ms
100 loops, best of 3: 16.3 ms per loop
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
CPU times: user 4.08 s, sys: 50 ms, total: 4.13 s
Wall time: 92.2 ms
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]

IndexIVFPQ 加聚类，加压缩, 压缩算法：https://hal.inria.fr/file/index/docid/514462/filename/paper_hal.pdf

# https://github.com/facebookresearch/faiss/wiki/Lower-memory-footprint
# 有损压缩存储
# 
'''
The indexes we have seen, IndexFlatL2 and IndexIVFFlat both store the full vectors. 
To scale up to very large datasets, 
Faiss offers variants that compress the stored vectors with a lossy compression based on product quantizers.

The vectors are still stored in Voronoi cells, 
but their size is reduced to a configurable number of bytes m (d must be a multiple of m)
'''
# 向量降维参数 m，  向量维度d必须是m的倍数
'''
The compression is based on a Product Quantizer, 
that can be seen as an additional level of quantization, 
that is applied on sub-vectors of the vectors to encode.

In this case, since the vectors are not stored exactly, 
the distances that are returned by the search method are also approximations.
'''
# 结果的近似计算， 向量压缩映射基于Product Quantizer  

'''
They can be compared with the IVFFlat results above. For this case, 
most results are wrong, but they are in the correct area of the space, 
as shown by the IDs around 10000. The situation is better for real data because:

   uniform data is very difficult to index because 
there is no regularity that can be exploited to cluster or reduce dimensionality

   for natural data, the semantic nearest neighbor is often significantly closer than irrelevant results.
'''
# 在真实数据分布， 近似结果会更好

nlist = 100
m = 8 
k = 4
quantizer = faiss.IndexFlatL2(d)  # this remains the same
# 为了扩展到非常大的数据集，Faiss提供了基于产品量化器的有损压缩来压缩存储的向量的变体。压缩的方法基于乘积量化。
# 损失了一定精度为代价， 自身距离也不为0， 这是由于有损压缩。
%time index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
faiss.IndexIVFPQ??
# 8 specifies that each sub-vector is encoded as 8 bits
%time index.train(xb)
%time index.add(xb)
%time D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
index.nprobe = 10              # make comparable with experiment above
%time D, I = index.search(xq, k)     # search
print(I[-5:])
# 简化写法， 传String参数
print "Simplifying index construction"
%time index = faiss.index_factory(d, "IVF100,PQ8")
%time index.train(xb)
%time index.add(xb)
index.nprobe = 10              # make comparable with experiment above
%time D, I = index.search(xq, k)     # search

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 144 µs
CPU times: user 1min 29s, sys: 2.12 s, total: 1min 31s
Wall time: 2.04 s
CPU times: user 6.87 s, sys: 113 ms, total: 6.98 s
Wall time: 152 ms
CPU times: user 11 ms, sys: 1e+03 µs, total: 12 ms
Wall time: 257 µs
[[   0  424  363  278]
 [   1  555 1063   24]
 [   2  304   46  346]
 [   3  773  182 1529]
 [   4  288  754  531]]
[[ 1.45568264  6.03136778  6.18729019  6.38852692]
 [ 1.4934082   5.74254704  6.19941282  6.21501732]
 [ 1.60279989  6.20174742  6.32792664  6.78541422]
 [ 1.69804895  6.2623148   6.26956797  6.56042767]
 [ 1.30235744  6.13624763  6.33899879  6.51442099]]
CPU times: user 3.82 s, sys: 69 ms, total: 3.89 s
Wall time: 84.8 ms
[[10664 10914  9922  9380]
 [10260  9014  9458 10310]
 [11291  9380 11103 10392]
 [10856 10284  9638 11276]
 [10304  9327 10152  9229]]
Simplifying index construction
CPU times: user 6 ms, sys: 1 ms, total: 7 ms
Wall time: 163 µs
CPU times: user 2min 27s, sys: 2.21 s, total: 2min 29s
Wall time: 8.72 s
CPU times: user 8.89 s, sys: 185 ms, total: 9.08 s
Wall time: 202 ms
CPU times: user 3.05 s, sys: 37 ms, total: 3.08 s
Wall time: 67.7 ms

GPU使用(未部署，服务器没有GPU)

# https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs
# 使用GPU
# 单GPU
res = faiss.StandardGpuResources()  # use a single GPU
# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
gpu_index_flat.add(xb)         # add vectors to the index
print(gpu_index_flat.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index_flat.search(xq, k)  # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])

# 多GPU
ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus(  # build the index
    cpu_index
)
gpu_index.add(xb)              # add vectors to the index
print(gpu_index.ntotal)
k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

k-means 聚类
https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

# Faiss building blocks: clustering, PCA, quantization
# k-means 聚类， PCA, PQ量化

# k-means 聚类
ncentroids = 1024
niter = 20
verbose = True
n = 20000
d = 32
x = np.random.rand(n, d).astype('float32')
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)
# kmeans.centroids 聚类中心
D, I = kmeans.index.search(x, 1)
index = faiss.IndexFlatL2 (d)
index.add (x)
%time D, I = index.search (kmeans.centroids, 15)  # 最近的15个聚类中心
print(D[:5])                   # neighbors of the 5 first queries
print(I[:5])                  # neighbors of the 5 last queries

CPU times: user 544 ms, sys: 11 ms, total: 555 ms
Wall time: 12.1 ms
[[ 0.75993538  1.01302719  1.23967171  1.24554443  1.35987091  1.36220741
   1.3649559   1.37965965  1.3970623   1.40263176  1.43087006  1.47748756
   1.49857903  1.50987244  1.51762772]
 [ 0.64031982  1.11455727  1.19411278  1.2266655   1.27806664  1.3130722
   1.32496071  1.37686729  1.41514015  1.41737747  1.46724701  1.48617554
   1.49199867  1.49544144  1.51000404]
 [ 0.91652107  1.06410217  1.15173531  1.39237595  1.40605354  1.41710854
   1.42310143  1.44347572  1.44360542  1.45162773  1.49108124  1.49484253
   1.53658867  1.5404129   1.55116844]
 [ 0.80885315  1.33774185  1.34846497  1.3493042   1.36525154  1.36763763
   1.46035767  1.46951294  1.47649002  1.48202133  1.49103546  1.51737404
   1.5261879   1.52928543  1.55130959]
 [ 0.86759949  1.15416336  1.17353058  1.19447136  1.21015167  1.33818817
   1.37376595  1.3926239   1.49480057  1.52285957  1.53648758  1.54061317
   1.55315018  1.57767677  1.58125877]]
[[ 7088 11759    99 19526 16154 12055 17769 12503  9853 13708 13931  6740
   4466 17428 19830]
 [11262  8439 11477  8793 19599  6928  3824  7343 14503  9918 18099  4363
  18700  9995 17213]
 [16546 15334  8798  3045  5780  4047 19566 19456  1025 11855 11011  4220
   6267 14521 15692]
 [ 1301 19438  1677  9848  5933  1423 12830  2033  4208 19255 16002 12055
   2408 17972  2365]
 [18764 17835  7337  3042  3290  6483  9176  3692 17901 19445  3422  2857
  19815  4279  9130]]

PCA 向量降维

# PCA降维
# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)

[ 116.68821716  116.36938477  107.59984589  107.00305939  105.75762939
  103.17525482  101.48827362  101.10425568   98.96426392   96.75302887]

PQ encoding / decoding 向量的有损压缩与解压

# PQ encoding / decoding
# 向量的有损压缩与解压
# The ProductQuantizer object can be used to encode or decode vectors to codes:
d = 32  # data dimension
cs = 4  # code size (bytes)
# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# encode 
codes = pq.compute_codes(x)
# decode
x2 = pq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
print avg_relative_error

0.0663308

IDMAP

# IndexIDMap 加ID映射 
index = faiss.IndexFlatL2(xb.shape[1]) 
ids = np.arange(xb.shape[0])
# index.add_with_ids(xb, ids)  # 报错， 不支持 because IndexFlatL2 does not support add_with_ids
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids) # ok, works, the vectors are stored in the underlying index

https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

如何选择使用哪种索引方式，

faiss提供了很多索引方式的api供选择

所有的索引构建都可以通过 faiss.index_factory(d, "......") 统一接口来实现

'''
10000 * 64
1, IndexFlatL2， Exact Search for L2， brute-force， "Flat"
L2距离精确计算，无需train，速度慢， CPU time需要26s, real需要687ms
2, IndexIVFFlat， Take another index to assign vectors to inverted lists "IVFx,Flat"
切割到cells, 估算，需要train, nprobe 越大越精确(最大nlist), 教程中取了1/10，执行时间也大概缩短为1/7
3, IndexPQ, Product quantizer (PQ) in flat mode "PQx"
压缩，"PQx"
4, IndexIVFPQ， IVFADC (coarse quantizer+PQ on residuals) "IVF100,PQ8"
分桶加压缩，可以用index_factory来简化构造， "IVF100,PQ8" -> 100个桶，压缩到8bit
时间进一步降低，更重要的是，通过PQ的有损压缩，降低了内存使用
'''

索引类型选择

'''
1, HNSWx IndexHNSWFlat 方法无需train, 不支持removing 向量样本
Supported on GPU: no，准确
2， "...,Flat",
先聚类分桶，读入训练过程会比较慢，再计算相似度，快，无压缩，存储消耗等同于原数据集大小，通过nprobe来折中速度与精度
"..." means a clustering of the dataset has to be performed beforehand (read below).
After clustering, "Flat" just organizes the vectors into buckets,
so it does not compress them, the storage size is the same as that of the original dataset.
The tradeoff between speed and accuracy is set via the nprobe parameter.
Supported on GPU: yes (but see below, the clustering method must be supported as well)
3， "How big is the dataset?"
(1), 小于1M vectors "...,IVFx,..." 支持GPU
(2), 1M ~ 10M: "...,IMI2x10,..." 不支持GPU
(3), 10M - 100M "...,IMI2x12,..."
(4), 100M - 1B "...,IMI2x14,..."
'''

faiss 使用

添加faiss到python路径

参照官网： https://github.com/facebookresearch/faiss/wiki/Getting-started

如何选择使用哪种索引方式，

索引类型选择

你可能感兴趣的:(faiss 使用)