添加faiss到python路径
vim ~/anaconda2/lib/python2.7/site-packages/my.pth
添加路径 /home/bi_tag/faiss/faiss
参照官网: https://github.com/facebookresearch/faiss/wiki/Getting-started
import numpy as np
import faiss # make faiss available
# 构造数据
import time
d = 64 # dimension
nb = 100000 # database size
nq = 10000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
- IndexFlatL2
# 为向量集构建IndexFlatL2索引,它是最简单的索引类型,只执行强力L2距离搜索
%time index = faiss.IndexFlatL2(d) # build the index
print(index.is_trained)
index.add(xb) # add vectors to the index
print(index.ntotal)
k = 4 # we want to see 4 nearest neighbors
%time D, I = index.search(xb[:5], k) # sanity check
print(I) # 向量索引位置
print(D) # 相似度矩阵
%time D, I = index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 80.1 µs
True
100000
CPU times: user 197 ms, sys: 7 ms, total: 204 ms
Wall time: 4.86 ms
[[ 0 393 363 78]
[ 1 555 277 364]
[ 2 304 101 13]
[ 3 173 18 182]
[ 4 288 370 531]]
[[ 0. 7.17517376 7.20763016 7.25116253]
[ 0. 6.32356453 6.6845808 6.79994583]
[ 0. 5.79640865 6.39173603 7.28151226]
[ 0. 7.27790546 7.527987 7.66284657]
[ 0. 6.76380348 7.29512024 7.36881447]]
CPU times: user 26.7 s, sys: 3.66 s, total: 30.3 s
Wall time: 683 ms
[[ 381 207 210 477]
[ 526 911 142 72]
[ 838 527 1290 425]
[ 196 184 164 359]
[ 526 377 120 425]]
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
- IndexIVFFlat 加聚类
# https://github.com/facebookresearch/faiss/wiki/Faster-search
# 加速计算
'''
To speed up the search, it is possible to segment the dataset into pieces.
We define Voronoi cells in the d-dimensional space, and each database vector falls in one of the cells.
At search time,
only the database vectors y contained in the cell the query x
falls in and a few neighboring ones are compared against the query vector.
'''
# 通过使用IndexIVFFlat索引,将数据集分割成多个,我们在d维空间中定义Voronoi cells,
# 只计算和x落在同一个cells中的y的距离
'''
This is done via the IndexIVFFlat index. This type of index requires a training stage,
that can be performed on any collection of vectors
that has the same distribution as the database vectors.
In this case we just use the database vectors themselves.
# 需要预训练
# 搜索方法有两个参数:nlist(单元格数),nprobe(执行搜索访问的单元格数)
'''
'''
There are two parameters to the search method: nlist, the number of cells,
and nprobe, the number of cells (out of nlist) that are visited to perform a search.
The search time roughly increases linearly with the number of probes plus some constant due to the quantization.
'''
The nprobe parameter is always a way of adjusting the tradeoff between speed and accuracy of the result.
Setting nprobe = nlist gives the same result as the brute-force search (but slower).'''
# nprobe, 准确度和时间的折中
nlist = 100 # 单元格数
k = 4
quantizer = faiss.IndexFlatL2(d) # the other index d是向量维度
%time index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# here we specify METRIC_L2, by default it performs inner-product search
assert not index.is_trained
%time index.train(xb)
assert index.is_trained
%time index.add(xb) # add may be a bit slower as well
%timeit D, I = index.search(xq, k) # actual search
print(I[-5:]) # neighbors of the 5 last queries
index.nprobe = 10 # 执行搜索访问的单元格数(nlist以外) # default nprobe is 1, try a few more
%time D, I = index.search(xq, k)
print(I[-5:]) # neighbors of the 5 last queries
index.search??
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 111 µs
CPU times: user 1.51 s, sys: 33 ms, total: 1.54 s
Wall time: 62.9 ms
CPU times: user 1.02 s, sys: 26 ms, total: 1.05 s
Wall time: 23.9 ms
100 loops, best of 3: 16.3 ms per loop
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
CPU times: user 4.08 s, sys: 50 ms, total: 4.13 s
Wall time: 92.2 ms
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
- IndexIVFPQ 加聚类, 加压缩, 压缩算法:https://hal.inria.fr/file/index/docid/514462/filename/paper_hal.pdf
# https://github.com/facebookresearch/faiss/wiki/Lower-memory-footprint
# 有损压缩存储
#
'''
The indexes we have seen, IndexFlatL2 and IndexIVFFlat both store the full vectors.
To scale up to very large datasets,
Faiss offers variants that compress the stored vectors with a lossy compression based on product quantizers.
The vectors are still stored in Voronoi cells,
but their size is reduced to a configurable number of bytes m (d must be a multiple of m)
'''
# 向量降维参数 m, 向量维度d必须是m的倍数
'''
The compression is based on a Product Quantizer,
that can be seen as an additional level of quantization,
that is applied on sub-vectors of the vectors to encode.
In this case, since the vectors are not stored exactly,
the distances that are returned by the search method are also approximations.
'''
# 结果的近似计算, 向量压缩映射基于Product Quantizer
'''
They can be compared with the IVFFlat results above. For this case,
most results are wrong, but they are in the correct area of the space,
as shown by the IDs around 10000. The situation is better for real data because:
uniform data is very difficult to index because
there is no regularity that can be exploited to cluster or reduce dimensionality
for natural data, the semantic nearest neighbor is often significantly closer than irrelevant results.
'''
# 在真实数据分布, 近似结果会更好
nlist = 100
m = 8
k = 4
quantizer = faiss.IndexFlatL2(d) # this remains the same
# 为了扩展到非常大的数据集,Faiss提供了基于产品量化器的有损压缩来压缩存储的向量的变体。压缩的方法基于乘积量化。
# 损失了一定精度为代价, 自身距离也不为0, 这是由于有损压缩。
%time index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
faiss.IndexIVFPQ??
# 8 specifies that each sub-vector is encoded as 8 bits
%time index.train(xb)
%time index.add(xb)
%time D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
index.nprobe = 10 # make comparable with experiment above
%time D, I = index.search(xq, k) # search
print(I[-5:])
# 简化写法, 传String参数
print "Simplifying index construction"
%time index = faiss.index_factory(d, "IVF100,PQ8")
%time index.train(xb)
%time index.add(xb)
index.nprobe = 10 # make comparable with experiment above
%time D, I = index.search(xq, k) # search
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 144 µs
CPU times: user 1min 29s, sys: 2.12 s, total: 1min 31s
Wall time: 2.04 s
CPU times: user 6.87 s, sys: 113 ms, total: 6.98 s
Wall time: 152 ms
CPU times: user 11 ms, sys: 1e+03 µs, total: 12 ms
Wall time: 257 µs
[[ 0 424 363 278]
[ 1 555 1063 24]
[ 2 304 46 346]
[ 3 773 182 1529]
[ 4 288 754 531]]
[[ 1.45568264 6.03136778 6.18729019 6.38852692]
[ 1.4934082 5.74254704 6.19941282 6.21501732]
[ 1.60279989 6.20174742 6.32792664 6.78541422]
[ 1.69804895 6.2623148 6.26956797 6.56042767]
[ 1.30235744 6.13624763 6.33899879 6.51442099]]
CPU times: user 3.82 s, sys: 69 ms, total: 3.89 s
Wall time: 84.8 ms
[[10664 10914 9922 9380]
[10260 9014 9458 10310]
[11291 9380 11103 10392]
[10856 10284 9638 11276]
[10304 9327 10152 9229]]
Simplifying index construction
CPU times: user 6 ms, sys: 1 ms, total: 7 ms
Wall time: 163 µs
CPU times: user 2min 27s, sys: 2.21 s, total: 2min 29s
Wall time: 8.72 s
CPU times: user 8.89 s, sys: 185 ms, total: 9.08 s
Wall time: 202 ms
CPU times: user 3.05 s, sys: 37 ms, total: 3.08 s
Wall time: 67.7 ms
- GPU使用(未部署, 服务器没有GPU)
# https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs
# 使用GPU
# 单GPU
res = faiss.StandardGpuResources() # use a single GPU
# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
gpu_index_flat.add(xb) # add vectors to the index
print(gpu_index_flat.ntotal)
k = 4 # we want to see 4 nearest neighbors
D, I = gpu_index_flat.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:])
# 多GPU
ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus( # build the index
cpu_index
)
gpu_index.add(xb) # add vectors to the index
print(gpu_index.ntotal)
k = 4 # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
0
- k-means 聚类
https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization
# Faiss building blocks: clustering, PCA, quantization
# k-means 聚类, PCA, PQ量化
# k-means 聚类
ncentroids = 1024
niter = 20
verbose = True
n = 20000
d = 32
x = np.random.rand(n, d).astype('float32')
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)
# kmeans.centroids 聚类中心
D, I = kmeans.index.search(x, 1)
index = faiss.IndexFlatL2 (d)
index.add (x)
%time D, I = index.search (kmeans.centroids, 15) # 最近的15个聚类中心
print(D[:5]) # neighbors of the 5 first queries
print(I[:5]) # neighbors of the 5 last queries
CPU times: user 544 ms, sys: 11 ms, total: 555 ms
Wall time: 12.1 ms
[[ 0.75993538 1.01302719 1.23967171 1.24554443 1.35987091 1.36220741
1.3649559 1.37965965 1.3970623 1.40263176 1.43087006 1.47748756
1.49857903 1.50987244 1.51762772]
[ 0.64031982 1.11455727 1.19411278 1.2266655 1.27806664 1.3130722
1.32496071 1.37686729 1.41514015 1.41737747 1.46724701 1.48617554
1.49199867 1.49544144 1.51000404]
[ 0.91652107 1.06410217 1.15173531 1.39237595 1.40605354 1.41710854
1.42310143 1.44347572 1.44360542 1.45162773 1.49108124 1.49484253
1.53658867 1.5404129 1.55116844]
[ 0.80885315 1.33774185 1.34846497 1.3493042 1.36525154 1.36763763
1.46035767 1.46951294 1.47649002 1.48202133 1.49103546 1.51737404
1.5261879 1.52928543 1.55130959]
[ 0.86759949 1.15416336 1.17353058 1.19447136 1.21015167 1.33818817
1.37376595 1.3926239 1.49480057 1.52285957 1.53648758 1.54061317
1.55315018 1.57767677 1.58125877]]
[[ 7088 11759 99 19526 16154 12055 17769 12503 9853 13708 13931 6740
4466 17428 19830]
[11262 8439 11477 8793 19599 6928 3824 7343 14503 9918 18099 4363
18700 9995 17213]
[16546 15334 8798 3045 5780 4047 19566 19456 1025 11855 11011 4220
6267 14521 15692]
[ 1301 19438 1677 9848 5933 1423 12830 2033 4208 19255 16002 12055
2408 17972 2365]
[18764 17835 7337 3042 3290 6483 9176 3692 17901 19445 3422 2857
19815 4279 9130]]
- PCA 向量降维
# PCA降维
# random training data
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)
[ 116.68821716 116.36938477 107.59984589 107.00305939 105.75762939
103.17525482 101.48827362 101.10425568 98.96426392 96.75302887]
- PQ encoding / decoding 向量的有损压缩与解压
# PQ encoding / decoding
# 向量的有损压缩与解压
# The ProductQuantizer object can be used to encode or decode vectors to codes:
d = 32 # data dimension
cs = 4 # code size (bytes)
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# encode
codes = pq.compute_codes(x)
# decode
x2 = pq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
print avg_relative_error
0.0663308
- IDMAP
# IndexIDMap 加ID映射
index = faiss.IndexFlatL2(xb.shape[1])
ids = np.arange(xb.shape[0])
# index.add_with_ids(xb, ids) # 报错, 不支持 because IndexFlatL2 does not support add_with_ids
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids) # ok, works, the vectors are stored in the underlying index
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
如何选择使用哪种索引方式,
faiss提供了很多索引方式的api供选择
所有的索引构建都可以通过 faiss.index_factory(d, "......") 统一接口来实现
'''
10000 * 64
1, IndexFlatL2, Exact Search for L2, brute-force, "Flat"
L2距离 精确计算,无需train,速度慢, CPU time需要26s, real需要687ms
2, IndexIVFFlat, Take another index to assign vectors to inverted lists "IVFx,Flat"
切割到cells, 估算,需要train, nprobe 越大越精确(最大nlist), 教程中取了1/10, 执行时间也大概缩短为1/7
3, IndexPQ, Product quantizer (PQ) in flat mode "PQx"
压缩,"PQx"
4, IndexIVFPQ, IVFADC (coarse quantizer+PQ on residuals) "IVF100,PQ8"
分桶加压缩, 可以用index_factory来简化构造, "IVF100,PQ8" -> 100个桶, 压缩到8bit
时间进一步降低, 更重要的是, 通过PQ的有损压缩,降低了内存使用
'''
索引类型选择
'''
1, HNSWx IndexHNSWFlat 方法 无需train, 不支持removing 向量样本
Supported on GPU: no, 准确
2, "...,Flat",
先聚类分桶,读入训练过程会比较慢, 再计算相似度, 快, 无压缩, 存储消耗等同于原数据集大小, 通过nprobe来折中速度与精度
"..." means a clustering of the dataset has to be performed beforehand (read below).
After clustering, "Flat" just organizes the vectors into buckets,
so it does not compress them, the storage size is the same as that of the original dataset.
The tradeoff between speed and accuracy is set via the nprobe parameter.
Supported on GPU: yes (but see below, the clustering method must be supported as well)
3, "How big is the dataset?"
(1), 小于1M vectors "...,IVFx,..." 支持GPU
(2), 1M ~ 10M: "...,IMI2x10,..." 不支持GPU
(3), 10M - 100M "...,IMI2x12,..."
(4), 100M - 1B "...,IMI2x14,..."
'''