faiss工具使用

faiss使用

GitHub地址:https://github.com/dreamfantacy/faiss/blob/master/faiss_use.md

faiss是为稠密向量提供高效相似度搜索和聚类的框架。由Facebook AI Research研发。

具有以下特性:

1、提供多种检索方法

2、速度快

3、可存在内存和磁盘中

4、C++实现,提供Python封装调用。

5、大部分算法支持GPU实现

下面,具体讲解了faiss常用的几种实现方法。

import faiss
from faiss import normalize_L2
import numpy as np

d = 64   # 向量维度
nb = 1000000  # 数据大小
k = 6 # 查询top 5个最近邻
np.random.seed(1000)
train_vectors = np.random.random((nb, d)).astype('float32')   # 产生随机数,维度为nb x d
print(train_vectors[0:1])
[[0.6535896  0.11500695 0.9502829  0.4821914  0.87247455 0.21233268
  0.04070963 0.39719447 0.2331322  0.8417407  0.20708235 0.74246955
  0.39215413 0.18225652 0.7435394  0.06958208 0.88533723 0.9526444
  0.93114346 0.41543096 0.02898166 0.9820275  0.3396377  0.7066872
  0.36187705 0.0351059  0.85505825 0.6572535  0.765683   0.5540872
  0.8850929  0.90419763 0.0104217  0.07455674 0.2446292  0.13330476
  0.6979251  0.3982049  0.8831222  0.1810075  0.43249917 0.0181432
  0.69143784 0.46969065 0.12822218 0.89133704 0.91820365 0.073121
  0.04544794 0.43857288 0.6017209  0.31022704 0.68190825 0.20901315
  0.5196043  0.56598884 0.44116738 0.13755615 0.21354319 0.13337189
  0.3222967  0.23388712 0.5274982  0.56597114]]

精确的内积搜索

%%time
# 精确的内积搜索,对归一化向量计算余弦相似度
faiss.normalize_L2(train_vectors)   # 归一化
index = faiss.IndexFlatIP(d)  # 内积建立索引
index.add(train_vectors)  # 添加矩阵
CPU times: user 528 ms, sys: 109 ms, total: 638 ms
Wall time: 201 ms
%time D, I = index.search(train_vectors[:100], k)  # 基于索引,对前5行数据,进行K近邻查询
print(D[:5])  # 距离矩阵
print(I[:5])  # 索引矩阵
CPU times: user 603 ms, sys: 14.6 ms, total: 617 ms
Wall time: 180 ms
[[1.         0.88921225 0.8749414  0.87424946 0.8681991  0.86813927]
 [1.0000002  0.91028035 0.91020644 0.9085115  0.90596604 0.9054309 ]
 [1.0000001  0.885697   0.8750487  0.87181693 0.8717227  0.87148   ]
 [1.         0.88810456 0.87671304 0.8715367  0.86975056 0.8680135 ]
 [0.9999998  0.9004812  0.88626826 0.8845979  0.8831274  0.88120776]]
[[     0  91017 177926 512918 299246 495820]
 [     1 927457 377725 163893  23204 352095]
 [     2 316873 797030 147008 906258 787238]
 [     3 498985 296306 972003 759146 704490]
 [     4 321229 754783 139900 340274 795477]]

精确的L2距离搜索

%%time
index = faiss.IndexFlatL2(d)
print(index.is_trained)
index.add(train_vectors) 
print(index.ntotal)  # 查看建立索引的向量数目
True
1000000
CPU times: user 285 ms, sys: 138 ms, total: 423 ms
Wall time: 210 ms
%time D, I = index.search(train_vectors[:100], k)  # 基于索引,对前5行数据,进行K近邻查询
print(D[:5])
print(I[:5])
CPU times: user 618 ms, sys: 14.8 ms, total: 632 ms
Wall time: 185 ms
[[0.         0.22157544 0.25011715 0.25150135 0.26360166 0.26372138]
 [0.         0.17943956 0.1795873  0.18297717 0.18806806 0.18913835]
 [0.         0.22860612 0.24990267 0.25636625 0.25655466 0.25703993]
 [0.         0.22379076 0.246574   0.25692683 0.26049888 0.26397297]
 [0.         0.19903749 0.22746333 0.2308041  0.23374498 0.2375843 ]]
[[     0  91017 177926 512918 299246 495820]
 [     1 927457 377725 163893  23204 352095]
 [     2 316873 797030 147008 906258 787238]
 [     3 498985 296306 972003 759146 704490]
 [     4 321229 754783 139900 340274 795477]]

倒排文件检索

为了加速查询,可以把数据集切分成多个,采用基于Multi-probing(best-bin KD树变体)的分块方法。

  • 特征空间被划分成ncells个块;
  • 数据被划分到这些块中(k-means可根据最近欧式距离),归属关系存储在ncells个节点的倒排列表中;
  • 搜索时,检索离目标最近的nprobe个块;
  • 根据倒排列表检索nprobe个块中的所有数据

这便是IndexIVFFlat,它需要另一个索引来记录倒排列表。

%%time
ncells = 1000  # 设置聚类中心个数
nprobe = 5 # 设置搜索时,需要检索的离目标最近的nprobe个

quantizer = faiss.IndexFlatIP(d)   # 建立一个量化器
index = faiss.IndexIVFFlat(quantizer, d, ncells, faiss.METRIC_INNER_PRODUCT)
index.train(train_vectors)
index.add(train_vectors)
CPU times: user 25.7 s, sys: 593 ms, total: 26.3 s
Wall time: 8.16 s
#使用默认的nprobe = 1
%time D, I = index.search(train_vectors[:100], k)
print(D[:5])
print(I[:5])
CPU times: user 1.64 ms, sys: 2.45 ms, total: 4.09 ms
Wall time: 4.7 ms
[[1.         0.8649015  0.8589389  0.85849226 0.84375334 0.84226656]
 [1.0000002  0.89472115 0.8887934  0.8870386  0.8853303  0.88515115]
 [1.0000001  0.8685078  0.8675276  0.8618238  0.85603774 0.8477105 ]
 [1.         0.8715367  0.837631   0.8366878  0.8355187  0.83386433]
 [0.9999998  0.9004812  0.8719325  0.8713748  0.87032235 0.86367786]]
[[     0 768157 973530 331451 102250 787734]
 [     1 351245 493513  30966 603376 971320]
 [     2  17543 383568 591888 243810 820911]
 [     3 972003 652932 895411 383572  89314]
 [     4 321229 234587 577836 894080 258768]]
# 设置nprobe = 10
index.nprobe = nprobe
%time D, I = index.search(train_vectors[:100], k)
print(D[:5])
print(I[:5])
CPU times: user 2.24 ms, sys: 1.35 ms, total: 3.58 ms
Wall time: 2.44 ms
[[1.         0.86556    0.8649015  0.85989505 0.8589389  0.85849226]
 [1.0000002  0.89788103 0.89687335 0.89633375 0.89472115 0.89016867]
 [1.0000001  0.8685078  0.8675276  0.8618238  0.85603774 0.8477105 ]
 [1.         0.8715367  0.86341625 0.8632425  0.8576048  0.8548256 ]
 [0.9999998  0.9004812  0.87391615 0.87204486 0.8719325  0.8713748 ]]
[[     0 916688 768157  24856 973530 331451]
 [     1 923354 468023 142134 351245  58678]
 [     2  17543 383568 591888 243810 820911]
 [     3 972003 538319 433030 489597 822219]
 [     4 321229 921911  38244 234587 577836]]

IndexIVFPQ

由于IndexIVFFlat和IndexIVFL2都要存储所有的向量数据,对于大型数据集是不现实的。

因此,Faiss基于PQ提供了变体IndexIVFPQ来压缩数据向量,该方法有一定的精度损耗。

%%time
m = 8
ncells = 1000
quantizer = faiss.IndexFlatIP(d)   # 建立一个量化器
index = faiss.IndexIVFPQ(quantizer, d, ncells, m, 8)
index.train(train_vectors)

index.add(train_vectors)
CPU times: user 57.2 s, sys: 872 ms, total: 58.1 s
Wall time: 16.2 s
#使用默认的nprobe = 1
%time D, I = index.search(train_vectors[:5], k)
print(D[:5])
print(I[:5])
CPU times: user 6.64 ms, sys: 4.05 ms, total: 10.7 ms
Wall time: 6.93 ms
[[0.0835091  0.2554305  0.25882936 0.26452243 0.26792726 0.269789  ]
 [0.05534138 0.18861991 0.20567755 0.20591447 0.20633136 0.21044768]
 [0.06245257 0.24012531 0.24424459 0.25692075 0.266371   0.27467135]
 [0.09201896 0.26103753 0.27423584 0.2745729  0.28080127 0.28357062]
 [0.08818588 0.20140383 0.2034731  0.20371425 0.21177894 0.21378195]]
[[     0 299246 973480 899095 210338 484988]
 [     1 863851 689669 361876 677159 252478]
 [     2 953962  13098 590473 808981 292139]
 [     3  34108 909142 230885 957363 894159]
 [     4 882792 990343 527845 648431 754209]]

你可能感兴趣的:(推荐系统,机器学习)