项目官网:Milvus · Open Source Vector Database built for scalable similarity searchhttps://milvus.io/cn/
项目文档:
关于 Milvus - Milvus documentationhttps://milvus.io/cn/docs/v2.0.x/overview.md 最近我发现光靠mysql玩不转向量检索,每次匹配向量的时候都从数据库把所有向量读取出来挨个做点积求相似度,真是要命。
存储、索引和管理由深度神经网络和其他机器学习 (ML) 模型生成的大量向量。
Milvus简单看了一下发现,真香。
安装简单,使用简单,python接口完善。
它不光是数据库,还直接提供向量相似度检索结果,据称20万条向量索引仅需70ms,甚至还支持GPU加速。
主要就是三个概念:
(1)Filed:字段,可以是结构化数据、向量;
(2)Entity:一组Filed,类似表的一条数据;
(3)Collection:表;
记住这三个概念就开始发车。
主要使用docker-compose安装,这里不写如何安装docker和docker-compose了,一定要注意:
首先下载容器配置文件:
wget https://github.com/milvus-io/milvus/releases/download/v2.0.2/milvus-standalone-docker-compose.yml -O docker-compose.yml
然后一键启动:
sudo docker-compose up -d
Docker Compose is now in the Docker CLI, try `docker compose up`
Creating milvus-etcd ... done
Creating milvus-minio ... done
Creating milvus-standalone ... done
使用docker命令可以看到有三个容器正在运行
sudo docker-compose ps
Name Command State Ports
----------------------------------------------------------------------------------------------------------------
milvus-etcd etcd -listen-peer-urls=htt ... Up (healthy) 2379/tcp, 2380/tcp
milvus-minio /usr/bin/docker-entrypoint ... Up (healthy) 9000/tcp
milvus-standalone /tini -- milvus run standalone Up 0.0.0.0:19530->19530/tcp,:::19530->19530/tcp
# 停止容器
sudo docker-compose down
# 清理数据
sudo rm -rf volume
安装python接口
pip install pymilvus==2.0.0rc6
1、建表
from pymilvus_orm import connections, FieldSchema, CollectionSchema, DataType, Collection
from tools.configs import *
# 连接milvus服务器
connections.connect(host="127.0.0.1", port="19530")# 主机地址和端口
id = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, )# 主键索引
features = FieldSchema(name="features", dtype=DataType.FLOAT_VECTOR, dim=2)# 向量,dim=2代表向量只有两列,自己的数据的话一个向量有多少个元素就多少列
schema = CollectionSchema(fields=[id, features], description="Save features")# 描述
collection_name = "Features"# 表名
collection = Collection(name=collection_name, schema=schema, using='default', shards_num=2,consistency_level="Strong")# 建表
2、检查创建的结果
from pymilvus_orm import utility
utility.has_collection("Features")
True
3、查看表信息
collection = Collection("Features") # Get an existing collection.
print(collection.schema) # Return the schema.CollectionSchema of the collection.
print(collection.description) # Return the description of the collection.
print(collection.name ) # Return the name of the collection.
print(collection.is_empty ) # Return the boolean value that indicates if the collection is empty.
print(collection.num_entities ) # Return the number of entities in the collection.
print(collection.primary_field ) # Return the schema.FieldSchema of the primary key field.
print(collection.partitions) # Return the list[Partition] object.
print(collection.indexes ) # Return the list[Index] object.
{"auto_id": false, "description": "Save features", "fields": [{"name": "id", "description": "", "type": 5, "is_primary": true, "auto_id": false}, {"name": "features", "description": "", "type": 101, "params": {"dim": 2}}]}
Save features
Features
True
0
[{"name": "_default", "collection_name": "Features", "description": ""}]
[]
4、制作数据
import random
data = [[i for i in range(2000)],
[[random.random() for _ in range(2)] for _ in range(2000)],]
一组整型索引,一组向量,都是以列表的形式
5、插入数据
collection = Collection("Features") # Get an existing collection.
mr = collection.insert(data)
print(mr)
(insert count: 2000, delete count: 0, upsert count: 0, timestamp: 432676330227367941)
6、建立索引
建立索引是为了更快的加速检索
index_params = {"metric_type":"L2","index_type":"IVF_FLAT","params":{"nlist":1024}}
collection = Collection("Features") # Get an existing collection.
collection.create_index(
field_name="features",
index_params=index_params
)
Status(code=0, message='')
7、检索相似向量
首先要把表加载进内存里面
collection = Collection("Features") # Get an existing collection.
collection.load()
8、配置搜索参数
search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
9、检索相似向量
results = collection.search(
data=[[0.1, 0.2]],
anns_field="features",
param=search_params,
limit=10,
expr=None,
consistency_level="Strong"
)
10、查看检索结果
results[0].ids
[204, 422, 1813, 810, 950, 229, 1367, 317, 1845, 982]
results[0].distances
[5.291282286634669e-05, 0.00022546770924236625, 0.0003814520314335823, 0.0005705003277398646, 0.00064045749604702, 0.0006797321257181466, 0.0009549811366014183, 0.0011909721652045846, 0.0014195223338901997, 0.0014343145303428173]
第204个向量实际值
data[1][204]
[0.09788455702517407, 0.19304028354464775]
11、释放内存
collection.release()
相当丝滑。