使用PGVector进行电影文档的向量搜索

在本文中,我们将演示如何使用Postgres数据库中的PGVector包来进行向量相似性搜索。具体而言,我们会展示如何使用PGVector创建一个向量存储,并结合自查询检索器(SelfQueryRetriever)来对电影文档集合进行检索。

技术背景介绍

PGVector是一个针对Postgres数据库的向量相似性搜索插件。它允许我们在数据库中存储向量并进行快速的相似性检索,非常适合于需要进行语义搜索的场景,例如搜索电影的相似性。

核心原理解析

PGVector通过将文档进行向量化后存储在Postgres数据库中,然后利用向量的余弦相似性或其他相似性度量进行快速检索。结合OpenAI的嵌入服务(OpenAIEmbeddings),我们可以对文本数据进行向量化,从而实现复杂的语义搜索。

代码实现演示

以下是使用PGVector创建电影文档向量存储及自查询检索器的完整代码实现:

# 安装必要的库
%pip install --upgrade --quiet lark pgvector psycopg2-binary

# 导入相关库
import os
import getpass
from langchain_community.vectorstores import PGVector
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_openai import OpenAI

# 设置OpenAI API Key
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

# 创建PGVector向量存储
collection = "movie_collection"
embeddings = OpenAIEmbeddings()

# 创建电影文档集合
docs = [
    Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose", metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"}),
    Document(page_content="Leo DiCaprio gets lost in a dream within a dream...", metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2}),
    Document(page_content="A psychologist / detective gets lost in a series of dreams...", metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6}),
    Document(page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them", metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3}),
    Document(page_content="Toys come alive and have a blast doing so", metadata={"year": 1995, "genre": "animated"}),
    Document(page_content="Three men walk into the Zone, three men walk out of the Zone", metadata={"year": 1979, "director": "Andrei Tarkovsky", "genre": "science fiction", "rating": 9.9}),
]

# 创建向量存储
vectorstore = PGVector.from_documents(docs, embeddings, collection_name=collection)

# 配置元数据描述信息
metadata_field_info = [
    AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
    AttributeInfo(name="year", description="The year the movie was released", type="integer"),
    AttributeInfo(name="director", description="The name of the movie director", type="string"),
    AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]

# 描述文档内容
document_content_description = "Brief summary of a movie"

# 创建LLM和自查询检索器
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)

# 测试检索功能
print(retriever.invoke("What are some movies about dinosaurs"))
print(retriever.invoke("I want to watch a movie rated higher than 8.5"))
print(retriever.invoke("Has Greta Gerwig directed any movies about women"))
print(retriever.invoke("What's a highly rated (above 8.5) science fiction film?"))
print(retriever.invoke("What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"))

应用场景分析

这种设置非常适合用于需要对大规模文本数据进行语义搜索的场景,比如电影推荐、文档检索和知识库查询。通过结合PGVector与OpenAI的嵌入服务,我们能够快速且高效地实现语义级的相似性搜索。

实践建议

  1. 确保你的Postgres数据库配置正确,并已安装PGVector插件。
  2. 使用OpenAIEmbeddings时,请确保API Key的安全性,不要在公共代码中泄露。
  3. 自查询检索器提供了强大的元数据过滤功能,可以根据特定需求灵活调整。

如果遇到问题欢迎在评论区交流。
—END—

你可能感兴趣的:(python,开发语言)