LangChain系列文章
除了记录运行情况外,LangSmith还允许您测试和评估LLM应用程序。
在本节中,您将利用LangSmith创建一个基准数据集,并在代理上运行AI辅助评估器。您将通过以下几个步骤完成:
在下面,我们使用LangSmith客户端从上面的输入问题和标签列表创建一个数据集。您将在以后使用这些来衡量新代理的性能。数据集是一组例子,它们只是您可以用作应用程序测试用例的输入-输出对。
有关数据集的更多信息,包括如何从CSV或其他文件创建它们,或者如何在平台上创建它们,请参阅LangSmith文档。
outputs = [
"LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.",
"LangSmith is a unified platform for debugging, testing, and monitoring language model applications and agents powered by LangChain",
"July 18, 2023",
"The langsmith cookbook is a github repository containing detailed examples of how to use LangSmith to debug, evaluate, and monitor large language model-powered applications.",
"September 5, 2023",
]
dataset_name = f"agent-qa-{unique_id}"
dataset = client.create_dataset(
dataset_name,
description="An example dataset of questions over the LangSmith documentation.",
)
client.create_examples(
inputs=[{"input": query} for query in inputs],
outputs=[{"output": answer} for answer in outputs],
dataset_id=dataset.id,
)
LangSmith允许您评估任何LLM、chain、agent,甚至是自定义函数。会话代理是有状态的(它们有内存);为了确保这种状态不会在数据集运行之间共享,我们将传入一个chain_factory
(也称为构造函数constructor
)函数,以便为每次调用进行初始化。
在这种情况下,我们将测试一个使用OpenAI的函数调用端点的代理。
from langchain import hub
from langchain.agents import AgentExecutor, AgentType, initialize_agent, load_tools
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain_openai import ChatOpenAI
# Since chains can be stateful (e.g. they can have memory), we provide
# a way to initialize a new chain for each row in the dataset. This is done
# by passing in a factory function that returns a new chain for each row.
def create_agent(prompt, llm_with_tools):
runnable_agent = (
{
"input": lambda x: x["input"],
"agent_scratchpad": lambda x: format_to_openai_function_messages(
x["intermediate_steps"]
),
}
| prompt
| llm_with_tools
| OpenAIFunctionsAgentOutputParser()
)
return AgentExecutor(agent=runnable_agent, tools=tools, handle_parsing_errors=True)
在界面上手动比较链条的结果虽然有效,但可能会耗费时间。使用自动化指标和人工智能辅助反馈来评估您的组件性能会很有帮助。
接下来,我们将创建一个自定义运行评估器,记录启发式评估。
启发式评估器 Heuristic evaluators
from langsmith.evaluation import EvaluationResult, run_evaluator
from langsmith.schemas import Example, Run
@run_evaluator
def check_not_idk(run: Run, example: Example):
"""Illustration of a custom evaluator."""
agent_response = run.outputs["output"]
if "don't know" in agent_response or "not sure" in agent_response:
score = 0
else:
score = 1
# You can access the dataset labels in example.outputs[key]
# You can also access the model inputs in run.inputs[key]
return EvaluationResult(
key="not_uncertain",
score=score,
)
以下,我们将使用上面的自定义评估器配置评估,以及一些预先实施的运行评估器,执行以下操作:- 将结果与地面真相标签进行比较。- 使用嵌入距离测量语义(不)相似性- 使用自定义标准以无参考方式评估代理响应的“方面”
有关如何为您的用例选择适当的评估器以及如何创建自己的自定义评估器的更长讨论,请参考LangSmith文档。
from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig
evaluation_config = RunEvalConfig(
# Evaluators can either be an evaluator type (e.g., "qa", "criteria", "embedding_distance", etc.) or a configuration for that evaluator
evaluators=[
# Measures whether a QA response is "Correct", based on a reference answer
# You can also select via the raw string "qa"
EvaluatorType.QA,
# Measure the embedding distance between the output and the reference answer
# Equivalent to: EvalConfig.EmbeddingDistance(embeddings=OpenAIEmbeddings())
EvaluatorType.EMBEDDING_DISTANCE,
# Grade whether the output satisfies the stated criteria.
# You can select a default one such as "helpfulness" or provide your own.
RunEvalConfig.LabeledCriteria("helpfulness"),
# The LabeledScoreString evaluator outputs a score on a scale from 1-10.
# You can use default criteria or write our own rubric
RunEvalConfig.LabeledScoreString(
{
"accuracy": """
Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor errors or omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference."""
},
normalize_by=10,
),
],
# You can add custom StringEvaluator or RunEvaluator objects here as well, which will automatically be
# applied to each prediction. Check out the docs for examples.
custom_evaluators=[check_not_idk],
)
使用run_on_dataset(或异步arun_on_dataset)函数来评估您的模型。这将:
结果将在LangSmith应用程序中可见。
from langchain import hub
# We will test this version of the prompt
prompt = hub.pull("wfh/langsmith-agent-prompt:798e7324")
import functools
from langchain.smith import arun_on_dataset, run_on_dataset
chain_results = run_on_dataset(
dataset_name=dataset_name,
llm_or_chain_factory=functools.partial(
create_agent, prompt=prompt, llm_with_tools=llm_with_tools
),
evaluation=evaluation_config,
verbose=True,
client=client,
project_name=f"runnable-agent-test-5d466cbc-{unique_id}",
# Project metadata communicates the experiment parameters,
# Useful for reviewing the test results
project_metadata={
"env": "testing-notebook",
"model": "gpt-3.5-turbo",
"prompt": "5d466cbc",
},
)
# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.
# These are logged as warnings here and captured as errors in the tracing UI.
View the evaluation results for project 'runnable-agent-test-5d466cbc-97e1' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/14d8a382-3c0f-48e7-b212-33489ee8a13e/compare?selectedSessions=62f0a0c0-73bf-420c-a907-2c6b2f4625c4
View all tests for Dataset agent-qa-e2d24144 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/14d8a382-3c0f-48e7-b212-33489ee8a13e
[> ] 0/5[------------------------------------------------->] 5/5
Server error caused failure to patch https://api.smith.langchain.com/runs/e9d26fe6-bf4a-4f88-81c5-f5d0f70977f0 in LangSmith API. HTTPError('500 Server Error: Internal Server Error for url: https://api.smith.langchain.com/runs/e9d26fe6-bf4a-4f88-81c5-f5d0f70977f0', '{"detail":"Internal server error"}')
实验结果:
请查看测试结果
您可以通过点击上面输出中的URL或导航到LangSmith的“agent-qa-{unique_id}” 数据集中的“测试和数据集”页面来查看下面的测试结果跟踪UI。
https://github.com/zgpeace/pets-name-langchain/tree/develop
https://python.langchain.com/docs/langsmith/walkthrough