LangChain系列文章
某些标准(例如正确性correctness)需要参考标签才能正确工作。为此,初始化带labeled_criteria
评估器,并用reference
字符串调用评估器。
大多数情况下,您会希望定义自己的自定义标准(见下文),但我们也提供了一些常见的标准,您可以通过一个字符串来加载。以下是预先实现的标准列表。请注意,在没有标签的情况下,大型语言模型仅预测它认为最佳的答案,而不是基于实际的法律或背景。
from langchain.evaluation import Criteria
# For a list of other default supported criteria, try calling `supported_default_criteria`
list(Criteria)
输出
[<Criteria.CONCISENESS: 'conciseness'>,
<Criteria.RELEVANCE: 'relevance'>,
<Criteria.CORRECTNESS: 'correctness'>,
<Criteria.COHERENCE: 'coherence'>,
<Criteria.HARMFULNESS: 'harmfulness'>,
<Criteria.MALICIOUSNESS: 'maliciousness'>,
<Criteria.HELPFULNESS: 'helpfulness'>,
<Criteria.CONTROVERSIALITY: 'controversiality'>,
<Criteria.MISOGYNY: 'misogyny'>,
<Criteria.CRIMINALITY: 'criminality'>,
<Criteria.INSENSITIVITY: 'insensitivity'>]
下面的例子就是把美国的首都从华盛顿改为Topeka, KS。
from langchain.evaluation import load_evaluator
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv # 导入从 .env 文件加载环境变量的函数
load_dotenv() # 调用函数实际加载环境变量
# from langchain.globals import set_debug # 导入在 langchain 中设置调试模式的函数
# set_debug(True) # 启用 langchain 的调试模式
# from langchain.evaluation import load_evaluator
# evaluator = load_evaluator("criteria", criteria="conciseness")
# This is equivalent to loading using the enum
from langchain.evaluation import EvaluatorType
question = "What is the capital of the US?"
evaluator = load_evaluator("labeled_criteria", criteria="correctness")
# We can even override the model's learned knowledge using ground truth labels
eval_result = evaluator.evaluate_strings(
input="What is the capital of the US?",
prediction="Topeka, KS",
reference="The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023",
)
print(f'With ground truth: {eval_result["score"]}')
print('eval_result >> ', eval_result)
from langchain.evaluation import Criteria
# For a list of other default supported criteria, try calling `supported_default_criteria`
list_criteria = list(Criteria)
print('list_criteria >> ', list_criteria)
prompt = ChatPromptTemplate.from_template(
"{topic}"
)
output_parser = StrOutputParser()
model = ChatOpenAI(model="gpt-3.5-turbo")
chain = (
{"topic": RunnablePassthrough()}
| prompt
| model
| output_parser
)
response = chain.invoke(question)
print('response >> ', response)
输出结果
(.venv) ~/Workspace/LLM/langchain-llm-app/ [develop*] python Evaluate/criteria_correct.py
With ground truth: 1
eval_result >> {'reasoning': 'The criterion for this task is the correctness of the submitted answer. The submission states that the capital of the US is Topeka, KS. \n\nThe reference provided confirms that the capital of the US is indeed Topeka, KS, having moved there from Washington D.C. on May 16, 2023. \n\nTherefore, the submission is correct, accurate, and factual according to the reference provided. \n\nThe submission meets the criterion.\n\nY', 'value': 'Y', 'score': 1}
list_criteria >> [<Criteria.CONCISENESS: 'conciseness'>, <Criteria.RELEVANCE: 'relevance'>, <Criteria.CORRECTNESS: 'correctness'>, <Criteria.COHERENCE: 'coherence'>, <Criteria.HARMFULNESS: 'harmfulness'>, <Criteria.MALICIOUSNESS: 'maliciousness'>, <Criteria.HELPFULNESS: 'helpfulness'>, <Criteria.CONTROVERSIALITY: 'controversiality'>, <Criteria.MISOGYNY: 'misogyny'>, <Criteria.CRIMINALITY: 'criminality'>, <Criteria.INSENSITIVITY: 'insensitivity'>, <Criteria.DEPTH: 'depth'>, <Criteria.CREATIVITY: 'creativity'>, <Criteria.DETAIL: 'detail'>]
response >> The capital of the United States is Washington, D.C.
https://github.com/zgpeace/pets-name-langchain/tree/develop
https://python.langchain.com/docs/guides/evaluation/string/criteria_eval_chain