LangChain系列文章
评分评估器指导语言模型根据您的自定义标准或评分表,在指定的等级上(默认为1-10)评估您的模型的预测。这一特性提供了一种细致的评估,而不是简单的二元得分,有助于根据量身定制的评分表评估模型,并比较模型在特定任务上的表现。
在我们开始之前,请注意,任何来自大型语言模型的特定等级都应持保留态度。得分为“8”的预测可能并不比得分为“7”的预测有显著优势。
要全面了解,请参阅LabeledScoreStringEvalChain文档。
以下是一个示例,展示了使用默认提示的LabeledScoreStringEvalChain的用法:
from langchain.evaluation import load_evaluator
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv # 导入从 .env 文件加载环境变量的函数
load_dotenv() # 调用函数实际加载环境变量
# from langchain.globals import set_debug # 导入在 langchain 中设置调试模式的函数
# set_debug(True) # 启用 langchain 的调试模式
from langchain.evaluation import load_evaluator
from langchain.chat_models import ChatOpenAI
evaluator = load_evaluator("labeled_score_string", llm=ChatOpenAI(model="gpt-3.5-turbo"))
# Correct
eval_result = evaluator.evaluate_strings(
prediction="You can find them in the dresser's third drawer.",
reference="The socks are in the third drawer in the dresser",
input="Where are my socks?",
)
print("Correct: ", eval_result)
输出
(.venv) ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py ⏎
Correct: {'reasoning': "Explanation:\nThe assistant's response is helpful and relevant to the user's question. It provides a concise and accurate answer, directing the user to find their socks in the third drawer of the dresser. The response is correct and factual, as it accurately refers to the location of the socks. While the response does not demonstrate depth of thought, it effectively addresses the user's query.\n\nRating: [[8]]", 'score': 8}
评估您的应用程序的具体情境时,如果您提供了一份完整的评分标准,评估者可以更有效地进行工作。以下是以准确性为例的一个示例。
accuracy_criteria = {
"accuracy": """
Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor errors or omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference."""
}
evaluator = load_evaluator(
"labeled_score_string",
criteria=accuracy_criteria,
llm=ChatOpenAI(model="gpt-4"),
)
执行评分
# Correct
eval_result = evaluator.evaluate_strings(
prediction="You can find them in the dresser's third drawer.",
reference="The socks are in the third drawer in the dresser",
input="Where are my socks?",
)
print(eval_result)
结果分数提升到10分
(.venv) ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py
Correct: {'reasoning': 'Explanation: The assistant accurately states that the socks can be found in the third drawer of the dresser, which aligns perfectly with the reference. There are no errors or omissions in the response.\n\nRating: [[10]]', 'score': 10}
# Correct but lacking information
eval_result = evaluator.evaluate_strings(
prediction="You can find them in the dresser.",
reference="The socks are in the third drawer in the dresser",
input="Where are my socks?",
)
print(eval_result)
输出结果
(.venv) ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py
Correct but lacking information >>> /n {'reasoning': "Explanation: The assistant's response is partially accurate as it correctly mentions that the socks can be found in the dresser. However, it does not provide specific information about the location of the socks in the dresser. \n\nRating: [[7]]", 'score': 7}
# Incorrect
eval_result = evaluator.evaluate_strings(
prediction="You can find them in the dog's bed.",
reference="The socks are in the third drawer in the dresser",
input="Where are my socks?",
)
print(eval_result)
输出
(.venv) ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py
Incorrect >>> /n {'reasoning': "The AI assistant's response is completely unrelated to the reference. The reference states that the socks are in the third drawer in the dresser, while the assistant suggests they can be found in the dog's bed. \n\nRating: [[1]]", 'score': 1}
你也可以让评估者为你规范化分数,如果你想要在与其他评估者相似的标准上使用这些值。
evaluator = load_evaluator(
"labeled_score_string",
criteria=accuracy_criteria,
llm=ChatOpenAI(model="gpt-3.5-turbo"),
normalize_by=10,
)
# Correct but lacking information
eval_result = evaluator.evaluate_strings(
prediction="You can find them in the dresser.",
reference="The socks are in the third drawer in the dresser",
input="Where are my socks?",
)
print(eval_result)
输出
(.venv) ~/Workspace/LLM/langchain-llm-app/ [develop+*] python Evaluate/score.py
Correct but lacking information >>>> {'reasoning': "Explanation: \nThe AI assistant's response is partially relevant to the user's question. It mentions the location of the socks as being in the dresser, which aligns with the ground truth. However, it lacks the specific information that the socks are in the third drawer. Overall, the response provides some relevant information but has a minor error in omitting the specific drawer location. \n\nRating: [[7]]", 'score': 0.7}
https://github.com/zgpeace/pets-name-langchain/tree/develop
https://python.langchain.com/docs/guides/evaluation/string/scoring_eval_chain