评估集
重要
此功能目前以公共预览版提供。
若要衡量代理应用程序的质量,需要能够定义一组具有代表性的请求以及描述高质量响应的标准。 可以通过提供评估集来实现此目的。 本文介绍评估集的各种选项,以及创建评估集的一些最佳做法。
Databricks 建议创建一个人工标记的评估集,其中包括代表性问题和地面答案。 如果应用程序包含检索步骤,可以选择提供预期响应所基于的支持文档。 虽然建议使用人工标记的评估集,但代理评估同样适用于合成生成的评估集。
良好的评估集具有以下特征:
- 有代表性:应该准确反映应用程序在生产环境中会遇到的请求范围。
- 有挑战性:应该包括困难和多样化的案例,以有效地测试应用程序的全部功能。
- 持续更新:应定期更新以反映应用程序的使用方式和生产流量的变化模式。
有关评估集的必需架构,请参阅 代理评估输入架构。
示例评估集
本部分包括评估集的简单示例。
仅包含 request
的示例评估集
eval_set = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
}
]
包含 request
和 expected_response
的示例评估集
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_response": "There's no significant difference.",
}
]
包含 request
、expected_response
和 expected_retrieved_content
的示例评估集
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
"doc_uri": "doc_uri_1",
},
{
"doc_uri": "doc_uri_2",
},
],
"expected_response": "There's no significant difference.",
}
]
仅包含 request
和 response
的示例评估集
eval_set = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
}
]
包含 request
、response
和 retrieved_context
的示例评估集
eval_set = [
{
"request_id": "request-id", # optional, but useful for tracking
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
包含 request
、response
、retrieved_context
和 expected_response
的示例评估集
eval_set = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_response": "There's no significant difference.",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
包含 request
、response
、retrieved_context
、expected_response
和 expected_retrieved_context
的示例评估集
level_4_data = [
{
"request_id": "request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
"doc_uri": "doc_uri_2_1",
},
{
"doc_uri": "doc_uri_2_2",
},
],
"expected_response": "There's no significant difference.",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}
]
有关开发评估集的最佳做法
- 将评估集中的每个样本或样本组视为一个单元测试。 也就是说,每个样本都应该对应一个具有显式预期结果的特定方案。 例如,考虑测试更长上下文、多跳推理以及从间接证据推理答案的功能。
- 考虑测试来自恶意用户的对抗方案。
- 对于评估集中应包含的问题数量,没有具体的指导原则,但来自高质量数据的清晰信号通常比来自弱数据的噪声信号表现更好。
- 考虑包括一些非常具有挑战性的示例,甚至对于人类而言也难以回答。
- 无论是生成通用应用程序还是针对特定领域,应用程序都可能会遇到各种各样的问题。 评估集应该反映这一点。 例如,如果你正在创建一个应用程序来回答特定的 HR 问题,仍应考虑测试其他领域(例如操作),以确保该应用程序不会产生幻觉或提供有害响应。
- 高质量、一致的人工生成的标签是确保提供给应用程序的真实值准确反映所需行为的最佳方式。 确保高质量人工标签的一些步骤如下:
- 聚合多个人工标记工具对同一问题的回答(标签)。
- 确保标记说明清晰,且标记工具一致。
- 确保人工标记过程的条件与提交给 RAG 应用程序的请求的格式相同。
- 人工标记工具本质上是嘈杂且不一致的,例如由于对问题的解释不同。 这是该流程的一个重要部分。 使用人工标记可以揭示未曾考虑过的问题的解释,并且可能为在应用程序中观察到的行为提供见解。