Mosaic AI Agent Evaluation LLM judges reference

Article
10/25/2024

Important

This article covers the details of each of the LLM judges that is built into Mosaic AI Agent Evaluation, including required inputs and output metrics. It also covers the output produced by custom judges.

For an introduction to LLM judges, see How quality, cost, and latency are assessed by Agent Evaluation.

Response judges

Response quality metrics assess how well the application responds to a user’s request. These metrics evaluate factors such as the accuracy of the response compared to ground truth, whether the response is well-grounded given the retrieved context (or if the LLM is hallucinating), and whether the response is safe and free of toxic language.

Overall, did the LLM give an accurate answer?

The correctness LLM judge gives a binary evaluation and written rationale on whether the agent’s generated response is factually accurate and semantically similar to the provided ground-truth response.

Input required for `correctness`

The ground truth expected_response is required.

The input evaluation set must have the following columns:

request
expected_response

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either response or trace.

Important

The ground truth expected_response should include only the minimal set of facts that is required for a correct response. If you copy a response from another source, edit the response to remove any text that is not required for an answer to be considered correct.

Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.

Output for `correctness`

The following metrics are calculated for each question:

Data field	Type	Description
`response/llm_judged/correctness/rating`	`string`	`yes` or `no`. `yes` indicates that the generated response is highly accurate and semantically similar to the ground truth. Minor omissions or inaccuracies that still capture the intent of the ground truth are acceptable. `no` indicates that the response does not meet the criteria.
`response/llm_judged/correctness/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/correctness/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/correctness/rating/percentage`	`float, [0, 1]`	Across all questions, percentage where correctness is judged as `yes`.

Is the response relevant to the request?

The relevance_to_query LLM judge determines whether the response is relevant to the input request.

Input required for `relevance_to_query`

Ground truth is not required.

The input evaluation set must have the following column:

request

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either response or trace.

Output for `relevance_to_query`

The following metrics are calculated for each question:

Data field	Type	Description
`response/llm_judged/relevance_to_query/rating`	`string`	`yes` if the response is judged to be relevant to the request, `no` otherwise.
`response/llm_judged/relevance_to_query/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/relevance_to_query/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/relevance_to_query/rating/percentage`	`float, [0, 1]`	Across all questions, percentage where `relevance_to_query/rating` is judged to be `yes`.

Is the response a hallucination, or is it grounded in the retrieved context?

The groundedness LLM judge returns a binary evaluation and written rationale on whether the generated response is factually consistent with the retrieved context.

Input required for `groundedness`

Ground truth is not required.

The input evaluation set must have the following column:

request

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either trace or both of response and retrieved_context[].content.

Output for `groundedness`

The following metrics are calculated for each question:

Data field	Type	Description
`response/llm_judged/groundedness/rating`	`string`	`yes` if the retrieved context supports all or almost all generated responses, `no` otherwise.
`response/llm_judged/groundedness/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/groundedness/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/groundedness/rating/percentage`	`float, [0, 1]`	Across all questions, what’s the percentage where `groundedness/rating` is judged as `yes`.

Is there harmful content in the agent response?

The safety LLM judge returns a binary rating and a written rationale on whether the generated response has harmful or toxic content.

Input required for `safety`

Ground truth is not required.

The input evaluation set must have the following column:

request

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either response or trace.

Output for `safety`

The following metrics are calculated for each question:

Data field	Type	Description
`response/llm_judged/safety/rating`	`string`	`yes` if the response does not have harmful or toxic content, `no` otherwise.
`response/llm_judged/safety/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/safety/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/safety/rating/average`	`float, [0, 1]`	Percentage of all questions that were judged to be `yes`.

Retrieval judges

Retrieval quality metrics assess the performance of the retriever in finding the documents that are relevant to the input request. These metrics evaluate factors such as: Did the retriever find the relevant chunks? How many of the known relevant documents did it find? Were the documents it found sufficient to produce the expected response?

Did the retriever find relevant chunks?

The chunk-relevance-precision LLM judge determines whether the chunks returned by the retriever are relevant to the input request. Precision is calculated as the number of relevant chunks returned divided by the total number of chunks returned. For example, if the retriever returns four chunks, and the LLM judge determines that three of the four returned documents are relevant to the request, then llm_judged/chunk_relevance/precision is 0.75.

Input required for `llm_judged/chunk_relevance`

Ground truth is not required.

The input evaluation set must have the following column:

request

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either retrieved_context[].content or trace.

Output for `llm_judged/chunk_relevance`

The following metrics are calculated for each question:

Data field	Type	Description
`retrieval/llm_judged/chunk_relevance/ratings`	`array[string]`	For each chunk, `yes` or `no`, indicating if the retrieved chunk is relevant to the input request.
`retrieval/llm_judged/chunk_relevance/rationales`	`array[string]`	For each chunk, LLM’s reasoning for the corresponding rating.
`retrieval/llm_judged/chunk_relevance/error_messages`	`array[string]`	For each chunk, if there was an error computing the rating, details of the error are here, and other output values will be NULL. If no error, this is NULL.
`retrieval/llm_judged/chunk_relevance/precision`	`float, [0, 1]`	Calculates the percentage of relevant chunks among all retrieved chunks.

The following metric is reported for the entire evaluation set:

Metric name	Type	Description
`retrieval/llm_judged/chunk_relevance/precision/average`	`float, [0, 1]`	Average value of `chunk_relevance/precision` across all questions.

How many of the known relevant documents did the retriever find?

document_recall is calculated as the number of relevant documents returned divided by the total number of relevant documents based on ground truth. For example, suppose that two documents are relevant based on ground truth. If the retriever returns one of those documents, document_recall is 0.5. This metric is not affected by the total number of documents returned.

This metric is deterministic and does not use an LLM judge.

Input required for `document_recall`

Ground truth is required.

The input evaluation set must have the following column:

expected_retrieved_context[].doc_uri

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either retrieved_context[].doc_uri or trace.

Output for `document_recall`

The following metric is calculated for each question:

Data field	Type	Description
`retrieval/ground_truth/document_recall`	`float, [0, 1]`	The percentage of ground truth `doc_uris` present in the retrieved chunks.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`retrieval/ground_truth/document_recall/average`	`float, [0, 1]`	Average value of `document_recall` across all questions.

Did the retriever find documents sufficient to produce the expected response?

The context_sufficiency LLM judge determines whether the retriever has retrieved documents that are sufficient to produce the expected response.

Input required for `context_sufficiency`

Ground truth expected_response is required.

The input evaluation set must have the following columns:

request
- expected_response

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either retrieved_context[].content or trace.

Output for `context_sufficiency`

The following metrics are calculated for each question:

Data field	Type	Description
`retrieval/llm_judged/context_sufficiency/rating`	`string`	`yes` or `no`. `yes` indicates that the retrieved context is sufficient to produce the expected response. `no` indicates that the retrieval needs to be tuned for this question so that it brings back the missing information. The output rationale should mention what information is missing.
`retrieval/llm_judged/context_sufficiency/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`retrieval/llm_judged/context_sufficiency/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`retrieval/llm_judged/context_sufficiency/rating/percentage`	`float, [0, 1]`	Percentage where context sufficiency is judged as `yes`.

Custom judge metrics

You can create a custom judge to perform assessments specific to your use case. For details, see Create custom LLM judges.

The output produced by a custom judge depends on its assessment_type, ANSWER or RETRIEVAL.

Custom LLM judge for ANSWER assessment

A custom LLM judge for ANSWER assessment evaluates the response for each question.

Outputs provided for each assessment:

Data field	Type	Description
`response/llm_judged/{assessment_name}/rating`	`string`	`yes` or `no`.
`response/llm_judged/{assessment_name}/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/{assessment_name}/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/{assessment_name}/rating/percentage`	`float, [0, 1]`	Across all questions, percentage where {assessment_name} is judged as `yes`.

Custom LLM judge for RETRIEVAL assessment

A custom LLM judge for RETRIEVAL assessment evaluates each retrieved chunk across all questions.

Outputs provided for each assessment:

Data field	Type	Description
`retrieval/llm_judged/{assessment_name}/ratings`	`array[string]`	Evaluation of the custom judge for each chunk,`yes` or `no`.
`retrieval/llm_judged/{assessment_name}/rationales`	`array[string]`	For each chunk, LLM’s written reasoning for `yes` or `no`.
`retrieval/llm_judged/{assessment_name}/error_messages`	`array[string]`	For each chunk, if there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL.
`retrieval/llm_judged/{assessment_name}/precision`	`float, [0, 1]`	Percentage of all retrieved chunks that the custom judge evaluated as `yes`.

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`retrieval/llm_judged/{assessment_name}/precision/average`	`float, [0, 1]`	Average value of `{assessment_name}_precision` across all questions.

Share via

Mosaic AI Agent Evaluation LLM judges reference

Response judges

Overall, did the LLM give an accurate answer?

Input required for `correctness`

Output for `correctness`

Is the response relevant to the request?

Input required for `relevance_to_query`

Output for `relevance_to_query`

Is the response a hallucination, or is it grounded in the retrieved context?

Input required for `groundedness`

Output for `groundedness`

Is there harmful content in the agent response?

Input required for `safety`

Output for `safety`

Retrieval judges

Did the retriever find relevant chunks?

Input required for `llm_judged/chunk_relevance`

Output for `llm_judged/chunk_relevance`

How many of the known relevant documents did the retriever find?

Input required for `document_recall`

Output for `document_recall`

Did the retriever find documents sufficient to produce the expected response?

Input required for `context_sufficiency`

Output for `context_sufficiency`

Custom judge metrics

Custom LLM judge for ANSWER assessment

Custom LLM judge for RETRIEVAL assessment

Feedback

Additional resources

Share via

Mosaic AI Agent Evaluation LLM judges reference

Response judges

Overall, did the LLM give an accurate answer?

Input required for correctness

Output for correctness

Is the response relevant to the request?

Input required for relevance_to_query

Output for relevance_to_query

Is the response a hallucination, or is it grounded in the retrieved context?

Input required for groundedness

Output for groundedness

Is there harmful content in the agent response?

Input required for safety

Output for safety

Retrieval judges

Did the retriever find relevant chunks?

Input required for llm_judged/chunk_relevance

Output for llm_judged/chunk_relevance

How many of the known relevant documents did the retriever find?

Input required for document_recall

Output for document_recall

Did the retriever find documents sufficient to produce the expected response?

Input required for context_sufficiency

Output for context_sufficiency

Custom judge metrics

Custom LLM judge for ANSWER assessment

Custom LLM judge for RETRIEVAL assessment

Feedback

Additional resources

Input required for `correctness`

Output for `correctness`

Input required for `relevance_to_query`

Output for `relevance_to_query`

Input required for `groundedness`

Output for `groundedness`

Input required for `safety`

Output for `safety`

Input required for `llm_judged/chunk_relevance`

Output for `llm_judged/chunk_relevance`

Input required for `document_recall`

Output for `document_recall`

Input required for `context_sufficiency`

Output for `context_sufficiency`