Agent Evaluation input schema

Important

This feature is in Public Preview.

This article explains the input schema required by Agent Evaluation to assess your application’s quality, cost, and latency.

  • During development, evaluation takes place offline, and an evaluation set is a required input to Agent Evaluation.
  • When an application is in production, all inputs to Agent Evaluation come from your inference tables or production logs.

The input schema is identical for both online and offline evaluations.

For general information about evaluation sets, see Evaluation sets.

Evaluation input schema

The following table shows Agent Evaluation’s input schema. The last two columns of the table refer to how input is provided to the mlflow.evaluate() call. See How to provide input to an evaluation run for details.

Column Data type Description Application passed as input argument Previously generated outputs provided
request_id string Unique identifier of request. Optional Optional
request See Schema for request. Input to the application to evaluate, user’s question or query. For example, {'messages': [{"role": "user", "content": "What is RAG"}]} or “What is RAG?”. When request is provided as a string, it will be transformed to messages before it is passed to your agent. Required Required
response string Response generated by the application being evaluated. Generated by Agent Evaluation Optional. If not provided then derived from the Trace. Either response or trace is required.
expected_facts array of string A list of facts that are expected in the model output. See expected_facts guidelines. Optional Optional
expected_response string Ground-truth (correct) answer for the input request. See expected_response guidelines. Optional Optional
expected_retrieved_context array Array of objects containing the expected retrieved context for the request (if the application includes a retrieval step). Array schema Optional Optional
retrieved_context array Retrieval results generated by the retriever in the application being evaluated. If multiple retrieval steps are in the application, this is the retrieval results from the last step (chronologically in the trace). Array schema Generated by Agent Evaluation Optional. If not provided then derived from the provided trace.
trace JSON string of MLflow Trace MLflow Trace of the application’s execution on the corresponding request. Generated by Agent Evaluation Optional. Either response or trace is required.

expected_facts guidelines

The expected_facts field specifies the list of facts that is expected to appear in any correct model response for the specific input request. That is, a model response is deemed correct if it contains these facts, regardless of how the response is phrased.

Including only the required facts, and leaving out facts that are not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.

You can specify at most one of expected_facts and expected_response. If you specify both, an error will be reported. Databricks recommends using expected_facts, as it is a more specific guideline that helps Agent Evaluation judge more effectively the quality of generated responses.

expected_response guidelines

The expected_response field contains a fully formed response that represents a reference for correct model responses. That is, a model response is deemed correct if it matches the information content in expected_response. In contrast, expected_facts lists only the facts that are required to appear in a correct response and is not a fully formed reference response.

Similar to expected_facts, expected_response should contain only the minimal set of facts that is required for a correct response. Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.

You can specify at most one of expected_facts and expected_response. If you specify both, an error will be reported. Databricks recommends using expected_facts, as it is a more specific guideline that helps Agent Evaluation judge more effectively the quality of generated responses.

Schema for request

The request schema can be one of the following:

  • A messages field that follows the OpenAI chat completion schema and can encode the full conversation.
  • A plain string. This format supports single-turn conversations only. Plain strings are converted to the messages format before being passed to your agent.
  • A query string field for the most recent request and an optional history field that encodes previous turns of the conversation.

For multi-turn chat applications, use the second or third option above.

The following example shows all three options in the same request column of the evaluation dataset:

import pandas as pd

data = {
  "request": [

      # Plain string. Plain strings are transformed to the `messages` format before being passed to your agent.
      "What is the difference between reduceByKey and groupByKey in Spark?",

      # Using the `messages` field for a single- or multi-turn chat
      {
          "messages": [
              {
                  "role": "user",
                  "content": "How can you minimize data shuffling in Spark?"
              }
          ]
      },

      # Using the query and history fields for a single- or multi-turn chat
      {
          "query": "Explain broadcast variables in Spark. How do they enhance performance?",
          "history": [
              {
                  "role": "user",
                  "content": "What are broadcast variables?"
              },
              {
                  "role": "assistant",
                  "content": "Broadcast variables allow the programmer to keep a read-only variable cached on each machine."
              }
          ]
      }
  ],

  "expected_response": [
    "expected response for first question",
    "expected response for second question",
    "expected response for third question"
  ]
}

eval_dataset = pd.DataFrame(data)

Schema for arrays in evaluation input

The schema of the arrays expected_retrieved_context and retrieved_context is shown in the following table:

Column Data type Description Application passed as input argument Previously generated outputs provided
content string Contents of the retrieved context. String in any format, such as HTML, plain text, or Markdown. Optional Optional
doc_uri string Unique identifier (URI) of the parent document where the chunk came from. Required Required

Computed metrics

The columns in the following table indicate the data included in the input, and indicates that the metric is supported when that data is provided.

For details about what these metrics measure, see How quality, cost, and latency are assessed by Agent Evaluation.

Calculated metrics request request and expected_response request, expected_response, and expected_retrieved_context request and expected_retrieved_context
response/llm_judged/relevance_to_query/rating
response/llm_judged/safety/rating
response/llm_judged/groundedness/rating
retrieval/llm_judged/chunk_relevance_precision
agent/total_token_count
agent/input_token_count
agent/output_token_count
response/llm_judged/correctness/rating
retrieval/llm_judged/context_sufficiency/rating
retrieval/ground_truth/document_recall