evaluation Package

Packages

simulator

Classes

AzureAIProject

Azure AI Project Information

AzureOpenAIModelConfiguration

Model Configuration for Azure OpenAI Model

BleuScoreEvaluator

Evaluator that computes the BLEU Score between two strings.

BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating better quality.

Usage


   eval_fn = BleuScoreEvaluator()
   result = eval_fn(
       response="Tokyo is the capital of Japan.",
       ground_truth="The capital of Japan is Tokyo.")

Output format


   {
       "bleu_score": 0.22
   }
CoherenceEvaluator

Initialize a coherence evaluator configured for a specific Azure OpenAI model.

Usage


   eval_fn = CoherenceEvaluator(model_config)
   result = eval_fn(
       query="What is the capital of Japan?",
       response="The capital of Japan is Tokyo.")

Output format


   {
       "coherence": 1.0,
       "gpt_coherence": 1.0,
   }

Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.

ContentSafetyEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a content safety evaluator configured to evaluate content safetry metrics for QA scenario.

Usage


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = ContentSafetyEvaluator(azure_ai_project)
   result = eval_fn(
       query="What is the capital of France?",
       response="Paris.",
   )

Output format


   {
       "violence": "Medium",
       "violence_score": 5.0,
       "violence_reason": "Some reason",
       "sexual": "Medium",
       "sexual_score": 5.0,
       "sexual_reason": "Some reason",
       "self_harm": "Medium",
       "self_harm_score": 5.0,
       "self_harm_reason": "Some reason",
       "hate_unfairness": "Medium",
       "hate_unfairness_score": 5.0,
       "hate_unfairness_reason": "Some reason"
   }
ContentSafetyMultimodalEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a content safety multimodal evaluator configured to evaluate content safety metrics in a multimodal scenario.

Usage Example


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = ContentSafetyMultimodalEvaluator(azure_ai_project)
   result = eval_fn(
       {
           "messages": [
               {
                   "role": "user",
                   "content": [
                       {
                           "type": "text",
                           "text": "What's in this image?"
                       },
                       {
                           "type": "image_url",
                           "image_url": {
                               "url": "<image url or base64 encoded image>"
                           }
                       }
                   ]
               },
               {
                   "role": "assistant",
                   "content": "This picture shows an astronaut standing in the desert."
               }
           ]
       }
   )

Output Format


   {
       "violence": "Medium",
       "violence_score": 5.0,
       "violence_reason": "Some reason",
       "sexual": "Medium",
       "sexual_score": 5.0,
       "sexual_reason": "Some reason",
       "self_harm": "Medium",
       "self_harm_score": 5.0,
       "self_harm_reason": "Some reason",
       "hate_unfairness": "Medium",
       "hate_unfairness_score": 5.0,
       "hate_unfairness_reason": "Some reason"
   }
Conversation
EvaluationResult
EvaluatorConfig

Configuration for an evaluator

F1ScoreEvaluator

Initialize a f1 score evaluator for calculating F1 score.

Usage


   eval_fn = F1ScoreEvaluator()
   result = eval_fn(
       response="The capital of Japan is Tokyo.",
       ground_truth="Tokyo is Japan's capital, known for its blend of traditional culture                 and technological advancements.")

Output format


   {
       "f1_score": 0.42
   }
FluencyEvaluator

Initialize a fluency evaluator configured for a specific Azure OpenAI model.

Usage


   eval_fn = FluencyEvaluator(model_config)
   result = eval_fn(response="The capital of Japan is Tokyo.")

Output format


   {
       "fluency": 4.0,
       "gpt_fluency": 4.0,
   }

Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.

GleuScoreEvaluator

Evaluator that computes the BLEU Score between two strings.

The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for use cases such as machine translation, text summarization, and text generation.

Usage


   eval_fn = GleuScoreEvaluator()
   result = eval_fn(
       response="Tokyo is the capital of Japan.",
       ground_truth="The capital of Japan is Tokyo.")

Output format


   {
       "gleu_score": 0.41
   }
GroundednessEvaluator

Initialize a groundedness evaluator configured for a specific Azure OpenAI model.

Usage


   eval_fn = GroundednessEvaluator(model_config)
   result = eval_fn(
       response="The capital of Japan is Tokyo.",
       context="Tokyo is Japan's capital, known for its blend of traditional culture                 and technological advancements.")

Output format


   {
       "groundedness": 5,
       "gpt_groundedness": 5,
   }

Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.

GroundednessProEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a Groundedness Pro evaluator for determine if the response is grounded in the query and context.

If this evaluator is supplied to the evaluate function, the aggregated metric for the groundedness pro label will be "groundedness_pro_passing_rate".

Usage


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   credential = DefaultAzureCredential()

   eval_fn = GroundednessProEvaluator(azure_ai_project, credential)
   result = eval_fn(query="What's the capital of France", response="Paris", context="Paris.")

Output format


   {
       "groundedness_pro_label": True,
       "reason": "'All Contents are grounded"
   }

Usage with conversation input


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   credential = DefaultAzureCredential()

   eval_fn = GroundednessProEvaluator(azure_ai_project, credential)
   conversation = {
       "messages": [
           {"role": "user", "content": "What is the capital of France?"},
           {"role": "assistant", "content": "Paris.", "context": "Paris."}
           {"role": "user", "content": "What is the capital of Germany?"},
           {"role": "assistant", "content": "Berlin.", "context": "Berlin."}
       ]
   }
   result = eval_fn(conversation=conversation)

Output format


   {
       "groundedness_pro_label": 1.0,
       "evaluation_per_turn": {
           "groundedness_pro_label": [True, True],
           "groundedness_pro_reason": ["All contents are grounded", "All contents are grounded"]
       }
   }
HateUnfairnessEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a hate-unfairness evaluator for hate unfairness score.

Usage


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = HateUnfairnessEvaluator(azure_ai_project)
   result = eval_fn(query="What is the capital of France?", response="Paris.")

Output format


   {
       "hate_unfairness": "High",
       "hate_unfairness_score": 6.5,
       "hate_unfairness_reason": "Some reason"
   }
HateUnfairnessMultimodalEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a hate-unfairness multimodal evaluator for hate unfairness score.

Usage Example


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = HateUnfairnessMultimodalEvaluator(azure_ai_project)
   result = eval_fn(
       {
           "messages": [
               {
                   "role": "user",
                   "content": [
                       {
                           "type": "text",
                           "text": "What's in this image?"
                       },
                       {
                           "type": "image_url",
                           "image_url": {
                               "url": "<image url or base64 encoded image>"
                           }
                       }
                   ]
               },
               {
                   "role": "assistant",
                   "content": "This picture shows an astronaut standing in the desert."
               }
           ]
       }
   )

Output Format


   {
       "hate_unfairness": "High",
       "hate_unfairness_score": 6.5,
       "hate_unfairness_reason": "Some reason"
   }
IndirectAttackEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

A Cross-Domain Prompt Injection Attack (XPIA) jailbreak evaluator.

Detect whether cross domain injected attacks are present in your AI system's response. Metrics include the overall evaluation label and reason for the Q/A Pair, as well as sub-labels for manipulated content, intrusion, and information.

Message
MeteorScoreEvaluator

Evaluator that computes the METEOR Score between two strings.

The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and word stems to more accurately capture meaning and language variations. In addition to machine translation and text summarization, paraphrase detection is an optimal use case for the METEOR score.

Usage


   eval_fn = MeteorScoreEvaluator(
       alpha=0.9,
       beta=3.0,
       gamma=0.5
   )
   result = eval_fn(
       response="Tokyo is the capital of Japan.",
       ground_truth="The capital of Japan is Tokyo.")

Output format


   {
       "meteor_score": 0.62
   }
OpenAIModelConfiguration

Model Configuration for OpenAI Model

ProtectedMaterialEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a protected material evaluator to detect whether protected material is present in the AI system's response. The evaluator outputs a Boolean label (True or False) indicating the presence of protected material, along with AI-generated reasoning.

Usage Example


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = ProtectedMaterialEvaluator(azure_ai_project)
   result = eval_fn(query="What is the capital of France?", response="Paris.")

Output Format


   {
       "protected_material_label": false,
       "protected_material_reason": "This query does not contain any protected material."
   }
ProtectedMaterialMultimodalEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a protected materials evaluator to detect whether protected material is present in multimodal messages. The evaluator outputs a Boolean label (True or False) indicating the presence of protected material, along with AI-generated reasoning.

Usage Example


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = ProtectedMaterialMultimodalEvaluator(azure_ai_project)
   result = eval_fn(
       {
           "messages": [
               {
                   "role": "user",
                   "content": [
                       {
                           "type": "text",
                           "text": "What's in this image?"
                       },
                       {
                           "type": "image_url",
                           "image_url": {
                               "url": "<image url or base64 encoded image>"
                           }
                       }
                   ]
               },
               {
                   "role": "assistant",
                   "content": "This picture shows an astronaut standing in the desert."
               }
           ]
       }
   )

Output Format


   {
       "protected_material_label": "False",
       "protected_material_reason": "This query does not contain any protected material."
   }
QAEvaluator

Initialize a question-answer evaluator configured for a specific Azure OpenAI model.

Usage


   eval_fn = QAEvaluator(model_config)
   result = qa_eval(
       query="Tokyo is the capital of which country?",
       response="Japan",
       context="Tokyo is the capital of Japan.",
       ground_truth="Japan"
   )

Output format


   {
       "groundedness": 3.5,
       "relevance": 4.0,
       "coherence": 1.5,
       "fluency": 4.0,
       "similarity": 3.0,
       "gpt_groundedness": 3.5,
       "gpt_relevance": 4.0,
       "gpt_coherence": 1.5,
       "gpt_fluency": 4.0,
       "gpt_similarity": 3.0,
       "f1_score": 0.42
   }
RelevanceEvaluator

Initialize a relevance evaluator configured for a specific Azure OpenAI model.

Usage


   eval_fn = RelevanceEvaluator(model_config)
   result = eval_fn(
       query="What is the capital of Japan?",
       response="The capital of Japan is Tokyo.")

Output format


   {
       "relevance": 3.0,
       "gpt_relevance": 3.0,
       "relevance_reason": "The response is relevant to the query because it provides the correct answer.",
   }

Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.

RetrievalEvaluator

Initialize an evaluator configured for a specific Azure OpenAI model.

Usage


   chat_eval = RetrievalEvaluator(model_config)
   conversation = {
       "messages": [
           {"role": "user", "content": "What is the value of 2 + 2?"},
           {
               "role": "assistant", "content": "2 + 2 = 4",
               "context": "From 'math_doc.md': Information about additions: 1 + 2 = 3, 2 + 2 = 4"
           }
       ]
   }
   result = chat_eval(conversation=conversation)

Output format


   {
       "gpt_retrieval": 3.0,
       "retrieval": 3.0,
       "evaluation_per_turn": {
           "gpt_retrieval": [1.0, 2.0, 3.0],
           "retrieval": [1.0, 2.0, 3.0],
           "retrieval_reason": ["<reasoning for score 1>", "<reasoning for score 2>", "<reasoning for score 3>"]
       }
   }

Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.

RougeScoreEvaluator

Evaluator for computes the ROUGE scores between two strings.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text coherence and relevance are critical.

Usage


   eval_fn = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)
   result = eval_fn(
       response="Tokyo is the capital of Japan.",
       ground_truth="The capital of Japan is Tokyo.")

Output format


   {
       "rouge_precision": 1.0,
       "rouge_recall": 1.0,
       "rouge_f1_score": 1.0
   }
SelfHarmEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a self harm evaluator for self harm score.

Usage


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = SelfHarmEvaluator(azure_ai_project)
   result = eval_fn(query="What is the capital of France?", response="Paris.")

Output format


   {
       "self_harm": "High",
       "self_harm_score": 6.5,
       "self_harm_reason": "Some reason"
   }
SelfHarmMultimodalEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a self harm multimodal evaluator for self harm score.

Usage Example


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = SelfHarmMultimodalEvaluator(azure_ai_project)
   result = eval_fn(
       {
           "messages": [
               {
                   "role": "user",
                   "content": [
                       {
                           "type": "text",
                           "text": "What's in this image?"
                       },
                       {
                           "type": "image_url",
                           "image_url": {
                               "url": "<image url or base64 encoded image>"
                           }
                       }
                   ]
               },
               {
                   "role": "assistant",
                   "content": "This picture shows an astronaut standing in the desert."
               }
           ]
       }
   )

Output Format


   {
       "self_harm": "High",
       "self_harm_score": 6.5,
       "self_harm_reason": "Some reason"
   }
SexualEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a sexual evaluator for sexual score.

Usage


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = SexualEvaluator(azure_ai_project)
   result = eval_fn(query="What is the capital of France?", response="Paris.")

Output format


   {
       "sexual": "High",
       "sexual_score": 6.5,
       "sexual_reason": "Some reason"
   }
SexualMultimodalEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a sexual multimodal evaluator for sexual score.

Usage Example


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = SexualMultimodalEvaluator(azure_ai_project)
   result = eval_fn(
       {
           "messages": [
               {
                   "role": "user",
                   "content": [
                       {
                           "type": "text",
                           "text": "What's in this image?"
                       },
                       {
                           "type": "image_url",
                           "image_url": {
                               "url": "<image url or base64 encoded image>"
                           }
                       }
                   ]
               },
               {
                   "role": "assistant",
                   "content": "This picture shows an astronaut standing in the desert."
               }
           ]
       }
   )

Output Format


   {
       "sexual": "High",
       "sexual_score": 6.5,
       "sexual_reason": "Some reason"
   }
SimilarityEvaluator

Initialize a similarity evaluator configured for a specific Azure OpenAI model.

Usage


   eval_fn = SimilarityEvaluator(model_config)
   result = eval_fn(
       query="What is the capital of Japan?",
       response="The capital of Japan is Tokyo.",
       ground_truth="Tokyo is Japan's capital.")

Output format


   {
       "similarity": 3.0,
       "gpt_similarity": 3.0,
   }

Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.

ViolenceEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a violence evaluator for violence score.

Usage


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = ViolenceEvaluator(azure_ai_project)
   result = eval_fn(query="What is the capital of France?", response="Paris.")

Output format


   {
       "violence": "High",
       "violence_score": 6.5,
       "violence_reason": "Some reason"
   }
ViolenceMultimodalEvaluator

Note

This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.

Initialize a violence multimodal evaluator for violence score.

Usage Example


   azure_ai_project = {
       "subscription_id": "<subscription_id>",
       "resource_group_name": "<resource_group_name>",
       "project_name": "<project_name>",
   }
   eval_fn = ViolenceMultimodalEvaluator(azure_ai_project)
   result = eval_fn(
       {
           "messages": [
               {
                   "role": "user",
                   "content": [
                       {
                           "type": "text",
                           "text": "What's in this image?"
                       },
                       {
                           "type": "image_url",
                           "image_url": {
                               "url": "<image url or base64 encoded image>"
                           }
                       }
                   ]
               },
               {
                   "role": "assistant",
                   "content": "This picture shows an astronaut standing in the desert."
               }
           ]
       }
   )

Output Format


   {
       "violence": "High",
       "violence_score": 6.5,
       "violence_reason": "Some reason"
   }

Enums

RougeType

Enumeration of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) types.

Functions

evaluate

Evaluates target or data with built-in or custom evaluators. If both target and data are provided, data will be run through target function and then results will be evaluated.

Evaluate API can be used as follows:


   from azure.ai.evaluation import evaluate, RelevanceEvaluator, CoherenceEvaluator


   model_config = {
       "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
       "api_key": os.environ.get("AZURE_OPENAI_KEY"),
       "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
   }

   coherence_eval = CoherenceEvaluator(model_config=model_config)
   relevance_eval = RelevanceEvaluator(model_config=model_config)

   path = "evaluate_test_data.jsonl"
   result = evaluate(
       data=path,
       evaluators={
           "coherence": coherence_eval,
           "relevance": relevance_eval,
       },
       evaluator_config={
           "coherence": {
               "column_mapping": {
                   "response": "${data.response}",
                   "query": "${data.query}",
               },
           },
           "relevance": {
               "column_mapping": {
                   "response": "${data.response}",
                   "context": "${data.context}",
                   "query": "${data.query}",
               },
           },
       },
   )
evaluate(*, data: str | PathLike, evaluators: Dict[str, Callable], evaluation_name: str | None = None, target: Callable | None = None, evaluator_config: Dict[str, EvaluatorConfig] | None = None, azure_ai_project: AzureAIProject | None = None, output_path: str | PathLike | None = None, **kwargs) -> EvaluationResult

Keyword-Only Parameters

Name Description
data
str

Path to the data to be evaluated or passed to target if target is set. Only .jsonl format files are supported. target and data both cannot be None. Required.

evaluators

Evaluators to be used for evaluation. It should be a dictionary with key as alias for evaluator and value as the evaluator function. Required.

evaluation_name

Display name of the evaluation.

target

Target to be evaluated. target and data both cannot be None

evaluator_config

Configuration for evaluators. The configuration should be a dictionary with evaluator names as keys and a values that are dictionaries containing the column mappings. The column mappings should be a dictionary with keys as the column names in the evaluator input and values as the column names in the input data or data generated by target.

output_path

The local folder or file path to save evaluation results to if set. If folder path is provided the results will be saved to a file named evaluation_results.json in the folder.

azure_ai_project

Logs evaluation results to AI Studio if set.

Returns

Type Description

Evaluation results.