evaluation Package

Reference

Packages

Classes

AzureAIProject	Azure AI Project Information
AzureOpenAIModelConfiguration	Model Configuration for Azure OpenAI Model
BleuScoreEvaluator	Evaluator that computes the BLEU Score between two strings. BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating better quality. Usage `eval_fn = BleuScoreEvaluator() result = eval_fn( response="Tokyo is the capital of Japan.", ground_truth="The capital of Japan is Tokyo.")` Output format `{ "bleu_score": 0.22 }`
CoherenceEvaluator	Initialize a coherence evaluator configured for a specific Azure OpenAI model. Usage `eval_fn = CoherenceEvaluator(model_config) result = eval_fn( query="What is the capital of Japan?", response="The capital of Japan is Tokyo.")` Output format `{ "coherence": 1.0, "gpt_coherence": 1.0, }` Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
ContentSafetyEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a content safety evaluator configured to evaluate content safetry metrics for QA scenario. Usage `azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = ContentSafetyEvaluator(azure_ai_project) result = eval_fn( query="What is the capital of France?", response="Paris.", )` Output format `{ "violence": "Medium", "violence_score": 5.0, "violence_reason": "Some reason", "sexual": "Medium", "sexual_score": 5.0, "sexual_reason": "Some reason", "self_harm": "Medium", "self_harm_score": 5.0, "self_harm_reason": "Some reason", "hate_unfairness": "Medium", "hate_unfairness_score": 5.0, "hate_unfairness_reason": "Some reason" }`
ContentSafetyMultimodalEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a content safety multimodal evaluator configured to evaluate content safety metrics in a multimodal scenario. Usage Example azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = ContentSafetyMultimodalEvaluator(azure_ai_project) result = eval_fn( { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "<image url or base64 encoded image>" } } ] }, { "role": "assistant", "content": "This picture shows an astronaut standing in the desert." } ] } ) Output Format `{ "violence": "Medium", "violence_score": 5.0, "violence_reason": "Some reason", "sexual": "Medium", "sexual_score": 5.0, "sexual_reason": "Some reason", "self_harm": "Medium", "self_harm_score": 5.0, "self_harm_reason": "Some reason", "hate_unfairness": "Medium", "hate_unfairness_score": 5.0, "hate_unfairness_reason": "Some reason" }`
Conversation
EvaluationResult
EvaluatorConfig	Configuration for an evaluator
F1ScoreEvaluator	Initialize a f1 score evaluator for calculating F1 score. Usage `eval_fn = F1ScoreEvaluator() result = eval_fn( response="The capital of Japan is Tokyo.", ground_truth="Tokyo is Japan's capital, known for its blend of traditional culture and technological advancements.")` Output format `{ "f1_score": 0.42 }`
FluencyEvaluator	Initialize a fluency evaluator configured for a specific Azure OpenAI model. Usage `eval_fn = FluencyEvaluator(model_config) result = eval_fn(response="The capital of Japan is Tokyo.")` Output format `{ "fluency": 4.0, "gpt_fluency": 4.0, }` Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
GleuScoreEvaluator	Evaluator that computes the BLEU Score between two strings. The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for use cases such as machine translation, text summarization, and text generation. Usage `eval_fn = GleuScoreEvaluator() result = eval_fn( response="Tokyo is the capital of Japan.", ground_truth="The capital of Japan is Tokyo.")` Output format `{ "gleu_score": 0.41 }`
GroundednessEvaluator	Initialize a groundedness evaluator configured for a specific Azure OpenAI model. Usage `eval_fn = GroundednessEvaluator(model_config) result = eval_fn( response="The capital of Japan is Tokyo.", context="Tokyo is Japan's capital, known for its blend of traditional culture and technological advancements.")` Output format `{ "groundedness": 5, "gpt_groundedness": 5, }` Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
GroundednessProEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a Groundedness Pro evaluator for determine if the response is grounded in the query and context. If this evaluator is supplied to the evaluate function, the aggregated metric for the groundedness pro label will be "groundedness_pro_passing_rate". Usage `azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } credential = DefaultAzureCredential() eval_fn = GroundednessProEvaluator(azure_ai_project, credential) result = eval_fn(query="What's the capital of France", response="Paris", context="Paris.")` Output format `{ "groundedness_pro_label": True, "reason": "'All Contents are grounded" }` Usage with conversation input azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } credential = DefaultAzureCredential() eval_fn = GroundednessProEvaluator(azure_ai_project, credential) conversation = { "messages": [ {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris.", "context": "Paris."} {"role": "user", "content": "What is the capital of Germany?"}, {"role": "assistant", "content": "Berlin.", "context": "Berlin."} ] } result = eval_fn(conversation=conversation) Output format `{ "groundedness_pro_label": 1.0, "evaluation_per_turn": { "groundedness_pro_label": [True, True], "groundedness_pro_reason": ["All contents are grounded", "All contents are grounded"] } }`
HateUnfairnessEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a hate-unfairness evaluator for hate unfairness score. Usage `azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = HateUnfairnessEvaluator(azure_ai_project) result = eval_fn(query="What is the capital of France?", response="Paris.")` Output format `{ "hate_unfairness": "High", "hate_unfairness_score": 6.5, "hate_unfairness_reason": "Some reason" }`
HateUnfairnessMultimodalEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a hate-unfairness multimodal evaluator for hate unfairness score. Usage Example azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = HateUnfairnessMultimodalEvaluator(azure_ai_project) result = eval_fn( { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "<image url or base64 encoded image>" } } ] }, { "role": "assistant", "content": "This picture shows an astronaut standing in the desert." } ] } ) Output Format `{ "hate_unfairness": "High", "hate_unfairness_score": 6.5, "hate_unfairness_reason": "Some reason" }`
IndirectAttackEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. A Cross-Domain Prompt Injection Attack (XPIA) jailbreak evaluator. Detect whether cross domain injected attacks are present in your AI system's response. Metrics include the overall evaluation label and reason for the Q/A Pair, as well as sub-labels for manipulated content, intrusion, and information.
Message
MeteorScoreEvaluator	Evaluator that computes the METEOR Score between two strings. The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and word stems to more accurately capture meaning and language variations. In addition to machine translation and text summarization, paraphrase detection is an optimal use case for the METEOR score. Usage `eval_fn = MeteorScoreEvaluator( alpha=0.9, beta=3.0, gamma=0.5 ) result = eval_fn( response="Tokyo is the capital of Japan.", ground_truth="The capital of Japan is Tokyo.")` Output format `{ "meteor_score": 0.62 }`
OpenAIModelConfiguration	Model Configuration for OpenAI Model
ProtectedMaterialEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a protected material evaluator to detect whether protected material is present in the AI system's response. The evaluator outputs a Boolean label (True or False) indicating the presence of protected material, along with AI-generated reasoning. Usage Example `azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = ProtectedMaterialEvaluator(azure_ai_project) result = eval_fn(query="What is the capital of France?", response="Paris.")` Output Format `{ "protected_material_label": false, "protected_material_reason": "This query does not contain any protected material." }`
ProtectedMaterialMultimodalEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a protected materials evaluator to detect whether protected material is present in multimodal messages. The evaluator outputs a Boolean label (True or False) indicating the presence of protected material, along with AI-generated reasoning. Usage Example azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = ProtectedMaterialMultimodalEvaluator(azure_ai_project) result = eval_fn( { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "<image url or base64 encoded image>" } } ] }, { "role": "assistant", "content": "This picture shows an astronaut standing in the desert." } ] } ) Output Format `{ "protected_material_label": "False", "protected_material_reason": "This query does not contain any protected material." }`
QAEvaluator	Initialize a question-answer evaluator configured for a specific Azure OpenAI model. Usage `eval_fn = QAEvaluator(model_config) result = qa_eval( query="Tokyo is the capital of which country?", response="Japan", context="Tokyo is the capital of Japan.", ground_truth="Japan" )` Output format `{ "groundedness": 3.5, "relevance": 4.0, "coherence": 1.5, "fluency": 4.0, "similarity": 3.0, "gpt_groundedness": 3.5, "gpt_relevance": 4.0, "gpt_coherence": 1.5, "gpt_fluency": 4.0, "gpt_similarity": 3.0, "f1_score": 0.42 }`
RelevanceEvaluator	Initialize a relevance evaluator configured for a specific Azure OpenAI model. Usage `eval_fn = RelevanceEvaluator(model_config) result = eval_fn( query="What is the capital of Japan?", response="The capital of Japan is Tokyo.")` Output format `{ "relevance": 3.0, "gpt_relevance": 3.0, "relevance_reason": "The response is relevant to the query because it provides the correct answer.", }` Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
RetrievalEvaluator	Initialize an evaluator configured for a specific Azure OpenAI model. Usage `chat_eval = RetrievalEvaluator(model_config) conversation = { "messages": [ {"role": "user", "content": "What is the value of 2 + 2?"}, { "role": "assistant", "content": "2 + 2 = 4", "context": "From 'math_doc.md': Information about additions: 1 + 2 = 3, 2 + 2 = 4" } ] } result = chat_eval(conversation=conversation)` Output format `{ "gpt_retrieval": 3.0, "retrieval": 3.0, "evaluation_per_turn": { "gpt_retrieval": [1.0, 2.0, 3.0], "retrieval": [1.0, 2.0, 3.0], "retrieval_reason": ["<reasoning for score 1>", "<reasoning for score 2>", "<reasoning for score 3>"] } }` Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
RougeScoreEvaluator	Evaluator for computes the ROUGE scores between two strings. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text coherence and relevance are critical. Usage `eval_fn = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1) result = eval_fn( response="Tokyo is the capital of Japan.", ground_truth="The capital of Japan is Tokyo.")` Output format `{ "rouge_precision": 1.0, "rouge_recall": 1.0, "rouge_f1_score": 1.0 }`
SelfHarmEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a self harm evaluator for self harm score. Usage `azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = SelfHarmEvaluator(azure_ai_project) result = eval_fn(query="What is the capital of France?", response="Paris.")` Output format `{ "self_harm": "High", "self_harm_score": 6.5, "self_harm_reason": "Some reason" }`
SelfHarmMultimodalEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a self harm multimodal evaluator for self harm score. Usage Example azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = SelfHarmMultimodalEvaluator(azure_ai_project) result = eval_fn( { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "<image url or base64 encoded image>" } } ] }, { "role": "assistant", "content": "This picture shows an astronaut standing in the desert." } ] } ) Output Format `{ "self_harm": "High", "self_harm_score": 6.5, "self_harm_reason": "Some reason" }`
SexualEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a sexual evaluator for sexual score. Usage `azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = SexualEvaluator(azure_ai_project) result = eval_fn(query="What is the capital of France?", response="Paris.")` Output format `{ "sexual": "High", "sexual_score": 6.5, "sexual_reason": "Some reason" }`
SexualMultimodalEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a sexual multimodal evaluator for sexual score. Usage Example azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = SexualMultimodalEvaluator(azure_ai_project) result = eval_fn( { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "<image url or base64 encoded image>" } } ] }, { "role": "assistant", "content": "This picture shows an astronaut standing in the desert." } ] } ) Output Format `{ "sexual": "High", "sexual_score": 6.5, "sexual_reason": "Some reason" }`
SimilarityEvaluator	Initialize a similarity evaluator configured for a specific Azure OpenAI model. Usage `eval_fn = SimilarityEvaluator(model_config) result = eval_fn( query="What is the capital of Japan?", response="The capital of Japan is Tokyo.", ground_truth="Tokyo is Japan's capital.")` Output format `{ "similarity": 3.0, "gpt_similarity": 3.0, }` Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
ViolenceEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a violence evaluator for violence score. Usage `azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = ViolenceEvaluator(azure_ai_project) result = eval_fn(query="What is the capital of France?", response="Paris.")` Output format `{ "violence": "High", "violence_score": 6.5, "violence_reason": "Some reason" }`
ViolenceMultimodalEvaluator	Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a violence multimodal evaluator for violence score. Usage Example azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = ViolenceMultimodalEvaluator(azure_ai_project) result = eval_fn( { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": "<image url or base64 encoded image>" } } ] }, { "role": "assistant", "content": "This picture shows an astronaut standing in the desert." } ] } ) Output Format `{ "violence": "High", "violence_score": 6.5, "violence_reason": "Some reason" }`

Enums

RougeType

Enumeration of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) types.

Functions

evaluate

Evaluates target or data with built-in or custom evaluators. If both target and data are provided, data will be run through target function and then results will be evaluated.

Evaluate API can be used as follows:


   from azure.ai.evaluation import evaluate, RelevanceEvaluator, CoherenceEvaluator


   model_config = {
       "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
       "api_key": os.environ.get("AZURE_OPENAI_KEY"),
       "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
   }

   coherence_eval = CoherenceEvaluator(model_config=model_config)
   relevance_eval = RelevanceEvaluator(model_config=model_config)

   path = "evaluate_test_data.jsonl"
   result = evaluate(
       data=path,
       evaluators={
           "coherence": coherence_eval,
           "relevance": relevance_eval,
       },
       evaluator_config={
           "coherence": {
               "column_mapping": {
                   "response": "${data.response}",
                   "query": "${data.query}",
               },
           },
           "relevance": {
               "column_mapping": {
                   "response": "${data.response}",
                   "context": "${data.context}",
                   "query": "${data.query}",
               },
           },
       },
   )

evaluate(*, data: str | PathLike, evaluators: Dict[str, Callable], evaluation_name: str | None = None, target: Callable | None = None, evaluator_config: Dict[str, EvaluatorConfig] | None = None, azure_ai_project: AzureAIProject | None = None, output_path: str | PathLike | None = None, **kwargs) -> EvaluationResult

Keyword-Only Parameters

Name	Description
data	str Path to the data to be evaluated or passed to target if target is set. Only .jsonl format files are supported. target and data both cannot be None. Required.
evaluators	Dict[str, Callable] Evaluators to be used for evaluation. It should be a dictionary with key as alias for evaluator and value as the evaluator function. Required.
evaluation_name	Optional[str] Display name of the evaluation.
target	Optional[Callable] Target to be evaluated. target and data both cannot be None
evaluator_config	Optional[Dict[str, EvaluatorConfig]] Configuration for evaluators. The configuration should be a dictionary with evaluator names as keys and a values that are dictionaries containing the column mappings. The column mappings should be a dictionary with keys as the column names in the evaluator input and values as the column names in the input data or data generated by target.
output_path	Optional[str] The local folder or file path to save evaluation results to if set. If folder path is provided the results will be saved to a file named evaluation_results.json in the folder.
azure_ai_project	Optional[AzureAIProject] Logs evaluation results to AI Studio if set.

Returns

Type	Description
EvaluationResult	Evaluation results.

Share via