evaluation Package
Packages
simulator |
Classes
AzureAIProject |
Azure AI Project Information |
AzureOpenAIModelConfiguration |
Model Configuration for Azure OpenAI Model |
BleuScoreEvaluator |
Evaluator that computes the BLEU Score between two strings. BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating better quality. Usage
Output format
|
CoherenceEvaluator |
Initialize a coherence evaluator configured for a specific Azure OpenAI model. Usage
Output format
Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
ContentSafetyEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a content safety evaluator configured to evaluate content safetry metrics for QA scenario. Usage
Output format
|
ContentSafetyMultimodalEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a content safety multimodal evaluator configured to evaluate content safety metrics in a multimodal scenario. Usage Example
Output Format
|
Conversation | |
EvaluationResult | |
EvaluatorConfig |
Configuration for an evaluator |
F1ScoreEvaluator |
Initialize a f1 score evaluator for calculating F1 score. Usage
Output format
|
FluencyEvaluator |
Initialize a fluency evaluator configured for a specific Azure OpenAI model. Usage
Output format
Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
GleuScoreEvaluator |
Evaluator that computes the BLEU Score between two strings. The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for use cases such as machine translation, text summarization, and text generation. Usage
Output format
|
GroundednessEvaluator |
Initialize a groundedness evaluator configured for a specific Azure OpenAI model. Usage
Output format
Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
GroundednessProEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a Groundedness Pro evaluator for determine if the response is grounded in the query and context. If this evaluator is supplied to the evaluate function, the aggregated metric for the groundedness pro label will be "groundedness_pro_passing_rate". Usage
Output format
Usage with conversation input
Output format
|
HateUnfairnessEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a hate-unfairness evaluator for hate unfairness score. Usage
Output format
|
HateUnfairnessMultimodalEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a hate-unfairness multimodal evaluator for hate unfairness score. Usage Example
Output Format
|
IndirectAttackEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. A Cross-Domain Prompt Injection Attack (XPIA) jailbreak evaluator. Detect whether cross domain injected attacks are present in your AI system's response. Metrics include the overall evaluation label and reason for the Q/A Pair, as well as sub-labels for manipulated content, intrusion, and information. |
Message | |
MeteorScoreEvaluator |
Evaluator that computes the METEOR Score between two strings. The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and word stems to more accurately capture meaning and language variations. In addition to machine translation and text summarization, paraphrase detection is an optimal use case for the METEOR score. Usage
Output format
|
OpenAIModelConfiguration |
Model Configuration for OpenAI Model |
ProtectedMaterialEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a protected material evaluator to detect whether protected material is present in the AI system's response. The evaluator outputs a Boolean label (True or False) indicating the presence of protected material, along with AI-generated reasoning. Usage Example
Output Format
|
ProtectedMaterialMultimodalEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a protected materials evaluator to detect whether protected material is present in multimodal messages. The evaluator outputs a Boolean label (True or False) indicating the presence of protected material, along with AI-generated reasoning. Usage Example
Output Format
|
QAEvaluator |
Initialize a question-answer evaluator configured for a specific Azure OpenAI model. Usage
Output format
|
RelevanceEvaluator |
Initialize a relevance evaluator configured for a specific Azure OpenAI model. Usage
Output format
Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
RetrievalEvaluator |
Initialize an evaluator configured for a specific Azure OpenAI model. Usage
Output format
Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
RougeScoreEvaluator |
Evaluator for computes the ROUGE scores between two strings. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text coherence and relevance are critical. Usage
Output format
|
SelfHarmEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a self harm evaluator for self harm score. Usage
Output format
|
SelfHarmMultimodalEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a self harm multimodal evaluator for self harm score. Usage Example
Output Format
|
SexualEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a sexual evaluator for sexual score. Usage
Output format
|
SexualMultimodalEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a sexual multimodal evaluator for sexual score. Usage Example
Output Format
|
SimilarityEvaluator |
Initialize a similarity evaluator configured for a specific Azure OpenAI model. Usage
Output format
Note: To align with our support of a diverse set of models, a key without the gpt_ prefix has been added. To maintain backwards compatibility, the old key with the gpt_ prefix is still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future. |
ViolenceEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a violence evaluator for violence score. Usage
Output format
|
ViolenceMultimodalEvaluator |
Note This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information. Initialize a violence multimodal evaluator for violence score. Usage Example
Output Format
|
Enums
RougeType |
Enumeration of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) types. |
Functions
evaluate
Evaluates target or data with built-in or custom evaluators. If both target and data are provided, data will be run through target function and then results will be evaluated.
Evaluate API can be used as follows:
from azure.ai.evaluation import evaluate, RelevanceEvaluator, CoherenceEvaluator
model_config = {
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
"api_key": os.environ.get("AZURE_OPENAI_KEY"),
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}
coherence_eval = CoherenceEvaluator(model_config=model_config)
relevance_eval = RelevanceEvaluator(model_config=model_config)
path = "evaluate_test_data.jsonl"
result = evaluate(
data=path,
evaluators={
"coherence": coherence_eval,
"relevance": relevance_eval,
},
evaluator_config={
"coherence": {
"column_mapping": {
"response": "${data.response}",
"query": "${data.query}",
},
},
"relevance": {
"column_mapping": {
"response": "${data.response}",
"context": "${data.context}",
"query": "${data.query}",
},
},
},
)
evaluate(*, data: str | PathLike, evaluators: Dict[str, Callable], evaluation_name: str | None = None, target: Callable | None = None, evaluator_config: Dict[str, EvaluatorConfig] | None = None, azure_ai_project: AzureAIProject | None = None, output_path: str | PathLike | None = None, **kwargs) -> EvaluationResult
Keyword-Only Parameters
Name | Description |
---|---|
data
|
Path to the data to be evaluated or passed to target if target is set. Only .jsonl format files are supported. target and data both cannot be None. Required. |
evaluators
|
Evaluators to be used for evaluation. It should be a dictionary with key as alias for evaluator and value as the evaluator function. Required. |
evaluation_name
|
Display name of the evaluation. |
target
|
Target to be evaluated. target and data both cannot be None |
evaluator_config
|
Configuration for evaluators. The configuration should be a dictionary with evaluator names as keys and a values that are dictionaries containing the column mappings. The column mappings should be a dictionary with keys as the column names in the evaluator input and values as the column names in the input data or data generated by target. |
output_path
|
The local folder or file path to save evaluation results to if set. If folder path is provided the results will be saved to a file named evaluation_results.json in the folder. |
azure_ai_project
|
Logs evaluation results to AI Studio if set. |
Returns
Type | Description |
---|---|
Evaluation results. |
Azure SDK for Python