@@ -1,6 +1,6 @@
---
-title: Evaluate your Generative AI application with the Azure AI Evaluation SDK
-titleSuffix: Azure AI project
+title: Local Evaluation with Azure AI Evaluation SDK
+titleSuffix: Azure AI Foundry
description: This article provides instructions on how to evaluate a Generative AI application with the Azure AI Evaluation SDK.
manager: scottpolly
ms.service: azure-ai-foundry
@@ -9,17 +9,17 @@ ms.custom:
- references_regions
- ignite-2024
ms.topic: how-to
-ms.date: 12/18/2024
+ms.date: 02/21/2025
ms.reviewer: minthigpen
ms.author: lagayhar
author: lgayhardt
---
-# Evaluate your Generative AI application with the Azure AI Evaluation SDK
+# Evaluate your Generative AI application locally with the Azure AI Evaluation SDK
[!INCLUDE [feature-preview](../../includes/feature-preview.md)]
> [!NOTE]
-> Evaluation with the prompt flow SDK has been retired and replaced with Azure AI Evaluation SDK.
+> Evaluation with the prompt flow SDK has been retired and replaced with Azure AI Evaluation SDK client library for Python. For more information about input data requirements, see the [API Reference Documentation](https://aka.ms/azureaieval-python-ref).
To thoroughly assess the performance of your generative AI application when applied to a substantial dataset, you can evaluate a Generative AI application in your development environment with the Azure AI evaluation SDK. Given either a test dataset or a target, your generative AI application generations are quantitatively measured with both mathematical based metrics and AI-assisted quality and safety evaluators. Built-in or custom evaluators can provide you with comprehensive insights into the application's capabilities and limitations.
@@ -51,9 +51,6 @@ For more in-depth information on each evaluator definition and how it's calculat
Built-in quality and safety metrics take in query and response pairs, along with additional information for specific evaluators.
-> [!TIP]
-> For more information about inputs and outputs, see the [Azure Python reference documentation](https://aka.ms/azureaieval-python-ref).
-
### Data requirements for built-in evaluators
Built-in evaluators can accept *either* query and response pairs or a list of conversations:
@@ -64,7 +61,7 @@ Built-in evaluators can accept *either* query and response pairs or a list of co
| Evaluator | `query` | `response` | `context` | `ground_truth` | `conversation` |
|----------------|---------------|---------------|---------------|---------------|-----------|
|`GroundednessEvaluator` | Optional: String | Required: String | Required: String | N/A | Supported for text |
-| `GroundednessProEvaluator` | Required: String | Required: String | Required: String | N/A | Supported for text |
+| `GroundednessProEvaluator` | Required: String | Required: String | Required: String | N/A | Supported for text |
| `RetrievalEvaluator` | Required: String | N/A | Required: String | N/A | Supported for text |
| `RelevanceEvaluator` | Required: String | Required: String | N/A | N/A | Supported for text |
| `CoherenceEvaluator` | Required: String | Required: String | N/A | N/A |Supported for text |
@@ -82,7 +79,7 @@ Built-in evaluators can accept *either* query and response pairs or a list of co
| `IndirectAttackEvaluator` | Required: String | Required: String | Required: String | N/A |Supported for text |
| `ProtectedMaterialEvaluator` | Required: String | Required: String | N/A | N/A |Supported for text and image |
| `QAEvaluator` | Required: String | Required: String | Required: String | Required: String | Not supported |
-| `ContentSafetyEvaluator` | Required: String | Required: String | N/A | N/A | Supported for text and image |
+| `ContentSafetyEvaluator` | Required: String | Required: String | N/A | N/A | Supported for text and image |
- Query: the query sent in to the generative AI application
- Response: the response to the query generated by the generative AI application
@@ -91,7 +88,7 @@ Built-in evaluators can accept *either* query and response pairs or a list of co
- Conversation: a list of messages of user and assistant turns. See more in the next section.
> [!NOTE]
-> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they will consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)
+> AI-assisted quality evaluators except for `SimilarityEvaluator` come with a reason field. They employ techniques including chain-of-thought reasoning to generate an explanation for the score. Therefore they'll consume more token usage in generation as a result of improved evaluation quality. Specifically, `max_token` for evaluator generation has been set to 800 for all AI-assisted evaluators (and 1600 for `RetrievalEvaluator` to accommodate for longer inputs.)
#### Conversation support for text
@@ -126,7 +123,7 @@ For evaluators that support conversations for text, you can provide `conversatio
Our evaluators understand that the first turn of the conversation provides valid `query` from `user`, `context` from `assistant`, and `response` from `assistant` in the query-response format. Conversations are then evaluated per turn and results are aggregated over all turns for a conversation score.
> [!NOTE]
-> Note that in the second turn, even if `context` is `null` or a missing key, it will be interpreted as an empty string instead of erroring out, which might lead to misleading results. We strongly recommend that you validate your evaluation data to comply with the data requirements.
+> In the second turn, even if `context` is `null` or a missing key, it will be interpreted as an empty string instead of erroring out, which might lead to misleading results. We strongly recommend that you validate your evaluation data to comply with the data requirements.
#### Conversation support for images and multi-modal text and image
@@ -201,9 +198,9 @@ safety_score = safety_evaluator(conversation=conversation_image_url)
Currently the image and multi-modal evaluators support:
-- Single turn only (a conversation can have only 1 user message and 1 assistant message)
-- Conversation can have only 1 system message
-- Conversation payload should be less than 10MB size (including images)
+- Single turn only (a conversation can have only one user message and one assistant message)
+- Conversation can have only one system message
+- Conversation payload should be less than 10-MB size (including images)
- Absolute URLs and Base64 encoded images
- Multiple images in a single turn
- JPG/JPEG, PNG, GIF file formats
@@ -214,12 +211,14 @@ You can use our built-in AI-assisted and NLP quality evaluators to assess the pe
#### Set up
-1. For AI-assisted quality evaluators except for `GroundednessProEvaluator`, you must specify a GPT model to act as a judge to score the evaluation data. Choose a deployment with either GPT-3.5, GPT-4, GPT-4o or GPT-4-mini model for your calculations and set it as your `model_config`. We support both Azure OpenAI or OpenAI model configuration schema. We recommend using GPT models that don't have the `(preview)` suffix for the best performance and parseable responses with our evaluators.
+1. For AI-assisted quality evaluators except for `GroundednessProEvaluator` (preview), you must specify a GPT model (`gpt-35-turbo`, `gpt-4`, `gpt-4-turbo`, `gpt-4o`, or `gpt-4o-mini`) in your `model_config` to act as a judge to score the evaluation data. We support both Azure OpenAI or OpenAI model configuration schema. We recommend using GPT models that aren't in preview for the best performance and parseable responses with our evaluators.
> [!NOTE]
-> Make sure the you have at least `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with API key. For more permissions, learn more about [permissioning for Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).
+> It's strongly recommended that `gpt-3.5-turbo` should be replaced by `gpt-4o-mini` for your evaluator model, as the latter is cheaper, more capable, and just as fast according to [OpenAI](https://platform.openai.com/docs/models/gpt-4#gpt-3-5-turbo).
+>
+> Make sure the you have at least `Cognitive Services OpenAI User` role for the Azure OpenAI resource to make inference calls with API key. To learn more about permissions, see [permissions for Azure OpenAI resource](../../../ai-services/openai/how-to/role-based-access-control.md#summary).
-2. For `GroundednessProEvaluator`, instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the backend evaluation service of your Azure AI project.
+2. For `GroundednessProEvaluator` (preview), instead of a GPT deployment in `model_config`, you must provide your `azure_ai_project` information. This accesses the backend evaluation service of your Azure AI project.
#### Performance and quality evaluator usage
@@ -253,7 +252,7 @@ groundedness_pro_eval = GroundednessProEvaluator(azure_ai_project=azure_ai_proje
query_response = dict(
query="Which tent is the most waterproof?",
- context="The Alpine Explorer Tent is the most water-proof of all tents available.",
+ context="The Alpine Explorer Tent is the second most water-proof of all tents available.",
response="The Alpine Explorer Tent is the most waterproof."
)
@@ -296,13 +295,15 @@ The result of the AI-assisted quality evaluators for a query and response pair i
- `{metric_name}_label` provides a binary label.
- `{metric_name}_reason` explains why a certain score or label was given for each data point.
+#### Comparing quality and custom evaluators
+
For NLP evaluators, only a score is given in the `{metric_name}` key.
-Like 6 other AI-assisted evaluators, `GroundednessEvaluator` is a prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator` invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` if all content is grounded, or `False` if any ungrounded content is detected.
+Like six other AI-assisted evaluators, `GroundednessEvaluator` is a prompt-based evaluator that outputs a score on a 5-point scale (the higher the score, the more grounded the result is). On the other hand, `GroundednessProEvaluator` (preview) invokes our backend evaluation service powered by Azure AI Content Safety and outputs `True` if all content is grounded, or `False` if any ungrounded content is detected.
-We open-source the prompts of our quality evaluators except for `GroundednessProEvaluator` (powered by Azure AI Content Safety) for transparency. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics (what the 5 levels of quality mean for the metric). We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. See details in [Custom Evaluators](#custom-evaluators).
+We open-source the prompts of our quality evaluators except for `GroundednessProEvaluator` (powered by Azure AI Content Safety) for transparency. These prompts serve as instructions for a language model to perform their evaluation task, which requires a human-friendly definition of the metric and its associated scoring rubrics (what the five levels of quality mean for the metric). We highly recommend that users customize the definitions and grading rubrics to their scenario specifics. See details in [Custom Evaluators](#custom-evaluators).
-For conversation mode, here is an example for `GroundednessEvaluator`:
+For conversation mode, here's an example for `GroundednessEvaluator`:
```python
# Conversation mode
@@ -391,7 +392,7 @@ print(violence_conv_score)
The result of the content safety evaluators for a query and response pair is a dictionary containing:
-- `{metric_name}` provides a severity label for that content risk ranging from Very low, Low, Medium, and High. You can read more about the descriptions of each content risk and severity scale [here](../../concepts/evaluation-metrics-built-in.md).
+- `{metric_name}` provides a severity label for that content risk ranging from Very low, Low, Medium, and High. To learn more about the descriptions of each content risk and severity scale, see [Evaluation and monitoring metrics for generative AI](../../concepts/evaluation-metrics-built-in.md).
- `{metric_name}_score` has a range between 0 and 7 severity level that maps to a severity label given in `{metric_name}`.
- `{metric_name}_reason` explains why a certain severity score was given for each data point.
@@ -738,284 +739,12 @@ result = evaluate(
```
-## Cloud evaluation (preview) on test datasets
-
-After local evaluations of your generative AI applications, you might want to run evaluations in the cloud for pre-deployment testing, and [continuously evaluate](https://aka.ms/GenAIMonitoringDoc) your applications for post-deployment monitoring. Azure AI Projects SDK offers such capabilities via a Python API and supports almost all of the features available in local evaluations. Follow the steps below to submit your evaluation to the cloud on your data using built-in or custom evaluators.
-
-### Prerequisites
-
-- Azure AI project in the same [regions](#region-support) as risk and safety evaluators (preview). If you don't have an existing project, follow the guide [How to create Azure AI project](../create-projects.md?tabs=ai-studio) to create one.
-
-> [!NOTE]
-> Cloud evaluations do not support `ContentSafetyEvaluator`, and `QAEvaluator`.
-
-- Azure OpenAI Deployment with GPT model supporting `chat completion`, for example `gpt-4`.
-- `Connection String` for Azure AI project to easily create `AIProjectClient` object. You can get the **Project connection string** under **Project details** from the project's **Overview** page.
-- Make sure you're first logged into your Azure subscription by running `az login`.
-
-### Installation Instructions
-
-1. Create a **virtual Python environment of you choice**. To create one using conda, run the following command:
-
- ```bash
- conda create -n cloud-evaluation
- conda activate cloud-evaluation
- ```
-
-2. Install the required packages by running the following command:
-
- ```bash
- pip install azure-identity azure-ai-projects azure-ai-ml
- ```
-
- Optionally you can `pip install azure-ai-evaluation` if you want a code-first experience to fetch evaluator ID for built-in evaluators in code.
-
-Now you can define a client and a deployment which will be used to run your evaluations in the cloud:
-
-```python
-
-import os, time
-from azure.ai.projects import AIProjectClient
-from azure.identity import DefaultAzureCredential
-from azure.ai.projects.models import Evaluation, Dataset, EvaluatorConfiguration, ConnectionType
-from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator
-
-# Load your Azure OpenAI config
-deployment_name = os.environ.get("AZURE_OPENAI_DEPLOYMENT")
-api_version = os.environ.get("AZURE_OPENAI_API_VERSION")
-
-# Create an Azure AI Client from a connection string. Avaiable on Azure AI project Overview page.
-project_client = AIProjectClient.from_connection_string(
- credential=DefaultAzureCredential(),
- conn_str="<connection_string>"
-)
-```
-
-### Uploading evaluation data
-
-We provide two ways to register your data in Azure AI project required for evaluations in the cloud:
-
-1. **From SDK**: Upload new data from your local directory to your Azure AI project in the SDK, and fetch the dataset ID as a result:
-
-```python
-data_id, _ = project_client.upload_file("./evaluate_test_data.jsonl")
-```
-
-**From UI**: Alternatively, you can upload new data or update existing data versions by following the UI walkthrough under the **Data** tab of your Azure AI project.
-
-2. Given existing datasets uploaded to your Project:
-
-- **From SDK**: if you already know the dataset name you created, construct the dataset ID in this format: `/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<project-name>/data/<dataset-name>/versions/<version-number>`
-
-- **From UI**: If you don't know the dataset name, locate it under the **Data** tab of your Azure AI project and construct the dataset ID as in the format above.
-
-### Specifying evaluators from Evaluator library
-
-We provide a list of built-in evaluators registered in the [Evaluator library](../evaluate-generative-ai-app.md#view-and-manage-the-evaluators-in-the-evaluator-library) under **Evaluation** tab of your Azure AI project. You can also register custom evaluators and use them for Cloud evaluation. We provide two ways to specify registered evaluators:
-
-#### Specifying built-in evaluators
-
-- **From SDK**: Use built-in evaluator `id` property supported by `azure-ai-evaluation` SDK:
-
-```python
-from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator
-print("F1 Score evaluator id:", F1ScoreEvaluator.id)
-```
-
-- **From UI**: Follows these steps to fetch evaluator ids after they're registered to your project:
- - Select **Evaluation** tab in your Azure AI project;
- - Select Evaluator library;
- - Select your evaluators of choice by comparing the descriptions;
- - Copy its "Asset ID" which will be your evaluator id, for example, `azureml://registries/azureml/models/Groundedness-Evaluator/versions/1`.
-
-#### Specifying custom evaluators
-
-- For code-based custom evaluators, register them to your Azure AI project and fetch the evaluator ids with the following:
-
-```python
-from azure.ai.ml import MLClient
-from azure.ai.ml.entities import Model
-from promptflow.client import PFClient
-
-
-# Define ml_client to register custom evaluator
-ml_client = MLClient(
- subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
- resource_group_name=os.environ["AZURE_RESOURCE_GROUP"],
- workspace_name=os.environ["AZURE_PROJECT_NAME"],
- credential=DefaultAzureCredential()
-)
-
-
-# Load evaluator from module
-from answer_len.answer_length import AnswerLengthEvaluator
-
-# Then we convert it to evaluation flow and save it locally
-pf_client = PFClient()
-local_path = "answer_len_local"
-pf_client.flows.save(entry=AnswerLengthEvaluator, path=local_path)
-
-# Specify evaluator name to appear in the Evaluator library
-evaluator_name = "AnswerLenEvaluator"
-
-# Finally register the evaluator to the Evaluator library
-custom_evaluator = Model(
- path=local_path,
- name=evaluator_name,
- description="Evaluator calculating answer length.",
-)
-registered_evaluator = ml_client.evaluators.create_or_update(custom_evaluator)
-print("Registered evaluator id:", registered_evaluator.id)
-# Registered evaluators have versioning. You can always reference any version available.
-versioned_evaluator = ml_client.evaluators.get(evaluator_name, version=1)
-print("Versioned evaluator id:", registered_evaluator.id)
-```
-
-After registering your custom evaluator to your Azure AI project, you can view it in your [Evaluator library](../evaluate-generative-ai-app.md#view-and-manage-the-evaluators-in-the-evaluator-library) under **Evaluation** tab in your Azure AI project.
-
-- For prompt-based custom evaluators, use this snippet to register them. For example, let's register our `FriendlinessEvaluator` built as described in [Prompt-based evaluators](#prompt-based-evaluators):
-
-```python
-# Import your prompt-based custom evaluator
-from friendliness.friend import FriendlinessEvaluator
-
-# Define your deployment
-model_config = dict(
- azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
- azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
- api_version=os.environ.get("AZURE_API_VERSION"),
- api_key=os.environ.get("AZURE_API_KEY"),
- type="azure_openai"
-)
-
-# Define ml_client to register custom evaluator
-ml_client = MLClient(
- subscription_id=os.environ["AZURE_SUBSCRIPTION_ID"],
- resource_group_name=os.environ["AZURE_RESOURCE_GROUP"],
- workspace_name=os.environ["AZURE_PROJECT_NAME"],
- credential=DefaultAzureCredential()
-)
-
-# # Convert evaluator to evaluation flow and save it locally
-local_path = "friendliness_local"
-pf_client = PFClient()
-pf_client.flows.save(entry=FriendlinessEvaluator, path=local_path)
-
-# Specify evaluator name to appear in the Evaluator library
-evaluator_name = "FriendlinessEvaluator"
-
-# Register the evaluator to the Evaluator library
-custom_evaluator = Model(
- path=local_path,
- name=evaluator_name,
- description="prompt-based evaluator measuring response friendliness.",
-)
-registered_evaluator = ml_client.evaluators.create_or_update(custom_evaluator)
-print("Registered evaluator id:", registered_evaluator.id)
-# Registered evaluators have versioning. You can always reference any version available.
-versioned_evaluator = ml_client.evaluators.get(evaluator_name, version=1)
-print("Versioned evaluator id:", registered_evaluator.id)
-```
-
-After logging your custom evaluator to your Azure AI project, you can view it in your [Evaluator library](../evaluate-generative-ai-app.md#view-and-manage-the-evaluators-in-the-evaluator-library) under **Evaluation** tab of your Azure AI project.
-
-### Cloud evaluation (preview) with Azure AI Projects SDK
-
-You can submit a cloud evaluation with Azure AI Projects SDK via a Python API. See the following example to submit a cloud evaluation of your dataset using an NLP evaluator (F1 score), an AI-assisted quality evaluator (Relevance), a safety evaluator (Violence) and a custom evaluator. Putting it altogether:
-
-```python
-import os, time
-from azure.ai.projects import AIProjectClient
-from azure.identity import DefaultAzureCredential
-from azure.ai.projects.models import Evaluation, Dataset, EvaluatorConfiguration, ConnectionType
-from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator
-
-# Load your Azure OpenAI config
-deployment_name = os.environ.get("AZURE_OPENAI_DEPLOYMENT")
-api_version = os.environ.get("AZURE_OPENAI_API_VERSION")
-
-# Create an Azure AI Client from a connection string. Available on project overview page on Azure AI project UI.
-project_client = AIProjectClient.from_connection_string(
- credential=DefaultAzureCredential(),
- conn_str="<connection_string>"
-)
-
-# Construct dataset ID per the instruction
-data_id = "<dataset-id>"
-
-default_connection = project_client.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI)
-
-# Use the same model_config for your evaluator (or use different ones if needed)
-model_config = default_connection.to_evaluator_model_config(deployment_name=deployment_name, api_version=api_version)
-
-# Create an evaluation
-evaluation = Evaluation(
- display_name="Cloud evaluation",
- description="Evaluation of dataset",
- data=Dataset(id=data_id),
- evaluators={
- # Note the evaluator configuration key must follow a naming convention
- # the string must start with a letter with only alphanumeric characters
- # and underscores. Take "f1_score" as example: "f1score" or "f1_evaluator"
- # will also be acceptable, but "f1-score-eval" or "1score" will result in errors.
- "f1_score": EvaluatorConfiguration(
- id=F1ScoreEvaluator.id,
- ),
- "relevance": EvaluatorConfiguration(
- id=RelevanceEvaluator.id,
- init_params={
- "model_config": model_config
- },
- ),
- "violence": EvaluatorConfiguration(
- id=ViolenceEvaluator.id,
- init_params={
- "azure_ai_project": project_client.scope
- },
- ),
- "friendliness": EvaluatorConfiguration(
- id="<custom_evaluator_id>",
- init_params={
- "model_config": model_config
- }
- )
- },
-)
-
-# Create evaluation
-evaluation_response = project_client.evaluations.create(
- evaluation=evaluation,
-)
-
-# Get evaluation
-get_evaluation_response = project_client.evaluations.get(evaluation_response.id)
-
-print("----------------------------------------------------------------")
-print("Created evaluation, evaluation ID: ", get_evaluation_response.id)
-print("Evaluation status: ", get_evaluation_response.status)
-print("AI project URI: ", get_evaluation_response.properties["AiStudioEvaluationUri"])
-print("----------------------------------------------------------------")
-```
-
-Now we can run the cloud evaluation we just instantiated above.
-
-```python
-evaluation = client.evaluations.create(
- evaluation=evaluation,
- subscription_id=subscription_id,
- resource_group_name=resource_group_name,
- workspace_name=workspace_name,
- headers={
- "x-azureml-token": DefaultAzureCredential().get_token("https://ml.azure.com/.default").token,
- }
-)
-```
-
## Related content
-- [Azure Python reference documentation](https://aka.ms/azureaieval-python-ref)
-- [Azure AI Evaluation SDK Troubleshooting guide](https://aka.ms/azureaieval-tsg)
+- [Azure AI Evaluation Python SDK client reference documentation](https://aka.ms/azureaieval-python-ref)
+- [Azure AI Evaluation SDK client Troubleshooting guide](https://aka.ms/azureaieval-tsg)
- [Learn more about the evaluation metrics](../../concepts/evaluation-metrics-built-in.md)
+- [Evaluate your Generative AI applications remotely on the cloud](./cloud-evaluation.md)
- [Learn more about simulating test datasets for evaluation](./simulator-interaction-data.md)
- [View your evaluation results in Azure AI project](../../how-to/evaluate-results.md)
- [Get started building a chat app using the Azure AI Foundry SDK](../../quickstarts/get-started-code.md)