[AITour 26] Build trustworthy AI with systematic evaluations in Azure AI Foundry

Build Trustworthy AI with Systematic Evaluations in Azure AI Foundry
Microsoft AI Tour | LTG151 Nitya Narasimhan, PhD Senior AI Advocate Microsoft

Build Trustworthy AI with Systematic Evaluations in Azure AI Foundry
Building generative AI apps starts with model selection—but earning user trust requires continuous evaluation. In this talk, learn how Azure AI Evaluations SDK helps assess models pre- and post-production, analyze results, and improve quality through Observability.

Managing AI quality, performance and safety is paramount Azure AI
Foundry Monitor Performance, Quality, Safety Metrics Diagnose Issues Quickly Intervene When Needed ai.azure.com

Introducing Foundry Observability Azure AI Foundry Continuously improve and monitor
your AI Agents

Plan Develop Operate Aligned with your end-to-end workflow Azure AI
Foundry Powers visibility, monitoring and optimization across the entire AI development lifecycle Govern Monitor Mitigate Optimize Evaluate

Observability as your guide to get started Azure AI Foundry
Select the best Model Evaluate it with guidance Transition from prototype to development ai.azure.com

Demo 1: Leaderboards Select the right base model for the
task using leaderboards and benchmarks to assess and compare quality, safety & costs

Demo 2: Generate Eval Dataset Generate QA pairs from the
Zava search index directly. Gives us a starting point (question, response, ground truth) for evaluations

Demo 3: First Evaluation Flow Run an AI-Assisted evaluation using
the evaluation dataset, with the right list of evaluators. Specify Azure AI Project for portal results, or output file for local viewing.

Quality and Safety Evaluators Azure AI Foundry Quality Document Retrieval
Groundedness Relevance Coherence Fluency Similarity NLP Metrics (e.g., F1 Score) AOAI Graders Risk & Safety Indirect Attack Jailbreaks Direct Attack Jailbreaks Hate and Unfairness Sexual Violence Self-Harm Protected Material Ungrounded Attributes Code Vulnerability NEW NEW NEW NEW

Demo 4: Explore Built-in Quality & Safety Evaluators We’ll take
a quick look at some examples

Evaluations for agents User query “Weather now.” User Proxy Agent
User wants to know the local weather in current time Tool Agent Call location and time API Call weather API Response Agent “The temperature is 30 degrees.” Intent resolution Tool calling evaluation Task adherence Preview Preview Preview • Correct intent classification • Clarification for ambiguity • Scope adherence • Single-step call accuracy • Parameter extraction accuracy • Multi-step trajectory efficiency • Final response satisfaction • Response completeness

New Evaluators Azure AI Foundry Quality Document Retrieval Groundedness Relevance
Coherence Fluency Similarity NLP Metrics (e.g., F1 Score) AOAI Graders Risk & Safety Indirect Attack Jailbreaks Direct Attack Jailbreaks Hate and Unfairness Sexual Violence Self-Harm Protected Material Ungrounded Attributes Code Vulnerability Agents Intent Resolution Tool Call Accuracy Task Adherence Response Completeness + Custom Evaluators NEW NEW NEW NEW NEW

Demo 5: Explore Agent Evaluators & Graders We’ll take a
quick look at some examples

AI Red Teaming Agent NEW Automated scans to empower security
professionals and ML engineers to proactively find risks in their generative AI systems faster with integrations of PyRIT into Azure AI Foundry

Azure AI Foundry Security • Identity • Management Foundry Models
Foundry Agent Service Azure AI Search Foundry Observability Azure AI Services Azure Machine Learning Azure AI Content Safety Copilot Studio Visual Studio GitHub Foundry SDK Serverless Control Azure Kubernetes Service Azure Container Apps Azure App Service Azure Functions Cloud Azure Azure Arc Foundry Local Edge Azure AI Foundry Observability is the foundation for reliable AI agents

Agent Best Practices for Reliable AI Pick the right model
– using Leaderboards Evaluate agents continuously – with the SDK Automate evaluations – in CI/CD pipeline Scan for vulnerabilities – with AI red teaming Monitor in production – with tracing & alerts

Learn more: Build Trustworthy AI With Evaluations Learn Collection GitHub
Repository

Feedback Your feedback is valuable. Please submit your thoughts about
today’s experiences at aka.ms/MicrosoftAITour/Survey …or use the QR code. Scan QR code to respond

[AITour 26] Build trustworthy AI with systemati...

[AITour 26] Build trustworthy AI with systematic evaluations in Azure AI Foundry

Nitya Narasimhan, PhD

More Decks by Nitya Narasimhan, PhD

Other Decks in Technology

Featured

Transcript

Build Trustworthy AI with Systematic Evaluations in Azure AI Foundry

Build Trustworthy AI with Systematic Evaluations in Azure AI Foundry

Managing AI quality, performance and safety is paramount Azure AI

Introducing Foundry Observability Azure AI Foundry Continuously improve and monitor

Plan Develop Operate Aligned with your end-to-end workflow Azure AI

Observability as your guide to get started Azure AI Foundry

Demo 1: Leaderboards Select the right base model for the

Demo 2: Generate Eval Dataset Generate QA pairs from the

Demo 3: First Evaluation Flow Run an AI-Assisted evaluation using

Quality and Safety Evaluators Azure AI Foundry Quality Document Retrieval

Demo 4: Explore Built-in Quality & Safety Evaluators We’ll take

Evaluations for agents User query “Weather now.” User Proxy Agent

New Evaluators Azure AI Foundry Quality Document Retrieval Groundedness Relevance

Demo 5: Explore Agent Evaluators & Graders We’ll take a

AI Red Teaming Agent NEW Automated scans to empower security

Azure AI Foundry Security • Identity • Management Foundry Models

Agent Best Practices for Reliable AI Pick the right model

Learn more: Build Trustworthy AI With Evaluations Learn Collection GitHub

Feedback Your feedback is valuable. Please submit your thoughts about