Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[AITour 26] Build trustworthy AI with systemati...

[AITour 26] Build trustworthy AI with systematic evaluations in Azure AI Foundry

Building generative AI apps starts with model selection—but earning user trust requires continuous evaluation. In this talk, learn how Azure AI Evaluations SDK helps assess models pre- and post-production, analyze results, and improve quality through Observability.

Location: Toronto
Date: Oct 1, 2025
Session: https://aitour.microsoft.com/flow/microsoft/toronto26/sessioncatalog/page/sessioncatalog/session/1755310350275001j1Bu

Visit the Repo:
https://github.com/microsoft/aitour26-LTG151-build-trustworthy-ai-with-systematic-evaluations-in-azure-ai-foundry

Join the Discord:
https://aka.ms/model-mondays/discord

Avatar for Nitya Narasimhan, PhD

Nitya Narasimhan, PhD

October 08, 2025
Tweet

More Decks by Nitya Narasimhan, PhD

Other Decks in Technology

Transcript

  1. Build Trustworthy AI with Systematic Evaluations in Azure AI Foundry

    Microsoft AI Tour | LTG151 Nitya Narasimhan, PhD Senior AI Advocate Microsoft
  2. Build Trustworthy AI with Systematic Evaluations in Azure AI Foundry

    Building generative AI apps starts with model selection—but earning user trust requires continuous evaluation. In this talk, learn how Azure AI Evaluations SDK helps assess models pre- and post-production, analyze results, and improve quality through Observability.
  3. Managing AI quality, performance and safety is paramount Azure AI

    Foundry Monitor Performance, Quality, Safety Metrics Diagnose Issues Quickly Intervene When Needed ai.azure.com
  4. Plan Develop Operate Aligned with your end-to-end workflow Azure AI

    Foundry Powers visibility, monitoring and optimization across the entire AI development lifecycle Govern Monitor Mitigate Optimize Evaluate
  5. Observability as your guide to get started Azure AI Foundry

    Select the best Model Evaluate it with guidance Transition from prototype to development ai.azure.com
  6. Demo 1: Leaderboards Select the right base model for the

    task using leaderboards and benchmarks to assess and compare quality, safety & costs
  7. Demo 2: Generate Eval Dataset Generate QA pairs from the

    Zava search index directly. Gives us a starting point (question, response, ground truth) for evaluations
  8. Demo 3: First Evaluation Flow Run an AI-Assisted evaluation using

    the evaluation dataset, with the right list of evaluators. Specify Azure AI Project for portal results, or output file for local viewing.
  9. Quality and Safety Evaluators Azure AI Foundry Quality Document Retrieval

    Groundedness Relevance Coherence Fluency Similarity NLP Metrics (e.g., F1 Score) AOAI Graders Risk & Safety Indirect Attack Jailbreaks Direct Attack Jailbreaks Hate and Unfairness Sexual Violence Self-Harm Protected Material Ungrounded Attributes Code Vulnerability NEW NEW NEW NEW
  10. Evaluations for agents User query “Weather now.” User Proxy Agent

    User wants to know the local weather in current time Tool Agent Call location and time API Call weather API Response Agent “The temperature is 30 degrees.” Intent resolution Tool calling evaluation Task adherence Preview Preview Preview • Correct intent classification • Clarification for ambiguity • Scope adherence • Single-step call accuracy • Parameter extraction accuracy • Multi-step trajectory efficiency • Final response satisfaction • Response completeness
  11. New Evaluators Azure AI Foundry Quality Document Retrieval Groundedness Relevance

    Coherence Fluency Similarity NLP Metrics (e.g., F1 Score) AOAI Graders Risk & Safety Indirect Attack Jailbreaks Direct Attack Jailbreaks Hate and Unfairness Sexual Violence Self-Harm Protected Material Ungrounded Attributes Code Vulnerability Agents Intent Resolution Tool Call Accuracy Task Adherence Response Completeness + Custom Evaluators NEW NEW NEW NEW NEW
  12. AI Red Teaming Agent NEW Automated scans to empower security

    professionals and ML engineers to proactively find risks in their generative AI systems faster with integrations of PyRIT into Azure AI Foundry
  13. Azure AI Foundry Security • Identity • Management Foundry Models

    Foundry Agent Service Azure AI Search Foundry Observability Azure AI Services Azure Machine Learning Azure AI Content Safety Copilot Studio Visual Studio GitHub Foundry SDK Serverless Control Azure Kubernetes Service Azure Container Apps Azure App Service Azure Functions Cloud Azure Azure Arc Foundry Local Edge Azure AI Foundry Observability is the foundation for reliable AI agents
  14. Agent Best Practices for Reliable AI Pick the right model

    – using Leaderboards Evaluate agents continuously – with the SDK Automate evaluations – in CI/CD pipeline Scan for vulnerabilities – with AI red teaming Monitor in production – with tracing & alerts
  15. Feedback Your feedback is valuable. Please submit your thoughts about

    today’s experiences at aka.ms/MicrosoftAITour/Survey …or use the QR code. Scan QR code to respond