Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Are AI SRE Agents Ready for Prime Time? Buildin...

Avatar for Komodor Komodor
August 26, 2025

Are AI SRE Agents Ready for Prime Time? Building Trust in Autonomous Operations

Avatar for Komodor

Komodor

August 26, 2025
Tweet

More Decks by Komodor

Other Decks in Technology

Transcript

  1. Introduction The AI SRE Landscape The Evaluation Challenge Battle-Tested Tips

    & Tricks Benchmarking & Practical Takeaways Agenda DEMO Q&A
  2. Hi, My Name is Asaf Savich 👋 • CTO Office

    Team Lead at Komodor • Co-Founding CTO of Genie AI • Prev. Director of Engineering at Kubiya.ai • Prev. R&D Director at Mend • Ice Bath Enthusiast
  3. Here’s the sequence of events: • It was my first

    day at one of the companies I worked for. • I was about to leave office but then sh*t hit the fan - production fell just before a crucial POC • I had no context • There was no deployment took place that day • There was no single person who could investigate by their own • R&D blamed DevOps, DevOps blamed R&D. Rings a bell? My (Human) Experience with Incident Response
  4. Key differentiators: • OSS vs. Proprietary • Legacy vs. New

    Players • Context-Rich vs. AI-Wrapper • Chatbot vs. UI • Opinionated vs. Deterministic • Agent vs. Multi-Agent • Point Solution vs. Platform The AI SRE Landscape
  5. Why is it so hard? • Real life is (more)

    complicated • Lots of disparate data across the Cloud-Native stack • Hard to determine quality in a lab setting • Each infrastructure is uniquely intricate • Missing context & data quality • Many different LLMs and frameworks The Evaluation Challenge
  6. Failure Playground We chose the failure scenarios based on the

    following three parameters: • How common are they? • How difficult is it to find the RC? • How severe are the failures?
  7. Failure Playground Testing playground with dozens of failure scenarios and

    golden standards for instant model feedback against expert benchmarks. • Comprehensive test scenarios: Maintains dozens of real Kubernetes failure cases with detailed golden standard investigations defining what excellent root cause analysis should look like • Instant feedback loop: Enables immediate evaluation of new models and algorithm changes against established benchmarks, accelerating development cycles • Quality benchmarking: Provides objective measurement of investigation quality by comparing agent outputs against expert-crafted golden standards for each failure scenario
  8. LLM as a Judge • Don’t rely on LLM common

    sense • Create a set of prewritten rules • Let LLMs rate how other LLMs are doing • Ensure consistency and accuracy
  9. LLM as a Judge • Multi-dimensional scoring: Evaluates agents across

    technical accuracy, investigation depth, reasoning quality, and actionability using standardized criteria • Objective comparison: Systematically compares agents to identify superior approaches, like detecting complete causal chains versus symptom-focused analysis • Scalable assessment: Enables rapid evaluation of multiple agent variants simultaneously, accelerating our development cycle without human bottlenecks
  10. Comparison Tool • You need a way to compare different

    models fast • To stay ahead of the curve you need to quickly assess new models and new use-cases • Is Sonnet 3.7 better than 3.5? Is DeepSeek better than Claude?
  11. How Do We Compare? We tested the leading K8s AI

    SRE agents on the market with 30 scenarios: In this example: A deployment relying on a ConfigMap with an invalid value. Here are the results 👉
  12. Only Komodor’s Klaudia correctly detected the root-cause of 28/30 scenarios.

    Klaudia also suggested exact instructions for remediation - Completing the troubleshooting cycle E2E 👇
  13. Q&A