Are AI SRE Agents Ready for Prime Time? Building Trust in Autonomous Operations

Are AI SRE Agents Ready for Prime Time? Building Trust
in Autonomous Operations

Introduction The AI SRE Landscape The Evaluation Challenge Battle-Tested Tips
& Tricks Benchmarking & Practical Takeaways Agenda DEMO Q&A

Hi, My Name is Asaf Savich 👋 • CTO Office
Team Lead at Komodor • Co-Founding CTO of Genie AI • Prev. Director of Engineering at Kubiya.ai • Prev. R&D Director at Mend • Ice Bath Enthusiast

Here’s the sequence of events: • It was my ﬁrst
day at one of the companies I worked for. • I was about to leave oﬃce but then sh*t hit the fan - production fell just before a crucial POC • I had no context • There was no deployment took place that day • There was no single person who could investigate by their own • R&D blamed DevOps, DevOps blamed R&D. Rings a bell? My (Human) Experience with Incident Response

Key differentiators: • OSS vs. Proprietary • Legacy vs. New
Players • Context-Rich vs. AI-Wrapper • Chatbot vs. UI • Opinionated vs. Deterministic • Agent vs. Multi-Agent • Point Solution vs. Platform The AI SRE Landscape

Why is it so hard? • Real life is (more)
complicated • Lots of disparate data across the Cloud-Native stack • Hard to determine quality in a lab setting • Each infrastructure is uniquely intricate • Missing context & data quality • Many different LLMs and frameworks The Evaluation Challenge

Battle Tested Tips & Tricks How We Built Klaudia

Failure Playground We chose the failure scenarios based on the
following three parameters: • How common are they? • How difficult is it to find the RC? • How severe are the failures?

Failure Playground Testing playground with dozens of failure scenarios and
golden standards for instant model feedback against expert benchmarks. • Comprehensive test scenarios: Maintains dozens of real Kubernetes failure cases with detailed golden standard investigations defining what excellent root cause analysis should look like • Instant feedback loop: Enables immediate evaluation of new models and algorithm changes against established benchmarks, accelerating development cycles • Quality benchmarking: Provides objective measurement of investigation quality by comparing agent outputs against expert-crafted golden standards for each failure scenario

LLM as a Judge • Don’t rely on LLM common
sense • Create a set of prewritten rules • Let LLMs rate how other LLMs are doing • Ensure consistency and accuracy

LLM as a Judge • Multi-dimensional scoring: Evaluates agents across
technical accuracy, investigation depth, reasoning quality, and actionability using standardized criteria • Objective comparison: Systematically compares agents to identify superior approaches, like detecting complete causal chains versus symptom-focused analysis • Scalable assessment: Enables rapid evaluation of multiple agent variants simultaneously, accelerating our development cycle without human bottlenecks

Comparison Tool • You need a way to compare different
models fast • To stay ahead of the curve you need to quickly assess new models and new use-cases • Is Sonnet 3.7 better than 3.5? Is DeepSeek better than Claude?

Comparison Tool

Benchmarking Who is the leading AI SRE?

How Do We Compare? We tested the leading K8s AI
SRE agents on the market with 30 scenarios: In this example: A deployment relying on a ConfigMap with an invalid value. Here are the results 👉

Only Komodor’s Klaudia correctly detected the root-cause of 28/30 scenarios.
Klaudia also suggested exact instructions for remediation - Completing the troubleshooting cycle E2E 👇

Velocity Operational Efficiency Cost Savings Remember Why You’re Doing This

Demo Time Let’s recap & see some AI SREs in
Action

Are AI SRE Agents Ready for Prime Time? Buildin...

Are AI SRE Agents Ready for Prime Time? Building Trust in Autonomous Operations

Komodor

More Decks by Komodor

Other Decks in Technology

Featured

Transcript