Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The_Evolution_of_Bits_AI_SRE.pdf

 The_Evolution_of_Bits_AI_SRE.pdf

Avatar for 株式会社ヌーラボ

株式会社ヌーラボ PRO

March 13, 2026
Tweet

More Decks by 株式会社ヌーラボ

Other Decks in Technology

Transcript

  1. The Evolution of Bits AI SRE Me, the Organization, and

    the Future 2026/03/13@Frontiers of AIOps with Datadog Nulab Inc. Hisatomo Futahashi(@futahashi)
  2. Hisatomo Futahashi @futahashi 2 • Principal Engineer / AWS Alliance

    Lead @Nulab Inc. (8y) • Technical Advisor @Horizon Technology Co. (2y) • SRE @Ignission G.K. (3mo) • SRE @XXXXXX (Starting in 3d) • 2022 & 2023 APN AWS Top Engineers (Software) • 2023 Japan AWS All Certifications Engineers
  3. • Promoting Datadog adoption at Nulab and my side jobs

    • Founding member and organizer of JDDUG Fukuoka • Speaker at Datadog Live Tokyo 2025 (June) • Speaker at Datadog Live Tokyo 2025 (December) • Exhibited at the JDDUG booth at Datadog Summit (October) • Visited the Datadog Japan office for a total of 6 days Datadog & Me 3
  4. 4

  5. Agenda • Background • Falling in Love with Bits •

    Bits Use Cases • Key Takeaways & Impact • Tips for Onboarding Bits to Your Organization • Summary 6
  6. Engineering Teams:20+ Accelerating Complexity 8 History:20+ years Cacoo Users:400M+ Backlog

    Paid Subscriptions:15k+ Hosts:300+ Containers:1500+ Services:50+ Repositories:300+ AWS Accounts:35+ Headcount 2x in 6y Diverse Tech Stack Scala / Go / Kotlin etc. Cross-Product Integration IPO Ensuring Reliability Amidst Unpredictable Complexity
  7. You might expect Google to try to build 100% reliable

    services̶ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically donʼt notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that usersʼ overall happiness̶with features, service, and performance̶is optimized. 100% Reliability Is Rarely the Right Answer 9 Ref: https://sre.google/sre-book/embracing-risk/
  8. You might expect Google to try to build 100% reliable

    services̶ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically donʼt notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that usersʼ overall happiness̶with features, service, and performance̶is optimized. 100% Reliability Is Rarely the Right Answer 10 My Interpretation ①Over-Engineering Reliability = Higher Costs + Slower Velocity + Diminishing Returns for Users = No Value Added ②SRE = Reliability × Velocity × Cost Efficiency = Maximize Overall Value through Balance Ref: https://sre.google/sre-book/embracing-risk/
  9. The Path to Sustainable Reliability 11 ・Increasing System Complexity ・Multi-Product

    Ecosystem ・Shifting Landscapes ・Rapid Recovery ・Standardized Investigations ・Leveraging Knowledge as Assets Key: Embracing Acceptable Failure & Leveraging AI for Recovery Current Ideal ・Slower Triage ・Investigation Silos ・Reaching Cognitive Load Limits Challenges Scaling Chaos The Wall of Load & Silos Sustainable Reliability
  10. Desperate for a Helping Paw 😇 13 • Core infrastructure

    supporting all Nulab services • Handles sensitive org/user data • Auth, Billing, and Security-hardening features Critical Systems Solo SRE The Burden of Return • Sunset of a primary product • New tech stack & lack of domain knowledge • Juggling childcare and housework • Moved to a team of 1 after 3 departures • Getting by with help from kind neighboring teams (Platform, other SREs, Developers, etc.)
  11. Discovering Bits AI SRE at DASH 2025 🗽 14 Wow,

    Bits is impressive! (How long until it's real?)
  12. The Bits AI SRE Dependency Cycle ( by a certain

    Datadog CSM ) 🔄 15 Trial is available! Maybe just a little... Wow, Bits is impressive! Too busy to touch it I need more SREs! Bits AI SRE! Experiencing AIOps thru a 30-day trial and a strong nudge Rising AIOps
  13. 1. Solo Midnight DDoS Response (1/3) 17 Concluded the investigation

    in 4 minutes during a 3:50 AM emergency solo response
  14. 1. Solo Midnight DDoS Response (3/3) 19 Mirrored SRE workflows

    by leveraging logs and dashboards exactly as a human would
  15. 2. Low-Priority Alert Triage (1/1) • 10+ alerts per month

    that aren't urgent but can't be ignored 20 App not deploying A few 5xx errors Help! Too many slow queries Internal tools inaccessible Pod restarting repeatedly It just broke suddenly High CPU usage, huh? Feels slow, doesn't it? Often impossible to start immediately and ties up multiple people
  16. Praised by a certain Datadog TEM who happened to see

    it 24 This Notebook is incredibly well-structured Futahashi-san is impressive! Heh...
  17. Side Story: Chat with a Datadog SE 25 Investigations often

    finish with just Bits now I don't even open AWS Console or Terminal anymore Futahashi-san is impressive! A world I never would have believed a year ago That's insane
  18. • Monitors still require manual judgment on which ones should

    trigger an investigation. • APM latency alerts lack the ability to provide additional context. • Invisible issues remain uninvestigable if they aren't caught by monitors or APM latency. • Unsupported data sources cannot be utilized for investigations. But is Bits in its current state enough? 26 We must lower the barrier to entry and expand coverage for Bits
  19. A Request to a Certain Datadog SSE for the Improvement

    and Adoption of Bits Higher Expectations for "Bits AI SRE" While "Bits AI SRE" is highly effective even without tuning, I expect it to achieve even greater results with less effort in the future. Specifically, it currently requires pre-configured monitors as investigation triggers, and context must be added to those monitors to gain deeper insights. In the future, I hope it will be able to autonomously recognize anomalies and leverage context based on all available information within Datadog. 27 Ref: https://www.datadoghq.com/ja/blog/datadog-live-tokyo-2025-recap/
  20. A Request to a Certain Datadog SSE for the Improvement

    and Adoption of Bits Higher Expectations for "Bits AI SRE" While "Bits AI SRE" is highly effective even without tuning, I expect it to achieve even greater results with less effort in the future. Specifically, it currently requires pre-configured monitors as investigation triggers, and context must be added to those monitors to gain deeper insights. In the future, I hope it will be able to autonomously recognize anomalies and leverage context based on all available information within Datadog. 28 Ref: https://www.datadoghq.com/ja/blog/datadog-live-tokyo-2025-recap/ My Interpretation I WANT TO investigate without monitors and master all the data and just make my life easy! lol Futahashi-san is impressive!
  21. • Metrics • APM traces • Logs • Dashboards •

    Events • Change Tracking Expanded Support Coverage 🔥 29 Leveraging rich context to boost investigation depth and scope Ref: https://docs.datadoghq.com/bits_ai/bits_ai_sre/investigate_issues/ • Source code (GitHub only) • Watchdog • Real User Monitoring • Network Path • Database Monitoring • Continuous Profiler
  22. Hot New Preview Features 🔥 • Generate code fixes: Bits

    creates PRs to fix the root cause • Investigate Synthetics API: Support for Synthetics Monitors • Recommended Actions: Triage steps based on investigation results • Bits.md: Provide shared context to Bits for investigations • Start investigations from APM latency graphs & APM Watchdog stories: Trigger investigations without monitors • Prompt-based investigations: Trigger investigations without monitors 30 Ref: https://www.datadoghq.com/product-preview/bits-ai-sre-pilot-features/
  23. Log Trend Analysis via Natural Language (1/2) 32 Compare log

    trends between yesterday and today from XX:XX to XX:XX
  24. Log Trend Analysis via Natural Language (2/2) 33 Comprehensive investigations

    for any query, not just incidents. Bits is impressive! Woof! (50s)
  25. Cost Anomaly Investigation via Natural Language (2/2) 35 The assistant

    everyone in the world has been waiting for!!!! Bits is impressive! Woof! (2 mins) That's insane
  26. Bits Assistant Use Cases • Answers based on official documentation

    • Insights from dashboards and service pages • Insights from cloud costs • Insights from SLOs • Capacity planning • Investigating issues undetected by monitors 36 The possibilities are endless!!!!
  27. Comparing AI Features & Use Cases 37 Bits AI SRE

    Bits Assistant Datadog MCP Pup CLI Primary Role Autonomous SRE Conversational Assistant Gateway to External AI AI-powered CLI Primary Use Incident Response Data Search, Insights, General Q&A Using Datadog via external LLMs Human/AI ops & Scripting UI Specific Pages / Slack / (Mobile) All Pages / Slack / (Mobile) External LLM UI (Kiro, Claude, etc.) Terminal / Scripts Target Users SRE / Ops Every human being Dev / SRE / Ops AI / SRE / Ops Each one is unique, and all are impressive!!
  28. • Rethinking Incident Response ‒ Accelerated / Autonomous / Asset-driven

    ➡➡➡ Fundamental Change • Focusing on the Essence of SRE ‒ Bits protects the “NOW”, I protect the “FUTURE” ‒ Data Consolidation / Context Enrichment / Feature Utilization ➡➡➡ Enhanced Observability • Strengthening Organization-wide Incident Response ‒ And when I shape Bits, Bits also shapes me ‒ Human-AI Collaboration / Visibility / Memory ➡➡➡ Improvement Loop What Bits Has Brought to Us 🤲 39 Bits is not just automation; it is a fundamental transformation of our world
  29. Treat Bits as a valued member of the team🫶 •

    Evangelize Bits: Share features and celebrate successes • Prepare the "Dog Park": Deploy and utilize based on Datadog Best Practices • Train Bits: Foster growth through known alerts and chaos injection • Secure the "Dog Food" budget: Guardrails and cost monitoring 41 Lead as a "Capable Senior" through Servant Leadership Iʼll keep pushing forward too! 🏋 Futahashi-san is impressive!
  30. • Configure Unified Service Tagging • Propagate context to correlate

    telemetry • Enable key features and ingest essential data • Create effective Dashboards for investigation • Include telemetry links in Monitor Messages • Train through Feedback and Memory • Set up Slack Integration Tips for Empowering Bits 💡 42 Ref: https://docs.datadoghq.com/bits_ai/bits_ai_sre/knowledge_sources A better environment for humans is a better environment for our Good Boy
  31. To a New World with Datadog AIOps • Bits Transforms

    the World of Incident Response ‒ A Partner with "Super-Canine" Insight and Expression, Growing alongside us like a Teammate • Our Mission as SREs: Build the Ultimate Dog Park for Bits ‒ Provide the Best Datadog Environment for the Best Partner • Confidence in the Rapid Evolution of Datadog and Bits ‒ Reach a Higher Realm simply by continuing your journey with Datadog 44 Letʼs Evolve and Transform Our Organizations Together with Bits! Datadog is impressive!