Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Unknown Unknowns: How to Fix Ka-Booms in Comple...

Unknown Unknowns: How to Fix Ka-Booms in Complex Systems

In this talk we will build together a toolbox to tame complicated (known unknowns) and complex (unknown unknowns) systems in production: monitoring and traceability, reproducibility and access, team dynamics and incident management, and more.

The talk will start with a brief overview of the Cynefin framework https://en.m.wikipedia.org/wiki/Cynefin_framework and translate its learnings into a set of practices that we can use in operating production systems at scale, plus a set of tools that everyone can use to improve their systems resilience and teams well-being.

We will use concrete examples of systems that have strict uptime requirements, mostly from the banking industry.

You will leave this talk with a better understanding of how to scale your system operations sustainably, and what the things that will let you deploy in production with confidence and then go to bed are"

Avatar for Riccardo Capraro

Riccardo Capraro

November 29, 2025
Tweet

More Decks by Riccardo Capraro

Other Decks in Technology

Transcript

  1. Our objectives 1. Talk about complexity 2. Learn how to

    tame it 3. The subtle art of fixing things (and not letting them break again)
  2. Our objectives 1. Talk about complexity 2. Learn how to

    tame it 3. The subtle art of fixing things (and not letting them break again)
  3. Me • aka Riccardo Capraro • DevOps / SRE /

    Platform / Cloud • Located in Vienna • Located in Trento
  4. Example Payment system running in Production • RTO (Recovery Time

    Objective) • RPO (Recovery Point Objective)
  5. Example Payment system running in Production • RTO (Recovery Time

    Objective) • RPO (Recovery Point Objective) • Releases
  6. Example Payment system running in Production • RTO (Recovery Time

    Objective) • RPO (Recovery Point Objective) • Releases • Stakeholders management
  7. Innocence • We find out immediately because an alert is

    fired • A runbooks tells us how to fix it or whom to contact
  8. Innocence • We find out immediately because an alert is

    fired • A runbooks tells us how to fix it or whom to contact • Bug is fixed and hotfix is released
  9. Innocence • We find out immediately because an alert is

    fired • A runbooks tells us how to fix it or whom to contact • Bug is fixed and hotfix is released • Customers happy :)
  10. Experience • It takes 1 week to find out it

    is broken • It takes 1 hour to find out who should fix it
  11. Experience • It takes 1 week to find out it

    is broken • It takes 1 hour to find out who should fix it • It takes 1 day to fix it
  12. Experience • It takes 1 week to find out it

    is broken • It takes 1 hour to find out who should fix it • It takes 1 day to fix it • It takes 1 more day to fix the fix
  13. Innocence • Reproducibility and Access • Monitoring and Traceability •

    Understand team dynamics • Incident management
  14. Experience • Assess impact and communicate it clearly to stakeholders

    • Narrow down the problem scope ASAP (team/microservice/network)
  15. Experience • Assess impact and communicate it clearly to stakeholders

    • Narrow down the problem scope ASAP (team/microservice/network) • Turn (all) debug logs on as soon as possible
  16. Experience • Assess impact and communicate it clearly to stakeholders

    • Narrow down the problem scope ASAP (team/microservice/network) • Turn (all) debug logs on as soon as possible • Reproduce it locally even if it is hard
  17. Experience • Assess impact and communicate it clearly to stakeholders

    • Narrow down the problem scope ASAP (team/microservice/network) • Turn (all) debug logs on as soon as possible • Reproduce it locally even if it is hard • Do not discard hypotheses: make sure to test them thoroughly
  18. Innocence • Monitor everything • Document everything • People have

    the right access to what they need • We know who is responsible for what
  19. Experience • Communication is more important than the actual fix

    • Assess impact thoroughly, urgency is usually unnecessary
  20. Experience • Communication is more important than the actual fix

    • Assess impact thoroughly, urgency is usually unnecessary • Hone the process
  21. Conclusions • Fixing problems is 10% about the fix and

    90% about the process • Fixing problems requires different approaches based on the complexity of the domain
  22. Conclusions • Fixing problems is 10% about the fix and

    90% about the process • Fixing problems requires different approaches based on the complexity of the domain ◦ try to add constraints to the problem
  23. References • SQUER and Vienna DevOps Meetup • Listen to

    the Cynefin framework explained by Dave Snowden • Look at this interesting article about operations and SRE (software reliability engineer) culture • Look at some of the video from our CodeCrafts conference on YouTube ◦ Sociotechnical systems and design • Learn about residuality theory with Barry O’Reilly
  24. Cynefin framework (notes) • Sense-making framework (way of looking at

    reality), not model (represent reality) • 5 Domains (ordered, complex, chaotic); adds disorder and splits ordered into 2 • Disorder ◦ not knowing what system you are in ◦ In chaotic there cannot be order, in disorder there might be order • Ordered (linear relationship between cause and effect): obvious (relationship is evident and undisputable, driving example) and complicated ◦ Over-constrained obvious goes into chaos ◦ Complicated (not obvious except maybe for experts); mistake is to impose best practice on a good practice domain; experts only create hypothesis in complex domain, can only be trusted in complicated domain
  25. Cynefin framework (notes) • Complex: I don’t know what the

    right solution is until I act ◦ Heuristics, parallel experiments ◦ Everything I do changes is the situation ◦ Exaptation: radical repurposing (beer bottle example) • Chaotic: no pattern, no constraint ◦ First action: create constraints or get rid of constraints