Unknown Unknowns: How to Fix Ka-Booms in Complex Systems

Unknown Unknowns: How to Fix Ka-Booms in Complex Systems v1.0

Our objectives 1. Talk about complexity 2. Learn how to
tame it 3. The subtle art of ﬁxing things (and not letting them break again)

Agenda • Complexity theory and Cyneﬁn framework • Songs of
Innocence • Songs of Experience

Me • aka Riccardo Capraro • DevOps / SRE /
Platform / Cloud • Located in Vienna • Located in Trento

Complexity theory

Cyneﬁn

Example Payment system running in Production

Example Payment system running in Production • RTO (Recovery Time
Objective)

Objective) • RPO (Recovery Point Objective)

Objective) • RPO (Recovery Point Objective) • Releases

Objective) • RPO (Recovery Point Objective) • Releases • Stakeholders management

Innocence & Experience

What do we do when something breaks?

Innocence • We ﬁnd out immediately because an alert is
ﬁred

ﬁred • A runbooks tells us how to ﬁx it or whom to contact

fired • A runbooks tells us how to fix it or whom to contact • Bug is fixed and hotfix is released

fired • A runbooks tells us how to fix it or whom to contact • Bug is fixed and hotfix is released • Customers happy :)

Experience • It takes 1 week to ﬁnd out it
is broken

is broken • It takes 1 hour to ﬁnd out who should ﬁx it

is broken • It takes 1 hour to find out who should fix it • It takes 1 day to fix it

is broken • It takes 1 hour to find out who should fix it • It takes 1 day to fix it • It takes 1 more day to fix the fix

What do we need to ﬁx problems?

Innocence • Reproducibility and Access

Innocence • Reproducibility and Access • Monitoring and Traceability

Innocence • Reproducibility and Access • Monitoring and Traceability •
Understand team dynamics

Innocence • Reproducibility and Access • Monitoring and Traceability •
Understand team dynamics • Incident management

Experience • Assess impact and communicate it clearly to stakeholders

• Narrow down the problem scope ASAP (team/microservice/network)

• Narrow down the problem scope ASAP (team/microservice/network) • Turn (all) debug logs on as soon as possible

• Narrow down the problem scope ASAP (team/microservice/network) • Turn (all) debug logs on as soon as possible • Reproduce it locally even if it is hard

• Narrow down the problem scope ASAP (team/microservice/network) • Turn (all) debug logs on as soon as possible • Reproduce it locally even if it is hard • Do not discard hypotheses: make sure to test them thoroughly

What do we do to make it better?

Innocence • Monitor everything

Innocence • Monitor everything • Document everything

Innocence • Monitor everything • Document everything • People have
the right access to what they need

Innocence • Monitor everything • Document everything • People have
the right access to what they need • We know who is responsible for what

Experience • Communication is more important than the actual ﬁx

• Assess impact thoroughly, urgency is usually unnecessary

• Assess impact thoroughly, urgency is usually unnecessary • Hone the process

Conclusions

Conclusions • Fixing problems is 10% about the ﬁx and
90% about the process

90% about the process • Fixing problems requires different approaches based on the complexity of the domain

90% about the process • Fixing problems requires different approaches based on the complexity of the domain ◦ try to add constraints to the problem

Thank you!

References • SQUER and Vienna DevOps Meetup • Listen to
the Cyneﬁn framework explained by Dave Snowden • Look at this interesting article about operations and SRE (software reliability engineer) culture • Look at some of the video from our CodeCrafts conference on YouTube ◦ Sociotechnical systems and design • Learn about residuality theory with Barry O’Reilly

Questions?

Bonus slides

Cyneﬁn framework (notes) • Sense-making framework (way of looking at
reality), not model (represent reality) • 5 Domains (ordered, complex, chaotic); adds disorder and splits ordered into 2 • Disorder ◦ not knowing what system you are in ◦ In chaotic there cannot be order, in disorder there might be order • Ordered (linear relationship between cause and effect): obvious (relationship is evident and undisputable, driving example) and complicated ◦ Over-constrained obvious goes into chaos ◦ Complicated (not obvious except maybe for experts); mistake is to impose best practice on a good practice domain; experts only create hypothesis in complex domain, can only be trusted in complicated domain

Cyneﬁn framework (notes) • Complex: I don’t know what the
right solution is until I act ◦ Heuristics, parallel experiments ◦ Everything I do changes is the situation ◦ Exaptation: radical repurposing (beer bottle example) • Chaotic: no pattern, no constraint ◦ First action: create constraints or get rid of constraints

Unknown Unknowns: How to Fix Ka-Booms in Comple...

Unknown Unknowns: How to Fix Ka-Booms in Complex Systems

More Decks by Riccardo Capraro

Other Decks in Technology

Featured

Transcript