2
Subbu Allamaraju
@sallamar
Expedia Group
See h=ps://www.subbu.org for slides.
Slide 3
Slide 3 text
3
“Chaos Engineering is the discipline of
experimenIng on a system in order to
build confidence in the system’s capability
to withstand turbulent condiIons in
producIon.”
Photo by Jilbert Ebrahimi on Unsplash
From Principles of Chaos Engineering
(h=ps://principlesofchaos.org/)
Slide 4
Slide 4 text
3
4
1
Stable
Ok. Back
to Stable
Your assumed
fault boundary
Actual fault
boundary
Slide 5
Slide 5 text
Q
5
What is the system?
How do you form a hypothesis?
How do you ensure system safety?
Why should anyone listen to you?
Slide 6
Slide 6 text
6
Cloud
adopIon
Slide 7
Slide 7 text
7
Cloud
adopIon
How to build
resilience?
Slide 8
Slide 8 text
8
Chaos
engineering
Cloud
adopIon
How to build
resilience?
Slide 9
Slide 9 text
9
Chaos
engineering
Cloud
dopIon
to build
ience?
Not everyone likes you to a=ack their apps
Slide 10
Slide 10 text
10
Chaos
engineering
oud
pIon
build
nce? Randomly killing servers
uncovers trivial issues only
Not everyone likes you to a=ack their apps
Slide 11
Slide 11 text
11
Chaos
gineering
Randomly killing servers
uncovers trivial issues only
Not everyone likes you to a=ack their apps
You can’t/won’t test
more serious failures
Slide 12
Slide 12 text
12
Chaos
gineering
Randomly killing servers
uncovers trivial issues only
Not everyone likes you to a=ack their apps
You can’t/won’t test
more serious failures
Self-doubt
Slide 13
Slide 13 text
13
Chaos
gineering
Randomly killing servers
uncovers trivial issues only
Not everyone likes you to a=ack their apps
You can’t/won’t test
more serious failures
Self-doubt
Slide 14
Slide 14 text
14
Null hypothesis
Chaos engineering has nothing to
do system’s capability to withstand
turbulent condiIons in producIon.
Slide 15
Slide 15 text
15
How is the system behaving
in producIon today?
How do we make the system
withstand turbulent condiIons?
Photo by Hush Naidoo on Unsplash
Slide 16
Slide 16 text
16
“as designed” “as it is”
Biased by your expectaIon of how
the system is supposed to work
The real
world
Metrics
Alerts
Logs
Docs
Diagram
s Cod
e
Incident
s
Slide 17
Slide 17 text
17
Let’s observe the
real world
Slide 18
Slide 18 text
[CATE
GORY
NAME
1. Changes are contribuIng to majority
of impact
18
Slide 19
Slide 19 text
2. Second/higher order effects are hard
to troubleshoot
ReIred
App
Big
App
Tech
Debt
Another
Big App
19
Slide 20
Slide 20 text
20
3. We don’t understand where a failure
stops
ContribuIng to cascading
failures and long recovery
Slide 21
Slide 21 text
1. Improve release safety through progressive
delivery
2. Ensure Ighter fault domain boundaries in
the “as designed” state
3. Implement safety in the “as designed”
state
4. Only then pick what to test
21
Slide 22
Slide 22 text
22
These observaIons are relevant
but not as much as the act of
learning from incidents.
Because the ”as it is” state might
tell you what to do.
Slide 23
Slide 23 text
23
Pick the most criIcal areas
But how to prioriIze
such work?
ArIculate value
Slide 24
Slide 24 text
24
Randomly killing servers
uncovers trivial issues only
veryone likes you to a=ack their apps
You can’t/won’t test
more serious failures
Self-doubt
Learn from
incidents
Slide 25
Slide 25 text
25
domly killing servers
overs trivial issues only
likes you to a=ack their apps
You can’t/won’t test
more serious failures
Self-doubt
Learn from
incidents
Make value
based decisions
Slide 26
Slide 26 text
26
How to learn
from incidents?
Slide 27
Slide 27 text
27
How does it feel when
you learn from incidents?
Slide 28
Slide 28 text
28
1. Developed mental models of how
the system works when it does, and
doesn’t when it doesn’t.
2. You’re not chasing symptoms but are
beginning to understand the system
as a whole.
Slide 29
Slide 29 text
29
3. You start to understand role of
people, processes and tools for
success as well as failure.
4. You are able to arIculate the value
of hygiene investments.
Slide 30
Slide 30 text
30
Lessons learned
Slide 31
Slide 31 text
31
1. Learn from incidents.
2. There is no lesson 2.
Slide 32
Slide 32 text
32
@sallamar – h=ps://www.subbu.org
Source: h=ps://www.trover.com/li/vCu0/nZk3
Thank you