Who we are
▪ Sr. Reliability Advocate at
▪ devopsdays Boise organizer
▪ avid mushroom-hunter
▪ came to Spain by plane
Julie Gunderson
@julie_gund
Gremlin
Slide 3
Slide 3 text
▪ Sr. Developer Advocate at
▪ recovering conference organizer
▪ aerial photography aficionado
▪ came to Spain by plane
Who we are
Kerim Satirli
@ksatirli
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
complex
systems
fail
Slide 7
Slide 7 text
OOPS WE DIDN’T TEST THAT
GRAPHS AND TRACES
AUTOMATED VISIBILITY
SURVEILLANCE STATE
29
23
11
1
?
WHAT DOES "OBSERVABILITY" MEAN TO YOU?
known-knowns
known-unknowns
uhm. !
marketing "
Slide 8
Slide 8 text
gather requirements
build the thing
release the thing
experience incident
$
Slide 9
Slide 9 text
experience incident restore service
%
detect incident
Slide 10
Slide 10 text
OOPS WE DIDN’T TEST THAT
GRAPHS AND TRACES
WHAT DOES "OBSERVABILITY" MEAN TO YOU?
known-knowns
known-unknowns
INFORMATION YOU DIDN’T THINK YOU NEEDED
BUT COULD ACTUALLY SOLVE YOUR PROBLEM
unknown-unknowns
4
5
Slide 11
Slide 11 text
pillars of observability
Slide 12
Slide 12 text
If you can't log it,
you can't investigate it.
Slide 13
Slide 13 text
If you can't trace it,
you can't debug it.
Slide 14
Slide 14 text
If you can't measure it,
you can't understand it.
Slide 15
Slide 15 text
If you can't measure it,
you can't understand it.
If you can’t interpret it,
you can’t harden it.
Slide 16
Slide 16 text
trace, log, measure
Slide 17
Slide 17 text
restore service
time to detect time to restore
experience incident
Slide 18
Slide 18 text
experimentation
Slide 19
Slide 19 text
▪ observe baseline metrics understand and document
your system’s nominal state
Science and Chaos
Slide 20
Slide 20 text
▪ observe baseline metrics
▪ formulate hypothesis given a nominal state, does this
function the way we expect it to?
Science and Chaos
Slide 21
Slide 21 text
what parts of the system will be
impacted by an experiment?
▪ observe baseline metrics
▪ formulate hypothesis
▪ understand blast radius
Science and Chaos
Slide 22
Slide 22 text
when to stop experimenting and
revert back to a nominal state?
▪ observe baseline metrics
▪ formulate hypothesis
▪ understand blast radius
▪ set abort conditions
Science and Chaos
Slide 23
Slide 23 text
▪ observe baseline metrics
▪ formulate hypothesis
▪ understand blast radius
▪ set abort conditions
▪ analyze the results what learnings can be derived
from the experiment?
Science and Chaos
Slide 24
Slide 24 text
▪ observe baseline metrics
▪ formulate hypothesis
▪ understand blast radius
▪ set abort conditions
▪ analyze the results
▪ learn and improve share the results and derive
actions from the data
Science and Chaos
Slide 25
Slide 25 text
four golden signals
Slide 26
Slide 26 text
understand your recovery plan
verify backups are known-good
⚠
Slide 27
Slide 27 text
how to simulate
Latency
the time it takes to
service a request
▪ programmatically inject delays
▪ change DNS and network settings
▪ switch to different geo zones
Signals and Simulations
Slide 28
Slide 28 text
Errors
the rate of requests that
fail to complete correctly
how to simulate
▪ terminate services
▪ revoke access credentials
▪ change system clock and timezones
Signals and Simulations
Slide 29
Slide 29 text
Traffic
the demand that is placed
on a system at any point
how to simulate
▪ create traffic spikes with tooling
▪ change load-balancing to create hot spots
▪ re-deploy on over-subscribed compute
Signals and Simulations
Slide 30
Slide 30 text
Saturation
the measure of system
utilization and constraints
how to simulate
▪ alter scaling logic to delay triggering
▪ fill up empty disk space and memory
▪ run stress or consume.exe
Signals and Simulations
Slide 31
Slide 31 text
Take-aways
▪ this was never about any one tool
▪ codify resources and processes
▪ method over madness
▪ culture breeds reliability
(words matter)
Slide 32
Slide 32 text
Normal Accidents
Living with High-Risk Technologies
Charles Perrow
1984
Fatal Defect
Chasing Killer Computer Bugs
Ivars Peterson
1995
Accelerate
Building and Scaling High-Perf Orgs
Nicole Forsgren et al.
2018