Better Reliability through Observability (and Experimentation)

Better Reliability through Observability (and Experimentation) Julie Gunderson, Gremlin &
Kerim Satirli, HashiCorp

Who we are ▪ Sr. Reliability Advocate at ▪ devopsdays
Boise organizer ▪ avid mushroom-hunter ▪ came to Spain by plane Julie Gunderson @julie_gund Gremlin

▪ Sr. Developer Advocate at ▪ recovering conference organizer ▪
aerial photography aficionado ▪ came to Spain by plane Who we are Kerim Satirli @ksatirli

complex systems fail

OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES AUTOMATED VISIBILITY
SURVEILLANCE STATE 29 23 11 1 ? WHAT DOES "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns uhm. ! marketing "

gather requirements build the thing release the thing experience incident
$

experience incident restore service % detect incident

OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES WHAT DOES
"OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns INFORMATION YOU DIDN’T THINK YOU NEEDED BUT COULD ACTUALLY SOLVE YOUR PROBLEM unknown-unknowns 4 5

pillars of observability

If you can't log it, you can't investigate it.

If you can't trace it, you can't debug it.

If you can't measure it, you can't understand it.

If you can't measure it, you can't understand it. If
you can’t interpret it, you can’t harden it.

trace, log, measure

restore service time to detect time to restore experience incident

experimentation

▪ observe baseline metrics understand and document your system’s nominal
state Science and Chaos

▪ observe baseline metrics ▪ formulate hypothesis given a nominal
state, does this function the way we expect it to? Science and Chaos

what parts of the system will be impacted by an
experiment? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius Science and Chaos

when to stop experimenting and revert back to a nominal
state? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius ▪ set abort conditions Science and Chaos

▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast
radius ▪ set abort conditions ▪ analyze the results what learnings can be derived from the experiment? Science and Chaos

▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast
radius ▪ set abort conditions ▪ analyze the results ▪ learn and improve share the results and derive actions from the data Science and Chaos

four golden signals

understand your recovery plan verify backups are known-good ⚠

how to simulate Latency the time it takes to service
a request ▪ programmatically inject delays ▪ change DNS and network settings ▪ switch to different geo zones Signals and Simulations

Errors the rate of requests that fail to complete correctly
how to simulate ▪ terminate services ▪ revoke access credentials ▪ change system clock and timezones Signals and Simulations

Traffic the demand that is placed on a system at
any point how to simulate ▪ create traffic spikes with tooling ▪ change load-balancing to create hot spots ▪ re-deploy on over-subscribed compute Signals and Simulations

Saturation the measure of system utilization and constraints how to
simulate ▪ alter scaling logic to delay triggering ▪ fill up empty disk space and memory ▪ run stress or consume.exe Signals and Simulations

Take-aways ▪ this was never about any one tool ▪
codify resources and processes ▪ method over madness ▪ culture breeds reliability (words matter)

Normal Accidents Living with High-Risk Technologies Charles Perrow 1984 Fatal
Defect Chasing Killer Computer Bugs Ivars Peterson 1995 Accelerate Building and Scaling High-Perf Orgs Nicole Forsgren et al. 2018

Gracias!

Better Reliability through Observability (and E...

Better Reliability through Observability (and Experimentation)

Video

Resources

companion code

More Decks by Kerim Satirli

Other Decks in Technology

Featured

Transcript