Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Better Reliability through Observability (and E...

Better Reliability through Observability (and Experimentation)

In this presentation, Julie Gunderson (Sr. Reliability Advocate at Gremlin), and I look at how to improve your service's reliability through experimentation.

This version of the talk was given at a KubeCon Europe (Valencia) in May 2022.

Avatar for Kerim Satirli

Kerim Satirli

May 19, 2022
Tweet

Video


Resources

More Decks by Kerim Satirli

Other Decks in Technology

Transcript

  1. Who we are ▪ Sr. Reliability Advocate at ▪ devopsdays

    Boise organizer ▪ avid mushroom-hunter ▪ came to Spain by plane Julie Gunderson @julie_gund Gremlin
  2. ▪ Sr. Developer Advocate at ▪ recovering conference organizer ▪

    aerial photography aficionado ▪ came to Spain by plane Who we are Kerim Satirli @ksatirli
  3. OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES AUTOMATED VISIBILITY

    SURVEILLANCE STATE 29 23 11 1 ? WHAT DOES "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns uhm. ! marketing "
  4. OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES WHAT DOES

    "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns INFORMATION YOU DIDN’T THINK YOU NEEDED BUT COULD ACTUALLY SOLVE YOUR PROBLEM unknown-unknowns 4 5
  5. If you can't measure it, you can't understand it. If

    you can’t interpret it, you can’t harden it.
  6. ▪ observe baseline metrics ▪ formulate hypothesis given a nominal

    state, does this function the way we expect it to? Science and Chaos
  7. what parts of the system will be impacted by an

    experiment? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius Science and Chaos
  8. when to stop experimenting and revert back to a nominal

    state? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius ▪ set abort conditions Science and Chaos
  9. ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast

    radius ▪ set abort conditions ▪ analyze the results what learnings can be derived from the experiment? Science and Chaos
  10. ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast

    radius ▪ set abort conditions ▪ analyze the results ▪ learn and improve share the results and derive actions from the data Science and Chaos
  11. how to simulate Latency the time it takes to service

    a request ▪ programmatically inject delays ▪ change DNS and network settings ▪ switch to different geo zones Signals and Simulations
  12. Errors the rate of requests that fail to complete correctly

    how to simulate ▪ terminate services ▪ revoke access credentials ▪ change system clock and timezones Signals and Simulations
  13. Traffic the demand that is placed on a system at

    any point how to simulate ▪ create traffic spikes with tooling ▪ change load-balancing to create hot spots ▪ re-deploy on over-subscribed compute Signals and Simulations
  14. Saturation the measure of system utilization and constraints how to

    simulate ▪ alter scaling logic to delay triggering ▪ fill up empty disk space and memory ▪ run stress or consume.exe Signals and Simulations
  15. Take-aways ▪ this was never about any one tool ▪

    codify resources and processes ▪ method over madness ▪ culture breeds reliability (words matter)
  16. Normal Accidents Living with High-Risk Technologies Charles Perrow 1984 Fatal

    Defect Chasing Killer Computer Bugs Ivars Peterson 1995 Accelerate Building and Scaling High-Perf Orgs Nicole Forsgren et al. 2018