Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Better Reliability through Observability (and Experimentation)

Better Reliability through Observability (and Experimentation)

In this presentation, Julie Gunderson (Sr. Reliability Advocate at Gremlin), and I look at how to improve your service's reliability through experimentation.

This version of the talk was given at a KubeCon Europe (Valencia) in May 2022.

---

Companion Code: github.com/ksatirli/better-reliability-through-observability-and-experimentation

8c73ec710b03be8909e71ad500866934?s=128

Kerim Satirli
PRO

May 19, 2022
Tweet

More Decks by Kerim Satirli

Other Decks in Programming

Transcript

  1. Better Reliability through Observability (and Experimentation) Julie Gunderson, Gremlin &

    Kerim Satirli, HashiCorp
  2. Who we are ▪ Sr. Reliability Advocate at ▪ devopsdays

    Boise organizer ▪ avid mushroom-hunter ▪ came to Spain by plane Julie Gunderson @julie_gund Gremlin
  3. ▪ Sr. Developer Advocate at ▪ recovering conference organizer ▪

    aerial photography aficionado ▪ came to Spain by plane Who we are Kerim Satirli @ksatirli
  4. None
  5. None
  6. complex systems fail

  7. OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES AUTOMATED VISIBILITY

    SURVEILLANCE STATE 29 23 11 1 ? WHAT DOES "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns uhm. ! marketing "
  8. gather requirements build the thing release the thing experience incident

    $
  9. experience incident restore service % detect incident

  10. OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES WHAT DOES

    "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns INFORMATION YOU DIDN’T THINK YOU NEEDED BUT COULD ACTUALLY SOLVE YOUR PROBLEM unknown-unknowns 4 5
  11. pillars of observability

  12. If you can't log it, you can't investigate it.

  13. If you can't trace it, you can't debug it.

  14. If you can't measure it, you can't understand it.

  15. If you can't measure it, you can't understand it. If

    you can’t interpret it, you can’t harden it.
  16. trace, log, measure

  17. restore service time to detect time to restore experience incident

  18. experimentation

  19. ▪ observe baseline metrics understand and document your system’s nominal

    state Science and Chaos
  20. ▪ observe baseline metrics ▪ formulate hypothesis given a nominal

    state, does this function the way we expect it to? Science and Chaos
  21. what parts of the system will be impacted by an

    experiment? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius Science and Chaos
  22. when to stop experimenting and revert back to a nominal

    state? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius ▪ set abort conditions Science and Chaos
  23. ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast

    radius ▪ set abort conditions ▪ analyze the results what learnings can be derived from the experiment? Science and Chaos
  24. ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast

    radius ▪ set abort conditions ▪ analyze the results ▪ learn and improve share the results and derive actions from the data Science and Chaos
  25. four golden signals

  26. understand your recovery plan verify backups are known-good ⚠

  27. how to simulate Latency the time it takes to service

    a request ▪ programmatically inject delays ▪ change DNS and network settings ▪ switch to different geo zones Signals and Simulations
  28. Errors the rate of requests that fail to complete correctly

    how to simulate ▪ terminate services ▪ revoke access credentials ▪ change system clock and timezones Signals and Simulations
  29. Traffic the demand that is placed on a system at

    any point how to simulate ▪ create traffic spikes with tooling ▪ change load-balancing to create hot spots ▪ re-deploy on over-subscribed compute Signals and Simulations
  30. Saturation the measure of system utilization and constraints how to

    simulate ▪ alter scaling logic to delay triggering ▪ fill up empty disk space and memory ▪ run stress or consume.exe Signals and Simulations
  31. Take-aways ▪ this was never about any one tool ▪

    codify resources and processes ▪ method over madness ▪ culture breeds reliability (words matter)
  32. Normal Accidents Living with High-Risk Technologies Charles Perrow 1984 Fatal

    Defect Chasing Killer Computer Bugs Ivars Peterson 1995 Accelerate Building and Scaling High-Perf Orgs Nicole Forsgren et al. 2018
  33. Gracias!