Better Reliability through Observability (and Experimentation)

Slide 1

Slide 1 text

Better Reliability through Observability (and Experimentation) Julie Gunderson, Gremlin & Kerim Satirli, HashiCorp

Slide 2

Slide 2 text

Who we are ▪ Sr. Reliability Advocate at ▪ devopsdays Boise organizer ▪ avid mushroom-hunter ▪ came to Spain by plane Julie Gunderson @julie_gund Gremlin

Slide 3

Slide 3 text

▪ Sr. Developer Advocate at ▪ recovering conference organizer ▪ aerial photography aficionado ▪ came to Spain by plane Who we are Kerim Satirli @ksatirli

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

complex systems fail

Slide 7

Slide 7 text

OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES AUTOMATED VISIBILITY SURVEILLANCE STATE 29 23 11 1 ? WHAT DOES "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns uhm. ! marketing "

Slide 8

Slide 8 text

gather requirements build the thing release the thing experience incident $

Slide 9

Slide 9 text

experience incident restore service % detect incident

Slide 10

Slide 10 text

OOPS WE DIDN’T TEST THAT GRAPHS AND TRACES WHAT DOES "OBSERVABILITY" MEAN TO YOU? known-knowns known-unknowns INFORMATION YOU DIDN’T THINK YOU NEEDED BUT COULD ACTUALLY SOLVE YOUR PROBLEM unknown-unknowns 4 5

Slide 11

Slide 11 text

pillars of observability

Slide 12

Slide 12 text

If you can't log it, you can't investigate it.

Slide 13

Slide 13 text

If you can't trace it, you can't debug it.

Slide 14

Slide 14 text

If you can't measure it, you can't understand it.

Slide 15

Slide 15 text

If you can't measure it, you can't understand it. If you can’t interpret it, you can’t harden it.

Slide 16

Slide 16 text

trace, log, measure

Slide 17

Slide 17 text

restore service time to detect time to restore experience incident

Slide 18

Slide 18 text

experimentation

Slide 19

Slide 19 text

▪ observe baseline metrics understand and document your system’s nominal state Science and Chaos

Slide 20

Slide 20 text

▪ observe baseline metrics ▪ formulate hypothesis given a nominal state, does this function the way we expect it to? Science and Chaos

Slide 21

Slide 21 text

what parts of the system will be impacted by an experiment? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius Science and Chaos

Slide 22

Slide 22 text

when to stop experimenting and revert back to a nominal state? ▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius ▪ set abort conditions Science and Chaos

Slide 23

Slide 23 text

▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius ▪ set abort conditions ▪ analyze the results what learnings can be derived from the experiment? Science and Chaos

Slide 24

Slide 24 text

▪ observe baseline metrics ▪ formulate hypothesis ▪ understand blast radius ▪ set abort conditions ▪ analyze the results ▪ learn and improve share the results and derive actions from the data Science and Chaos

Slide 25

Slide 25 text

four golden signals

Slide 26

Slide 26 text

understand your recovery plan verify backups are known-good ⚠

Slide 27

Slide 27 text

how to simulate Latency the time it takes to service a request ▪ programmatically inject delays ▪ change DNS and network settings ▪ switch to different geo zones Signals and Simulations

Slide 28

Slide 28 text

Errors the rate of requests that fail to complete correctly how to simulate ▪ terminate services ▪ revoke access credentials ▪ change system clock and timezones Signals and Simulations

Slide 29

Slide 29 text

Traffic the demand that is placed on a system at any point how to simulate ▪ create traffic spikes with tooling ▪ change load-balancing to create hot spots ▪ re-deploy on over-subscribed compute Signals and Simulations

Slide 30

Slide 30 text

Saturation the measure of system utilization and constraints how to simulate ▪ alter scaling logic to delay triggering ▪ fill up empty disk space and memory ▪ run stress or consume.exe Signals and Simulations

Slide 31

Slide 31 text

Take-aways ▪ this was never about any one tool ▪ codify resources and processes ▪ method over madness ▪ culture breeds reliability (words matter)

Slide 32

Slide 32 text

Normal Accidents Living with High-Risk Technologies Charles Perrow 1984 Fatal Defect Chasing Killer Computer Bugs Ivars Peterson 1995 Accelerate Building and Scaling High-Perf Orgs Nicole Forsgren et al. 2018

Slide 33

Slide 33 text

Gracias!