Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Tips

Jason Hand
December 04, 2018

Site Reliability Tips

Five Tips to getting started with Site Reliability Engineering.

Jason Hand

December 04, 2018
Tweet

More Decks by Jason Hand

Other Decks in Technology

Transcript

  1. Tip#1 • Observability = Reality • Set Service Level Expectations

    • Measure & report against specific expectations @jasonhand | #Slush18
  2. Latency Traffic Errors Duration Saturation Utilization Rate The percentage of

    time a resource is in use The amount of work the resource must perform The number of failed requests The number of requests per second The amount of time to process a request The time it takes to service a request A measure of how much demand on the system @jasonhand | #Slush18
  3. Service Level Indicator (SLI) (represented as a ratio or …%

    proportion) # of successful HTTP calls / # of HTTP calls # of operations that completed in < 10ms / # of operations # of “full quality responses” / # of responses # of records processed / # of records ratio X 100 = % proportion @jasonhand | #Slush18
  4. Service Level Objective (SLO) HTTP requests (as reported by the

    load balancer) 95% 30-day (example) SLI @jasonhand | #Slush18
  5. / year / quarter / month / week / day

    / hour 99% 3.65 days 21.6 hours 7.2 hours 1.68 hours 14.4 minutes 36 seconds 99.9% 8.76 hours 2.16 minutes 43.2 seconds 10.1 minutes 1.44 minutes 3.6 seconds 99.99% 52.6 minutes 12.96 minutes 4.32 minutes 60.5 seconds 8.64 seconds 0.36 seconds 99.999% 5.26 minutes 1.30 minutes 25.9 seconds 6.05 seconds 0.87 seconds 0.04 seconds 9’s appropriate? @jasonhand | #Slush18
  6. Game Days Using knowledge and structured plans routinely to rehearse

    incident response. Expanding and improving the current baseline. Prepare - Rehearse - Respond @jasonhand | #Slush18
  7. Tip#2 • Improve Observability • Increase Mental Models • Test

    hypothesis in Production @jasonhand | #Slush18
  8. |