Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability Engineering for Humans

Hannah Foxwell
November 02, 2018

Reliability Engineering for Humans

The concepts and practices of Site Reliability Engineering are changing the way we build and operate our platforms and enabling us to have more meaningful conversations about availability, service-level objectives, and cost. But what are the benefits for the engineer holding the pager?

Join Hannah Foxwell to look at Site Reliability Engineering practices through a human lens. Hannah combines SRE with HumanOps and explains how to use SRE practices to improve the health and well-being of your team.

This talk was first given at Velocity Conf London 2018.

Hannah Foxwell

November 02, 2018
Tweet

More Decks by Hannah Foxwell

Other Decks in Technology

Transcript

  1. @HannahFoxwell #VelocityConf @HannahFoxwell Delivery Manager @ Pivotal DevOpsDays London |

    HumanOps London Reliability Engineering for Humans Is Site Reliability Engineering Good for You?
  2. @HannahFoxwell #VelocityConf 100 0 Health 1/17 2/17 3/17 4/17 5/17

    6/17 7/17 8/17 9/17 10/17 11/17 12/17 3m 0m 2m 1m 4m 3/17 6/17 12/17 9/17 1h 0h 3/17 6/17 12/17 9/17 Improved Health MTTA Down MTTR Down CloudOps at Pivotal What did they do?
  3. @HannahFoxwell #VelocityConf “SRE is what happens when a software engineer

    is tasked with what used to be called operations” Ben Treynor – Founder of Google’s SRE Team
  4. @HannahFoxwell #VelocityConf “There is no trade off between improving performance

    and achieving higher levels of quality and stability. High performers do better at all these measures” Accelerate: The Science Behind DevOps Nicole Forsgren, Jez Humble & Gene Kim
  5. @HannahFoxwell #VelocityConf “In 2017 we saw low performers lose some

    ground in stability” (Increasing MTTR and CFR from 2016-17) Accelerate: The Science Behind DevOps Nicole Forsgren, Jez Humble & Gene Kim
  6. @HannahFoxwell #VelocityConf 100% Availability is not your target. So what

    is? Agree your SLI’s and SLO’s with everyone. Yes, everyone. Oops! We broke something. What now? SLO Error Budget (per 30 Days) 99% 432 mins 99.5% 216 mins 99.9% 43.2 mins 99.95% 21.6 mins 99.99% 4.32 mins 99.999% 0.43 mins • Everyone understands the importance of reliability • Everyone understands the error budget and how it works • Everyone understands the new rules! • On-call / Playbooks / Fire drills • Blameless Incident Review / Retrospective • Review error budget • Reduce risk and invest in reliability
  7. @HannahFoxwell #VelocityConf “The only normal way to begin speaking a

    new language is to begin speaking it badly” Greg Thomson
  8. @HannahFoxwell #VelocityConf “Psychological safety is a shared belief that the

    team is safe for interpersonal risk taking” Amy Edmondson– Harvard Business School Professor
  9. @HannahFoxwell #VelocityConf “Psychological safety is a belief that one will

    not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes.” Amy Edmondson– Harvard Business School Professor
  10. @HannahFoxwell #VelocityConf Research Psychological safety was studied in a medical

    environment • Teams were measured on Psychological Safety, Error Rates and Team Performance • Higher Psychological Safety correlated to higher Error Rates • However, higher Error Rates correlated to higher Team Performance • Better practices. More lives saved. Amy Edmondson– Learning from mistakes is easier said than done
  11. @HannahFoxwell #VelocityConf Failure is Normal Psychological Safety Hannah’s Hypothesis SRE

    Practices can transform teams by improving their Psychological Safety SRE Practices Increase learning from mistakes Boosts employee engagement Improved innovation
  12. @HannahFoxwell #VelocityConf What is Toil? “I’m too good for this

    toily BS” “Yes, yes you are” • Manual • Repetitive • Automatable • Tactical • No enduring value • O(n) with service growth
  13. @HannahFoxwell #VelocityConf “I didn’t have time to automate myself out

    of a job. I didn’t even have time to eat!” Toil vs. Engineering Work
  14. @HannahFoxwell #VelocityConf “If we have to staff humans to do

    the work, we are feeding the machines with the blood, sweat and tears of human beings” Joseph Bironas – Google SRE Chapter 7: The Evolution of Automation at Google
  15. @HannahFoxwell #VelocityConf Toxic toil The work that hurts • Wakes

    you up at night • Ruins your evenings weekends • Interrupts your work • Distracts you • Stresses you out
  16. @HannahFoxwell #VelocityConf I’m getting enough sleep I’m not afraid to

    fail I spend enough time with my family I’m good at my job I’m always learning Maslow’s Hierarchy of Needs
  17. @HannahFoxwell #VelocityConf Safety / Security Hannah’s Hypothesis SRE Practices can

    transform teams by meeting employees needs SRE Practices Physiological Social / Belonging Self Esteem Self Actualization
  18. @HannahFoxwell #VelocityConf SRE SRE SRE SRE SRE Tell me your

    story… Blameless culture @HannahFoxwell Error Budget policy SLO’s