Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability Engineering for Humans

Af6c25ecaf4150696c36d91321a7af13?s=47 Hannah Foxwell
November 02, 2018

Reliability Engineering for Humans

The concepts and practices of Site Reliability Engineering are changing the way we build and operate our platforms and enabling us to have more meaningful conversations about availability, service-level objectives, and cost. But what are the benefits for the engineer holding the pager?

Join Hannah Foxwell to look at Site Reliability Engineering practices through a human lens. Hannah combines SRE with HumanOps and explains how to use SRE practices to improve the health and well-being of your team.

This talk was first given at Velocity Conf London 2018.

Af6c25ecaf4150696c36d91321a7af13?s=128

Hannah Foxwell

November 02, 2018
Tweet

Transcript

  1. @HannahFoxwell #VelocityConf @HannahFoxwell Delivery Manager @ Pivotal DevOpsDays London |

    HumanOps London Reliability Engineering for Humans Is Site Reliability Engineering Good for You?
  2. @HannahFoxwell #VelocityConf “You don’t need SRE unless you’re the size

    of Google” Anonymous CEO
  3. @HannahFoxwell #VelocityConf OPS DEVOPS SRE The Evolution of Ops

  4. @HannahFoxwell #VelocityConf Naming things is hard

  5. @HannahFoxwell #VelocityConf “DevOps is not a job title” DevOps Community

    (2009 - Present)
  6. @HannahFoxwell #VelocityConf CloudOps at Pivotal

  7. @HannahFoxwell #VelocityConf 100 0 Health 1/17 2/17 3/17 4/17 5/17

    6/17 7/17 8/17 9/17 10/17 11/17 12/17 3m 0m 2m 1m 4m 3/17 6/17 12/17 9/17 1h 0h 3/17 6/17 12/17 9/17 Improved Health MTTA Down MTTR Down CloudOps at Pivotal What did they do?
  8. @HannahFoxwell #VelocityConf OK, you have my attention

  9. @HannahFoxwell #VelocityConf #HUMANOPS

  10. @HannahFoxwell #VelocityConf The wellbeing of human operators impacts the reliability

    of systems.
  11. @HannahFoxwell #VelocityConf Free: https://landing.google.com/sre/book

  12. @HannahFoxwell #VelocityConf “SRE is what happens when a software engineer

    is tasked with what used to be called operations” Ben Treynor – Founder of Google’s SRE Team
  13. @HannahFoxwell #VelocityConf Failure is Normal

  14. @HannahFoxwell #VelocityConf Reliability is Fundamental

  15. @HannahFoxwell #VelocityConf “There is no trade off between improving performance

    and achieving higher levels of quality and stability. High performers do better at all these measures” Accelerate: The Science Behind DevOps Nicole Forsgren, Jez Humble & Gene Kim
  16. @HannahFoxwell #VelocityConf “In 2017 we saw low performers lose some

    ground in stability” (Increasing MTTR and CFR from 2016-17) Accelerate: The Science Behind DevOps Nicole Forsgren, Jez Humble & Gene Kim
  17. @HannahFoxwell #VelocityConf SLIs, SLOs and Error Budgets

  18. @HannahFoxwell #VelocityConf SLO Service Level Objective

  19. @HannahFoxwell #VelocityConf SLI Service Level Indicator

  20. @HannahFoxwell #VelocityConf Error Budget

  21. @HannahFoxwell #VelocityConf 100% Availability is not your target. So what

    is? Agree your SLI’s and SLO’s with everyone. Yes, everyone. Oops! We broke something. What now? SLO Error Budget (per 30 Days) 99% 432 mins 99.5% 216 mins 99.9% 43.2 mins 99.95% 21.6 mins 99.99% 4.32 mins 99.999% 0.43 mins • Everyone understands the importance of reliability • Everyone understands the error budget and how it works • Everyone understands the new rules! • On-call / Playbooks / Fire drills • Blameless Incident Review / Retrospective • Review error budget • Reduce risk and invest in reliability
  22. @HannahFoxwell #VelocityConf Set Your Service Level Objectives Measure Your Service

    Level Indicators Enforce Your Error Budgets
  23. @HannahFoxwell #VelocityConf “The only normal way to begin speaking a

    new language is to begin speaking it badly” Greg Thomson
  24. @HannahFoxwell #VelocityConf Aspirational SLO’s are OK

  25. @HannahFoxwell #VelocityConf Overachieving on your SLO is less OK

  26. @HannahFoxwell #VelocityConf “You get me” - Your CFO CFO

  27. @HannahFoxwell #VelocityConf Psychological Safety

  28. @HannahFoxwell #VelocityConf “Psychological safety is a shared belief that the

    team is safe for interpersonal risk taking” Amy Edmondson– Harvard Business School Professor
  29. @HannahFoxwell #VelocityConf “Psychological safety is a belief that one will

    not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes.” Amy Edmondson– Harvard Business School Professor
  30. @HannahFoxwell #VelocityConf Research Psychological safety was studied in a medical

    environment • Teams were measured on Psychological Safety, Error Rates and Team Performance • Higher Psychological Safety correlated to higher Error Rates • However, higher Error Rates correlated to higher Team Performance • Better practices. More lives saved. Amy Edmondson– Learning from mistakes is easier said than done
  31. @HannahFoxwell #VelocityConf Google’s Project Aristotle

  32. @HannahFoxwell #VelocityConf Failure is Normal Psychological Safety Hannah’s Hypothesis SRE

    Practices can transform teams by improving their Psychological Safety SRE Practices Increase learning from mistakes Boosts employee engagement Improved innovation
  33. @HannahFoxwell #VelocityConf Let’s talk about toil

  34. @HannahFoxwell #VelocityConf What is Toil? “I’m too good for this

    toily BS” “Yes, yes you are” • Manual • Repetitive • Automatable • Tactical • No enduring value • O(n) with service growth
  35. @HannahFoxwell #VelocityConf “I didn’t have time to automate myself out

    of a job. I didn’t even have time to eat!” Toil vs. Engineering Work
  36. @HannahFoxwell #VelocityConf You should spend below 50% of your time

    on toil.
  37. @HannahFoxwell #VelocityConf Not all toil is equal

  38. @HannahFoxwell #VelocityConf “If we have to staff humans to do

    the work, we are feeding the machines with the blood, sweat and tears of human beings” Joseph Bironas – Google SRE Chapter 7: The Evolution of Automation at Google
  39. @HannahFoxwell #VelocityConf Toxic toil The work that hurts • Wakes

    you up at night • Ruins your evenings weekends • Interrupts your work • Distracts you • Stresses you out
  40. @HannahFoxwell #VelocityConf I’m getting enough sleep I’m not afraid to

    fail I spend enough time with my family I’m good at my job I’m always learning Maslow’s Hierarchy of Needs
  41. @HannahFoxwell #VelocityConf Safety / Security Hannah’s Hypothesis SRE Practices can

    transform teams by meeting employees needs SRE Practices Physiological Social / Belonging Self Esteem Self Actualization
  42. @HannahFoxwell #VelocityConf Is SRE good for you?

  43. @HannahFoxwell #VelocityConf SRE SRE SRE SRE SRE Tell me your

    story… Blameless culture @HannahFoxwell Error Budget policy SLO’s
  44. @HannahFoxwell #VelocityConf Thank You x