Slide 1

Slide 1 text

@HannahFoxwell #VelocityConf @HannahFoxwell Delivery Manager @ Pivotal DevOpsDays London | HumanOps London Reliability Engineering for Humans Is Site Reliability Engineering Good for You?

Slide 2

Slide 2 text

@HannahFoxwell #VelocityConf “You don’t need SRE unless you’re the size of Google” Anonymous CEO

Slide 3

Slide 3 text

@HannahFoxwell #VelocityConf OPS DEVOPS SRE The Evolution of Ops

Slide 4

Slide 4 text

@HannahFoxwell #VelocityConf Naming things is hard

Slide 5

Slide 5 text

@HannahFoxwell #VelocityConf “DevOps is not a job title” DevOps Community (2009 - Present)

Slide 6

Slide 6 text

@HannahFoxwell #VelocityConf CloudOps at Pivotal

Slide 7

Slide 7 text

@HannahFoxwell #VelocityConf 100 0 Health 1/17 2/17 3/17 4/17 5/17 6/17 7/17 8/17 9/17 10/17 11/17 12/17 3m 0m 2m 1m 4m 3/17 6/17 12/17 9/17 1h 0h 3/17 6/17 12/17 9/17 Improved Health MTTA Down MTTR Down CloudOps at Pivotal What did they do?

Slide 8

Slide 8 text

@HannahFoxwell #VelocityConf OK, you have my attention

Slide 9

Slide 9 text

@HannahFoxwell #VelocityConf #HUMANOPS

Slide 10

Slide 10 text

@HannahFoxwell #VelocityConf The wellbeing of human operators impacts the reliability of systems.

Slide 11

Slide 11 text

@HannahFoxwell #VelocityConf Free: https://landing.google.com/sre/book

Slide 12

Slide 12 text

@HannahFoxwell #VelocityConf “SRE is what happens when a software engineer is tasked with what used to be called operations” Ben Treynor – Founder of Google’s SRE Team

Slide 13

Slide 13 text

@HannahFoxwell #VelocityConf Failure is Normal

Slide 14

Slide 14 text

@HannahFoxwell #VelocityConf Reliability is Fundamental

Slide 15

Slide 15 text

@HannahFoxwell #VelocityConf “There is no trade off between improving performance and achieving higher levels of quality and stability. High performers do better at all these measures” Accelerate: The Science Behind DevOps Nicole Forsgren, Jez Humble & Gene Kim

Slide 16

Slide 16 text

@HannahFoxwell #VelocityConf “In 2017 we saw low performers lose some ground in stability” (Increasing MTTR and CFR from 2016-17) Accelerate: The Science Behind DevOps Nicole Forsgren, Jez Humble & Gene Kim

Slide 17

Slide 17 text

@HannahFoxwell #VelocityConf SLIs, SLOs and Error Budgets

Slide 18

Slide 18 text

@HannahFoxwell #VelocityConf SLO Service Level Objective

Slide 19

Slide 19 text

@HannahFoxwell #VelocityConf SLI Service Level Indicator

Slide 20

Slide 20 text

@HannahFoxwell #VelocityConf Error Budget

Slide 21

Slide 21 text

@HannahFoxwell #VelocityConf 100% Availability is not your target. So what is? Agree your SLI’s and SLO’s with everyone. Yes, everyone. Oops! We broke something. What now? SLO Error Budget (per 30 Days) 99% 432 mins 99.5% 216 mins 99.9% 43.2 mins 99.95% 21.6 mins 99.99% 4.32 mins 99.999% 0.43 mins • Everyone understands the importance of reliability • Everyone understands the error budget and how it works • Everyone understands the new rules! • On-call / Playbooks / Fire drills • Blameless Incident Review / Retrospective • Review error budget • Reduce risk and invest in reliability

Slide 22

Slide 22 text

@HannahFoxwell #VelocityConf Set Your Service Level Objectives Measure Your Service Level Indicators Enforce Your Error Budgets

Slide 23

Slide 23 text

@HannahFoxwell #VelocityConf “The only normal way to begin speaking a new language is to begin speaking it badly” Greg Thomson

Slide 24

Slide 24 text

@HannahFoxwell #VelocityConf Aspirational SLO’s are OK

Slide 25

Slide 25 text

@HannahFoxwell #VelocityConf Overachieving on your SLO is less OK

Slide 26

Slide 26 text

@HannahFoxwell #VelocityConf “You get me” - Your CFO CFO

Slide 27

Slide 27 text

@HannahFoxwell #VelocityConf Psychological Safety

Slide 28

Slide 28 text

@HannahFoxwell #VelocityConf “Psychological safety is a shared belief that the team is safe for interpersonal risk taking” Amy Edmondson– Harvard Business School Professor

Slide 29

Slide 29 text

@HannahFoxwell #VelocityConf “Psychological safety is a belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes.” Amy Edmondson– Harvard Business School Professor

Slide 30

Slide 30 text

@HannahFoxwell #VelocityConf Research Psychological safety was studied in a medical environment • Teams were measured on Psychological Safety, Error Rates and Team Performance • Higher Psychological Safety correlated to higher Error Rates • However, higher Error Rates correlated to higher Team Performance • Better practices. More lives saved. Amy Edmondson– Learning from mistakes is easier said than done

Slide 31

Slide 31 text

@HannahFoxwell #VelocityConf Google’s Project Aristotle

Slide 32

Slide 32 text

@HannahFoxwell #VelocityConf Failure is Normal Psychological Safety Hannah’s Hypothesis SRE Practices can transform teams by improving their Psychological Safety SRE Practices Increase learning from mistakes Boosts employee engagement Improved innovation

Slide 33

Slide 33 text

@HannahFoxwell #VelocityConf Let’s talk about toil

Slide 34

Slide 34 text

@HannahFoxwell #VelocityConf What is Toil? “I’m too good for this toily BS” “Yes, yes you are” • Manual • Repetitive • Automatable • Tactical • No enduring value • O(n) with service growth

Slide 35

Slide 35 text

@HannahFoxwell #VelocityConf “I didn’t have time to automate myself out of a job. I didn’t even have time to eat!” Toil vs. Engineering Work

Slide 36

Slide 36 text

@HannahFoxwell #VelocityConf You should spend below 50% of your time on toil.

Slide 37

Slide 37 text

@HannahFoxwell #VelocityConf Not all toil is equal

Slide 38

Slide 38 text

@HannahFoxwell #VelocityConf “If we have to staff humans to do the work, we are feeding the machines with the blood, sweat and tears of human beings” Joseph Bironas – Google SRE Chapter 7: The Evolution of Automation at Google

Slide 39

Slide 39 text

@HannahFoxwell #VelocityConf Toxic toil The work that hurts • Wakes you up at night • Ruins your evenings weekends • Interrupts your work • Distracts you • Stresses you out

Slide 40

Slide 40 text

@HannahFoxwell #VelocityConf I’m getting enough sleep I’m not afraid to fail I spend enough time with my family I’m good at my job I’m always learning Maslow’s Hierarchy of Needs

Slide 41

Slide 41 text

@HannahFoxwell #VelocityConf Safety / Security Hannah’s Hypothesis SRE Practices can transform teams by meeting employees needs SRE Practices Physiological Social / Belonging Self Esteem Self Actualization

Slide 42

Slide 42 text

@HannahFoxwell #VelocityConf Is SRE good for you?

Slide 43

Slide 43 text

@HannahFoxwell #VelocityConf SRE SRE SRE SRE SRE Tell me your story… Blameless culture @HannahFoxwell Error Budget policy SLO’s

Slide 44

Slide 44 text

@HannahFoxwell #VelocityConf Thank You x