@HannahFoxwell
#VelocityConf
@HannahFoxwell
Delivery Manager @ Pivotal
DevOpsDays London | HumanOps London
Reliability Engineering for Humans
Is Site Reliability Engineering
Good for You?
Slide 2
Slide 2 text
@HannahFoxwell
#VelocityConf
“You don’t need SRE unless you’re
the size of Google”
Anonymous CEO
Slide 3
Slide 3 text
@HannahFoxwell
#VelocityConf
OPS DEVOPS SRE
The Evolution of Ops
Slide 4
Slide 4 text
@HannahFoxwell
#VelocityConf
Naming things is hard
Slide 5
Slide 5 text
@HannahFoxwell
#VelocityConf
“DevOps is not a job title”
DevOps Community (2009 - Present)
Slide 6
Slide 6 text
@HannahFoxwell
#VelocityConf
CloudOps at Pivotal
Slide 7
Slide 7 text
@HannahFoxwell
#VelocityConf
100
0
Health
1/17 2/17 3/17 4/17 5/17 6/17 7/17 8/17 9/17 10/17 11/17 12/17
3m
0m
2m
1m
4m
3/17 6/17 12/17
9/17
1h
0h
3/17 6/17 12/17
9/17
Improved Health
MTTA Down MTTR Down
CloudOps at Pivotal
What did
they do?
Slide 8
Slide 8 text
@HannahFoxwell
#VelocityConf
OK, you have my attention
Slide 9
Slide 9 text
@HannahFoxwell
#VelocityConf
#HUMANOPS
Slide 10
Slide 10 text
@HannahFoxwell
#VelocityConf
The wellbeing of human
operators impacts the
reliability of systems.
@HannahFoxwell
#VelocityConf
“SRE is what happens when a software
engineer is tasked with what used to
be called operations”
Ben Treynor – Founder of Google’s SRE Team
Slide 13
Slide 13 text
@HannahFoxwell
#VelocityConf
Failure is Normal
Slide 14
Slide 14 text
@HannahFoxwell
#VelocityConf
Reliability is Fundamental
Slide 15
Slide 15 text
@HannahFoxwell
#VelocityConf
“There is no trade off between improving
performance and achieving higher levels of
quality and stability. High performers do
better at all these measures”
Accelerate: The Science Behind DevOps
Nicole Forsgren, Jez Humble & Gene Kim
Slide 16
Slide 16 text
@HannahFoxwell
#VelocityConf
“In 2017 we saw low performers lose some
ground in stability”
(Increasing MTTR and CFR from 2016-17)
Accelerate: The Science Behind DevOps
Nicole Forsgren, Jez Humble & Gene Kim
Slide 17
Slide 17 text
@HannahFoxwell
#VelocityConf
SLIs, SLOs and Error Budgets
Slide 18
Slide 18 text
@HannahFoxwell
#VelocityConf
SLO
Service Level Objective
Slide 19
Slide 19 text
@HannahFoxwell
#VelocityConf
SLI
Service Level Indicator
Slide 20
Slide 20 text
@HannahFoxwell
#VelocityConf
Error Budget
Slide 21
Slide 21 text
@HannahFoxwell
#VelocityConf
100% Availability is not
your target.
So what is?
Agree your SLI’s and
SLO’s with everyone.
Yes, everyone.
Oops! We broke
something.
What now?
SLO Error Budget
(per 30 Days)
99% 432 mins
99.5% 216 mins
99.9% 43.2 mins
99.95% 21.6 mins
99.99% 4.32 mins
99.999% 0.43 mins
• Everyone understands
the importance of
reliability
• Everyone understands
the error budget and
how it works
• Everyone understands
the new rules!
• On-call / Playbooks / Fire
drills
• Blameless Incident
Review / Retrospective
• Review error budget
• Reduce risk and invest in
reliability
Slide 22
Slide 22 text
@HannahFoxwell
#VelocityConf
Set Your Service Level Objectives
Measure Your Service Level Indicators
Enforce Your Error Budgets
Slide 23
Slide 23 text
@HannahFoxwell
#VelocityConf
“The only normal way to begin
speaking a new language is
to begin speaking it badly”
Greg Thomson
Slide 24
Slide 24 text
@HannahFoxwell
#VelocityConf
Aspirational SLO’s are OK
Slide 25
Slide 25 text
@HannahFoxwell
#VelocityConf
Overachieving on your SLO
is less OK
Slide 26
Slide 26 text
@HannahFoxwell
#VelocityConf
“You get me”
- Your CFO
CFO
Slide 27
Slide 27 text
@HannahFoxwell
#VelocityConf
Psychological Safety
Slide 28
Slide 28 text
@HannahFoxwell
#VelocityConf
“Psychological safety is a shared belief that
the team is safe for interpersonal risk
taking”
Amy Edmondson– Harvard Business School Professor
Slide 29
Slide 29 text
@HannahFoxwell
#VelocityConf
“Psychological safety is a belief that one
will not be punished or humiliated for
speaking up with ideas, questions,
concerns, or mistakes.”
Amy Edmondson– Harvard Business School Professor
Slide 30
Slide 30 text
@HannahFoxwell
#VelocityConf
Research
Psychological safety
was studied in a
medical environment
• Teams were measured on Psychological Safety,
Error Rates and Team Performance
• Higher Psychological Safety correlated to higher
Error Rates
• However, higher Error Rates correlated to
higher Team Performance
• Better practices. More lives saved.
Amy Edmondson– Learning from mistakes is easier said than done
@HannahFoxwell
#VelocityConf
Failure is Normal
Psychological Safety
Hannah’s
Hypothesis
SRE Practices can
transform teams by
improving their
Psychological
Safety
SRE Practices
Increase learning from
mistakes
Boosts employee
engagement
Improved innovation
Slide 33
Slide 33 text
@HannahFoxwell
#VelocityConf
Let’s talk about toil
Slide 34
Slide 34 text
@HannahFoxwell
#VelocityConf
What is
Toil?
“I’m too good for this
toily BS”
“Yes, yes you are”
• Manual
• Repetitive
• Automatable
• Tactical
• No enduring value
• O(n) with service growth
Slide 35
Slide 35 text
@HannahFoxwell
#VelocityConf
“I didn’t have time to automate myself out of a
job. I didn’t even have time to eat!”
Toil vs. Engineering Work
Slide 36
Slide 36 text
@HannahFoxwell
#VelocityConf
You should spend below
50% of your time on toil.
Slide 37
Slide 37 text
@HannahFoxwell
#VelocityConf
Not all toil is equal
Slide 38
Slide 38 text
@HannahFoxwell
#VelocityConf
“If we have to staff humans to do the work,
we are feeding the machines with the blood,
sweat and tears of human beings”
Joseph Bironas – Google SRE
Chapter 7: The Evolution of Automation at Google
Slide 39
Slide 39 text
@HannahFoxwell
#VelocityConf
Toxic toil
The work that hurts
• Wakes you up at night
• Ruins your evenings weekends
• Interrupts your work
• Distracts you
• Stresses you out
Slide 40
Slide 40 text
@HannahFoxwell
#VelocityConf
I’m getting enough sleep
I’m not afraid to fail
I spend enough time with my family
I’m good at my job
I’m always learning
Maslow’s Hierarchy of Needs
Slide 41
Slide 41 text
@HannahFoxwell
#VelocityConf
Safety / Security
Hannah’s
Hypothesis
SRE Practices can
transform teams by
meeting employees
needs
SRE Practices
Physiological
Social / Belonging
Self Esteem
Self Actualization
Slide 42
Slide 42 text
@HannahFoxwell
#VelocityConf
Is SRE good for you?
Slide 43
Slide 43 text
@HannahFoxwell
#VelocityConf
SRE SRE SRE SRE SRE
Tell me your story…
Blameless culture
@HannahFoxwell
Error Budget policy
SLO’s