@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
@holly_cummins
confession:
i am not an SRE
Slide 6
Slide 6 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
@holly_cummins
confession:
i am not an SRE
Robert Barron Cansu Kavılı Örnek
but some of my
good friends are
Slide 7
Slide 7 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
Robert Barron
IBM Garage
Cansu Kavılı Örnek
Red Hat Open Innovation Labs
Slide 8
Slide 8 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
thanks for the stories, Robert and Cansu
Robert Barron
IBM Garage
Cansu Kavılı Örnek
Red Hat Open Innovation Labs
Slide 9
Slide 9 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
thanks for the stories, Robert and Cansu
Robert Barron
IBM Garage
Cansu Kavılı Örnek
Red Hat Open Innovation Labs
Slide 10
Slide 10 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
thanks for the stories, Robert and Cansu
Robert Barron
IBM Garage
Cansu Kavılı Örnek
Red Hat Open Innovation Labs
Slide 11
Slide 11 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
poll: who are you?
sli.do
#886041
Slide 12
Slide 12 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
what is SRE?
Slide 13
Slide 13 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
what is SRE?
Slide 14
Slide 14 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
no really, WTF is SRE?
Slide 15
Slide 15 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
SRE
what ops would be like if it was
done by software engineers
Slide 16
Slide 16 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
what was wrong with the
old way of doing ops?
Slide 17
Slide 17 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
old ops
Slide 18
Slide 18 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
manual
old ops
Slide 19
Slide 19 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
manual
repetitive
old ops
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
manual
repetitive
siloed
not aligned to business goals
old ops
Slide 22
Slide 22 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
manual
repetitive
siloed
not aligned to business goals
unable to handle complexity of cloud native
old ops
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
I am not
designed for
this.
Slide 52
Slide 52 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
two war rooms
Slide 53
Slide 53 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
team mainframe team mobile
Slide 54
Slide 54 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
team mainframe team mobile
we’re
responsible for
stability of the
mainframe
we’re
responsible for
stability of the
front end
Slide 55
Slide 55 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
team mainframe team mobile
the ambassador
we’re
responsible for
stability of the
mainframe
we’re
responsible for
stability of the
front end
we’re responsible
for stability of the
mainframe … as long as
it’s used correctly
Slide 56
Slide 56 text
@holly_cummins
IBM Garage
true story
“we have a ticket per
team, not per incident”
dots aren’t connected
Slide 57
Slide 57 text
“we want to do SRE but
we don’t have enough
permissions on our
systems”
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
“it takes us 15 minutes just to get
permission to run a standard set of SQL
diagnostic statements”
Slide 60
Slide 60 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
“it takes us 15 minutes just to get
permission to run a standard set of SQL
diagnostic statements”
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
advanced metrics:
how many people were in the post-mortem?
Slide 70
Slide 70 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
advanced metrics:
how many people were in the post-mortem?
does it include more than the people directly involved?
Slide 71
Slide 71 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
advanced metrics:
how many people were in the post-mortem?
does it include more than the people directly involved?
did we invite more than our own team?
Slide 72
Slide 72 text
@holly_cummins
IBM Garage
true story
“no one says anything in our
blameless post-mortems”
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
if involvement in an incident is punished,
people will avoid engaging with systems
Slide 75
Slide 75 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
“great idea, go build that!”
if ideas are punished with extra work,
people will try not to have ideas
Slide 76
Slide 76 text
@holly_cummins
IBM Garage
true story
“we have
success metrics”
the perverse incentive
Slide 77
Slide 77 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
metrics are good
Slide 78
Slide 78 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
SREs are data-driven
Slide 79
Slide 79 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
but …
Slide 80
Slide 80 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
as senior leaders, be careful
what you incentivise
Slide 81
Slide 81 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
be careful what behaviours
you discourage
Slide 82
Slide 82 text
@holly_cummins
IBM Garage
true story
“we count how many incidents we
have; if the number goes down, it
means we are working better”
the perverse incentive
Slide 83
Slide 83 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
during the holidays,
quality is outstanding!
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
mean time to failure?
mean time to detect problems?
Slide 114
Slide 114 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
what is failure in a complex system?
if a system goes down but user experience is fine, does that
count?
Slide 115
Slide 115 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
measure “what have I learned”
measure “have I made sure it won’t happen again”
Slide 116
Slide 116 text
@holly_cummins
IBM Garage
true client story
“we can’t actually
release this.”
value on the shelf
@holly_cummins
IBM Garage
true client story
“we can’t release this
microservice…
we deploy all our
microservices at the
same time… because
otherwise nothing works.”
the monolithic microservices
Slide 126
Slide 126 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
let’s talk about
microservices
Slide 127
Slide 127 text
@holly_cummins
IBM Garage
true client story
“every time we change
code, something
breaks”
the peril of microservices
Slide 128
Slide 128 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
just because a system runs across 6
containers doesn’t mean it’s decoupled
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
when SRE is right
it is great
Slide 152
Slide 152 text
bank
Slide 153
Slide 153 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
team mainframe team mobile
remember this bank?
Slide 154
Slide 154 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
team mainframe team mobile
we’re
responsible for
stability of the
front end
remember this bank?
Slide 155
Slide 155 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
team mainframe team mobile
the ambassador
we’re
responsible for
stability of the
front end
we’re responsible
for stability of the
mainframe … as long as
it’s used correctly
remember this bank?
Slide 156
Slide 156 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
another department …
web front-end back-end
one
team
Slide 157
Slide 157 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
another department …
web front-end back-end
one
team
mobile front-end
Slide 158
Slide 158 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
one team, range of techniques
canary deploys
CI/CD pipelines
one
team
CI/CD pipelines
big-bang deploys
onto AIX
Slide 159
Slide 159 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
by the way …
Slide 160
Slide 160 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
big bang deploys
Slide 161
Slide 161 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
big bang deploys
50%
failure rate
Slide 162
Slide 162 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
canary deploys
big bang deploys
50%
failure rate
Slide 163
Slide 163 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
10%
failure rate
canary deploys
big bang deploys
50%
failure rate
Slide 164
Slide 164 text
industrial
Slide 165
Slide 165 text
remember the
suspicious DBAs?
Slide 166
Slide 166 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
two root problems:
• automation
• trust and transparency
Slide 167
Slide 167 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
trigger automation via slack
Slide 168
Slide 168 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
because it was transparent, DBAs were
happy and automated more things
Slide 169
Slide 169 text
@holly_cummins
PREVAIL Technical Conference 2021
#WTFisSRE
what happens when things
go wrong?