Arup Chakrabarti
Director of Engineering, PagerDuty
Building a Culture of Reliability
Slide 2
Slide 2 text
Slide 3
Slide 3 text
I work with smrt smart people
Slide 4
Slide 4 text
You are not PagerDuty
Slide 5
Slide 5 text
We get this wrong too
Slide 6
Slide 6 text
Slide 7
Slide 7 text
Slide 8
Slide 8 text
Probability that your
software works*
Slide 9
Slide 9 text
What every CTO claims they
want because numbers
Slide 10
Slide 10 text
Slide 11
Slide 11 text
Social behavior and norms
for a group of people
Slide 12
Slide 12 text
A way to get your colleagues to
behave the way you want them to
without staring at them all the time
Slide 13
Slide 13 text
Slide 14
Slide 14 text
“Show me the business impact”
-Your Pointy Haired Manager
Slide 15
Slide 15 text
“Here is a graph of open File Descriptors
going through the roof”
-Frustrated Engineer
Slide 16
Slide 16 text
“What the $%#! is a File Descriptor?”
-Your Pointy Haired Manager
Slide 17
Slide 17 text
Business Metrics
Managers Care About
Slide 18
Slide 18 text
Metrics Your Customers
Care About
Slide 19
Slide 19 text
Two Types of Online Businesses
• Individual Transaction Businesses
• Subscription Businesses
Slide 20
Slide 20 text
Individual Transaction Business
$$$ per Minute
Monday Tuesday Wednesday Thursday Friday
Slide 21
Slide 21 text
Individual Transaction Business
$$$ per Minute
Monday Tuesday Wednesday Thursday Friday
Slide 22
Slide 22 text
Individual Transaction Business
$$$ per Minute
Monday Tuesday Wednesday Thursday Friday
Slide 23
Slide 23 text
Individual Transaction Business
$$$ per Minute
Monday Tuesday Wednesday Thursday Friday
Slide 24
Slide 24 text
Individual Transaction Business
$$$ per Minute
Monday Tuesday Wednesday Thursday Friday
Slide 25
Slide 25 text
Individual Transaction Business
$$$ per Minute
Monday Tuesday Wednesday Thursday Friday
Slide 26
Slide 26 text
Subscription Businesses
• Cannot solely measure when you make money
• Poor Reliability erodes trust and will cause you lose revenue
• Need to find something between how money is made and what customers
care about
Slide 27
Slide 27 text
Subscription Businesses
Incidents Resolved per Hour - July 2017
Distributed Operations Org
• Sets expectations around availability of people
• More small incidents over single major incident
• Builds empathy and why Reliability is hard
Slide 40
Slide 40 text
Tooling and Processes
Slide 41
Slide 41 text
“If we just install Nagios, everything will be
fine and all of our problems will be solved”
-Arup in 2002
Slide 42
Slide 42 text
“We humans co-evolve with our tools. We
change the tools, and the tools change us,
and that cycle repeats.”
-Jeff Bezos
Slide 43
Slide 43 text
Failure Friday (Process)
Slide 44
Slide 44 text
Started Small
Slide 45
Slide 45 text
Slide 46
Slide 46 text
Got Bigger and Smarter
Slide 47
Slide 47 text
Slide 48
Slide 48 text
Slide 49
Slide 49 text
Slide 50
Slide 50 text
Slide 51
Slide 51 text
Reboot Roulette (Tool)
Slide 52
Slide 52 text
Slide 53
Slide 53 text
Major Incident Response
(Process and Tooling)
Slide 54
Slide 54 text
Started Really Poorly
Slide 55
Slide 55 text
Got A Little Better Each Time
Slide 56
Slide 56 text
Slide 57
Slide 57 text
Still Not Perfect
Slide 58
Slide 58 text
Slide 59
Slide 59 text
Internal Liaison Role (Process)
Slide 60
Slide 60 text
Over-communicate during
Major Incidents
Slide 61
Slide 61 text
Slide 62
Slide 62 text
Improving Reliability means
constantly failing, constantly
recovering, and constantly learning
Slide 63
Slide 63 text
Yes, it can be exhausting,
but it is worth it
Slide 64
Slide 64 text
Improving Culture means
constantly failing, constantly
recovering, and constantly learning
Slide 65
Slide 65 text
Yes, it can be even more
exhausting, but it is really
really really worth it
Slide 66
Slide 66 text
Arup Chakrabarti
Director of Engineering, PagerDuty
Thank You