Arup Chakrabarti
Director of Engineering, PagerDuty
Building a Culture of Reliability
SRECON EMEA 2017
@arupchak
Slide 2
Slide 2 text
@arupchak
Disclaimers
Slide 3
Slide 3 text
@arupchak
I work with smrt smart people
Slide 4
Slide 4 text
@arupchak
You are not PagerDuty
Slide 5
Slide 5 text
@arupchak
We get this wrong too
Slide 6
Slide 6 text
@arupchak
Definitions
Slide 7
Slide 7 text
@arupchak
Reliability
Slide 8
Slide 8 text
@arupchak
Probability that your
software works*
Slide 9
Slide 9 text
@arupchak
What every CTO claims they
want because numbers
Slide 10
Slide 10 text
@arupchak
Culture
Slide 11
Slide 11 text
@arupchak
Social behavior and norms
for a group of people
Slide 12
Slide 12 text
@arupchak
A way to get your colleagues to
behave the way you want them to
without staring at them all the time
Slide 13
Slide 13 text
@arupchak
Metrics
Slide 14
Slide 14 text
@arupchak
“Show me the business impact”
-Your Pointy Haired Manager
Slide 15
Slide 15 text
@arupchak
“Here is a graph of open File Descriptors
going through the roof”
-Frustrated Engineer
Slide 16
Slide 16 text
@arupchak
“What the $%#! is a File Descriptor?”
-Your Pointy Haired Manager
Slide 17
Slide 17 text
@arupchak
Business Metrics
Managers Care About
Slide 18
Slide 18 text
@arupchak
Metrics Your Customers
Care About
Slide 19
Slide 19 text
@arupchak
Two Types of Online Businesses
• Individual Transaction Businesses
• Subscription Businesses
Slide 20
Slide 20 text
@arupchak
Individual Transaction Business
$$$ per Minute
$0
$23
$45
$68
$90
Monday Tuesday Wednesday Thursday Friday
Slide 21
Slide 21 text
@arupchak
Individual Transaction Business
$$$ per Minute
$0
$23
$45
$68
$90
Monday Tuesday Wednesday Thursday Friday
Slide 22
Slide 22 text
@arupchak
Individual Transaction Business
$$$ per Minute
$0
$23
$45
$68
$90
Monday Tuesday Wednesday Thursday Friday
Slide 23
Slide 23 text
@arupchak
Individual Transaction Business
$$$ per Minute
$0
$23
$45
$68
$90
Monday Tuesday Wednesday Thursday Friday
Slide 24
Slide 24 text
@arupchak
Individual Transaction Business
$$$ per Minute
$0
$23
$45
$68
$90
Monday Tuesday Wednesday Thursday Friday
$
Slide 25
Slide 25 text
@arupchak
Individual Transaction Business
$$$ per Minute
$0
$23
$45
$68
$90
Monday Tuesday Wednesday Thursday Friday
$
€
Slide 26
Slide 26 text
@arupchak
Subscription Businesses
• Cannot solely measure when you make money
• Poor Reliability erodes trust and will cause you lose revenue
• Need to find something between how money is made and what customers
care about
Slide 27
Slide 27 text
@arupchak
Subscription Businesses
Incidents Resolved per Hour - July 2017
@arupchak
Distributed Operations Org
• Sets expectations around availability of people
• More small incidents over single major incident
• Builds empathy and why Reliability is hard
Slide 40
Slide 40 text
@arupchak
Tooling and Processes
Slide 41
Slide 41 text
@arupchak
“If we just install Nagios, everything will be
fine and all of our problems will be solved”
-Arup in 2002
Slide 42
Slide 42 text
@arupchak
“We humans co-evolve with our tools. We
change the tools, and the tools change us,
and that cycle repeats.”
-Jeff Bezos
Slide 43
Slide 43 text
@arupchak
Failure Friday (Process)
Slide 44
Slide 44 text
@arupchak
Started Small
Slide 45
Slide 45 text
@arupchak
Slide 46
Slide 46 text
@arupchak
Got Bigger and Smarter
Slide 47
Slide 47 text
@arupchak
Slide 48
Slide 48 text
@arupchak
?
Slide 49
Slide 49 text
@arupchak
?
Slide 50
Slide 50 text
@arupchak
?
Slide 51
Slide 51 text
@arupchak
Reboot Roulette (Tool)
Slide 52
Slide 52 text
@arupchak
Slide 53
Slide 53 text
@arupchak
Major Incident Response
(Process and Tooling)
Slide 54
Slide 54 text
@arupchak
Started Really Poorly
Slide 55
Slide 55 text
@arupchak
Got A Little Better Each Time
Slide 56
Slide 56 text
@arupchak
Slide 57
Slide 57 text
@arupchak
Still Not Perfect
Slide 58
Slide 58 text
@arupchak
Slide 59
Slide 59 text
@arupchak
Internal Liaison Role (Process)
Slide 60
Slide 60 text
@arupchak
Over-communicate during
Major Incidents
Slide 61
Slide 61 text
@arupchak
Slide 62
Slide 62 text
@arupchak
Improving Reliability means
constantly failing, constantly
recovering, and constantly learning
Slide 63
Slide 63 text
@arupchak
Yes, it can be exhausting,
but it is worth it
Slide 64
Slide 64 text
@arupchak
Improving Culture means
constantly failing, constantly
recovering, and constantly learning
Slide 65
Slide 65 text
@arupchak
Yes, it can be even more
exhausting, but it is really
really really worth it
Slide 66
Slide 66 text
Arup Chakrabarti
Director of Engineering, PagerDuty
Thank You
WE ARE HIRING PAGERDUTY.COM/CAREERS
ARUP@PAGERDUTY.COM
@arupchak