Data-driven
Postmortems
Matt Williams
Evangelist @ Datadog
@technovangelist
[email protected]
Slide 2
Slide 2 text
@datadoghq @technovangelist #nginx #nginxconf
• SaaS based infrastructure and app monitoring
• Open Source Agent
• Time series data (metrics and events)
• Processing about a trillion data points per day
• Intelligent Dashboards & Alerting
• We’re hiring! (www.datadoghq.com/careers/)
2
Slide 3
Slide 3 text
@datadoghq @technovangelist #nginx #nginxconf
“The problems we work on at Datadog
are hard and often don't have obvious,
clean-cut solutions, so it's useful to
cultivate your troubleshooting skills,
no matter what role you work in.”
3
Internal Datadog Dev Guide
Slide 4
Slide 4 text
@datadoghq @technovangelist #nginx #nginxconf
4
“The only real
mistake is the one
from which we learn
nothing.”
Henry Ford
Slide 5
Slide 5 text
@datadoghq @technovangelist #nginx #nginxconf
Oxford English Dictionary
“An analysis or discussion of an event
held soon after it has occurred,
especially in order to determine why
it was a failure.”
5
Postmortem
Slide 6
Slide 6 text
@datadoghq @technovangelist #nginx #nginxconf
6
What is DevOps? • Culture
• Automation
• Metrics
• Sharing
Slide 7
Slide 7 text
@datadoghq @technovangelist #nginx #nginxconf
7
Title Text
Slide 8
Slide 8 text
@datadoghq @technovangelist #nginx #nginxconf
8
Title Text
#nginx #nginxconf
10
Blameless
Postmortems
Instead of naming, blaming, and
shaming, our goal should always be to
maximize opportunities for
organizational learning
- DevOps Handbook, Kim, Humble, Debois, Willis 2016
Slide 11
Slide 11 text
#nginx #nginxconf
11
Slide 12
Slide 12 text
@datadoghq @technovangelist #nginx #nginxconf
• Blameless Postmortems by John Allspaw
http://bit.ly/etsy-blameless
• The Human Side of Postmortems by Dave
Zwieback
http://bit.ly/human-postmortem
12
Blameless Postmortem
Resources
Slide 13
Slide 13 text
Metrics
Culture and Sharing are
great, but what about
Slide 14
Slide 14 text
@datadoghq @technovangelist #nginx #nginxconf
14
Slide 15
Slide 15 text
@datadoghq @technovangelist #nginx #nginxconf
15
Follow
@honest_update
on Twitter
Slide 16
Slide 16 text
Collecting data is cheap;
not having it when you need it can
be expensive
So Instrument All The Things!
Slide 17
Slide 17 text
@datadoghq @technovangelist #nginx #nginxconf
17
Slide 18
Slide 18 text
@datadoghq @technovangelist #nginx #nginxconf
18
Slide 19
Slide 19 text
@datadoghq @technovangelist #nginx #nginxconf
19
Slide 20
Slide 20 text
@datadoghq @technovangelist #nginx #nginxconf
20
Slide 21
Slide 21 text
@datadoghq @technovangelist #nginx #nginxconf
• Well-understood
• Granular
• Tagged by scope
• Long-lived
21
4 Qualities of Good Metrics
Slide 22
Slide 22 text
@datadoghq @technovangelist #nginx #nginxconf
22
Recurse until you find the root
cause
Slide 23
Slide 23 text
Who What When
of postmortems
Slide 24
Slide 24 text
@datadoghq @technovangelist #nginx #nginxconf
24
If you’re still
responding to
the incident,
it’s not time for a
postmortem
@datadoghq @technovangelist #nginx #nginxconf
26
• Their perspective
• What they did
• What they thought
• Why they thought/did it
Data Collection: What?
Slide 27
Slide 27 text
@datadoghq @technovangelist #nginx #nginxconf
– John Allspaw
Behind every seemingly technical problem is actually a
human problem waiting to be found.
27
Slide 28
Slide 28 text
@datadoghq @technovangelist #nginx #nginxconf
– Dr Sidney Dekker
Human error is not our cause of troubles; instead
human error is a consequence of the design of the tools
that we gave them.
28
Slide 29
Slide 29 text
@datadoghq @technovangelist #nginx #nginxconf
29
Slide 30
Slide 30 text
@datadoghq @technovangelist #nginx #nginxconf
• … we will be dramatically improving the tooling
that humans (and systems) interact with such that
input validation is much more strict and will not
allow for all servers, and control plane servers to
be rebooted simultaneously …
30
Joyent US-EAST-1 Postmortem
2014
http://bit.ly/joyent-post
Slide 31
Slide 31 text
@datadoghq @technovangelist #nginx #nginxconf
– Richard Guindon
“Writing is nature’s way of letting you know how sloppy
your thinking is.”
31
Slide 32
Slide 32 text
@datadoghq @technovangelist #nginx #nginxconf
– Chinese Proverb
“One Picture is worth Ten thousand Words”
32
Slide 33
Slide 33 text
@datadoghq @technovangelist #nginx #nginxconf
• As soon as possible.
• Memory drops sharply within 20 minutes
• Susceptibility to “false memory” increases
33
Data Collection: When?
@datadoghq @technovangelist #nginx #nginxconf
36
How we do
Postmortems
At Datadog
Slide 37
Slide 37 text
@datadoghq @technovangelist #nginx #nginxconf
• Postmortems emailed to company wide
• Scheduled recurring postmortem meetings
37
A Few Notes
Slide 38
Slide 38 text
@datadoghq @technovangelist #nginx #nginxconf
• A real event - March 8, 2016
• Public-facing web of Datadog
• Higher 5xx % on outbound requests (pull data)
• No change on inbound requests or alerts (push data)
• First navigations gets slow...
• Then we get frequent “Down” pages
• Alerts on 5xx %
38
About The Example
Slide 39
Slide 39 text
@datadoghq @technovangelist #nginx #nginxconf
• Describe what happened here at a high-level --
think of it as an abstract in a scientific paper.
• What was the impact on customers?
• What was the severity of the outage?
• What components were affected?
• What ultimately resolved the outage?
39
Summary: What Happened?
Slide 40
Slide 40 text
@datadoghq @technovangelist #nginx #nginxconf
40
Slide 41
Slide 41 text
@datadoghq @technovangelist #nginx #nginxconf
41
Slide 42
Slide 42 text
@datadoghq @technovangelist #nginx #nginxconf
• We want to make sure we detected the issue early
and would catch the same issue if it were to
repeat.
• Did we have a metric that showed the outage?
• Was there a monitor on that metric?
• How long did it take for us to declare an outage?
42
How was the Outage Detected?
Slide 43
Slide 43 text
@datadoghq @technovangelist #nginx #nginxconf
43
Slide 44
Slide 44 text
@datadoghq @technovangelist #nginx #nginxconf
44
Slide 45
Slide 45 text
@datadoghq @technovangelist #nginx #nginxconf
• Who was the incident owner & who else was
involved?
• Slack archive links and timeline of events!
• What went well?
• What didn’t go so well?
45
How Did We Respond?
@datadoghq @technovangelist #nginx #nginxconf
48
*Names changed
Track Learnings As You Go
Slide 49
Slide 49 text
@datadoghq @technovangelist #nginx #nginxconf
• Deep dive into the cause
• Examples from this incident:
• http://bit.ly/dd-statuspage
• http://bit.ly/alq-postmortem
49
Why Did It Happen?
Slide 50
Slide 50 text
@datadoghq @technovangelist #nginx #nginxconf
• Link to Github issues and Trello cards
• Now?
• Next?
• Later?
• Follow up notes
50
How Do We Prevent It In The
Future?
Slide 51
Slide 51 text
@datadoghq @technovangelist #nginx #nginxconf
51
Slide 52
Slide 52 text
@datadoghq @technovangelist #nginx #nginxconf
• What happened (summary)?
• How did we detect it?
• How did we respond?
• Why did it happen (deep dive)?
• Actionable next steps!
52
Recap:
Slide 53
Slide 53 text
@datadoghq @technovangelist #nginx #nginxconf
• The Infinite Hows - John Allspaw
http://bit.ly/infinite-hows
• “Blameless” Postmortems don’t work - J
Paul Reed
http://bit.ly/blameless-dont-work
• Monitoring 101 - Alexis Lê-Quôc
http://dtdg.co/monitoring-101-data
53
More Resources
Slide 54
Slide 54 text
Thank You
@datadoghq @technovangelist #nginx #nginxconf
54
linkedin.com/in/technovangelist
@technovangelist