DATA-DRIVEN POSTMORTEMS
ILAN RABINOVITCH, DATADOG @IRABINOVITCH
Slide 2
Slide 2 text
$ finger ilan@datadog
[datadoghq.com]
Name: Ilan Rabinovitch
Role: Director, Technical Community
Interests:
* Monitoring and Metrics
* Large scale web operations
* FL/OSS Community Events
Slide 3
Slide 3 text
• SaaS based infrastructure and app monitoring
• Open Source Agent
• Time series data (metrics and events)
• Processing nearly a trillion data points per day
• Intelligent Alerting
• We’re hiring! (www.datadoghq.com/careers/)
Datadog Overview
Slide 4
Slide 4 text
“THE PROBLEMS WE WORK ON AT DATADOG ARE
HARD AND OFTEN DON'T HAVE OBVIOUS, CLEAN-
CUT SOLUTIONS, SO IT'S USEFUL TO CULTIVATE
YOUR TROUBLESHOOTING SKILLS, NO MATTER
WHAT ROLE YOU WORK IN.”
Internal Datadog Developer Guide
Slide 5
Slide 5 text
“THE ONLY REAL
MISTAKE IS THE
ONE FROM WHICH
WE LEARN
NOTHING.”
- Henry Ford
Slide 6
Slide 6 text
“AN ANALYSIS OR DISCUSSION OF AN EVENT
HELD SOON AFTER IT HAS OCCURRED,
ESPECIALLY IN ORDER TO DETERMINE WHY IT
WAS A FAILURE.”
OXFORD ENGLISH DICTIONARY
Oxford English Dictionary
POSTMORTEM
Slide 7
Slide 7 text
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
WHAT IS DEVOPS? ▸ Culture
▸ Automation
▸ Metrics
▸ Sharing
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
OUR FOCUS AREA ▸ Culture
▸ Sharing
Slide 11
Slide 11 text
BLAMELESS
POSTMORTEMS
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
CULTURE & SHARING RESOURCES
BLAMELESS POSTMORTEMS
▸Blameless Postmortems by John Allspaw
http://bit.ly/etsy-blameless
▸The Human Side of Postmortems by Dave
Zwieback
http://bit.ly/human-postmortem
Slide 14
Slide 14 text
METRICS
CULTURE & SHARING ARE GREAT, BUT WHAT ABOUT
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
Follow
@honest_update
on Twitter
Slide 17
Slide 17 text
COLLECTING DATA IS CHEAP;
NOT HAVING IT WHEN YOU
NEED IT CAN BE EXPENSIVE
SO INSTRUMENT ALL THE THINGS!
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
No content
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
METRICS
4 QUALITIES OF GOOD METRICS
▸ Well-understood
▸ Granular
▸ Tagged by scope
▸ Long-lived
Slide 23
Slide 23 text
RECURSE UNTIL YOU FIND THE TECHNICAL CAUSE
Slide 24
Slide 24 text
IF YOU’RE STILL
RESPONDING TO
THE INCIDENT,
IT’S NOT TIME FOR
A POSTMORTEM
Slide 25
Slide 25 text
HUMAN DATA
DATA COLLECTION: WHO?
▸ Everyone!
▸ Responders
▸ Identifiers
▸ Affected Users
Slide 26
Slide 26 text
HUMAN DATA
DATA COLLECTION: WHAT?
▸ Their perspective
▸ What they did
▸ What they thought
▸ Why they thought/did it
Slide 27
Slide 27 text
HUMAN ELEMENT
TECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES
Slide 28
Slide 28 text
No content
Slide 29
Slide 29 text
… we will be dramatically improving the tooling that
humans (and systems) interact with such that input
validation is much more strict and will not allow for all
servers, and control plane servers to be rebooted
simultaneously …
Joyent Postmortem
http://bit.ly/joyent-post
JOYENT US-EAST-1 POST-MORTEM 2014
Slide 30
Slide 30 text
“WRITING IS NATURE’S WAY OF
LETTING YOU KNOW HOW SLOPPY
YOUR THINKING IS.”
RICHARD GUINDON
Slide 31
Slide 31 text
“ONE PICTURE IS WORTH TEN
THOUSAND WORDS”
CHINESE PROVERB
Slide 32
Slide 32 text
HUMAN DATA
DATA COLLECTION: WHEN?
▸ As soon as possible.
▸ Memory drops sharply within 20 minutes
▸ Susceptibility to “false memory” increases
Slide 33
Slide 33 text
HUMAN DATA
DATA SKEW/CORRUPTION
▸ Stress
▸ Sleep deprivation
▸ Burnout
Slide 34
Slide 34 text
HUMAN DATA
DATA SKEW/CORRUPTION
▸ Blame/Fear of punitive action
▸ Bias
▸ Anchoring
▸ Hindsight
▸ Outcome
▸ Availability
▸ Recency
Slide 35
Slide 35 text
HOW WE
DO POSTMORTEMS
AT DATADOG
Slide 36
Slide 36 text
DATADOG POSTMORTEMS
A FEW NOTES
▸ Postmortems emailed to company wide
▸ Scheduled recurring postmortem meetings
Slide 37
Slide 37 text
DATADOG’S POSTMORTEM TEMPLATE (1/5)
SUMMARY: WHAT HAPPENED?
▸ Describe what happened here at a high-level --
think of it as an abstract in a scientific paper.
▸ What was the impact on customers?
▸ What was the severity of the outage?
▸ What components were affected?
▸ What ultimately resolved the outage?
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
No content
Slide 40
Slide 40 text
DATADOG’S POSTMORTEM TEMPLATE (2/5)
HOW WAS THE OUTAGE DETECTED?
▸ We want to make sure we detected the issue
early and would catch the same issue if it were to
repeat.
▸ Did we have a metric that showed the outage?
▸ Was there a monitor on that metric?
▸ How long did it take for us to declare an outage?
Slide 41
Slide 41 text
No content
Slide 42
Slide 42 text
No content
Slide 43
Slide 43 text
DATADOG’S POSTMORTEM TEMPLATE (3/5)
HOW DID WE RESPOND?
▸ Who was the incident owner & who else was
involved?
▸ Slack archive links and timeline of events!
▸ What went well?
▸ What didn’t go so well?
Slide 44
Slide 44 text
*Names changed
Slide 45
Slide 45 text
CHATOPS
ARCHIVES
FTW!
*Names changed
Slide 46
Slide 46 text
*Names changed
TRACK LEARNINGS AS YOU GO
Slide 47
Slide 47 text
DATADOG’S POSTMORTEM TEMPLATE (4/5)
WHY DID IT HAPPEN?
▸ Deep dive into the cause
▸ Examples from this incident:
▸ http://bit.ly/dd-statuspage
▸ http://bit.ly/alq-postmortem
Slide 48
Slide 48 text
DATADOG’S POSTMORTEM TEMPLATE (5/5)
HOW DO WE PREVENT IT IN THE FUTURE?
▸ Link to Github issues and Trello cards
▸ Now?
▸ Next?
▸ Later?
▸ Follow up notes
Slide 49
Slide 49 text
*Names changed
Slide 50
Slide 50 text
DATADOG’S POSTMORTEM TEMPLATE
RECAP:
▸ What happened (summary)?
▸ How did we detect it?
▸ How did we respond?
▸ Why did it happen (deep dive)?
▸ Actionable next steps!
Slide 51
Slide 51 text
KEEP LEARNING
MORE RESOURCES
▸ The Infinite Hows - John Allspaw
http://bit.ly/infinite-hows
▸ “Blameless” Postmortems don’t work - J Paul
Reed
http://bit.ly/blameless-dont-work
▸ Monitoring 101 - Alexis Lê-Quôc
http://dtdg.co/monitoring-101-data