Slide 1

Slide 1 text

DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH

Slide 2

Slide 2 text

$ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community 
 Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS Community Events

Slide 3

Slide 3 text

• SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/) Datadog Overview

Slide 4

Slide 4 text

“THE PROBLEMS WE WORK ON AT DATADOG ARE HARD AND OFTEN DON'T HAVE OBVIOUS, CLEAN- CUT SOLUTIONS, SO IT'S USEFUL TO CULTIVATE YOUR TROUBLESHOOTING SKILLS, NO MATTER WHAT ROLE YOU WORK IN.” Internal Datadog Developer Guide

Slide 5

Slide 5 text

“THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE LEARN NOTHING.” - Henry Ford

Slide 6

Slide 6 text

“AN ANALYSIS OR DISCUSSION OF AN EVENT HELD SOON AFTER IT HAS OCCURRED, ESPECIALLY IN ORDER TO DETERMINE WHY IT WAS A FAILURE.” OXFORD ENGLISH DICTIONARY Oxford English Dictionary POSTMORTEM

Slide 7

Slide 7 text

DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES WHAT IS DEVOPS? ▸ Culture ▸ Automation ▸ Metrics ▸ Sharing

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES OUR FOCUS AREA ▸ Culture ▸ Sharing

Slide 11

Slide 11 text

BLAMELESS POSTMORTEMS

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

CULTURE & SHARING RESOURCES BLAMELESS POSTMORTEMS ▸Blameless Postmortems by John Allspaw http://bit.ly/etsy-blameless ▸The Human Side of Postmortems by Dave Zwieback http://bit.ly/human-postmortem

Slide 14

Slide 14 text

METRICS CULTURE & SHARING ARE GREAT, BUT WHAT ABOUT

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Follow @honest_update on Twitter

Slide 17

Slide 17 text

COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED IT CAN BE EXPENSIVE SO INSTRUMENT ALL THE THINGS!

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

METRICS 4 QUALITIES OF GOOD METRICS ▸ Well-understood ▸ Granular ▸ Tagged by scope ▸ Long-lived

Slide 23

Slide 23 text

RECURSE UNTIL YOU FIND THE TECHNICAL CAUSE

Slide 24

Slide 24 text

IF YOU’RE STILL RESPONDING TO THE INCIDENT, IT’S NOT TIME FOR A POSTMORTEM

Slide 25

Slide 25 text

HUMAN DATA DATA COLLECTION: WHO? ▸ Everyone! ▸ Responders ▸ Identifiers ▸ Affected Users

Slide 26

Slide 26 text

HUMAN DATA DATA COLLECTION: WHAT? ▸ Their perspective ▸ What they did ▸ What they thought ▸ Why they thought/did it

Slide 27

Slide 27 text

HUMAN ELEMENT TECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

… we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously … Joyent Postmortem
 http://bit.ly/joyent-post JOYENT US-EAST-1 POST-MORTEM 2014

Slide 30

Slide 30 text

“WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.” RICHARD GUINDON

Slide 31

Slide 31 text

“ONE PICTURE IS WORTH TEN THOUSAND WORDS” CHINESE PROVERB

Slide 32

Slide 32 text

HUMAN DATA DATA COLLECTION: WHEN? ▸ As soon as possible. ▸ Memory drops sharply within 20 minutes ▸ Susceptibility to “false memory” increases

Slide 33

Slide 33 text

HUMAN DATA DATA SKEW/CORRUPTION ▸ Stress ▸ Sleep deprivation ▸ Burnout

Slide 34

Slide 34 text

HUMAN DATA DATA SKEW/CORRUPTION ▸ Blame/Fear of punitive action ▸ Bias ▸ Anchoring ▸ Hindsight ▸ Outcome ▸ Availability ▸ Recency

Slide 35

Slide 35 text

HOW WE DO POSTMORTEMS AT DATADOG

Slide 36

Slide 36 text

DATADOG POSTMORTEMS A FEW NOTES ▸ Postmortems emailed to company wide ▸ Scheduled recurring postmortem meetings

Slide 37

Slide 37 text

DATADOG’S POSTMORTEM TEMPLATE (1/5) SUMMARY: WHAT HAPPENED? ▸ Describe what happened here at a high-level -- think of it as an abstract in a scientific paper. ▸ What was the impact on customers? ▸ What was the severity of the outage? ▸ What components were affected? ▸ What ultimately resolved the outage?

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

DATADOG’S POSTMORTEM TEMPLATE (2/5) HOW WAS THE OUTAGE DETECTED? ▸ We want to make sure we detected the issue early and would catch the same issue if it were to repeat. ▸ Did we have a metric that showed the outage? ▸ Was there a monitor on that metric? ▸ How long did it take for us to declare an outage?

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

DATADOG’S POSTMORTEM TEMPLATE (3/5) HOW DID WE RESPOND? ▸ Who was the incident owner & who else was involved? ▸ Slack archive links and timeline of events! ▸ What went well? ▸ What didn’t go so well?

Slide 44

Slide 44 text

*Names changed

Slide 45

Slide 45 text

CHATOPS ARCHIVES FTW! *Names changed

Slide 46

Slide 46 text

*Names changed TRACK LEARNINGS AS YOU GO

Slide 47

Slide 47 text

DATADOG’S POSTMORTEM TEMPLATE (4/5) WHY DID IT HAPPEN? ▸ Deep dive into the cause ▸ Examples from this incident: ▸ http://bit.ly/dd-statuspage ▸ http://bit.ly/alq-postmortem

Slide 48

Slide 48 text

DATADOG’S POSTMORTEM TEMPLATE (5/5) HOW DO WE PREVENT IT IN THE FUTURE? ▸ Link to Github issues and Trello cards ▸ Now? ▸ Next? ▸ Later? ▸ Follow up notes

Slide 49

Slide 49 text

*Names changed

Slide 50

Slide 50 text

DATADOG’S POSTMORTEM TEMPLATE RECAP: ▸ What happened (summary)? ▸ How did we detect it? ▸ How did we respond? ▸ Why did it happen (deep dive)? ▸ Actionable next steps!

Slide 51

Slide 51 text

KEEP LEARNING MORE RESOURCES ▸ The Infinite Hows - John Allspaw
 http://bit.ly/infinite-hows
 ▸ “Blameless” Postmortems don’t work - J Paul Reed
 http://bit.ly/blameless-dont-work ▸ Monitoring 101 - Alexis Lê-Quôc
 http://dtdg.co/monitoring-101-data

Slide 52

Slide 52 text

QUESTIONS? LET’S TALK! @IRABINOVITCH @DATADOGHQ