Slide 1

Slide 1 text

Data-driven Postmortems Matt Williams
 Evangelist @ Datadog 
 @technovangelist [email protected]

Slide 2

Slide 2 text

@datadoghq @technovangelist #nginx #nginxconf • SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing about a trillion data points per day • Intelligent Dashboards & Alerting • We’re hiring! (www.datadoghq.com/careers/) 2

Slide 3

Slide 3 text

@datadoghq @technovangelist #nginx #nginxconf “The problems we work on at Datadog are hard and often don't have obvious, clean-cut solutions, so it's useful to cultivate your troubleshooting skills, no matter what role you work in.” 3 Internal Datadog Dev Guide

Slide 4

Slide 4 text

@datadoghq @technovangelist #nginx #nginxconf 4 “The only real mistake is the one from which we learn nothing.” Henry Ford

Slide 5

Slide 5 text

@datadoghq @technovangelist #nginx #nginxconf Oxford English Dictionary “An analysis or discussion of an event held soon after it has occurred, especially in order to determine why it was a failure.” 5 Postmortem

Slide 6

Slide 6 text

@datadoghq @technovangelist #nginx #nginxconf 6 What is DevOps? • Culture • Automation • Metrics • Sharing

Slide 7

Slide 7 text

@datadoghq @technovangelist #nginx #nginxconf 7 Title Text

Slide 8

Slide 8 text

@datadoghq @technovangelist #nginx #nginxconf 8 Title Text

Slide 9

Slide 9 text

@datadoghq @technovangelist #nginx #nginxconf 9 Our Focus Area • Culture • Sharing

Slide 10

Slide 10 text

#nginx #nginxconf 10 Blameless Postmortems Instead of naming, blaming, and shaming, our goal should always be to maximize opportunities for organizational learning - DevOps Handbook, Kim, Humble, Debois, Willis 2016

Slide 11

Slide 11 text

#nginx #nginxconf 11

Slide 12

Slide 12 text

@datadoghq @technovangelist #nginx #nginxconf • Blameless Postmortems by John Allspaw http://bit.ly/etsy-blameless • The Human Side of Postmortems by Dave Zwieback http://bit.ly/human-postmortem 12 Blameless Postmortem Resources

Slide 13

Slide 13 text

Metrics Culture and Sharing are great, but what about

Slide 14

Slide 14 text

@datadoghq @technovangelist #nginx #nginxconf 14

Slide 15

Slide 15 text

@datadoghq @technovangelist #nginx #nginxconf 15 Follow @honest_update on Twitter

Slide 16

Slide 16 text

Collecting data is cheap; not having it when you need it can be expensive So Instrument All The Things!

Slide 17

Slide 17 text

@datadoghq @technovangelist #nginx #nginxconf 17

Slide 18

Slide 18 text

@datadoghq @technovangelist #nginx #nginxconf 18

Slide 19

Slide 19 text

@datadoghq @technovangelist #nginx #nginxconf 19

Slide 20

Slide 20 text

@datadoghq @technovangelist #nginx #nginxconf 20

Slide 21

Slide 21 text

@datadoghq @technovangelist #nginx #nginxconf • Well-understood • Granular • Tagged by scope • Long-lived 21 4 Qualities of Good Metrics

Slide 22

Slide 22 text

@datadoghq @technovangelist #nginx #nginxconf 22 Recurse until you find the root cause

Slide 23

Slide 23 text

Who What When of postmortems

Slide 24

Slide 24 text

@datadoghq @technovangelist #nginx #nginxconf 24 If you’re still responding to the incident, it’s not time for a postmortem

Slide 25

Slide 25 text

@datadoghq @technovangelist #nginx #nginxconf • Everyone! • Responders • Identifiers • Affected Users 25 Data Collection: Who?

Slide 26

Slide 26 text

@datadoghq @technovangelist #nginx #nginxconf 26 • Their perspective • What they did • What they thought • Why they thought/did it Data Collection: What?

Slide 27

Slide 27 text

@datadoghq @technovangelist #nginx #nginxconf – John Allspaw Behind every seemingly technical problem is actually a human problem waiting to be found. 27

Slide 28

Slide 28 text

@datadoghq @technovangelist #nginx #nginxconf – Dr Sidney Dekker Human error is not our cause of troubles; instead human error is a consequence of the design of the tools that we gave them. 28

Slide 29

Slide 29 text

@datadoghq @technovangelist #nginx #nginxconf 29

Slide 30

Slide 30 text

@datadoghq @technovangelist #nginx #nginxconf • … we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously … 30 Joyent US-EAST-1 Postmortem 2014 http://bit.ly/joyent-post

Slide 31

Slide 31 text

@datadoghq @technovangelist #nginx #nginxconf – Richard Guindon “Writing is nature’s way of letting you know how sloppy your thinking is.” 31

Slide 32

Slide 32 text

@datadoghq @technovangelist #nginx #nginxconf – Chinese Proverb “One Picture is worth Ten thousand Words” 32

Slide 33

Slide 33 text

@datadoghq @technovangelist #nginx #nginxconf • As soon as possible. • Memory drops sharply within 20 minutes • Susceptibility to “false memory” increases 33 Data Collection: When?

Slide 34

Slide 34 text

@datadoghq @technovangelist #nginx #nginxconf • Stress • Sleep deprivation • Burnout 34 Data Skew/Corruption

Slide 35

Slide 35 text

@datadoghq @technovangelist #nginx #nginxconf • Blame/Fear of punitive action • Bias • Anchoring • Hindsight • Outcome • Availability • Recency 35 Data Skew/Corruption

Slide 36

Slide 36 text

@datadoghq @technovangelist #nginx #nginxconf 36 How we do Postmortems At Datadog

Slide 37

Slide 37 text

@datadoghq @technovangelist #nginx #nginxconf • Postmortems emailed to company wide • Scheduled recurring postmortem meetings 37 A Few Notes

Slide 38

Slide 38 text

@datadoghq @technovangelist #nginx #nginxconf • A real event - March 8, 2016 • Public-facing web of Datadog • Higher 5xx % on outbound requests (pull data) • No change on inbound requests or alerts (push data) • First navigations gets slow... • Then we get frequent “Down” pages • Alerts on 5xx % 38 About The Example

Slide 39

Slide 39 text

@datadoghq @technovangelist #nginx #nginxconf • Describe what happened here at a high-level -- think of it as an abstract in a scientific paper. • What was the impact on customers? • What was the severity of the outage? • What components were affected? • What ultimately resolved the outage? 39 Summary: What Happened?

Slide 40

Slide 40 text

@datadoghq @technovangelist #nginx #nginxconf 40

Slide 41

Slide 41 text

@datadoghq @technovangelist #nginx #nginxconf 41

Slide 42

Slide 42 text

@datadoghq @technovangelist #nginx #nginxconf • We want to make sure we detected the issue early and would catch the same issue if it were to repeat. • Did we have a metric that showed the outage? • Was there a monitor on that metric? • How long did it take for us to declare an outage? 42 How was the Outage Detected?

Slide 43

Slide 43 text

@datadoghq @technovangelist #nginx #nginxconf 43

Slide 44

Slide 44 text

@datadoghq @technovangelist #nginx #nginxconf 44

Slide 45

Slide 45 text

@datadoghq @technovangelist #nginx #nginxconf • Who was the incident owner & who else was involved? • Slack archive links and timeline of events! • What went well? • What didn’t go so well? 45 How Did We Respond?

Slide 46

Slide 46 text

@datadoghq @technovangelist #nginx #nginxconf 46 *Names changed

Slide 47

Slide 47 text

@datadoghq @technovangelist #nginx #nginxconf 47 ChatOps Archives FTW! *Names changed

Slide 48

Slide 48 text

@datadoghq @technovangelist #nginx #nginxconf 48 *Names changed Track Learnings As You Go

Slide 49

Slide 49 text

@datadoghq @technovangelist #nginx #nginxconf • Deep dive into the cause • Examples from this incident: • http://bit.ly/dd-statuspage • http://bit.ly/alq-postmortem 49 Why Did It Happen?

Slide 50

Slide 50 text

@datadoghq @technovangelist #nginx #nginxconf • Link to Github issues and Trello cards • Now? • Next? • Later? • Follow up notes 50 How Do We Prevent It In The Future?

Slide 51

Slide 51 text

@datadoghq @technovangelist #nginx #nginxconf 51

Slide 52

Slide 52 text

@datadoghq @technovangelist #nginx #nginxconf • What happened (summary)? • How did we detect it? • How did we respond? • Why did it happen (deep dive)? • Actionable next steps! 52 Recap:

Slide 53

Slide 53 text

@datadoghq @technovangelist #nginx #nginxconf • The Infinite Hows - John Allspaw
 http://bit.ly/infinite-hows • “Blameless” Postmortems don’t work - J Paul Reed
 http://bit.ly/blameless-dont-work • Monitoring 101 - Alexis Lê-Quôc
 http://dtdg.co/monitoring-101-data 53 More Resources

Slide 54

Slide 54 text

Thank You @datadoghq @technovangelist #nginx #nginxconf 54 linkedin.com/in/technovangelist @technovangelist