Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Driven Postmortems

Matt Williams
September 08, 2016

Data Driven Postmortems

Those who fail to learn from history are doomed to repeat it. When things go wrong, and our services are impacted, we need to tell the story of our failures so that we can grow and learn as team. Postmortems offer us an opportunity to share this knowledge, so that we can build on our successes and avoid repeating our mistakes in the future. In this session we will discuss how Datadog runs our internal postmortems from data collection to building timelines to the blameless review. Attendees will leave with a framework they can apply right away to make postmortems more impactful in their own organizations.

For more information about Datadog, visit: https://www.datadoghq.com/

Matt Williams

September 08, 2016
Tweet

More Decks by Matt Williams

Other Decks in Technology

Transcript

  1. @datadoghq @technovangelist #nginx #nginxconf • SaaS based infrastructure and app

    monitoring • Open Source Agent • Time series data (metrics and events) • Processing about a trillion data points per day • Intelligent Dashboards & Alerting • We’re hiring! (www.datadoghq.com/careers/) 2
  2. @datadoghq @technovangelist #nginx #nginxconf “The problems we work on at

    Datadog are hard and often don't have obvious, clean-cut solutions, so it's useful to cultivate your troubleshooting skills, no matter what role you work in.” 3 Internal Datadog Dev Guide
  3. @datadoghq @technovangelist #nginx #nginxconf 4 “The only real mistake is

    the one from which we learn nothing.” Henry Ford
  4. @datadoghq @technovangelist #nginx #nginxconf Oxford English Dictionary “An analysis or

    discussion of an event held soon after it has occurred, especially in order to determine why it was a failure.” 5 Postmortem
  5. #nginx #nginxconf 10 Blameless Postmortems Instead of naming, blaming, and

    shaming, our goal should always be to maximize opportunities for organizational learning - DevOps Handbook, Kim, Humble, Debois, Willis 2016
  6. @datadoghq @technovangelist #nginx #nginxconf • Blameless Postmortems by John Allspaw

    http://bit.ly/etsy-blameless • The Human Side of Postmortems by Dave Zwieback http://bit.ly/human-postmortem 12 Blameless Postmortem Resources
  7. Collecting data is cheap; not having it when you need

    it can be expensive So Instrument All The Things!
  8. @datadoghq @technovangelist #nginx #nginxconf 26 • Their perspective • What

    they did • What they thought • Why they thought/did it Data Collection: What?
  9. @datadoghq @technovangelist #nginx #nginxconf – John Allspaw Behind every seemingly

    technical problem is actually a human problem waiting to be found. 27
  10. @datadoghq @technovangelist #nginx #nginxconf – Dr Sidney Dekker Human error

    is not our cause of troubles; instead human error is a consequence of the design of the tools that we gave them. 28
  11. @datadoghq @technovangelist #nginx #nginxconf • … we will be dramatically

    improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously … 30 Joyent US-EAST-1 Postmortem 2014 http://bit.ly/joyent-post
  12. @datadoghq @technovangelist #nginx #nginxconf – Richard Guindon “Writing is nature’s

    way of letting you know how sloppy your thinking is.” 31
  13. @datadoghq @technovangelist #nginx #nginxconf • As soon as possible. •

    Memory drops sharply within 20 minutes • Susceptibility to “false memory” increases 33 Data Collection: When?
  14. @datadoghq @technovangelist #nginx #nginxconf • Blame/Fear of punitive action •

    Bias • Anchoring • Hindsight • Outcome • Availability • Recency 35 Data Skew/Corruption
  15. @datadoghq @technovangelist #nginx #nginxconf • Postmortems emailed to company wide

    • Scheduled recurring postmortem meetings 37 A Few Notes
  16. @datadoghq @technovangelist #nginx #nginxconf • A real event - March

    8, 2016 • Public-facing web of Datadog • Higher 5xx % on outbound requests (pull data) • No change on inbound requests or alerts (push data) • First navigations gets slow... • Then we get frequent “Down” pages • Alerts on 5xx % 38 About The Example
  17. @datadoghq @technovangelist #nginx #nginxconf • Describe what happened here at

    a high-level -- think of it as an abstract in a scientific paper. • What was the impact on customers? • What was the severity of the outage? • What components were affected? • What ultimately resolved the outage? 39 Summary: What Happened?
  18. @datadoghq @technovangelist #nginx #nginxconf • We want to make sure

    we detected the issue early and would catch the same issue if it were to repeat. • Did we have a metric that showed the outage? • Was there a monitor on that metric? • How long did it take for us to declare an outage? 42 How was the Outage Detected?
  19. @datadoghq @technovangelist #nginx #nginxconf • Who was the incident owner

    & who else was involved? • Slack archive links and timeline of events! • What went well? • What didn’t go so well? 45 How Did We Respond?
  20. @datadoghq @technovangelist #nginx #nginxconf • Deep dive into the cause

    • Examples from this incident: • http://bit.ly/dd-statuspage • http://bit.ly/alq-postmortem 49 Why Did It Happen?
  21. @datadoghq @technovangelist #nginx #nginxconf • Link to Github issues and

    Trello cards • Now? • Next? • Later? • Follow up notes 50 How Do We Prevent It In The Future?
  22. @datadoghq @technovangelist #nginx #nginxconf • What happened (summary)? • How

    did we detect it? • How did we respond? • Why did it happen (deep dive)? • Actionable next steps! 52 Recap:
  23. @datadoghq @technovangelist #nginx #nginxconf • The Infinite Hows - John

    Allspaw
 http://bit.ly/infinite-hows • “Blameless” Postmortems don’t work - J Paul Reed
 http://bit.ly/blameless-dont-work • Monitoring 101 - Alexis Lê-Quôc
 http://dtdg.co/monitoring-101-data 53 More Resources