Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos management during a major incident

3b4384aee42559502d895cc43891b1e5?s=47 Aish
April 24, 2017

Chaos management during a major incident

This is a short talk that I gave about Incident Response and how PagerDuty does it's incident response.
I gave two versions of this talk one at dotScale 2017 in Paris and the other at Full Stack Fest, Barcelona



April 24, 2017


  1. Aish Raj Dahal, So.ware Engineer, PagerDuty Chaos management during a

    major incident
  2. sudo rm -rf /

  3. Failure free operations require experience with failure
 - Richard Cook

  4. A story about failure

  5. Chapter I: This is fine

  6. Almost every engineer I knew was on the call

  7. Someone dialed in from a Bar

  8. Most people were doing the same task

  9. We had no clue where to start

  10. Chapter II: Dark and Stormy night

  11. “It was a dark and stormy night; the rain fell

    in torrents — except at occasional intervals, when it was checked by a violent gust of wind…..” - Almost every clueless person on the call
  12. Nobody to coordinate

  13. Chapter III: The exec-swoop

  14. “Can you send me a spreadsheet with a list of

    affected customers ?”
  15. Chapter IV: The morning aOer

  16. What’s wrong with the picture ?

  17. Define, Prepare, Measure

  18. Failures should be unique, if not automate the response.

  19. None
  20. Single responsibility principle

  21. Role: The Subject Matter Expert

  22. Role: The Incident Commander

  23. Notify that this is a major incident

  24. Verify that all SMEs are present

  25. Divide and conquer

  26. Communicate effectively

  27. Avoid the bystander effect

  28. “Please say yes, if you think it is a good

    idea to do so.”
  29. “Is there any strong objection to that ?”

  30. Role: The Deputy

  31. Assist the Incident Commander

  32. Get all Subject Matter Experts up to speed about what’s

  33. Liaise with stakeholders

  34. Role: The Scribe

  35. Documents the timeline of an incident as it progresses

  36. Role: Customer Liaison

  37. Notify customers about the incident

  38. Keep the Incident Commander apprised of any relevant customer information

  39. Transfer of command if necessary

  40. Blameless post-mortem

  41. You can’t fire you way to reliability

  42. Review

  43. happens, prepare for it

  44. Develop on-call empathy

  45. People are your most valuable asset. Don’t burn them out

    doing something that can be automated.
  46. aishrajdahal

  47. Appendix • The Big Red Button from Flickr by włodi

    CC SA • Upside down from Flickr by Akimasa Harada (CC SA) • Geography from Flicker (CC SA) • Gene Kranz’s image on Public Domain • That’s all folks image on the Public Domain
  48. References • Cook, Richard I. "How complex systems fail." Cognitive

    Technologies Laboratory, University of Chicago. Chicago IL (1998). • Incident Response, PagerDuty Incident Response Docs, https:/ /response.pagerduty.com/ • Allspaw, John. "Fault injection in production." Communications of the ACM 55.10 (2012): 48-52. • Krishnan, Kripa. "Weathering the Unexpected." Commun. ACM 55.11 (2012): 48-52. • Limoncelli, Tom, et al. "Resilience engineering: learning to embrace failure." Communications of the ACM 55.11 (2012): 40-47.