Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos management during a major incident

3b4384aee42559502d895cc43891b1e5?s=47 Aish
April 24, 2017

Chaos management during a major incident

This is a short talk that I gave about Incident Response and how PagerDuty does it's incident response.
I gave two versions of this talk one at dotScale 2017 in Paris and the other at Full Stack Fest, Barcelona

3b4384aee42559502d895cc43891b1e5?s=128

Aish

April 24, 2017
Tweet

Transcript

  1. Aish Raj Dahal, So.ware Engineer, PagerDuty Chaos management during a

    major incident
  2. sudo rm -rf /

  3. Failure free operations require experience with failure
 - Richard Cook

  4. A story about failure

  5. Chapter I: This is fine

  6. Almost every engineer I knew was on the call

  7. Someone dialed in from a Bar

  8. Most people were doing the same task

  9. We had no clue where to start

  10. Chapter II: Dark and Stormy night

  11. “It was a dark and stormy night; the rain fell

    in torrents — except at occasional intervals, when it was checked by a violent gust of wind…..” - Almost every clueless person on the call
  12. Nobody to coordinate

  13. Chapter III: The exec-swoop

  14. “Can you send me a spreadsheet with a list of

    affected customers ?”
  15. Chapter IV: The morning aOer

  16. What’s wrong with the picture ?

  17. Define, Prepare, Measure

  18. Failures should be unique, if not automate the response.

  19. None
  20. Single responsibility principle

  21. Role: The Subject Matter Expert

  22. Role: The Incident Commander

  23. Notify that this is a major incident

  24. Verify that all SMEs are present

  25. Divide and conquer

  26. Communicate effectively

  27. Avoid the bystander effect

  28. “Please say yes, if you think it is a good

    idea to do so.”
  29. “Is there any strong objection to that ?”

  30. Role: The Deputy

  31. Assist the Incident Commander

  32. Get all Subject Matter Experts up to speed about what’s

    happening
  33. Liaise with stakeholders

  34. Role: The Scribe

  35. Documents the timeline of an incident as it progresses

  36. Role: Customer Liaison

  37. Notify customers about the incident

  38. Keep the Incident Commander apprised of any relevant customer information

  39. Transfer of command if necessary

  40. Blameless post-mortem

  41. You can’t fire you way to reliability

  42. Review

  43. happens, prepare for it

  44. Develop on-call empathy

  45. People are your most valuable asset. Don’t burn them out

    doing something that can be automated.
  46. aishrajdahal

  47. Appendix • The Big Red Button from Flickr by włodi

    CC SA • Upside down from Flickr by Akimasa Harada (CC SA) • Geography from Flicker (CC SA) • Gene Kranz’s image on Public Domain • That’s all folks image on the Public Domain
  48. References • Cook, Richard I. "How complex systems fail." Cognitive

    Technologies Laboratory, University of Chicago. Chicago IL (1998). • Incident Response, PagerDuty Incident Response Docs, https:/ /response.pagerduty.com/ • Allspaw, John. "Fault injection in production." Communications of the ACM 55.10 (2012): 48-52. • Krishnan, Kripa. "Weathering the Unexpected." Commun. ACM 55.11 (2012): 48-52. • Limoncelli, Tom, et al. "Resilience engineering: learning to embrace failure." Communications of the ACM 55.11 (2012): 40-47.