Incident Response Done Right: From First Page to Postmortem

Incident Response Done Right: From First Page to Postmortem

A rambling sort of thing presented at DevOps ATL April 2014.

Cd839cc361ffa996be0cc8259f3d7555?s=128

Will Farrington

April 17, 2014
Tweet

Transcript

  1. I N C I D E N T R E

    S P O N S E D O N E R I G H T F R O M F I R S T PA G E T O P O S T M O R T E M
  2. W I L L FA R R I N G

    T O N @wfarr on the Internet ! Ops @ GitHub, 2012-now Ops @ Rails Machine, 2009-2011
  3. I N C I D E N T R E

    S P O N S E L E T ’ S TA L K A B O U T
  4. I N C I D E N T E V

    E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M
  5. I N C I D E N T E V

    E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M
  6. – M E The recipe to terrible software is simple:

    just add software.
  7. All software is terrible.

  8. All software breaks.

  9. I N C I D E N T E V

    E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M
  10. – M E Writing your own calendaring and alerting application

    is a terrible idea.
  11. Use PagerDuty, OpsGenie, smoke signals, or a carrier pigeon.

  12. I N C I D E N T E V

    E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M
  13. W H AT I S T H E P R

    O B L E M ? W H A T Y O U R E A L LY WA N T T O K N O W I S
  14. A P P I N P U T O U

    T P U T
  15. I N P U T O U T P U

    T L B A P P A U T H D B C A C H E A P I S
  16. I N P U T O U T P U

    T
  17. I N P U T O U T P U

    T ARCHITECTURE
  18. P R O C E S S H O W

    W O U L D Y O U D E S C R I B E Y O U R
  19. “Methodical” and “organized” aren’t often the first thought.

  20. Unfortunately, resolving the problem without identifying what the problem is

    usually results in more harm than good.
  21. A process more refined than guesswork is required.

  22. C H E C K L I S T S

    I R E C O M M E N D T H I S B O O K A B O U T
  23. – AT U L G A WA N D E

    “ It is common to misconceive how checklists function in complex lines of work. They are not comprehensive how-to guides, whether for building a skyscraper or getting a plane out of trouble. They are quick and simple tools aimed to buttress the skills of expert professionals.”
  24. A G O O D C H E C K

    L I S T P R E C I S E E F F I C I E N T C O N C I S E P R A C T I C A L E A S Y T O U S E
  25. ENGINE FAILURE DURING FLIGHT • Airspeed ! • Fuel Shutoff

    Valve • Fuel Selector • Auxiliary Fuel Pump • Mixture • Ignition Switch FLY THE AIRPLANE! 68 KIAS ! ON (IN) BOTH ON RICH BOTH
  26. Checklists help you eliminate the “obvious” from your mind so

    you can focus on the hard stuff.
  27. Checklists transform the process of identifying problems in a rapidly

    degrading situation from being haphazard and error-prone to methodical and organized.
  28. Occam’s Razor as a Service

  29. I N C I D E N T E V

    E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M
  30. – M E “The first step, in anything, is giving

    a shit.”
  31. – AT U L G A WA N D E

    ““That’s not my problem” is possibly the worst thing people can think.”
  32. It’s actually the single worst thing anyone can think.

  33. F I X T H E P R O B

    L E M I T ’ S T I M E T O
  34. Do you have a checklist for that?

  35. C H E C K L I S T S

    I R E C O M M E N D T H I S B O O K A B O U T ( A G A I N )
  36. ENGINE FAILURE DURING FLIGHT • Airspeed ! • Fuel Shutoff

    Valve • Fuel Selector • Auxiliary Fuel Pump • Mixture • Ignition Switch FLY THE AIRPLANE! 68 KIAS ! ON (IN) BOTH ON RICH BOTH
  37. Let’s say Elasticsearch is split-brained.

  38. You should immediately reach for the checklist.

  39. ELASTICSEARCH: SPLIT BRAIN • circuit break search OFF ! •

    disable allocation • get cluster state • shutdown all nodes w/ API • start the cluster • wait for all members • enable allocation UPDATE THE STATUS!
  40. Communicate synchronously.

  41. MTTR is the name of the game. ! Reduce it

    safely, by whatever means.
  42. Delegate

  43. On-call engineer Incident Commander Communicator

  44. Take 30s at the start of the hangout to make

    sure everyone knows who’s doing what. ! Make sure you say what your role is.
  45. Atul Gawande found that the simple act of a surgical

    team introducing themselves to one another before an operation increased the feeling of teamwork and efficacy across the team. ! It also enabled people to speak up when they see something.
  46. Communicate to the customer.

  47. Do it often! ! Every 15-20 minutes should be the

    upper-bound.
  48. Terrible things happen and if you don’t communicate to your

    customers, they’ll assume the worst.
  49. I N C I D E N T E V

    E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M
  50. N O W L E T ’ S TA L

    K A B O U T I T W E ’ V E F I X E D T H E P R O B L E M
  51. – J E S S E R O B B

    I N S “Regular postmortems are the closest thing you have to employing a scientific method to the complicated problem of web operations. By gathering real evidence, you can focus your limited resources on solving the issues that are actually causing you problems.”
  52. A G O O D P O S T M

    O R T E M D E S C R I P T I O N O F T H E I N C I D E N T D E S C R I P T I O N O F T H E R O O T C A U S E D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S T I M E L I N E O F T H E I N C I D E N T H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S
  53. A G O O D P O S T M

    O R T E M D E S C R I P T I O N O F T H E I N C I D E N T D E S C R I P T I O N O F T H E R O O T C A U S E D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S T I M E L I N E O F T H E I N C I D E N T H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S
  54. T R U S T A N D H O

    N E S T Y A G O O D P O S T M O R T E M R E Q U I R E S
  55. None
  56. Blame and punitive measures cannot enter the realm of possibility.

    ! Otherwise, you create a conflict of interest about honesty.
  57. H U M A N E R R O R

    I R E C O M M E N D T H I S B O O K A B O U T
  58. – S I D N E Y D E K

    K E R “Different perspectives on a sequence of events: Looking from the outside and hindsight you have knowledge of the outcome and dangers involved. From the inside, you may have neither.”
  59. Let’s entertain the thought that we don’t hire mindless automatons.

    ! We hire people who can and do think, and who care.
  60. Faced with a complex problem in a high-pressure scenario, with

    a process ill-equipped to effectively help them navigate the situation, their actions were entirely logical and yet doomed to fail.
  61. The most important thing is having all the facts.

  62. If facts are altered or missing, you cannot effectively remediate.

  63. A G O O D P O S T M

    O R T E M D E S C R I P T I O N O F T H E I N C I D E N T D E S C R I P T I O N O F T H E R O O T C A U S E D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S T I M E L I N E O F T H E I N C I D E N T H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S
  64. S E T T I N G Y O U

    R S E L F U P F O R FA I L U R E T H E M O S T C O M M O N P R O B L E M I S
  65. Your corrective actions should be aimed at figuring out how

    your process made the failure possible, and fixing the process.
  66. More training and trying harder are never the right answer.

  67. P U B L I C P O S T

    M O R T E M S
  68. Apologize first. Mean it.

  69. Own your availability.

  70. Own your security.

  71. Own your mistakes.

  72. Own your ignorance.

  73. Know your audience.

  74. Don’t bullshit. Ever.

  75. B U L L S H I T I R

    E C O M M E N D T H I S B L O G P O S T A B O U T
  76. – D AV I D H E I N E

    M E I E R H A N S S O N “The most important part of saying you’re sorry is to project some real empathy. If you can’t put yourself in your users’ shoes, then it’s going to out wrong.”
  77. Post it relatively soon.

  78. T H E B I G S E C R

    E T
  79. Nobody does this perfectly. ! Definitely not us.

  80. The point is to get better at it.

  81. T H A N K S