Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident Response Done Right: From First Page to Postmortem

Incident Response Done Right: From First Page to Postmortem

A rambling sort of thing presented at DevOps ATL April 2014.

Will Farrington

April 17, 2014
Tweet

More Decks by Will Farrington

Other Decks in Technology

Transcript

  1. I N C I D E N T R E S P O N S E D O N E R I G H T
    F R O M F I R S T PA G E T O P O S T M O R T E M

    View full-size slide

  2. W I L L FA R R I N G T O N
    @wfarr on the Internet
    !
    Ops @ GitHub, 2012-now
    Ops @ Rails Machine, 2009-2011

    View full-size slide

  3. I N C I D E N T R E S P O N S E
    L E T ’ S TA L K A B O U T

    View full-size slide

  4. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View full-size slide

  5. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View full-size slide

  6. – M E
    The recipe to terrible software is simple: just add software.

    View full-size slide

  7. All software is terrible.

    View full-size slide

  8. All software breaks.

    View full-size slide

  9. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View full-size slide

  10. – M E
    Writing your own calendaring and alerting application is a terrible idea.

    View full-size slide

  11. Use PagerDuty, OpsGenie, smoke signals, or a carrier pigeon.

    View full-size slide

  12. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View full-size slide

  13. W H AT I S T H E P R O B L E M ?
    W H A T Y O U R E A L LY WA N T T O K N O W I S

    View full-size slide

  14. A P P
    I N P U T O U T P U T

    View full-size slide

  15. I N P U T O U T P U T
    L B
    A P P A U T H
    D B
    C A C H E
    A P I S

    View full-size slide

  16. I N P U T O U T P U T

    View full-size slide

  17. I N P U T O U T P U T
    ARCHITECTURE

    View full-size slide

  18. P R O C E S S
    H O W W O U L D Y O U D E S C R I B E Y O U R

    View full-size slide

  19. “Methodical” and “organized” aren’t often the first thought.

    View full-size slide

  20. Unfortunately, resolving the problem without identifying what the
    problem is usually results in more harm than good.

    View full-size slide

  21. A process more refined than guesswork is required.

    View full-size slide

  22. C H E C K L I S T S
    I R E C O M M E N D T H I S B O O K A B O U T

    View full-size slide

  23. – AT U L G A WA N D E
    “ It is common to misconceive how checklists function in complex
    lines of work. They are not comprehensive how-to guides, whether
    for building a skyscraper or getting a plane out of trouble. They are
    quick and simple tools aimed to buttress the skills of expert
    professionals.”

    View full-size slide

  24. A G O O D C H E C K L I S T
    P R E C I S E
    E F F I C I E N T
    C O N C I S E
    P R A C T I C A L
    E A S Y T O U S E

    View full-size slide

  25. ENGINE FAILURE DURING FLIGHT
    • Airspeed
    !
    • Fuel Shutoff Valve
    • Fuel Selector
    • Auxiliary Fuel Pump
    • Mixture
    • Ignition Switch
    FLY THE AIRPLANE!
    68 KIAS
    !
    ON (IN)
    BOTH
    ON
    RICH
    BOTH

    View full-size slide

  26. Checklists help you eliminate the “obvious” from your mind
    so you can focus on the hard stuff.

    View full-size slide

  27. Checklists transform the process of identifying problems in a rapidly
    degrading situation from being haphazard and error-prone to
    methodical and organized.

    View full-size slide

  28. Occam’s Razor as a Service

    View full-size slide

  29. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View full-size slide

  30. – M E
    “The first step, in anything, is giving a shit.”

    View full-size slide

  31. – AT U L G A WA N D E
    ““That’s not my problem” is possibly the worst thing people can
    think.”

    View full-size slide

  32. It’s actually the single worst thing anyone can think.

    View full-size slide

  33. F I X T H E P R O B L E M
    I T ’ S T I M E T O

    View full-size slide

  34. Do you have a checklist for that?

    View full-size slide

  35. C H E C K L I S T S
    I R E C O M M E N D T H I S B O O K A B O U T
    ( A G A I N )

    View full-size slide

  36. ENGINE FAILURE DURING FLIGHT
    • Airspeed
    !
    • Fuel Shutoff Valve
    • Fuel Selector
    • Auxiliary Fuel Pump
    • Mixture
    • Ignition Switch
    FLY THE AIRPLANE!
    68 KIAS
    !
    ON (IN)
    BOTH
    ON
    RICH
    BOTH

    View full-size slide

  37. Let’s say Elasticsearch is split-brained.

    View full-size slide

  38. You should immediately reach for the checklist.

    View full-size slide

  39. ELASTICSEARCH: SPLIT BRAIN
    • circuit break search OFF
    !
    • disable allocation
    • get cluster state
    • shutdown all nodes w/ API
    • start the cluster
    • wait for all members
    • enable allocation
    UPDATE THE STATUS!

    View full-size slide

  40. Communicate synchronously.

    View full-size slide

  41. MTTR is the name of the game.
    !
    Reduce it safely, by whatever means.

    View full-size slide

  42. On-call engineer
    Incident Commander
    Communicator

    View full-size slide

  43. Take 30s at the start of the hangout to make sure everyone knows
    who’s doing what.
    !
    Make sure you say what your role is.

    View full-size slide

  44. Atul Gawande found that the simple act of a surgical team
    introducing themselves to one another before an operation
    increased the feeling of teamwork and efficacy across the team.
    !
    It also enabled people to speak up when they see something.

    View full-size slide

  45. Communicate to the customer.

    View full-size slide

  46. Do it often!
    !
    Every 15-20 minutes should be the upper-bound.

    View full-size slide

  47. Terrible things happen and if you don’t communicate to your
    customers, they’ll assume the worst.

    View full-size slide

  48. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View full-size slide

  49. N O W L E T ’ S TA L K A B O U T I T
    W E ’ V E F I X E D T H E P R O B L E M

    View full-size slide

  50. – J E S S E R O B B I N S
    “Regular postmortems are the closest thing you have to employing a
    scientific method to the complicated problem of web operations. By
    gathering real evidence, you can focus your limited resources on
    solving the issues that are actually causing you problems.”

    View full-size slide

  51. A G O O D P O S T M O R T E M
    D E S C R I P T I O N O F T H E I N C I D E N T
    D E S C R I P T I O N O F T H E R O O T C A U S E
    D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S
    T I M E L I N E O F T H E I N C I D E N T
    H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S
    R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

    View full-size slide

  52. A G O O D P O S T M O R T E M
    D E S C R I P T I O N O F T H E I N C I D E N T
    D E S C R I P T I O N O F T H E R O O T C A U S E
    D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S
    T I M E L I N E O F T H E I N C I D E N T
    H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S
    R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

    View full-size slide

  53. T R U S T A N D H O N E S T Y
    A G O O D P O S T M O R T E M R E Q U I R E S

    View full-size slide

  54. Blame and punitive measures cannot enter the realm of possibility.
    !
    Otherwise, you create a conflict of interest about honesty.

    View full-size slide

  55. H U M A N E R R O R
    I R E C O M M E N D T H I S B O O K A B O U T

    View full-size slide

  56. – S I D N E Y D E K K E R
    “Different perspectives on a sequence of events: Looking from the
    outside and hindsight you have knowledge of the outcome and
    dangers involved. From the inside, you may have neither.”

    View full-size slide

  57. Let’s entertain the thought that we don’t hire mindless automatons.
    !
    We hire people who can and do think, and who care.

    View full-size slide

  58. Faced with a complex problem in a high-pressure scenario, with a
    process ill-equipped to effectively help them navigate the situation,
    their actions were entirely logical and yet doomed to fail.

    View full-size slide

  59. The most important thing is having all the facts.

    View full-size slide

  60. If facts are altered or missing, you cannot effectively remediate.

    View full-size slide

  61. A G O O D P O S T M O R T E M
    D E S C R I P T I O N O F T H E I N C I D E N T
    D E S C R I P T I O N O F T H E R O O T C A U S E
    D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S
    T I M E L I N E O F T H E I N C I D E N T
    H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S
    R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

    View full-size slide

  62. S E T T I N G Y O U R S E L F U P F O R FA I L U R E
    T H E M O S T C O M M O N P R O B L E M I S

    View full-size slide

  63. Your corrective actions should be aimed at figuring out how your
    process made the failure possible, and fixing the process.

    View full-size slide

  64. More training and trying harder are never the right answer.

    View full-size slide

  65. P U B L I C P O S T M O R T E M S

    View full-size slide

  66. Apologize first. Mean it.

    View full-size slide

  67. Own your availability.

    View full-size slide

  68. Own your security.

    View full-size slide

  69. Own your mistakes.

    View full-size slide

  70. Own your ignorance.

    View full-size slide

  71. Know your audience.

    View full-size slide

  72. Don’t bullshit. Ever.

    View full-size slide

  73. B U L L S H I T
    I R E C O M M E N D T H I S B L O G P O S T A B O U T

    View full-size slide

  74. – D AV I D H E I N E M E I E R H A N S S O N
    “The most important part of saying you’re sorry is to project some
    real empathy. If you can’t put yourself in your users’ shoes, then it’s
    going to out wrong.”

    View full-size slide

  75. Post it relatively soon.

    View full-size slide

  76. T H E B I G S E C R E T

    View full-size slide

  77. Nobody does this perfectly.
    !
    Definitely not us.

    View full-size slide

  78. The point is to get better at it.

    View full-size slide