Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident Response Done Right: From First Page to Postmortem

Incident Response Done Right: From First Page to Postmortem

A rambling sort of thing presented at DevOps ATL April 2014.

Will Farrington

April 17, 2014
Tweet

More Decks by Will Farrington

Other Decks in Technology

Transcript

  1. I N C I D E N T R E S P O N S E D O N E R I G H T
    F R O M F I R S T PA G E T O P O S T M O R T E M

    View Slide

  2. W I L L FA R R I N G T O N
    @wfarr on the Internet
    !
    Ops @ GitHub, 2012-now
    Ops @ Rails Machine, 2009-2011

    View Slide

  3. I N C I D E N T R E S P O N S E
    L E T ’ S TA L K A B O U T

    View Slide

  4. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View Slide

  5. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View Slide

  6. – M E
    The recipe to terrible software is simple: just add software.

    View Slide

  7. All software is terrible.

    View Slide

  8. All software breaks.

    View Slide

  9. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View Slide

  10. – M E
    Writing your own calendaring and alerting application is a terrible idea.

    View Slide

  11. Use PagerDuty, OpsGenie, smoke signals, or a carrier pigeon.

    View Slide

  12. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View Slide

  13. W H AT I S T H E P R O B L E M ?
    W H A T Y O U R E A L LY WA N T T O K N O W I S

    View Slide

  14. A P P
    I N P U T O U T P U T

    View Slide

  15. I N P U T O U T P U T
    L B
    A P P A U T H
    D B
    C A C H E
    A P I S

    View Slide

  16. I N P U T O U T P U T

    View Slide

  17. I N P U T O U T P U T
    ARCHITECTURE

    View Slide

  18. P R O C E S S
    H O W W O U L D Y O U D E S C R I B E Y O U R

    View Slide

  19. “Methodical” and “organized” aren’t often the first thought.

    View Slide

  20. Unfortunately, resolving the problem without identifying what the
    problem is usually results in more harm than good.

    View Slide

  21. A process more refined than guesswork is required.

    View Slide

  22. C H E C K L I S T S
    I R E C O M M E N D T H I S B O O K A B O U T

    View Slide

  23. – AT U L G A WA N D E
    “ It is common to misconceive how checklists function in complex
    lines of work. They are not comprehensive how-to guides, whether
    for building a skyscraper or getting a plane out of trouble. They are
    quick and simple tools aimed to buttress the skills of expert
    professionals.”

    View Slide

  24. A G O O D C H E C K L I S T
    P R E C I S E
    E F F I C I E N T
    C O N C I S E
    P R A C T I C A L
    E A S Y T O U S E

    View Slide

  25. ENGINE FAILURE DURING FLIGHT
    • Airspeed
    !
    • Fuel Shutoff Valve
    • Fuel Selector
    • Auxiliary Fuel Pump
    • Mixture
    • Ignition Switch
    FLY THE AIRPLANE!
    68 KIAS
    !
    ON (IN)
    BOTH
    ON
    RICH
    BOTH

    View Slide

  26. Checklists help you eliminate the “obvious” from your mind
    so you can focus on the hard stuff.

    View Slide

  27. Checklists transform the process of identifying problems in a rapidly
    degrading situation from being haphazard and error-prone to
    methodical and organized.

    View Slide

  28. Occam’s Razor as a Service

    View Slide

  29. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View Slide

  30. – M E
    “The first step, in anything, is giving a shit.”

    View Slide

  31. – AT U L G A WA N D E
    ““That’s not my problem” is possibly the worst thing people can
    think.”

    View Slide

  32. It’s actually the single worst thing anyone can think.

    View Slide

  33. F I X T H E P R O B L E M
    I T ’ S T I M E T O

    View Slide

  34. Do you have a checklist for that?

    View Slide

  35. C H E C K L I S T S
    I R E C O M M E N D T H I S B O O K A B O U T
    ( A G A I N )

    View Slide

  36. ENGINE FAILURE DURING FLIGHT
    • Airspeed
    !
    • Fuel Shutoff Valve
    • Fuel Selector
    • Auxiliary Fuel Pump
    • Mixture
    • Ignition Switch
    FLY THE AIRPLANE!
    68 KIAS
    !
    ON (IN)
    BOTH
    ON
    RICH
    BOTH

    View Slide

  37. Let’s say Elasticsearch is split-brained.

    View Slide

  38. You should immediately reach for the checklist.

    View Slide

  39. ELASTICSEARCH: SPLIT BRAIN
    • circuit break search OFF
    !
    • disable allocation
    • get cluster state
    • shutdown all nodes w/ API
    • start the cluster
    • wait for all members
    • enable allocation
    UPDATE THE STATUS!

    View Slide

  40. Communicate synchronously.

    View Slide

  41. MTTR is the name of the game.
    !
    Reduce it safely, by whatever means.

    View Slide

  42. Delegate

    View Slide

  43. On-call engineer
    Incident Commander
    Communicator

    View Slide

  44. Take 30s at the start of the hangout to make sure everyone knows
    who’s doing what.
    !
    Make sure you say what your role is.

    View Slide

  45. Atul Gawande found that the simple act of a surgical team
    introducing themselves to one another before an operation
    increased the feeling of teamwork and efficacy across the team.
    !
    It also enabled people to speak up when they see something.

    View Slide

  46. Communicate to the customer.

    View Slide

  47. Do it often!
    !
    Every 15-20 minutes should be the upper-bound.

    View Slide

  48. Terrible things happen and if you don’t communicate to your
    customers, they’ll assume the worst.

    View Slide

  49. I N C I D E N T
    E V E N T
    N O T I F I C AT I O N
    I D E N T I F I C AT I O N
    R E S O L U T I O N
    P O S T M O RT E M

    View Slide

  50. N O W L E T ’ S TA L K A B O U T I T
    W E ’ V E F I X E D T H E P R O B L E M

    View Slide

  51. – J E S S E R O B B I N S
    “Regular postmortems are the closest thing you have to employing a
    scientific method to the complicated problem of web operations. By
    gathering real evidence, you can focus your limited resources on
    solving the issues that are actually causing you problems.”

    View Slide

  52. A G O O D P O S T M O R T E M
    D E S C R I P T I O N O F T H E I N C I D E N T
    D E S C R I P T I O N O F T H E R O O T C A U S E
    D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S
    T I M E L I N E O F T H E I N C I D E N T
    H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S
    R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

    View Slide

  53. A G O O D P O S T M O R T E M
    D E S C R I P T I O N O F T H E I N C I D E N T
    D E S C R I P T I O N O F T H E R O O T C A U S E
    D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S
    T I M E L I N E O F T H E I N C I D E N T
    H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S
    R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

    View Slide

  54. T R U S T A N D H O N E S T Y
    A G O O D P O S T M O R T E M R E Q U I R E S

    View Slide

  55. View Slide

  56. Blame and punitive measures cannot enter the realm of possibility.
    !
    Otherwise, you create a conflict of interest about honesty.

    View Slide

  57. H U M A N E R R O R
    I R E C O M M E N D T H I S B O O K A B O U T

    View Slide

  58. – S I D N E Y D E K K E R
    “Different perspectives on a sequence of events: Looking from the
    outside and hindsight you have knowledge of the outcome and
    dangers involved. From the inside, you may have neither.”

    View Slide

  59. Let’s entertain the thought that we don’t hire mindless automatons.
    !
    We hire people who can and do think, and who care.

    View Slide

  60. Faced with a complex problem in a high-pressure scenario, with a
    process ill-equipped to effectively help them navigate the situation,
    their actions were entirely logical and yet doomed to fail.

    View Slide

  61. The most important thing is having all the facts.

    View Slide

  62. If facts are altered or missing, you cannot effectively remediate.

    View Slide

  63. A G O O D P O S T M O R T E M
    D E S C R I P T I O N O F T H E I N C I D E N T
    D E S C R I P T I O N O F T H E R O O T C A U S E
    D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S
    T I M E L I N E O F T H E I N C I D E N T
    H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S
    R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

    View Slide

  64. S E T T I N G Y O U R S E L F U P F O R FA I L U R E
    T H E M O S T C O M M O N P R O B L E M I S

    View Slide

  65. Your corrective actions should be aimed at figuring out how your
    process made the failure possible, and fixing the process.

    View Slide

  66. More training and trying harder are never the right answer.

    View Slide

  67. P U B L I C P O S T M O R T E M S

    View Slide

  68. Apologize first. Mean it.

    View Slide

  69. Own your availability.

    View Slide

  70. Own your security.

    View Slide

  71. Own your mistakes.

    View Slide

  72. Own your ignorance.

    View Slide

  73. Know your audience.

    View Slide

  74. Don’t bullshit. Ever.

    View Slide

  75. B U L L S H I T
    I R E C O M M E N D T H I S B L O G P O S T A B O U T

    View Slide

  76. – D AV I D H E I N E M E I E R H A N S S O N
    “The most important part of saying you’re sorry is to project some
    real empathy. If you can’t put yourself in your users’ shoes, then it’s
    going to out wrong.”

    View Slide

  77. Post it relatively soon.

    View Slide

  78. T H E B I G S E C R E T

    View Slide

  79. Nobody does this perfectly.
    !
    Definitely not us.

    View Slide

  80. The point is to get better at it.

    View Slide

  81. T H A N K S

    View Slide