Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Get Your Story Straight

j.hand
November 03, 2015

Get Your Story Straight

Gartner has predicted that in 2015, 80% of outages will be caused by people and process issues. Are you considering the Human element when revisiting incidents and outages with your infrastructure? If so, are you approaching it with a blameless mindset? An agenda focused on removing bias in many forms, searching for absolute truth. Do you believe that there is always a root cause to problems or is it more accurate to seek out additional aspects that may have attributed to the incident, especially with regard to the people and processes? Regardless of your approach, the point of a post-mortem is to accurately describe the "story" about what took place in as much detail as possible. The good, the bad, those involved, conversations had, actions taken, related timestamps, who was on-call, etc. You want to know absolutely everything that took place and was related in some degree so that you can review the data and learn from it. How do we ensure that we are asking the right questions and seeking out relevant and important information that will help us understand what took place and ultimately how to become a better team, company, and product as a result? I'll introduce best practices for conducting effective post-mortems and illustrate their importance with statistical data to back up the claims, demonstrating that there are measurable benefits from adopting post-mortems especially those of a "blameless" nature.

j.hand

November 03, 2015
Tweet

More Decks by j.hand

Other Decks in Technology

Transcript

  1. How Do We Get Better? Learn From incidents, outages, and

    events 6 — @jasonhand | victorops.com
  2. What? A process intended to inform improvements by determining aspects

    that were successful or unsuccessful. When? As soon as feasible after the Incident is resolved. 8 — @jasonhand | victorops.com
  3. Who? Everyone involved and related stakeholders Why? To communicate with

    your team To understand what happened for learning and improving 9 — @jasonhand | victorops.com
  4. We are here to Learn NOT Blame Learn & Identify

    Improvements 11 — @jasonhand | victorops.com
  5. This is an opportunity to learn ... in a safe

    environment 16 — @jasonhand | victorops.com