Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flaming Poo & The Human Response

j.hand
November 07, 2015

Flaming Poo & The Human Response

Even the best designed systems can and will have outages. No matter how well you’ve hardened your infrastructure and put in place failover or self-healing automation, something you didn’t see coming will wreak havoc in your special snowflake of a system. In many cases a human is likely to be a contributing factor. In fact, Gartner has predicted that in 2015, 80% of outages will be caused by people and process issues.

Are you considering the Human element when revisiting incidents and outages with your infrastructure? If so, are you approaching it with a blameless mindset focused on removing the many forms of bias and searching for absolute truth. Do you believe that there is always a root cause to outages or is it more accurate to seek out additional aspects that may have contributed to the incident, especially with regard to the people and processes?

Regardless of your approach, the point of a postmortem is to accurately describe the "story" about what took place in as much detail as possible. The good, the bad, those involved, conversations had, actions taken, related timestamps, who was on-call, etc. You want to know absolutely everything that took place that was related in some degree so that you can review the data and learn from it.

How do we ensure that we are asking the right questions and seeking out relevant and important information that will help us understand what took place and ultimately how to become a better team, company, and product as a result?

The blameless culture (specifically blameless postmortems) is a topic of interest to many in the middle of a DevOps transformation within their organization. I'll outline important best practices for conducting effective postmortems and demonstrate methods to measure benefits from adopting postmortems especially those of a "blameless" nature.

j.hand

November 07, 2015
Tweet

More Decks by j.hand

Other Decks in Technology

Transcript

  1. Obvious The relationship between cause & effect is obvious sense

    - categorize - respond "Best Practice" @jasonhand
  2. Complicated The relationship between cause & effect requires analysis, investigation,

    triage, and/or "expert" knowledge sense - analyze - respond "Good Practice" @jasonhand
  3. Complex The relationship between cause & effect can only be

    perceived through retrospect probe - sense - respond "Emergent Practice" @jasonhand
  4. Chaotic No relationship between cause & effect at systems level

    act - sense - respond "Novel Practice" @jasonhand
  5. "We know that engineers build better systems when they support

    those systems"5 5 Pete Cheslock (ThreatStack) - Velocity New York (10/14/15) @jasonhand
  6. 80% of outages will be caused by people and process

    issues1 1 Gartner (https://www.gartner.com/doc/334197/nsm-weakest-link-business-availability) @jasonhand
  7. Humans May not (always) be a contributing factor But ...

    They are (likely) part of the resolution or improvement process @jasonhand
  8. Cognitive bias.. Deviation in judgement due to choosing timeliness over

    accuracy (ETTO) - Effeciency to Thoroughness Trade-Off @jasonhand
  9. We are WIRED ..to blame "Blame is a way to

    discharge pain and discomfort"6 6 Brene Brown @jasonhand
  10. How? Instead of Who? Focused on removing blame & the

    many forms of bias that prevent us from identifying areas of improvement? @jasonhand
  11. The point of a postmortem is to accurately describe the

    “story” of what took place so that we can.. learn & improve @jasonhand
  12. (S) pecific (M) easurable (A) ctionable (R) ealistic (T) imely

    ... Action Items aimed at (small) incremental improvements @jasonhand
  13. "It's not about the outcome! It's about the response"7 7

    - J. Paul Reed (@jpaulreed) & Kevina Finn-Braun (@kfinnbraun) @jasonhand
  14. Abstract Even the best designed systems can and will have

    outages. No matter how well you’ve hardened your infrastructure and put in place failover or self- healing automation, something you didn’t see coming will wreak havoc in your special snowflake of a system. In many cases a human is likely to be a contributing factor. In fact, Gartner has predicted that in 2015, 80% of outages will be caused by people and process issues. Are you considering the Human element when revisiting incidents and outages with your infrastructure? If so, are you approaching it with a blameless mindset focused on removing the many forms of bias and searching for absolute truth. Do you believe that there is always a root cause to outages or is it more accurate to seek out additional aspects that may have contributed to the incident, especially with regard to the people and processes? Regardless of your approach, the point of a postmortem is to accurately describe the "story" about what took place in as much detail as possible. The good, the bad, those involved, conversations had, actions taken, related timestamps, who was on-call, etc. You want to know absolutely everything that took place that was related in some degree so that you can review the data and learn from it. How do we ensure that we are asking the right questions and seeking out relevant and important information that will help us understand what took place and ultimately how to become a better team, company, and product as a result? The blameless culture (specifically blameless postmortems) is a topic of interest to many in the middle of a DevOps transformation within their organization. I'll outline important best practices for conducting effective postmortems and demonstrate methods to measure benefits from adopting postmortems especially those of a "blameless" nature. @jasonhand
  15. Images: https://thinkbeyondthelogo.files.wordpress.com/2015/06/machine.jpg http://4.bp.blogspot.com/-TTAqwl4SFSM/UGObgB-qbSI/AAAAAAAAA5Y/jp216LHBb7A/s1600/slide2375301201312free.jpg http://www.reactiongifs.com/r/brule-omg.gif http://www.chadecerebro.com.br/wp-content/uploads/2015/05/diferen%C3%A7a-entre-cliques-e-sess%C3%B5es.png http://s3-ec.buzzfed.com/static/2014-03/enhanced/webdr02/8/22/anigifenhanced-buzz-25148-1394334423-21.gif http://helixpc.com/wp-content/uploads/2014/02/80421-blue-circuit-board1.jpg http://www.designvertise.com/wp-content/uploads/2014/05/Mountain-Graph-by-Seth-Eckert.gif https://giphy.com/gifs/feels-adventure-time-fangirling-oxLgK1Rrubpba http://orangesv.com/wp-content/uploads/2015/01/neural-network-aficionados-ersatz-event-brain-graphic-1140x440-1140x440.jpg

    https://thenypost.files.wordpress.com/2014/10/hoverboard2.jpg http://images.goranhoracek.com.s3.amazonaws.com/wp-content/uploads/2011/01/knife3.jpeg http://www.kaizen-news.com/wp-content/uploads/2014/10/kaizen-small-improvements.png http://www.clarkgaither.com/wp-content/uploads/2015/03/Man-Pointing-Finger.jpg http://www.hdwallpaper.nu/wp-content/uploads/2015/02/os_x_lynx-2560x1600.jpg http://www.grassrootsfitness.ie/wordpress/wp-content/uploads/2014/07/starting-line.jpeg http://blog.hace-online.nl/wp-content/uploads/2011/06/Stamina-concept.png http://www.gregoryclassics.com/wp-content/uploads/2014/09/sail.jpg http://www.photos-public-domain.com/wp-content/uploads/2011/09/smart.jpg https://c1.staticflickr.com/9/8303/7785828546a1fda0801bb.jpg https://nesncom.files.wordpress.com/2013/12/peyton-manning2.jpg?w=599&h=492 @jasonhand
  16. Resources: • http://werve.net/articles/running-effective-retrospectives/ • http://blog.hut8labs.com/dan-talks-about-post-mortems.html • http://product.hubspot.com/blog/bid/64771/Post-Mortems-at-HubSpot-What-I-Learned- From-250-Whys • https://medium.com/towards-a-remarkable-career/how-to-run-a-simple-

    postmortem-9c3eff094b5f • http://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-dangers-of-the-five-whys/ • https://www.gartner.com/doc/334197/nsm-weakest-link-business-availability • https://www.victorops.com/blog • https://www.jasonhand.com @jasonhand