Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flaming Poo & The Human Response

516fcd20ab7b946f50090ce1d557638c?s=47 j.hand
November 07, 2015

Flaming Poo & The Human Response

Even the best designed systems can and will have outages. No matter how well you’ve hardened your infrastructure and put in place failover or self-healing automation, something you didn’t see coming will wreak havoc in your special snowflake of a system. In many cases a human is likely to be a contributing factor. In fact, Gartner has predicted that in 2015, 80% of outages will be caused by people and process issues.

Are you considering the Human element when revisiting incidents and outages with your infrastructure? If so, are you approaching it with a blameless mindset focused on removing the many forms of bias and searching for absolute truth. Do you believe that there is always a root cause to outages or is it more accurate to seek out additional aspects that may have contributed to the incident, especially with regard to the people and processes?

Regardless of your approach, the point of a postmortem is to accurately describe the "story" about what took place in as much detail as possible. The good, the bad, those involved, conversations had, actions taken, related timestamps, who was on-call, etc. You want to know absolutely everything that took place that was related in some degree so that you can review the data and learn from it.

How do we ensure that we are asking the right questions and seeking out relevant and important information that will help us understand what took place and ultimately how to become a better team, company, and product as a result?

The blameless culture (specifically blameless postmortems) is a topic of interest to many in the middle of a DevOps transformation within their organization. I'll outline important best practices for conducting effective postmortems and demonstrate methods to measure benefits from adopting postmortems especially those of a "blameless" nature.

516fcd20ab7b946f50090ce1d557638c?s=128

j.hand

November 07, 2015
Tweet

Transcript

  1. Flaming ! & the Human Response @jasonhand

  2. Jason Hand DevOps Evangelist VictorOps @jasonhand

  3. Systems WILL have outages @jasonhand

  4. @jasonhand

  5. SIKE! U MAD? @jasonhand

  6. @jasonhand

  7. Have You Tried ... turning it off and on again?

    @jasonhand
  8. It's the FUTURE @jasonhand

  9. ! WILL BREAK @jasonhand

  10. ... Halp ... @jasonhand

  11. I don't think I liked that Nope! @jasonhand

  12. Postmortem @jasonhand

  13. @jasonhand

  14. Complicated (Knowable) "known unknowns" @jasonhand

  15. - Indianapolis Raceway Park (1997) @jasonhand

  16. Complex (Unknown) "unknown unknowns" @jasonhand

  17. @jasonhand

  18. Cynefin Framework @jasonhand

  19. @jasonhand

  20. Obvious The relationship between cause & effect is obvious sense

    - categorize - respond "Best Practice" @jasonhand
  21. Complicated The relationship between cause & effect requires analysis, investigation,

    triage, and/or "expert" knowledge sense - analyze - respond "Good Practice" @jasonhand
  22. Complex The relationship between cause & effect can only be

    perceived through retrospect probe - sense - respond "Emergent Practice" @jasonhand
  23. Chaotic No relationship between cause & effect at systems level

    act - sense - respond "Novel Practice" @jasonhand
  24. Say Root Cause One more time .. @jasonhand

  25. Remember when? @jasonhand

  26. Ops didn't like Devs messing with infrastructure @jasonhand

  27. Make Ops Great Again! @jasonhand

  28. From No No No To Go Go Go @jasonhand

  29. Full Stack @jasonhand

  30. "We know that engineers build better systems when they support

    those systems"5 5 Pete Cheslock (ThreatStack) - Velocity New York (10/14/15) @jasonhand
  31. Devs on-call @jasonhand

  32. Ops @jasonhand

  33. Devs @jasonhand

  34. @jasonhand

  35. 80% of outages will be caused by people and process

    issues1 1 Gartner (https://www.gartner.com/doc/334197/nsm-weakest-link-business-availability) @jasonhand
  36. Humans May not (always) be a contributing factor But ...

    They are (likely) part of the resolution or improvement process @jasonhand
  37. And as Fallaby Humans we are susceptable to Bias @jasonhand

  38. Cognitive bias.. Deviation in judgement due to choosing timeliness over

    accuracy (ETTO) - Effeciency to Thoroughness Trade-Off @jasonhand
  39. Normalcy bias We believe it won't happen to us, because

    it hasn't previously @jasonhand
  40. Hindsight bias We believe it was predictable despite all evidence

    to the contrary @jasonhand
  41. Confirmation bias We seek information to back our up our

    position @jasonhand
  42. Our minds look for short cuts @jasonhand

  43. @jasonhand

  44. We are WIRED ..to blame "Blame is a way to

    discharge pain and discomfort"6 6 Brene Brown @jasonhand
  45. How? Instead of Who? Focused on removing blame & the

    many forms of bias that prevent us from identifying areas of improvement? @jasonhand
  46. The point of a postmortem is to accurately describe the

    “story” of what took place so that we can.. learn & improve @jasonhand
  47. Are you a Paid Pro? @jasonhand

  48. Maxim: We are here to Learn AND Improve @jasonhand

  49. In the beginning @jasonhand

  50. Establish the Timeline What did we notice first and when?

    @jasonhand
  51. Describe rather than explain Give an accurate account of what

    took place @jasonhand
  52. context @jasonhand

  53. Conversations & Actions @jasonhand

  54. Contributing factor Definition: Something that is partly responsible for a

    development or anomaly @jasonhand
  55. everybody gets a voice @jasonhand

  56. (S) pecific (M) easurable (A) ctionable (R) ealistic (T) imely

    ... Action Items aimed at (small) incremental improvements @jasonhand
  57. Is it working? @jasonhand

  58. MTTA & MTTR improvements over time @jasonhand

  59. Improvements in... Volume of Actionable Alerts @jasonhand

  60. "It's not about the outcome! It's about the response"7 7

    - J. Paul Reed (@jpaulreed) & Kevina Finn-Braun (@kfinnbraun) @jasonhand
  61. Continuous Incremental Improvements (i.e. baby steps) @jasonhand

  62. Teeny, Tiny Action Items @jasonhand

  63. Ongoing @jasonhand

  64. Barriers & Friction: Knock'em Down (walls, silos, bottlenecks, bad process)

    @jasonhand
  65. Never Finished with Continuous Improvements @jasonhand

  66. Never Finished with Transforming the way we deliver software @jasonhand

  67. Continuous @jasonhand

  68. Thank You @jasonhand

  69. Abstract Even the best designed systems can and will have

    outages. No matter how well you’ve hardened your infrastructure and put in place failover or self- healing automation, something you didn’t see coming will wreak havoc in your special snowflake of a system. In many cases a human is likely to be a contributing factor. In fact, Gartner has predicted that in 2015, 80% of outages will be caused by people and process issues. Are you considering the Human element when revisiting incidents and outages with your infrastructure? If so, are you approaching it with a blameless mindset focused on removing the many forms of bias and searching for absolute truth. Do you believe that there is always a root cause to outages or is it more accurate to seek out additional aspects that may have contributed to the incident, especially with regard to the people and processes? Regardless of your approach, the point of a postmortem is to accurately describe the "story" about what took place in as much detail as possible. The good, the bad, those involved, conversations had, actions taken, related timestamps, who was on-call, etc. You want to know absolutely everything that took place that was related in some degree so that you can review the data and learn from it. How do we ensure that we are asking the right questions and seeking out relevant and important information that will help us understand what took place and ultimately how to become a better team, company, and product as a result? The blameless culture (specifically blameless postmortems) is a topic of interest to many in the middle of a DevOps transformation within their organization. I'll outline important best practices for conducting effective postmortems and demonstrate methods to measure benefits from adopting postmortems especially those of a "blameless" nature. @jasonhand
  70. Images: https://thinkbeyondthelogo.files.wordpress.com/2015/06/machine.jpg http://4.bp.blogspot.com/-TTAqwl4SFSM/UGObgB-qbSI/AAAAAAAAA5Y/jp216LHBb7A/s1600/slide2375301201312free.jpg http://www.reactiongifs.com/r/brule-omg.gif http://www.chadecerebro.com.br/wp-content/uploads/2015/05/diferen%C3%A7a-entre-cliques-e-sess%C3%B5es.png http://s3-ec.buzzfed.com/static/2014-03/enhanced/webdr02/8/22/anigifenhanced-buzz-25148-1394334423-21.gif http://helixpc.com/wp-content/uploads/2014/02/80421-blue-circuit-board1.jpg http://www.designvertise.com/wp-content/uploads/2014/05/Mountain-Graph-by-Seth-Eckert.gif https://giphy.com/gifs/feels-adventure-time-fangirling-oxLgK1Rrubpba http://orangesv.com/wp-content/uploads/2015/01/neural-network-aficionados-ersatz-event-brain-graphic-1140x440-1140x440.jpg

    https://thenypost.files.wordpress.com/2014/10/hoverboard2.jpg http://images.goranhoracek.com.s3.amazonaws.com/wp-content/uploads/2011/01/knife3.jpeg http://www.kaizen-news.com/wp-content/uploads/2014/10/kaizen-small-improvements.png http://www.clarkgaither.com/wp-content/uploads/2015/03/Man-Pointing-Finger.jpg http://www.hdwallpaper.nu/wp-content/uploads/2015/02/os_x_lynx-2560x1600.jpg http://www.grassrootsfitness.ie/wordpress/wp-content/uploads/2014/07/starting-line.jpeg http://blog.hace-online.nl/wp-content/uploads/2011/06/Stamina-concept.png http://www.gregoryclassics.com/wp-content/uploads/2014/09/sail.jpg http://www.photos-public-domain.com/wp-content/uploads/2011/09/smart.jpg https://c1.staticflickr.com/9/8303/7785828546a1fda0801bb.jpg https://nesncom.files.wordpress.com/2013/12/peyton-manning2.jpg?w=599&h=492 @jasonhand
  71. Resources: • http://werve.net/articles/running-effective-retrospectives/ • http://blog.hut8labs.com/dan-talks-about-post-mortems.html • http://product.hubspot.com/blog/bid/64771/Post-Mortems-at-HubSpot-What-I-Learned- From-250-Whys • https://medium.com/towards-a-remarkable-career/how-to-run-a-simple-

    postmortem-9c3eff094b5f • http://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-dangers-of-the-five-whys/ • https://www.gartner.com/doc/334197/nsm-weakest-link-business-availability • https://www.victorops.com/blog • https://www.jasonhand.com @jasonhand