$30 off During Our Annual Pro Sale. View Details »

Finding the Second Story - Learning from Failure

Finding the Second Story - Learning from Failure

Behind every classic failure, there is someone to blame, right?

What if this isn’t the case? What happens when we make it explicitly NOT the case?

Join Pat as we try to understand human error, change the focus of our post-mortems, and discover what we can do to make our teams a safer place to deliver better software, faster, while learning from our (inevitable) mistakes.

Pat Hermens

May 16, 2019
Tweet

More Decks by Pat Hermens

Other Decks in Technology

Transcript

  1. Learning from Failure
    Finding the ‘second story’

    View Slide

  2. Pat Hermens
    Development Manager
    Coding for ~20 years
    Father & husband
    Rotterdam, Netherlands
    @phermens
    hermens.com.au

    View Slide

  3. View Slide

  4. View Slide

  5. 5
    @phermens
    Failure?
    Show of hands please.

    View Slide

  6. View Slide

  7. 7
    @phermens
    Failure?

    View Slide

  8. “99.88% uptime”

    View Slide

  9. 9
    @phermens
    Failure?

    View Slide

  10. “Second Story”

    View Slide

  11. The Field Guide to Understanding ‘Human Error’
    Sidney Dekker

    View Slide

  12. The Field Guide to Understanding ‘Human Error’
    Sidney Dekker
    Underneath every simple,
    obvious story about
    ‘human error’, there is a
    deeper, more complex story
    about the organisation.

    View Slide

  13. 13
    @phermens
    So, what is a ‘Second Story’?

    View Slide

  14. 14
    @phermens
    So, what is a ‘Second Story’?
    First Stories Second Stories

    View Slide

  15. 15
    @phermens
    So, what is a ‘Second Story’?
    First Stories Second Stories
    Human error is seen as
    the cause of failure

    View Slide

  16. 16
    @phermens
    So, what is a ‘Second Story’?
    First Stories Second Stories
    Human error is seen as
    the cause of failure
    Human error is seen as the effect of
    systemic vulnerabilities deeper inside
    the organisation or system

    View Slide

  17. 17
    @phermens
    So, what is a ‘Second Story’?
    First Stories Second Stories
    Human error is seen as
    the cause of failure
    Human error is seen as the effect of
    systemic vulnerabilities deeper inside
    the organisation or system
    Saying what people SHOULD
    have done is a satisfying way
    to describe THEIR mistake

    View Slide

  18. 18
    @phermens
    So, what is a ‘Second Story’?
    First Stories Second Stories
    Human error is seen as
    the cause of failure
    Human error is seen as the effect of
    systemic vulnerabilities deeper inside
    the organisation or system
    Saying what people SHOULD
    have done is a satisfying way
    to describe THEIR mistake
    Saying what people SHOULD have done,
    doesn’t explain WHY it made sense for
    them to do what they did.

    View Slide

  19. 19
    @phermens
    So, what is a ‘Second Story’?
    First Stories Second Stories
    Human error is seen as
    the cause of failure
    Human error is seen as the effect of
    systemic vulnerabilities deeper inside
    the organisation or system
    Saying what people SHOULD
    have done is a satisfying way
    to describe THEIR mistake
    Saying what people SHOULD have done,
    doesn’t explain WHY it made sense for
    them to do what they did.
    Telling people to be more
    careful will make the problem
    go away

    View Slide

  20. 20
    @phermens
    So, what is a ‘Second Story’?
    First Stories Second Stories
    Human error is seen as
    the cause of failure
    Human error is seen as the effect of
    systemic vulnerabilities deeper inside
    the organisation or system
    Saying what people SHOULD
    have done is a satisfying way
    to describe THEIR mistake
    Saying what people SHOULD have done,
    doesn’t explain WHY it made sense for
    them to do what they did.
    Telling people to be more
    careful will make the problem
    go away
    Only by constantly seeking out its
    vulnerabilities can organisations
    enhance safety

    View Slide

  21. 21
    @phermens
    So, what is a ‘Second Story’?
    First Stories Second Stories
    Human error is seen as
    the cause of failure
    Human error is seen as the effect of
    systemic vulnerabilities deeper inside
    the organisation or system
    Saying what people SHOULD
    have done is a satisfying way
    to describe THEIR mistake
    Saying what people SHOULD have done,
    doesn’t explain WHY it made sense for
    them to do what they did.
    Telling people to be more
    careful will make the problem
    go away
    Only by constantly seeking out its
    vulnerabilities can organisations
    enhance safety

    View Slide

  22. 22
    @phermens
    So, what is a ‘Second Story’?
    It is the real story
    of the complexity
    in which people work

    View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. CC BY 2.0
    https://www.flickr.com/photos/nrcgov/28751374767

    View Slide

  28. View Slide

  29. 29
    @phermens
    Failure?

    View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. James Thomas - April 15, 2019
    Death by PowerPoint: the slide that killed seven people

    View Slide

  34. James Thomas - April 15, 2019
    Death by PowerPoint: the slide that killed seven people

    View Slide

  35. James Thomas - April 15, 2019
    Death by PowerPoint: the slide that killed seven people

    View Slide

  36. James Thomas - April 15, 2019
    Death by PowerPoint: the slide that killed seven people

    View Slide

  37. 37
    @phermens
    Failure?

    View Slide

  38. “99.88% uptime”

    View Slide

  39. View Slide

  40. https://www.ideal.nl/en/latest-news/keyfigures/
    ideal-availability/

    View Slide

  41. 41
    @phermens
    Failure?

    View Slide

  42. View Slide

  43. “Just Culture”

    View Slide

  44. https://eur-lex.europa.eu/LexUriServ/LexUriServ.do
    ?uri=OJ:L:2010:201:0001:0022:EN:PDF

    View Slide

  45. https://eur-lex.europa.eu/LexUriServ/LexUriServ.do
    ?uri=OJ:L:2010:201:0001:0022:EN:PDF

    View Slide

  46. https://eur-lex.europa.eu/LexUriServ/LexUriServ.do
    ?uri=OJ:L:2010:201:0001:0022:EN:PDF

    View Slide

  47. 47
    @phermens
    Sure, but what is a ‘Just Culture’ in Tech?

    View Slide

  48. 48
    @phermens
    Sure, but what is a ‘Just Culture’ in Tech?
    It is a method of investigating mistakes

    View Slide

  49. 49
    @phermens
    Sure, but what is a ‘Just Culture’ in Tech?
    It is a method of investigating mistakes
    in a way that focuses on the
    situational aspects of a failure’s mechanism,

    View Slide

  50. 50
    @phermens
    Sure, but what is a ‘Just Culture’ in Tech?
    It is a method of investigating mistakes
    in a way that focuses on the
    situational aspects of a failure’s mechanism,
    as well as the decision-making process
    of people proximate to the failure
    - John Allspaw: https://codeascraft.com/2012/05/22/blameless-postmortems/

    View Slide

  51. “Blameless Postmortem”

    View Slide

  52. View Slide

  53. https://landing.google.com/sre/sre-book/chapters/
    postmortem-culture/

    View Slide

  54. https://www.atlassian.com/software/jira/ops/handbook/
    incident-postmortems

    View Slide

  55. https://www.etsy.com/progress-report/2015/
    blamess-post-mortems

    View Slide

  56. https://medium.com/hootsuite-engineering/5-whys-how-we-
    conduct-blameless-post-mortems-after-something-goes-wrong

    View Slide

  57. https://www.pagerduty.com/blog/
    postmortem-guide-documentation/

    View Slide

  58. View Slide

  59. John Allspaw, May 2012 -
    https://codeascraft.com/2012/05/22/blameless-postmortems

    View Slide

  60. View Slide

  61. View Slide

  62. View Slide

  63. View Slide

  64. Failing Forward
    John C. Maxwell, 2010

    View Slide

  65. Failing Forward
    John C. Maxwell, 2010
    Fail early, fail often,
    but always fail forward.

    View Slide

  66. Psychological Conditions of Personal Engagement
    and Disengagement at Work, Kahn, 1990 (JSTOR)

    View Slide

  67. Psychological Conditions of Personal Engagement
    and Disengagement at Work, Kahn, 1990 (JSTOR)
    Psychological safety is being
    able to show and employ
    one's self without fear of
    negative consequences of
    self-image, status or career.

    View Slide

  68. View Slide

  69. View Slide

  70. View Slide

  71. The 7 Habits of Highly Effective People
    Stephen R. Covey, 1989

    View Slide

  72. The 7 Habits of Highly Effective People
    Stephen R. Covey, 1989
    Our behavior is a function
    of our decisions,
    not our conditions.

    View Slide

  73. Finding the ‘second story’

    View Slide

  74. 74
    @phermens
    Finding the ‘second story’
    3 questions.

    View Slide

  75. 75
    @phermens
    Finding the ‘second story’
    1. WHAT happened that led to this moment?

    View Slide

  76. 76
    @phermens
    Finding the ‘second story’
    1. WHAT happened that led to this moment?
    2. WHY did this make sense to the operators?

    View Slide

  77. 77
    @phermens
    Finding the ‘second story’
    1. WHAT happened that led to this moment?
    2. WHY did this make sense to the operators?
    3. HOW did the operators manage to do this?

    View Slide

  78. View Slide

  79. View Slide

  80. 80
    @phermens
    WHAT happened?
    WHY do this?
    HOW is it possible?

    View Slide

  81. View Slide

  82. James Thomas - April 15, 2019
    Death by PowerPoint: the slide that killed seven people
    WHAT happened?
    WHY do this?
    HOW is it possible?

    View Slide

  83. View Slide

  84. https://www.ideal.nl/en/latest-news/keyfigures/
    ideal-availability/
    WHAT happened?
    WHY do this?
    HOW is it possible?

    View Slide

  85. 3 actions

    View Slide

  86. 86
    @phermens
    Find the incentivisation
    Ask what is responsible, not who.

    View Slide

  87. 87
    @phermens
    Enable the ‘right’ outcome
    Seek forward accountability, not backward.

    View Slide

  88. 88
    @phermens
    Assume positive intent
    No-one comes to work, aiming to do a bad job.

    View Slide

  89. View Slide

  90. View Slide

  91. 91
    @phermens
    Cited references (in order)
    ● “Who Destroyed 3 Mile Island”, a presentation by Nickolas Means at Lead Developer Conference, London 2018
    ● 3 Mile Island & 3 Mile Island Accident articles on Wikipedia, plus the related article at the Smithsonian
    ● Space Shuttle Columbia & Space Shuttle Columbia Disaster articles on Wikipedia
    ● iDEAL article on Wikipedia, and the Currence ‘Facts & Figures’ site
    ● “The Field Guide to Understanding ‘Human Error’” by Dr. Sidney Dekker (ISBN: 1472439058)
    ● Commission Regulation (EU) No 691/2010 of 29 July 2010
    ● Google’s “SRE Handbook”, Chapter 15: “Postmortem Culture”
    ● Atlassian’s “JIRA Ops Incident Handbook”: Incident Postmortems section
    ● Etsy’s Progress Report from 2015: Blameless Postmortems section
    ● Hootsuite Engineering’s Medium page: An article on using the 5-Why’s exercise in Postmortems
    ● PagerDuty’s Blog: An article titled “Introducing the PagerDuty Postmortem Guide”
    ● John Allspaw’s article on “Blameless Postmortems and a Just Culture” (at Etsy)
    ● “Failing Forward” by John C. Maxwell (ISBN: 0785288570)
    ● “Psychological Conditions of Personal Engagement and Disengagement at Work” by William A. Kahn (JSTOR)
    ● “The 7 Habits of Highly Effective People” by Stephen R. Covey (ISBN: 9781451639612)
    ● “Death by PowerPoint: the slide that killed seven people”, a blog post by James Thomas
    ● Fundamental Attribution Error article on Wikipedia

    View Slide

  92. 92
    @phermens
    Credits/disclaimers
    ● BIG THANKS to all those that have come before me and enabled me to
    share THEIR knowledge, achievements, and experiences.
    ---
    ● All icons & shapes are from Wikimedia Commons: CC BY-SA 3.0.
    ● All book covers are copyright their respective owners, utilised under “fair use”.
    ● All photos are either “public domain”, or rights have been granted.
    ● Any tweets have been obtained publicly, referenced & hyperlinked.
    ● References and “sources of inspiration” have been linked on the previous slide.

    View Slide

  93. 93
    @phermens
    Thanks
    Vragen? Vraag maar!
    [email protected]

    View Slide