$30 off During Our Annual Pro Sale. View Details »

PyOhio 2019 - The Blameless Post Mortem: How Embracing Failure Makes Us Better

PyOhio 2019 - The Blameless Post Mortem: How Embracing Failure Makes Us Better

Christopher Wilcox

July 27, 2019
Tweet

More Decks by Christopher Wilcox

Other Decks in Programming

Transcript

  1. crwilcox @chriswilcox47
    The blameless post
    mortem
    How embracing failure makes us better.

    View Slide

  2. crwilcox @chriswilcox47
    Software is
    pretty
    amazing

    View Slide

  3. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    And so are its
    bugs...

    View Slide

  4. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    What if we
    could learn
    from the
    bugs?

    View Slide

  5. crwilcox @chriswilcox47
    crwilcox
    @chriswilcox47
    https://crwilcox.com
    https://chriswilcox.racing
    https://speakerdeck.com/crwilcox
    About Me.

    View Slide

  6. crwilcox @chriswilcox47
    https://www.history.navy.mil

    View Slide

  7. crwilcox @chriswilcox47
    A look at other industries.

    View Slide

  8. crwilcox @chriswilcox47
    Sometimes decisions are actually, not in
    hyperbole, life and death.

    View Slide

  9. crwilcox @chriswilcox47
    Every “Mistake” is an opportunity to
    better the system.

    View Slide

  10. crwilcox @chriswilcox47
    The NTSB.
    National Transportation
    Safety Board

    View Slide

  11. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    The NTSB has no
    formal, prosecutorial,
    authority.

    View Slide

  12. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    And has provided
    13,000+ safety
    recommendations.

    View Slide

  13. crwilcox @chriswilcox47

    View Slide

  14. crwilcox @chriswilcox47

    View Slide

  15. crwilcox @chriswilcox47
    https://www.ntsb.gov/investigations/AccidentReports/Reports/MAB1917.pdf

    View Slide

  16. crwilcox @chriswilcox47
    Industry Process: M&M Conferences
    https://www.marsfoodservices.com

    View Slide

  17. crwilcox @chriswilcox47
    Morbidity & Mortality conferences
    - Provide medical education.
    - Improve quality of care.
    - Highlight cases with
    diagnostic uncertainty,
    complex management, etc.
    https://www.imperial.ac.uk

    View Slide

  18. crwilcox @chriswilcox47
    What can we, as
    technologists, learn
    from these other
    industries?

    View Slide

  19. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    How might we
    conduct a post
    mortem?

    View Slide

  20. crwilcox @chriswilcox47
    The structure of a post mortem.
    - Why did this happen?
    - Could it have been worse? Better?
    - How can we make sure it doesn’t happen
    again?

    View Slide

  21. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    The documentation for the Google
    Cloud Python libraries was
    unavailable for users.

    View Slide

  22. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    As part of repository cleanup, an
    engineer with write access to the
    development repository deleted the
    gh-pages branch.

    View Slide

  23. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    During a two hour period, hosted
    reference documentation for our
    libraries was unavailable.

    View Slide

  24. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    This was first detected via an
    external customer. Shortly after two
    internal teams also noticed and
    notified.
    The delay to the initial report was
    30 minutes following the start of
    the outage.

    View Slide

  25. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    The documentation was available
    after republishing to gh-pages.

    View Slide

  26. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    2 hours.
    2019-12-05 (all times PDT)
    09:45 Branch gh-pages deleted during
    clean up of repository branches.
    10:21 GitHub issue filed stating that
    docs are not available.
    10:45 Team responds to issue.
    10:55 Branch has been republished
    11:45 GitHub is serving docs again.

    View Slide

  27. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    We were notified fairly quick and
    the team monitors GitHub issues well
    enough that we knew about it pretty
    fast.

    View Slide

  28. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    The branch should have been
    protected from deletion.
    GitHub page publishing is a bit
    opaque. While the branch was pushed
    within about 5 minutes of finding
    out it took an additional hour or so
    to get it actually serving the
    files. The lack of debugging
    information made this more
    difficult.

    View Slide

  29. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    A member of the team had a local
    copy of the branch handy and was
    able to republish.

    View Slide

  30. crwilcox @chriswilcox47
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    Protect gh-pages branch. Don’t allow
    deletion.
    Investigate if debugging gh-pages
    could be improved.
    Investigate if there are other
    technologies we ought to be using
    instead of gh-pages.

    View Slide

  31. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    We fail so we
    can learn.

    View Slide

  32. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    “If you've never
    missed a flight,
    you're probably
    spending too much
    time in airports“
    -George Stigler

    View Slide

  33. crwilcox @chriswilcox47
    Blame isn’t a
    deterrent. It
    isn’t anything
    useful.

    View Slide

  34. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    Excusing failure
    as human error
    stops the
    conversation.

    View Slide

  35. crwilcox @chriswilcox47
    Why we don’t blame; an example.

    View Slide

  36. crwilcox @chriswilcox47

    View Slide

  37. crwilcox @chriswilcox47
    Building a blameless post mortem culture.

    View Slide

  38. crwilcox @chriswilcox47
    Create
    psychological
    safety.

    View Slide

  39. crwilcox @chriswilcox47
    Find positive
    examples.
    https://landing.google.com/sre/books

    View Slide

  40. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    Outages
    happen and
    no one wanted
    them to.

    View Slide

  41. crwilcox @chriswilcox47
    A Test for Blamelessness
    1. A script was executed that deleted the production
    database.
    2. An engineer executed a script that deleted the production
    database without confirmation.
    3. Chris executed a script that deleted the production
    database without confirmation.

    View Slide

  42. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    “But Chris, we
    never have outages
    at my work, we
    are perfect, in
    every way”
    -Someone, Probably.

    View Slide

  43. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    Raise your own
    bar, Break stuff.

    View Slide

  44. crwilcox @chriswilcox47
    crwilcox @chriswilcox47
    Add some 9’s to
    that SLA!

    View Slide

  45. crwilcox @chriswilcox47
    Thanks! References.
    https://crwilcox.com
    https://www.ntsb.gov/investigations/AccidentReports/Reports/MAB1917.pdf
    https://acphospitalist.org/archives/2018/07/the-new-improved-m-and-m.htm
    https://insights.ovid.com/pubmed?pmid=26983075&_ga=2.103365704.971425802.15638
    99461-1002013713.1563899459
    https://www.ama-assn.org/residents-students/residency/presenting-your-first-mm
    -conference-5-things-you-need-know
    https://landing.google.com/sre/books
    https://cloud.google.com/blog/products/gcp/fearless-shared-postmortems-cre-lif
    e-lessons
    https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/

    View Slide