Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Blameless Post Mortem - How Embracing Failure Makes Us Better

The Blameless Post Mortem - How Embracing Failure Makes Us Better

North Bay Python 2019

Talk Recording at https://www.youtube.com/watch?v=C_nywn1aR44

Christopher Wilcox

November 02, 2019
Tweet

More Decks by Christopher Wilcox

Other Decks in Programming

Transcript

  1. @chriswilcox47 https://chriswilcox.dev
    The blameless post
    mortem
    How embracing failure makes us better.

    View full-size slide

  2. @chriswilcox47 https://chriswilcox.dev
    Software is
    pretty
    amazing

    View full-size slide

  3. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    And so are its
    bugs...

    View full-size slide

  4. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    What if we
    could learn
    from the
    bugs?

    View full-size slide

  5. @chriswilcox47 https://chriswilcox.dev
    crwilcox
    @chriswilcox47
    https://crwilcox.com
    https://chriswilcox.racing
    https://speakerdeck.com/crwilcox
    Hi! I’m Chris.

    View full-size slide

  6. @chriswilcox47 https://chriswilcox.dev
    https://www.history.navy.mil

    View full-size slide

  7. @chriswilcox47 https://chriswilcox.dev
    A look at other industries.

    View full-size slide

  8. @chriswilcox47 https://chriswilcox.dev
    Sometimes decisions are actually, not in
    hyperbole, life and death.

    View full-size slide

  9. @chriswilcox47 https://chriswilcox.dev
    Every “Mistake” is an opportunity to
    better the system.

    View full-size slide

  10. @chriswilcox47 https://chriswilcox.dev
    The NTSB.
    National Transportation
    Safety Board

    View full-size slide

  11. @chriswilcox47 https://chriswilcox.dev

    View full-size slide

  12. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    And has provided
    13,000+ safety
    recommendations.

    View full-size slide

  13. @chriswilcox47 https://chriswilcox.dev

    View full-size slide

  14. @chriswilcox47 https://chriswilcox.dev

    View full-size slide

  15. @chriswilcox47 https://chriswilcox.dev
    https://www.ntsb.gov/investigations/AccidentReports/Reports/ASR1901.pdf

    View full-size slide

  16. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    The NTSB has no
    formal, prosecutorial,
    authority.

    View full-size slide

  17. @chriswilcox47 https://chriswilcox.dev
    Industry Process: M&M Conferences
    https://www.marsfoodservices.com

    View full-size slide

  18. @chriswilcox47 https://chriswilcox.dev
    Morbidity & Mortality conferences
    - Provide medical education.
    - Improve quality of care.
    - Highlight cases with
    diagnostic uncertainty,
    complex management, etc.
    https://www.imperial.ac.uk

    View full-size slide

  19. @chriswilcox47 https://chriswilcox.dev
    What can we, as
    technologists, learn
    from these other
    industries?

    View full-size slide

  20. @chriswilcox47 https://chriswilcox.dev
    Adverse
    Event
    Procedure
    Patient
    Environment Medication

    View full-size slide

  21. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    Does it matter
    who is
    present?

    View full-size slide

  22. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    How might we
    conduct a post
    mortem?

    View full-size slide

  23. @chriswilcox47 https://chriswilcox.dev
    The structure of a post mortem.
    - Why did this happen?
    - Could it have been worse? Better?
    - How can we make sure it doesn’t happen
    again?

    View full-size slide

  24. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    The documentation for the Google
    Cloud Python libraries was
    unavailable for users.

    View full-size slide

  25. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    As part of repository cleanup, an
    engineer with write access to the
    development repository deleted the
    gh-pages branch.

    View full-size slide

  26. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    During a two hour period, hosted
    reference documentation for our
    libraries was unavailable.

    View full-size slide

  27. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    This was first detected via an
    external customer. Shortly after two
    internal teams also noticed and
    notified.
    The delay to the initial report was
    30 minutes following the start of
    the outage.

    View full-size slide

  28. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    The documentation was available
    after republishing to gh-pages.

    View full-size slide

  29. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    2 hours.
    2019-12-05 (all times PDT)
    09:45 Branch gh-pages deleted during
    clean up of repository branches.
    10:21 GitHub issue filed stating that
    docs are not available.
    10:45 Team responds to issue.
    10:55 Branch has been republished
    11:45 GitHub is serving docs again.

    View full-size slide

  30. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    We were notified fairly quick and
    the team monitors GitHub issues well
    enough that we knew about it pretty
    fast.

    View full-size slide

  31. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    The branch should have been
    protected from deletion.
    GitHub page publishing is a bit
    opaque. While the branch was pushed
    within about 5 minutes of finding
    out it took an additional hour or so
    to get it actually serving the
    files. The lack of debugging
    information made this more
    difficult.

    View full-size slide

  32. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    A member of the team had a local
    copy of the branch handy and was
    able to republish.

    View full-size slide

  33. @chriswilcox47 https://chriswilcox.dev
    An example incident.
    - Summary.
    - Root Cause/Trigger.
    - User Impact.
    - Detection.
    - Resolution.
    - Duration/Timeline.
    - What Went Well?
    - What Went Poorly?
    - Where We Got Lucky?
    - Action Items.
    Protect gh-pages branch. Don’t allow
    deletion.
    Investigate if debugging gh-pages
    could be improved.
    Investigate if there are other
    technologies we ought to be using
    instead of gh-pages.

    View full-size slide

  34. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    We fail so we
    can learn.

    View full-size slide

  35. @chriswilcox47 https://chriswilcox.dev
    Blame isn’t a
    deterrent.

    View full-size slide

  36. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    Excusing failure
    as human error
    stops the
    conversation.

    View full-size slide

  37. @chriswilcox47 https://chriswilcox.dev
    Why we don’t blame; an example.

    View full-size slide

  38. @chriswilcox47 https://chriswilcox.dev

    View full-size slide

  39. @chriswilcox47 https://chriswilcox.dev
    A Test for Blamelessness
    1. A script was executed that deleted the production
    database.
    2. An engineer executed a script that deleted the production
    database without confirmation.
    3. Chris executed a script that deleted the production
    database without confirmation.

    View full-size slide

  40. @chriswilcox47 https://chriswilcox.dev
    Building a blameless post mortem culture.

    View full-size slide

  41. @chriswilcox47 https://chriswilcox.dev
    Create
    psychological
    safety.

    View full-size slide

  42. @chriswilcox47 https://chriswilcox.dev

    View full-size slide

  43. @chriswilcox47 https://chriswilcox.dev
    Find positive
    examples.
    https://landing.google.com/sre/books

    View full-size slide

  44. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    “But Chris, we
    never have outages
    at my work, we
    are perfect, in
    every way”
    -Someone, Probably.

    View full-size slide

  45. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    Add some 9’s to
    that SLA!

    View full-size slide

  46. @chriswilcox47 https://chriswilcox.dev
    “If you've never
    missed a flight,
    you're probably
    spending too much
    time in airports“
    -George Stigler

    View full-size slide

  47. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    Raise your own
    bar, Break stuff.

    View full-size slide

  48. @chriswilcox47 https://chriswilcox.dev
    @chriswilcox47 https://chriswilcox.dev
    Outages
    happen and
    no one wanted
    them to.

    View full-size slide

  49. @chriswilcox47 https://chriswilcox.dev
    Thanks!
    @chriswilcox47 https://chriswilcox.dev

    View full-size slide

  50. @chriswilcox47 https://chriswilcox.dev
    References.
    https://www.ntsb.gov/investigations/AccidentReports/Reports/MAB1917.pdf
    https://acphospitalist.org/archives/2018/07/the-new-improved-m-and-m.htm
    https://insights.ovid.com/pubmed?pmid=26983075&_ga=2.103365704.971425802.15638
    99461-1002013713.1563899459
    https://www.ama-assn.org/residents-students/residency/presenting-your-first-mm
    -conference-5-things-you-need-know
    https://landing.google.com/sre/books
    https://cloud.google.com/blog/products/gcp/fearless-shared-postmortems-cre-lif
    e-lessons
    https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/

    View full-size slide