$30 off During Our Annual Pro Sale. View Details »

Trust, Just Culture, and Blameless Post-Mortem

Trust, Just Culture, and Blameless Post-Mortem

Slides originally given at Tulsa Tech Fest. Will be presented at KCDC and possibly other conferences.

Aaron Blythe

July 21, 2017
Tweet

More Decks by Aaron Blythe

Other Decks in Technology

Transcript

  1. @ablythe
    GALLUP POLL
    1/3 of workers are engaged at work
    In the US only 1 in 3 of workers are engaged at work, a trend that has continued for many years.

    View Slide

  2. @ablythe
    1/3 ARE ENGAGED AT WORK (GALLUP)
    • 80,844 adults working for an employer
    • Key Indicators
    • opportunity to do what they do best
    each day
    • someone at work who encourages
    their development
    • believing their opinions count at
    work
    51%
    17%
    32%
    Engaged Actively Disengaged
    Not Engaged
    This is pretty serious, if you question the validity of this, I want to run through some of the numbers

    * 80,000+ people are tracked

    * Opportunity to do what they do best?

    * Someone encouraging their development

    * Believing their opinions count

    Oklahoma is actually tied with many state for #5 in the most recent numbers I can find with 35% engaged.

    View Slide

  3. @ablythe
    GALLUP DEFINITIONS
    • Engaged: Employees are highly involved in and enthusiastic about their work and
    workplace. They are psychological "owners," drive performance and innovation, and move the
    organization forward.
    • Not engaged: Employees are psychologically unattached to their work and company.
    Because their engagement needs are not being fully met, they're putting time -- but not
    energy or passion -- into their work.
    • Actively disengaged: Employees aren't just unhappy at work -- they are resentful
    that their needs aren't being met and are acting out their unhappiness. Every day, these
    workers potentially undermine what their engaged coworkers accomplish.

    View Slide

  4. @ablythe
    40 HOUR WORK WEEK? (GALLUP - 2014)
    http://www.gallup.com/poll/175286/hour-workweek-actually-longer-seven-hours.aspx

    59% of salaried employees in the US work over 40 hours per week.

    Overall average is 47 hours per week.

    View Slide

  5. @ablythe
    Aaron Blythe (@ablythe)
    • Lead Organizer
    @devopskc
    @devopsdayskc

    View Slide

  6. @ablythe
    http://aaronblythe.org/
    I am known for doing crazy things like standing on my head in presentations to make a point.

    Possibly you were in my last session where I walked through a full incident workflow using just my Echo and not typing.

    This presentation will be different. I will be walking through a story. Despite this presentation being a culmination of a couple of decades of experience in the software, I
    am continuously learning and look forward to your feedback.

    View Slide

  7. @ablythe
    TRUST, JUST CULTURE AND BLAMELESS
    POST MORTEMS
    Aaron Blythe

    View Slide

  8. @ablythe
    Trust
    Trust is built over time by repeatedly being there.

    View Slide

  9. @ablythe
    Justice
    Today we are going to talk about a "Just Culture"

    Justice - the administering of deserved punishment or reward.

    Justice when a mistake or violations occur

    View Slide

  10. @ablythe
    BLAMELESS
    POST-MORTEM

    View Slide

  11. @ablythe
    !
    Jennifer - Executive
    Let me introduce you to the team

    Jennifer

    - Solid executive, co-founder

    - Built a majority of the business

    - Does not dictate what the Engineering team does

    View Slide

  12. @ablythe
    !
    "Jennifer - Executive
    Rob - Manager
    Rob

    - used to code, now a manager

    - Rob’s schedule looks like this - full with standing one hour meetings

    - Rob doesn’t like his job any more

    - Rob optimises on “not getting in trouble” and showing that he cares to higher ups

    - Rob has lost touch with his team, and his fuse has grown shorter and shorter

    - Rob has been saying more and more that his team “doesn’t have time"

    View Slide

  13. @ablythe

    View Slide

  14. @ablythe
    !
    "Jennifer - Executive
    Rob - Manager
    Rob

    - used to code, now a manager

    - Rob’s schedule looks like this - full with standing one hour meetings

    - Rob doesn’t like his job any more

    - Rob optimises on “not getting in trouble” and showing that he cares to higher ups

    - Rob has lost touch with his team, and his fuse has grown shorter and shorter

    - Rob has been saying more and more that his team “doesn’t have time"

    View Slide

  15. @ablythe
    !
    "
    #
    Jennifer - Executive
    Rob - Manager
    Ethan
    Sr. Engineer
    Rob refers to him as his “Rock Star” and goes to him to get everything done

    View Slide

  16. @ablythe
    !
    "
    $
    #
    Jennifer - Executive
    Rob - Manager
    Ethan
    Sr. Engineer
    Shelly
    Sr. Engineer
    Has been at the company just as long as Ethan, joined the team two years ago by her own choice because it was fresh, now looking for other teams to move to

    View Slide

  17. @ablythe
    !
    "
    %
    $
    #
    Jennifer - Executive
    Rob - Manager
    Ethan
    Sr. Engineer
    Shelly
    Sr. Engineer
    Tabitha
    Engineer
    New engineer joined the team earlier this year fresh out of college, does not have much work to do

    View Slide

  18. @ablythe
    !
    "
    %
    $
    # &
    Jennifer - Executive
    Rob - Manager
    Ethan
    Sr. Engineer
    Shelly
    Sr. Engineer
    Tabitha
    Engineer
    Michael
    Engineering Intern
    Just joined the team, slightly quirky, sort of left alone by the rest of the team

    View Slide

  19. @ablythe
    New product site goes down for the first time. The site only has beta clients that are fairly understanding.

    View Slide

  20. @ablythe
    Alerts are going off.

    View Slide

  21. @ablythe
    $ Shelly
    Shelly
    Shelly: I think the site is down
    Rob: The whole site is down?
    Shelly: I think so
    Rob: Who changed something?
    Shelly: I don’t know
    Rob: Why not? How the hell can we not know?
    Get it back up.
    Shelly: We don’t know how, and Ethan is not at his
    desk
    $
    Shelly
    Sr. Engineer
    "
    Rob - Manager
    Well I am going to come and find out
    Rob
    After “I don’t know” - Shelly did know…

    Ethan was heads down coding with his head phones blasting all morning - Shelly doesn’t feel comfortable sharing things like this that make Ethan look bad, since
    he is the “Rock Star"

    View Slide

  22. @ablythe
    %
    $ &
    Shelly
    Sr. Engineer
    Tabitha
    Engineer
    Michael
    Engineering Intern
    Shelly, Tabby, and Michael are discussing what to do next in the developer’s bay.

    View Slide

  23. @ablythe
    "
    %
    $ &
    Rob - Manager
    Shelly
    Sr. Engineer
    Tabitha
    Engineer
    Michael
    Engineering Intern
    Rob shows up

    Rob: We have to get this back up. Have you even tried anything?

    A series of things are tried. They all make it worse and do not bring the site back up.

    View Slide

  24. @ablythe
    "
    %
    $
    # &
    Rob - Manager
    Ethan
    Sr. Engineer
    Shelly
    Sr. Engineer
    Tabitha
    Engineer
    Michael
    Engineering Intern
    15 minutes later Ethan shows up -

    Rob: where the hell have you been?

    Ethan: I was taking a walk.

    Ethan: Wait until you see the new features I added to the Beta site.

    Rob: The site is down, so it was you.

    Ethan: What? I told Shelly to review it, she always finds everything that is wrong

    This is true, Shelly does find a lot of bugs, but Shelly was shut in a conference room interviewing for a new position on the phone and didn’t look at the code very close
    that morning.

    Shelly: Don’t put this on me, the code should have been built in a Continuous Integration pipeline, but as if we will ever have time for that...

    Finger pointing continues and the site is still down. It remains down for a couple hours as they manually back out the changes.

    Rob: You two need to figure this out, this cannot happen again. I already missed two meetings, I HAVE to go to this next one.

    View Slide

  25. @ablythe
    %
    $
    # &
    Ethan
    Sr. Engineer
    Shelly
    Sr. Engineer
    Tabitha
    Engineer
    Michael
    Engineering Intern
    The Team is left defeated and dejected.

    View Slide

  26. @ablythe

    Sound painful? Hopefully I did not trigger anyone with that story because unfortunately that is a brutally honest depiction of scenarios that I have seen over and over in
    Software and IT.

    But what about this is wrong?

    View Slide

  27. @ablythe
    http://sidneydekker.com/just-culture/

    View Slide

  28. @ablythe
    RETRIBUTIVE JUST CULTURE
    • Which rule is broken?
    • Who did it?
    • How bad was the breach, and what
    should the consequences be?
    • Who gets to decide this?

    View Slide

  29. @ablythe
    • Grenade Person
    • Know-it-alls
    • Maybe Person
    • No Person
    • Nothing Person
    • Snipers
    • Tanks
    • Think-they-know-it-alls
    • Whiners
    • Yes Person
    Before I go on to explain a "restorative just culture” I do want to bring up an important book: Dealing with people you can’t stand

    I highly recommend this book. I actually listened to the audio book read by the two authors.

    I have listened to this twice at two different points in my life…

    There are 10 types of people… and the thing you will realize through this book, that you at many points are one of these types of people.

    View Slide

  30. @ablythe
    RETRIBUTIVE
    RESTORATIVE JUST CULTURE
    • Which rule is broken?
    • Who did it?
    • How bad was the breach, and what
    should the consequences be?
    • Who gets to decide this?
    • Who is hurt?
    • What do they need?
    • Whose obligation is it to meet that
    need?
    • How do you involve the community in
    this conversation?
    Anecdotally I am a father of three children. We have instituted these questions in our home when ever someone is hurt:

    • Are you hurt?

    • Where does it hurt, can you point to it?

    • What can I do to help you? Can I get a band-aid? or Boo-boo bag?

    View Slide

  31. @ablythe
    ALTERNATE UNIVERSE
    Rather than provide you with an epiphany. We are going to go back in time and fork reality. We are going to go back a little over a year in this story and change one
    thing.

    View Slide

  32. @ablythe
    ONE CHANGE:
    REGULAR RETROSPECTIVE MEETINGS

    View Slide

  33. @ablythe
    "
    Rob - Manager
    Rob

    - still reviews to the code and comments to stay in touch with the team, but rarely commits

    - Rob’s schedule looks like this - Time blocked out for team, has rule that only attends meetings with agendas, standing meetings cannot be longer than 30 minutes

    - Rob likes his job

    - Rob optimises on “how can we make that better?”

    - Rob connects with his team regularly and asks if he can pair to talk through what they are solving

    View Slide

  34. @ablythe

    View Slide

  35. @ablythe
    "
    Rob - Manager
    Rob

    - still reviews to the code and comments to stay in touch with the team, but rarely commits

    - Rob’s schedule looks like this - Time blocked out for team, has rule that only attends meetings with agendas, standing meetings cannot be longer than 30 minutes

    - Rob likes his job

    - Rob optimises on “how can we make that better?”

    - Rob connects with his team regularly and asks if he can pair to talk through what they are solving

    View Slide

  36. @ablythe
    #
    Ethan
    Sr. Engineer
    Rob refers to him as “Mentor”, Ethan is responsible for Michael’s questions to be answered.

    View Slide

  37. @ablythe
    $
    Shelly
    Sr. Engineer
    Rob refers to her as “Quality Guru"

    View Slide

  38. @ablythe
    %
    Tabitha
    Engineer
    New engineer joined the team earlier this year fresh out of college, regularly has challenging work

    View Slide

  39. @ablythe
    &
    Michael
    Engineering Intern
    Just joined the team, slightly quirky,

    View Slide

  40. @ablythe
    DIFFERENCES THIS TIME AROUND
    • Group Chat (Slack)
    • Alert assigned to rotation
    • Alert Posted to Group Chat
    • Acknowledgement visible to team
    • Code build in CI/CD
    • Rollback switch DNS back (10-min mark)
    • Blameless Post-Mortem Scheduled
    • One-on-one IM
    • Alert just to Shelly’s email
    • Alert just to Shelly’s email
    • Shelly forwarding email/IM’ing people
    • Code manually deployed by Ethan
    • Rollback manually removing code
    • Team left defeated/dejected

    View Slide

  41. @ablythe
    BLAMELESS POST-MORTEM
    • Test Environment just like Prod (found differences between two)
    • Use dns-a and dns-b (As do today)
    • However test before making the switch
    An action items from this meeting:

    • To have a Test environment that is just like prod (found differences in the current test environment)

    • Use a dns-a and dns-b and dns-prod, testing must be done after the "dns-a” or “dns-b” is brought up, but before "dns-prod” goes live.

    View Slide

  42. @ablythe
    Trust
    Remember that just a bit ago we said this was about trust.

    Think to yourself for a minute

    - if you are in a leadership position are you creating trust?

    - if you work for a leader, do you trust that leader?

    All things can improve… how will you go back to work on Monday and work to improve. Write that down.

    View Slide

  43. @ablythe
    Justice
    I feel that both the scenarios I described to you were ones where Justice was served.

    However the difference was the first was Retributive and the second was Restorative.

    View Slide

  44. @ablythe
    • Retributive justice achieves accountability by
    looking back on the harm done.
    • Restorative justice achieves accountability
    by looking ahead to meet the needs and repair
    the trust and relationships that were harmed.
    - Sidney Dekker, Just Culture

    View Slide

  45. @ablythe

    View Slide

  46. @ablythe
    Fremont Assembly Plant
    http://en.wikipedia.org/wiki/Fremont_Assembly
    46
    I want to talk to you about the Fremont Assembly Plant.
    A 411-acre manufacturing plant in California.
    At the time of its closure, the Fremont employees were "considered the worst workforce in the automobile industry in the United States", according to the United Auto Workers.
    [6][7]
    Employees drank alcohol on the job, were frequently absent (enough so that the production line couldn't be started), and even
    committed petty acts of sabotage such as putting "Coke bottles inside the door panels, so they'd rattle and annoy the customer."

    View Slide

  47. @ablythe
    NUMMI plant
    http://en.wikipedia.org/wiki/NUMMI
    47
    In spite of the history and reputation, when NUMMI reopened the factory for production in 1984, most of the troublesome GM workforce was rehired, with some sent to Japan to learn the Toyota Production System.
    [6][7]
    Workers who made the transition identified the emphasis on quality and teamwork by
    Toyota management as what motivated a change in work ethic. almost right away, the NUMMI factory was producing cars with as few defects per 100 vehicles as those produced in Japan.

    View Slide

  48. @ablythe
    Tesla Factory
    http://en.wikipedia.org/wiki/Tesla_Factory
    48
    Coincidentally, this historic NUMMI plant is now the home of Tesla. The electric car manufacturer that open sourced to a degree their designs and patents recently that may change the world of car manufacturing. Tesla as you may know discarded the idea of model long ago and moves faster than Toyota
    using a continuous model where consumers can download updates to get new features.

    View Slide

  49. @ablythe
    Netflix Culture Deck
    49

    View Slide

  50. @ablythe
    Adrian Cockcroft - Formerly Netflix
    50
    I had the opportunity to see Adrian Co-croft speak last week at DevOps Enterprise Summit about the Culture Deck.

    View Slide

  51. @ablythe
    51
    Directly from the Lean Enterprise text there are basically 3 high level types of cultures - At least in the non-Scientific model attributed to Westrum.
    Culture is the basis of building a successful system so I am going to spend some time on it.
    Blameless Post Mortems should be part of every team for every incident, aimed at identifying causes, learning

    View Slide

  52. @ablythe
    ME IN 2007/2008 - “5 WHY’S”
    I was working at a company in 2007/2008. I wasn’t even 30 years old, but a new position was created for me that was named “Quality Architect”. We had what many
    would consider to be a severe defect problem. A large percentage of patch packages (like 50%) that sent out would fix the original problem, however they would create
    a new one. This was desktop software. Our clients did not trust us. We originally called these “used-to-works” internally in engineering. However when support started
    saying this to clients, this was quickly changed to a more palatable term “previously functional”.

    Apathy

    Example: Late to work

    * Why? Car wouldn’t start

    * Why? Battery was dead

    * Why? Left car door open all night

    * Why? Got home late

    * Why? Out drinking all night

    3 months - every day for me - once a month for them

    people were angry at me

    presented back to them the summary of the things they said - Applied Pareto

    3 months - still did the meeting but at the end of that time they presented what they had changed.

    Apathy started to shrink over this time as people believed again they were in control.

    View Slide

  53. @ablythe
    From the Introduction:
    "it should in no way be associated with
    that great body of factual information
    relating to orthodox Zen Buddhist
    practice. It's not very factual on
    motorcycles, either.”
    Romantic - a friend of the narrator decides not to learn how to maintain his expensive new motorcycle. When something on the bike breaks he is frustrated and needs
    to rely on professional mechanics to repair it.

    Classical - the narrator has an older bike that he is usually able to diagnose and repair through rational problem solving.

    Ultimately the author comes to the conclusion that a Zen-like mix of being in the moment and combining this rationality and romanticism can create a better overall life.

    View Slide

  54. @ablythe
    – Kurt Vonnegut, Hocus Pocus
    “Another flaw in the human character is that everybody wants to build and
    nobody wants to do maintenance.”

    View Slide

  55. @ablythe
    5 WHY’S HAVE FALLEN OUT OF FAVOR
    • https://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-
    dangers-of-the-five-whys/
    • Really asking “How?” and doing this in a group is important
    • Even though this is easy to grasp, it is tunnel-visioned
    The 5 why’s has fallen out of favor.

    John Allspaw has an excellent blog on how there are dangers in using the 5 why’s.

    View Slide

  56. @ablythe
    –Adam Gale, President, KLAS
    “As a result of these and other changes, Cerner’s KLAS ranking has
    skyrocketed, moving from seventh to second in a four-year period
    (December 2007 to December 2011).”
    http://healthsystemcio.com/2012/04/09/how-cerner-was-able-to-turn-the-corner/
    Over the next couple years at the company I was at the quality of the software improved dramatically. There were many factors to this with excellent things going on
    throughout the company and a focus on quality. However interjecting a conscious back into the development life cycle of learning from past mistakes is one of the small
    contributions that I tried to promote.

    View Slide

  57. @ablythe
    POST MORTEM MEETING
    • Before meeting:
    • Time line of incident - facts,
    assumptions, expectations
    • During meeting
    • Level set expectations
    • Discuss without Blame
    • Only take action items that can be
    assigned and completed in next week

    View Slide

  58. @ablythe
    Machine setup
    Machine admin
    Application
    Zabbix
    Infra Team
    Operations Team
    Development
    Team **
    SAN (Storage Area Network)

    View Slide

  59. @ablythe
    Machine setup
    Machine admin
    Application
    Zabbix
    Infra Team
    Operations Team
    Development
    Team **
    SAN (Storage Area Network)

    View Slide

  60. @ablythe
    POST MORTEM MEETING
    • Before meeting:
    • Time line of incident - facts,
    assumptions, expectations
    • During meeting
    • Level set expectations
    • Discuss without Blame
    • Only take action items that can be
    assigned and completed in next week

    View Slide

  61. @ablythe
    Machine setup
    /etc/multipath.conf
    Machine admin
    Application
    Zabbix
    Infra Team
    Operations Team
    Development
    Team **
    SAN (Storage Area Network)

    View Slide

  62. @ablythe
    COGNITIVE BIASES
    • Hindsight Bias
    • Outcome Bias
    • Availability Bias (AKA Recency Bias)
    • Sunk Cost Bias
    • Confirmation Bias
    There is an excellent short book from O’Reilly by David Zwieback on the Human Side of Post Mortems, that I wish was there when I was younger and contemplating . I
    believe the PDF is a total of 32 pages.

    * Hindsight Bias - knowing what we know now

    * Outcome Bias - Attaching how bad the issue was to the decisions

    * Availability Bias - overestimating things that are easily recalled, underestimating the forgotten (ex. tornados cause more deaths than asthma)

    * Sunk Cost Bias - Well we already have this so we are going to eat the whole bowl

    * Confirmation Bias - Finding data that just confirms what we want to hear (For the MSNBC and Fox viewers out there, this is what they play on)

    View Slide

  63. @ablythe
    http://cdn.oreillystatic.com/oreilly/radarreport/0636920029731/9781449365851.pdf
    Another great point that Zwieback points out is that it is not about removing stress to zero.

    View Slide

  64. @ablythe
    RETROSPECTIVE MEETINGS
    • 10 minutes to quietly review the past two weeks
    • Write down 3 biggest accomplishments (team or individual)
    • Discussion and classification

    • Thank you (chance to formally in front of everyone thank someone on the team)
    • Action Items (to be followed up on next meeting)
    • Post publicly
    Coming from the agile development world. Retrospective meetings are my favorite meetings. I managed a team where we did these without fail every two weeks for
    over two years. We called this our feels meeting. We were lucky enough to have a bean bag room close to where we sat in our cube farm. We started with fact finding
    on sticky notes. Then put them on the wall and categorized them as a group. This would unearth a lot before resentment would have the ability to fester and grow.

    View Slide

  65. @ablythe
    – Kurt Vonnegut, Sirens of Titan
    “Now, you can say your Daddy is right and the other
    little child's Daddy is wrong, but the universe is an
    awfully big place. There is room enough for an awful
    lot of people to be right about things and still not
    agree.”

    View Slide

  66. @ablythe
    TYPES OF MEETINGS
    • Root Cause Analysis Meeting (Monthly)
    • Post Mortem Meeting (per Incident)
    • Retrospective Meeting (Fortnightly)

    View Slide

  67. @ablythe
    Just last week John Allspaw tweeted

    View Slide

  68. @ablythe
    Reed Hastings
    Culture Deck
    Paul Graham
    Makers Schedule vs.
    Manager’s Schedule
    John Allspaw
    Blameless PostMortems
    and a
    Just Culture
    Dr. Rick Brinkman
    Dr. Rick Kirchner
    Dealing with People
    You Can’t Stand
    David Zweiback
    Human Side of
    Postmortems
    Sidney Dekker
    Just Culture

    View Slide