Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Trust, Just Culture, and Blameless Post-Mortem

Trust, Just Culture, and Blameless Post-Mortem

Slides originally given at Tulsa Tech Fest. Will be presented at KCDC and possibly other conferences.

Aaron Blythe

July 21, 2017
Tweet

More Decks by Aaron Blythe

Other Decks in Technology

Transcript

  1. @ablythe GALLUP POLL 1/3 of workers are engaged at work

    In the US only 1 in 3 of workers are engaged at work, a trend that has continued for many years.
  2. @ablythe 1/3 ARE ENGAGED AT WORK (GALLUP) • 80,844 adults

    working for an employer • Key Indicators • opportunity to do what they do best each day • someone at work who encourages their development • believing their opinions count at work 51% 17% 32% Engaged Actively Disengaged Not Engaged This is pretty serious, if you question the validity of this, I want to run through some of the numbers * 80,000+ people are tracked * Opportunity to do what they do best? * Someone encouraging their development * Believing their opinions count Oklahoma is actually tied with many state for #5 in the most recent numbers I can find with 35% engaged.
  3. @ablythe GALLUP DEFINITIONS • Engaged: Employees are highly involved in

    and enthusiastic about their work and workplace. They are psychological "owners," drive performance and innovation, and move the organization forward. • Not engaged: Employees are psychologically unattached to their work and company. Because their engagement needs are not being fully met, they're putting time -- but not energy or passion -- into their work. • Actively disengaged: Employees aren't just unhappy at work -- they are resentful that their needs aren't being met and are acting out their unhappiness. Every day, these workers potentially undermine what their engaged coworkers accomplish.
  4. @ablythe 40 HOUR WORK WEEK? (GALLUP - 2014) http://www.gallup.com/poll/175286/hour-workweek-actually-longer-seven-hours.aspx 59%

    of salaried employees in the US work over 40 hours per week. Overall average is 47 hours per week.
  5. @ablythe http://aaronblythe.org/ I am known for doing crazy things like

    standing on my head in presentations to make a point. Possibly you were in my last session where I walked through a full incident workflow using just my Echo and not typing. This presentation will be different. I will be walking through a story. Despite this presentation being a culmination of a couple of decades of experience in the software, I am continuously learning and look forward to your feedback.
  6. @ablythe Justice Today we are going to talk about a

    "Just Culture" Justice - the administering of deserved punishment or reward. Justice when a mistake or violations occur
  7. @ablythe ! Jennifer - Executive Let me introduce you to

    the team Jennifer - Solid executive, co-founder - Built a majority of the business - Does not dictate what the Engineering team does
  8. @ablythe ! "Jennifer - Executive Rob - Manager Rob -

    used to code, now a manager - Rob’s schedule looks like this - full with standing one hour meetings - Rob doesn’t like his job any more - Rob optimises on “not getting in trouble” and showing that he cares to higher ups - Rob has lost touch with his team, and his fuse has grown shorter and shorter - Rob has been saying more and more that his team “doesn’t have time"
  9. @ablythe ! "Jennifer - Executive Rob - Manager Rob -

    used to code, now a manager - Rob’s schedule looks like this - full with standing one hour meetings - Rob doesn’t like his job any more - Rob optimises on “not getting in trouble” and showing that he cares to higher ups - Rob has lost touch with his team, and his fuse has grown shorter and shorter - Rob has been saying more and more that his team “doesn’t have time"
  10. @ablythe ! " # Jennifer - Executive Rob - Manager

    Ethan Sr. Engineer Rob refers to him as his “Rock Star” and goes to him to get everything done
  11. @ablythe ! " $ # Jennifer - Executive Rob -

    Manager Ethan Sr. Engineer Shelly Sr. Engineer Has been at the company just as long as Ethan, joined the team two years ago by her own choice because it was fresh, now looking for other teams to move to
  12. @ablythe ! " % $ # Jennifer - Executive Rob

    - Manager Ethan Sr. Engineer Shelly Sr. Engineer Tabitha Engineer New engineer joined the team earlier this year fresh out of college, does not have much work to do
  13. @ablythe ! " % $ # & Jennifer - Executive

    Rob - Manager Ethan Sr. Engineer Shelly Sr. Engineer Tabitha Engineer Michael Engineering Intern Just joined the team, slightly quirky, sort of left alone by the rest of the team
  14. @ablythe New product site goes down for the first time.

    The site only has beta clients that are fairly understanding.
  15. @ablythe $ Shelly Shelly Shelly: I think the site is

    down Rob: The whole site is down? Shelly: I think so Rob: Who changed something? Shelly: I don’t know Rob: Why not? How the hell can we not know? Get it back up. Shelly: We don’t know how, and Ethan is not at his desk $ Shelly Sr. Engineer " Rob - Manager Well I am going to come and find out Rob After “I don’t know” - Shelly did know… Ethan was heads down coding with his head phones blasting all morning - Shelly doesn’t feel comfortable sharing things like this that make Ethan look bad, since he is the “Rock Star"
  16. @ablythe % $ & Shelly Sr. Engineer Tabitha Engineer Michael

    Engineering Intern Shelly, Tabby, and Michael are discussing what to do next in the developer’s bay.
  17. @ablythe " % $ & Rob - Manager Shelly Sr.

    Engineer Tabitha Engineer Michael Engineering Intern Rob shows up Rob: We have to get this back up. Have you even tried anything? A series of things are tried. They all make it worse and do not bring the site back up.
  18. @ablythe " % $ # & Rob - Manager Ethan

    Sr. Engineer Shelly Sr. Engineer Tabitha Engineer Michael Engineering Intern 15 minutes later Ethan shows up - Rob: where the hell have you been? Ethan: I was taking a walk. Ethan: Wait until you see the new features I added to the Beta site. Rob: The site is down, so it was you. Ethan: What? I told Shelly to review it, she always finds everything that is wrong This is true, Shelly does find a lot of bugs, but Shelly was shut in a conference room interviewing for a new position on the phone and didn’t look at the code very close that morning. Shelly: Don’t put this on me, the code should have been built in a Continuous Integration pipeline, but as if we will ever have time for that... Finger pointing continues and the site is still down. It remains down for a couple hours as they manually back out the changes. Rob: You two need to figure this out, this cannot happen again. I already missed two meetings, I HAVE to go to this next one.
  19. @ablythe % $ # & Ethan Sr. Engineer Shelly Sr.

    Engineer Tabitha Engineer Michael Engineering Intern The Team is left defeated and dejected.
  20. @ablythe Sound painful? Hopefully I did not trigger anyone with

    that story because unfortunately that is a brutally honest depiction of scenarios that I have seen over and over in Software and IT. But what about this is wrong?
  21. @ablythe RETRIBUTIVE JUST CULTURE • Which rule is broken? •

    Who did it? • How bad was the breach, and what should the consequences be? • Who gets to decide this?
  22. @ablythe • Grenade Person • Know-it-alls • Maybe Person •

    No Person • Nothing Person • Snipers • Tanks • Think-they-know-it-alls • Whiners • Yes Person Before I go on to explain a "restorative just culture” I do want to bring up an important book: Dealing with people you can’t stand I highly recommend this book. I actually listened to the audio book read by the two authors. I have listened to this twice at two different points in my life… There are 10 types of people… and the thing you will realize through this book, that you at many points are one of these types of people.
  23. @ablythe RETRIBUTIVE RESTORATIVE JUST CULTURE • Which rule is broken?

    • Who did it? • How bad was the breach, and what should the consequences be? • Who gets to decide this? • Who is hurt? • What do they need? • Whose obligation is it to meet that need? • How do you involve the community in this conversation? Anecdotally I am a father of three children. We have instituted these questions in our home when ever someone is hurt: • Are you hurt? • Where does it hurt, can you point to it? • What can I do to help you? Can I get a band-aid? or Boo-boo bag?
  24. @ablythe ALTERNATE UNIVERSE Rather than provide you with an epiphany.

    We are going to go back in time and fork reality. We are going to go back a little over a year in this story and change one thing.
  25. @ablythe " Rob - Manager Rob - still reviews to

    the code and comments to stay in touch with the team, but rarely commits - Rob’s schedule looks like this - Time blocked out for team, has rule that only attends meetings with agendas, standing meetings cannot be longer than 30 minutes - Rob likes his job - Rob optimises on “how can we make that better?” - Rob connects with his team regularly and asks if he can pair to talk through what they are solving
  26. @ablythe " Rob - Manager Rob - still reviews to

    the code and comments to stay in touch with the team, but rarely commits - Rob’s schedule looks like this - Time blocked out for team, has rule that only attends meetings with agendas, standing meetings cannot be longer than 30 minutes - Rob likes his job - Rob optimises on “how can we make that better?” - Rob connects with his team regularly and asks if he can pair to talk through what they are solving
  27. @ablythe # Ethan Sr. Engineer Rob refers to him as

    “Mentor”, Ethan is responsible for Michael’s questions to be answered.
  28. @ablythe % Tabitha Engineer New engineer joined the team earlier

    this year fresh out of college, regularly has challenging work
  29. @ablythe DIFFERENCES THIS TIME AROUND • Group Chat (Slack) •

    Alert assigned to rotation • Alert Posted to Group Chat • Acknowledgement visible to team • Code build in CI/CD • Rollback switch DNS back (10-min mark) • Blameless Post-Mortem Scheduled • One-on-one IM • Alert just to Shelly’s email • Alert just to Shelly’s email • Shelly forwarding email/IM’ing people • Code manually deployed by Ethan • Rollback manually removing code • Team left defeated/dejected
  30. @ablythe BLAMELESS POST-MORTEM • Test Environment just like Prod (found

    differences between two) • Use dns-a and dns-b (As do today) • However test before making the switch An action items from this meeting: • To have a Test environment that is just like prod (found differences in the current test environment) • Use a dns-a and dns-b and dns-prod, testing must be done after the "dns-a” or “dns-b” is brought up, but before "dns-prod” goes live.
  31. @ablythe Trust Remember that just a bit ago we said

    this was about trust. Think to yourself for a minute - if you are in a leadership position are you creating trust? - if you work for a leader, do you trust that leader? All things can improve… how will you go back to work on Monday and work to improve. Write that down.
  32. @ablythe Justice I feel that both the scenarios I described

    to you were ones where Justice was served. However the difference was the first was Retributive and the second was Restorative.
  33. @ablythe • Retributive justice achieves accountability by looking back on

    the harm done. • Restorative justice achieves accountability by looking ahead to meet the needs and repair the trust and relationships that were harmed. - Sidney Dekker, Just Culture
  34. @ablythe Fremont Assembly Plant http://en.wikipedia.org/wiki/Fremont_Assembly 46 I want to talk

    to you about the Fremont Assembly Plant. A 411-acre manufacturing plant in California. At the time of its closure, the Fremont employees were "considered the worst workforce in the automobile industry in the United States", according to the United Auto Workers. [6][7] Employees drank alcohol on the job, were frequently absent (enough so that the production line couldn't be started), and even committed petty acts of sabotage such as putting "Coke bottles inside the door panels, so they'd rattle and annoy the customer."
  35. @ablythe NUMMI plant http://en.wikipedia.org/wiki/NUMMI 47 In spite of the history

    and reputation, when NUMMI reopened the factory for production in 1984, most of the troublesome GM workforce was rehired, with some sent to Japan to learn the Toyota Production System. [6][7] Workers who made the transition identified the emphasis on quality and teamwork by Toyota management as what motivated a change in work ethic. almost right away, the NUMMI factory was producing cars with as few defects per 100 vehicles as those produced in Japan.
  36. @ablythe Tesla Factory http://en.wikipedia.org/wiki/Tesla_Factory 48 Coincidentally, this historic NUMMI plant

    is now the home of Tesla. The electric car manufacturer that open sourced to a degree their designs and patents recently that may change the world of car manufacturing. Tesla as you may know discarded the idea of model long ago and moves faster than Toyota using a continuous model where consumers can download updates to get new features.
  37. @ablythe Adrian Cockcroft - Formerly Netflix 50 I had the

    opportunity to see Adrian Co-croft speak last week at DevOps Enterprise Summit about the Culture Deck.
  38. @ablythe 51 Directly from the Lean Enterprise text there are

    basically 3 high level types of cultures - At least in the non-Scientific model attributed to Westrum. Culture is the basis of building a successful system so I am going to spend some time on it. Blameless Post Mortems should be part of every team for every incident, aimed at identifying causes, learning
  39. @ablythe ME IN 2007/2008 - “5 WHY’S” I was working

    at a company in 2007/2008. I wasn’t even 30 years old, but a new position was created for me that was named “Quality Architect”. We had what many would consider to be a severe defect problem. A large percentage of patch packages (like 50%) that sent out would fix the original problem, however they would create a new one. This was desktop software. Our clients did not trust us. We originally called these “used-to-works” internally in engineering. However when support started saying this to clients, this was quickly changed to a more palatable term “previously functional”. Apathy Example: Late to work * Why? Car wouldn’t start * Why? Battery was dead * Why? Left car door open all night * Why? Got home late * Why? Out drinking all night 3 months - every day for me - once a month for them people were angry at me presented back to them the summary of the things they said - Applied Pareto 3 months - still did the meeting but at the end of that time they presented what they had changed. Apathy started to shrink over this time as people believed again they were in control.
  40. @ablythe From the Introduction: "it should in no way be

    associated with that great body of factual information relating to orthodox Zen Buddhist practice. It's not very factual on motorcycles, either.” Romantic - a friend of the narrator decides not to learn how to maintain his expensive new motorcycle. When something on the bike breaks he is frustrated and needs to rely on professional mechanics to repair it. Classical - the narrator has an older bike that he is usually able to diagnose and repair through rational problem solving. Ultimately the author comes to the conclusion that a Zen-like mix of being in the moment and combining this rationality and romanticism can create a better overall life.
  41. @ablythe – Kurt Vonnegut, Hocus Pocus “Another flaw in the

    human character is that everybody wants to build and nobody wants to do maintenance.”
  42. @ablythe 5 WHY’S HAVE FALLEN OUT OF FAVOR • https://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-

    dangers-of-the-five-whys/ • Really asking “How?” and doing this in a group is important • Even though this is easy to grasp, it is tunnel-visioned The 5 why’s has fallen out of favor. John Allspaw has an excellent blog on how there are dangers in using the 5 why’s.
  43. @ablythe –Adam Gale, President, KLAS “As a result of these

    and other changes, Cerner’s KLAS ranking has skyrocketed, moving from seventh to second in a four-year period (December 2007 to December 2011).” http://healthsystemcio.com/2012/04/09/how-cerner-was-able-to-turn-the-corner/ Over the next couple years at the company I was at the quality of the software improved dramatically. There were many factors to this with excellent things going on throughout the company and a focus on quality. However interjecting a conscious back into the development life cycle of learning from past mistakes is one of the small contributions that I tried to promote.
  44. @ablythe POST MORTEM MEETING • Before meeting: • Time line

    of incident - facts, assumptions, expectations • During meeting • Level set expectations • Discuss without Blame • Only take action items that can be assigned and completed in next week
  45. @ablythe Machine setup Machine admin Application Zabbix Infra Team Operations

    Team Development Team ** SAN (Storage Area Network)
  46. @ablythe Machine setup Machine admin Application Zabbix Infra Team Operations

    Team Development Team ** SAN (Storage Area Network)
  47. @ablythe POST MORTEM MEETING • Before meeting: • Time line

    of incident - facts, assumptions, expectations • During meeting • Level set expectations • Discuss without Blame • Only take action items that can be assigned and completed in next week
  48. @ablythe Machine setup /etc/multipath.conf Machine admin Application Zabbix Infra Team

    Operations Team Development Team ** SAN (Storage Area Network)
  49. @ablythe COGNITIVE BIASES • Hindsight Bias • Outcome Bias •

    Availability Bias (AKA Recency Bias) • Sunk Cost Bias • Confirmation Bias There is an excellent short book from O’Reilly by David Zwieback on the Human Side of Post Mortems, that I wish was there when I was younger and contemplating . I believe the PDF is a total of 32 pages. * Hindsight Bias - knowing what we know now * Outcome Bias - Attaching how bad the issue was to the decisions * Availability Bias - overestimating things that are easily recalled, underestimating the forgotten (ex. tornados cause more deaths than asthma) * Sunk Cost Bias - Well we already have this so we are going to eat the whole bowl * Confirmation Bias - Finding data that just confirms what we want to hear (For the MSNBC and Fox viewers out there, this is what they play on)
  50. @ablythe RETROSPECTIVE MEETINGS • 10 minutes to quietly review the

    past two weeks • Write down 3 biggest accomplishments (team or individual) • Discussion and classification • • Thank you (chance to formally in front of everyone thank someone on the team) • Action Items (to be followed up on next meeting) • Post publicly Coming from the agile development world. Retrospective meetings are my favorite meetings. I managed a team where we did these without fail every two weeks for over two years. We called this our feels meeting. We were lucky enough to have a bean bag room close to where we sat in our cube farm. We started with fact finding on sticky notes. Then put them on the wall and categorized them as a group. This would unearth a lot before resentment would have the ability to fester and grow.
  51. @ablythe – Kurt Vonnegut, Sirens of Titan “Now, you can

    say your Daddy is right and the other little child's Daddy is wrong, but the universe is an awfully big place. There is room enough for an awful lot of people to be right about things and still not agree.”
  52. @ablythe TYPES OF MEETINGS • Root Cause Analysis Meeting (Monthly)

    • Post Mortem Meeting (per Incident) • Retrospective Meeting (Fortnightly)
  53. @ablythe Reed Hastings Culture Deck Paul Graham Makers Schedule vs.

    Manager’s Schedule John Allspaw Blameless PostMortems and a Just Culture Dr. Rick Brinkman Dr. Rick Kirchner Dealing with People You Can’t Stand David Zweiback Human Side of Postmortems Sidney Dekker Just Culture