Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unfucking Your Oncall Culture

Unfucking Your Oncall Culture

On-call is where you learn more lessons about your company or your product than anywhere else, so you should make sure it doesn't suck.

We'll talk about the kinds of context that affect your unique oncall needs and results, and tools for persuasion if you need to change hearts and minds.


Charity Majors

July 01, 2016

More Decks by Charity Majors

Other Decks in Technology


  1. Charity Majors, CTO honeycomb.io @mipsytipsy Unfucking Your OnCall Culture

  2. Unfucking Your OnCall Culture (* not just for ops) (**

    not just for engineers) (*** not just for people who page or get paged)
  3. Things fall apart.

  4. Know your power, own your power. Culture is not designed

    by managers, it grows out of shared experience
  5. An emergent property of your cultural values. On-Call is:

  6. How to fix it, once and for all: EXPERT

  7. How to fix it, once and for all: Hahahahahahahahahahaahaha

  8. Kind of like diabetes. On-Call is: If you’re a manager

    or team lead or senior engineer, you are personally responsible for tending the on-call experience. It is one of the most high-impact things you can do.

  10. The overwhelming power of context • Company • Team •

    Pain • **YOU** (who are you? what formal and informal power do you possess? how much buy-in do you have?)
  11. • Company size, maturity, expectations • Product expectations/commitments • Time

    horizon for your roadmap • What are your customers’ (reasonable) expectations? • Your tech stack. How many homegrown elements? • The size and structure of your eng teams • Distribution of skills, seniority of ICs Context: company
  12. • How are your teams structured? • How much mutual

    respect and trust do you have? • How much specialization vs generalization? (Do you think you should have more or less of each?) • How quickly are you growing? • Why are you growing? (growth can be a pathology) Context: teams
  13. “I’ll just hire more people” You probably can’t hire your

    way out of this problem, and you definitely shouldn’t depend on it. Resource constraints are the best teacher. Embrace them.
  14. Context is everything.

  15. Toolbox.

  16. Tools #1 Everyone plays • Participation in a rotation is

    table stakes • Drop them in the deep end on the second week • Buddy up. (I like trailing primary) • The most effective rotation size is 5-8 engineers • Everyone is a generalist at a young startup
  17. Tools #2 Tools for Persuasion • If you believe you

    have no power, so will everyone else. • Don’t give a shit about getting credit. • Use your 1x1s to build consensus from behind. • Use “yes, and …” to gently shift perspective. • Every change is an experiment.
  18. Tools #3 Focus on impact • Find the pain and

    work backwards. • You need data to demonstrate impact. • Wrong docs are < no docs. • Put docs in the alerts, LoC in query comments
  19. Tools #4 Managers • Drive consensus about the Glorious Future

    • Monitor how often your team gets paged / woken up • Take yourself out of the main rotation (probably) • Pinch-hit when someone is tired or needs relief. • Call an all-hands strike team to knock out problems • Decide not to do things.
  20. Tools #5 Digging out of a hole • Have two

    lanes: one for wakeup alerts, and “other” • Nothing burns people out faster than flappy alerts • Lingering, non actionable alerts are a CRISIS. • Known alerts that linger for weeks are an EMERGENCY • Look for ways to auto remediate common alerts.
  21. Where failures surface; your most powerful tool for improvement. On-Call

    is: Watch it intently. Center your roadmapping and planning and training around what you learn.
  22. Final thought: Operations engineering has a history of being devalued

    compared to other software engineering subgenres Why??
  23. It is time for that shit to stop. No more

    heroes. No more martyrs.
  24. On-call is your friend if you unfuck it <3 @mipsytipsy