Designing Sustainable Ops Cultures

Designing Sustainable Ops Cultures

Presented at OpsMatters London 21 August 2019


Ryn Daniels

August 21, 2019


  1. Ops Ma'ers - 21 Aug 2019 Ryn Daniels they/them @rynchantress

    Designing Sustainable Ops Cultures
  2. @rynchantress Ops Ma/ers • Formerly an Ops • Very good

    at Apache upgrades • Now a Dev • Has Opinions about Devops
  3. @rynchantress Ops Ma/ers Burnout

  4. Cynicism Exhaustion Inefficacy

  5. 6 Factors of Burnout @rynchantress Ops Ma/ers Understanding the burnout

    experience: recent research and its implica6ons for psychiatry Chris6na Maslach, Michael P. Leiter World Psychiatry. 2016 Jun; 15(2): 103–111. Published online 2016 Jun 5. doi: 10.1002/wps.20311
  6. 6 Factors of Burnout @rynchantress Ops Ma/ers • Workload •

    Control • Reward • Community • Fairness • Values
  7. Computers were a mistake That #opslife Burnout @rynchantress Ops Ma/ers

  8. "We have too many fires to put out; we'll never

    have Ime to get to everything." "ProducIon incidents happen faster than we can keep up with the remediaIon items from them."
 "We have too much interrupt-driven work to be able to make progress on any planned work."
 "There's too much tedious manual work and people leave because of it." Workload @rynchantress Ops Ma/ers
  9. "We aren't allowed to spend Ime on work that would

    make on-call easier." "Management is giving us mandatory checklists to follow without asking if they actually work for us." "Product teams make decisions that affect us without ever talking to us first." Control @rynchantress Ops Ma/ers
  10. "We get yelled at when the site goes down, but

    nobody ever thanks us when things are going well." "Our work feels invisible and thankless." "We aren't seen as valuable in the organizaIon so we don't get the same bonus structure as other engineering teams." Reward @rynchantress Ops Ma/ers
  11. "Our work is different from that of the rest of

    engineering so our team tends to feel isolated." "We have the only on-call responsibiliIes in the organizaIon but nobody else views that as a problem." "I'm a one-person ops team." Community @rynchantress Ops Ma/ers
  12. "We get blamed for producIon incidents that were outside of

    our control." "It feels like we're always having to clean up other team's messes." "We're always assumed to be less important and less capable than devs are." Fairness @rynchantress Ops Ma/ers
  13. "My organizaIon doesn't see our work as valuable." "OperaIons is

    seen as nothing more than a cost center." "Management would rather us put on bandaid soluIons than invest any Ime in fixing things for real." Values @rynchantress Ops Ma/ers
  14. @rynchantress Ops Ma/ers #opslife

  15. Take burnout seriously. @rynchantress Ops Ma/ers

  16. What is the opposite of burnout? @rynchantress Ops Ma/ers

  17. @rynchantress Ops Ma/ers Purpose & Engagement!

  18. Can we design sustainable ops cultures? @rynchantress Ops Ma/ers

  19. What is culture? @rynchantress Ops Ma/ers

  20. @rynchantress Ops Ma/ers What is culture?

  21. Culture is the set of values, norms, and behaviors that

    emerge from a parIcular group. What is culture? @rynchantress Ops Ma/ers
  22. @rynchantress Ops Ma/ers Ops culture is...

  23. What would a sustainable, engaging ops culture look like? @rynchantress

    Ops Ma/ers
  24. @rynchantress Ops Ma/ers Purrpose & Engagement!

  25. Culture design process @rynchantress Ops Ma/ers

  26. • IdenIfy current and desired cultural outcomes • Note contributory

    behaviors • Describe potenIal new behaviors • Plan designable surfaces for changes • Document, execute, and iterate @rynchantress Ops Ma/ers Culture design process
  27. Designable surfaces @rynchantress Ops Ma/ers

  28. Designable surfaces are specific, concrete things that can be changed

    with the goal of impacIng a cultural outcome Designable surfaces @rynchantress Ops Ma/ers
  29. • schedules • handoffs • compensaIon • alerIng pracIces •

    escalaIon policies • services in scope • post-mortems • work prioriIzaIon • etc. Example: Designable surfaces for on-call @rynchantress Ops Ma/ers
  30. @rynchantress Ops Ma/ers Example time!

  31. • The ops team keeps failing to make progress on

    their proacIve project work • Team members spend significant amounts of Ime answering quesIons in mulIple channels • Lots of Ime is spent doing work manually just to get things done on Ime • On-call work adds even more interrupIons and stress Current cultural outcomes @rynchantress Ops Ma/ers
  32. @rynchantress Ops Ma/ers Oh no!

  33. • The ops team is able to complete projects as

    well as respond to interrupts • Responding to incoming quesIons feels manageable rather than overwhelming • The team is able to work in an effecIve manner and automate when necessary Desired cultural outcomes @rynchantress Ops Ma/ers
  34. @rynchantress Ops Ma/ers

  35. • The rotaIon only has 5 people in it, with

    week-long shiXs • Tradeoffs: Shorter shiXs would mean being on-call more oXen • Checks and alerts never get deleted because of a fear of missing something • Tradeoffs: False negaIves versus false posiIves (alert faIgue) • Interrupt work given priority over remediaIon work • Tradeoffs: Short-term response Ime against long- term stability On-call: Contributory behaviors @rynchantress Ops Ma/ers
  36. • Plan: Make non-essenIal services alert only during business hours

    • Designable surfaces: Service definiIons, severiIes in monitoring system, on-call schedules in alerIng system • Plan: Reduce the number of unacIonable alerts • Designable surfaces: Tool to tag alerts (e.g. OpsWeekly), deleIng alerts in monitoring system • Plan: Complete all remediaIon items within 30 days of an incident • Designable surfaces: PrioriIzaIon in issue tracker On-call: Designable Surfaces and Plans @rynchantress Ops Ma/ers
  37. @rynchantress Ops Ma/ers Getting better...

  38. • Work Icket backlog means Ickets have much slower response

    Imes • Tradeoffs: Balancing Imely responses with other types of work • People will oXen ask ops team members they know quesIons directly • Tradeoffs: Building and maintaining relaIonships • The ops team slack channel has grown to contain both internal cha/er and external quesIons • Tradeoffs: Number and specificity of communicaIon channels Interruptions: Contributory behaviors @rynchantress Ops Ma/ers
  39. • Plan: Create a rotaIon for interrupt-driven work • Designable

    surfaces: Schedule management tool (PagerDuty, calendar), request tracking (Jira, helpdesk soXware) • Plan: Create a dedicated channel for quesIons and a private one for internal team discussions • Designable surfaces: Slack, directory for finding where to ask quesIons Interruptions: Designable Surfaces and Plans @rynchantress Ops Ma/ers
  40. @rynchantress Ops Ma/ers The fuzzy side...

  41. • Iterate, iterate, iterate • Expect the unexpected • Understand

    how change happens What's next? @rynchantress Ops Ma/ers
  42. Empowering organizational change @rynchantress Ops Ma/ers

  43. For managers @rynchantress Ops Ma/ers

  44. @rynchantress Ops Ma/ers For managers • Empower your team to

    make changes • Encourage an atmosphere of learning, not blame • Help facilitate conversaIons with other teams
  45. @rynchantress Ops Ma/ers Empowerment, not gatekeeping

  46. For ICs @rynchantress Ops Ma/ers

  47. @rynchantress Ops Ma/ers For ICs • Look beyond "that's the

    way we've always done things" • Pay a/enIon to and prioriIze the tangible • Remember, you can always add alerts back
  48. @rynchantress Ops Ma/ers What is a sustainable #opslife for you?

  49. @rynchantress Ops Ma/ers THANK YOU!