Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing Sustainable Ops Cultures

Ryn Daniels
August 21, 2019

Designing Sustainable Ops Cultures

Presented at OpsMatters London 21 August 2019

Ryn Daniels

August 21, 2019
Tweet

More Decks by Ryn Daniels

Other Decks in Technology

Transcript

  1. @rynchantress Ops Ma/ers • Formerly an Ops • Very good

    at Apache upgrades • Now a Dev • Has Opinions about Devops
  2. 6 Factors of Burnout @rynchantress Ops Ma/ers Understanding the burnout

    experience: recent research and its implica6ons for psychiatry Chris6na Maslach, Michael P. Leiter World Psychiatry. 2016 Jun; 15(2): 103–111. Published online 2016 Jun 5. doi: 10.1002/wps.20311
  3. 6 Factors of Burnout @rynchantress Ops Ma/ers • Workload •

    Control • Reward • Community • Fairness • Values
  4. "We have too many fires to put out; we'll never

    have Ime to get to everything." "ProducIon incidents happen faster than we can keep up with the remediaIon items from them."
 
 "We have too much interrupt-driven work to be able to make progress on any planned work."
 
 "There's too much tedious manual work and people leave because of it." Workload @rynchantress Ops Ma/ers
  5. "We aren't allowed to spend Ime on work that would

    make on-call easier." "Management is giving us mandatory checklists to follow without asking if they actually work for us." "Product teams make decisions that affect us without ever talking to us first." Control @rynchantress Ops Ma/ers
  6. "We get yelled at when the site goes down, but

    nobody ever thanks us when things are going well." "Our work feels invisible and thankless." "We aren't seen as valuable in the organizaIon so we don't get the same bonus structure as other engineering teams." Reward @rynchantress Ops Ma/ers
  7. "Our work is different from that of the rest of

    engineering so our team tends to feel isolated." "We have the only on-call responsibiliIes in the organizaIon but nobody else views that as a problem." "I'm a one-person ops team." Community @rynchantress Ops Ma/ers
  8. "We get blamed for producIon incidents that were outside of

    our control." "It feels like we're always having to clean up other team's messes." "We're always assumed to be less important and less capable than devs are." Fairness @rynchantress Ops Ma/ers
  9. "My organizaIon doesn't see our work as valuable." "OperaIons is

    seen as nothing more than a cost center." "Management would rather us put on bandaid soluIons than invest any Ime in fixing things for real." Values @rynchantress Ops Ma/ers
  10. Culture is the set of values, norms, and behaviors that

    emerge from a parIcular group. What is culture? @rynchantress Ops Ma/ers
  11. • IdenIfy current and desired cultural outcomes • Note contributory

    behaviors • Describe potenIal new behaviors • Plan designable surfaces for changes • Document, execute, and iterate @rynchantress Ops Ma/ers Culture design process
  12. Designable surfaces are specific, concrete things that can be changed

    with the goal of impacIng a cultural outcome Designable surfaces @rynchantress Ops Ma/ers
  13. • schedules • handoffs • compensaIon • alerIng pracIces •

    escalaIon policies • services in scope • post-mortems • work prioriIzaIon • etc. Example: Designable surfaces for on-call @rynchantress Ops Ma/ers
  14. • The ops team keeps failing to make progress on

    their proacIve project work • Team members spend significant amounts of Ime answering quesIons in mulIple channels • Lots of Ime is spent doing work manually just to get things done on Ime • On-call work adds even more interrupIons and stress Current cultural outcomes @rynchantress Ops Ma/ers
  15. • The ops team is able to complete projects as

    well as respond to interrupts • Responding to incoming quesIons feels manageable rather than overwhelming • The team is able to work in an effecIve manner and automate when necessary Desired cultural outcomes @rynchantress Ops Ma/ers
  16. • The rotaIon only has 5 people in it, with

    week-long shiXs • Tradeoffs: Shorter shiXs would mean being on-call more oXen • Checks and alerts never get deleted because of a fear of missing something • Tradeoffs: False negaIves versus false posiIves (alert faIgue) • Interrupt work given priority over remediaIon work • Tradeoffs: Short-term response Ime against long- term stability On-call: Contributory behaviors @rynchantress Ops Ma/ers
  17. • Plan: Make non-essenIal services alert only during business hours

    • Designable surfaces: Service definiIons, severiIes in monitoring system, on-call schedules in alerIng system • Plan: Reduce the number of unacIonable alerts • Designable surfaces: Tool to tag alerts (e.g. OpsWeekly), deleIng alerts in monitoring system • Plan: Complete all remediaIon items within 30 days of an incident • Designable surfaces: PrioriIzaIon in issue tracker On-call: Designable Surfaces and Plans @rynchantress Ops Ma/ers
  18. • Work Icket backlog means Ickets have much slower response

    Imes • Tradeoffs: Balancing Imely responses with other types of work • People will oXen ask ops team members they know quesIons directly • Tradeoffs: Building and maintaining relaIonships • The ops team slack channel has grown to contain both internal cha/er and external quesIons • Tradeoffs: Number and specificity of communicaIon channels Interruptions: Contributory behaviors @rynchantress Ops Ma/ers
  19. • Plan: Create a rotaIon for interrupt-driven work • Designable

    surfaces: Schedule management tool (PagerDuty, calendar), request tracking (Jira, helpdesk soXware) • Plan: Create a dedicated channel for quesIons and a private one for internal team discussions • Designable surfaces: Slack, directory for finding where to ask quesIons Interruptions: Designable Surfaces and Plans @rynchantress Ops Ma/ers
  20. • Iterate, iterate, iterate • Expect the unexpected • Understand

    how change happens What's next? @rynchantress Ops Ma/ers
  21. @rynchantress Ops Ma/ers For managers • Empower your team to

    make changes • Encourage an atmosphere of learning, not blame • Help facilitate conversaIons with other teams
  22. @rynchantress Ops Ma/ers For ICs • Look beyond "that's the

    way we've always done things" • Pay a/enIon to and prioriIze the tangible • Remember, you can always add alerts back