Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Being On Call Does Not Have to Suck.

Being On Call Does Not Have to Suck.

We have collectively internalized the idea that being on call is a distasteful burden, a necessary evil at best. What if I told you that:

* on-call duty could be 100% voluntary?
* being on call could be a sign of prestige?
* you could look forward to your turn on call -- the time when you get to take a break from the roadmap and fix all the shit that has been personally bugging you?
* being on-call can be a great time to unleash your curiosity, learn new skills, and develop/refine your technical judgment?
* an on-call rotation, properly run, can heighten the stakes, build your empathy towards your users, and strengthen the bonds between you and your teammates?

On call duty doesn't have to be something you are forced to endure... we can do so much better than that. It might even become your favorite week of the month. 🙃

Charity Majors

November 17, 2022
Tweet

More Decks by Charity Majors

Other Decks in Programming

Transcript

  1. @mipsytipsy
    Why on call doesn’t have to suck.

    View Slide

  2. @mipsytipsy


    engineer/cofounder/CTO


    https://charity.wtf
    new!

    View Slide

  3. “I dread being on call.”
    “I became a software engineer with
    the expectation that other people
    would be getting paged, not me.”
    “I didn’t sign up for this.”

    View Slide

  4. “If you make everybody be on call, we’ll
    have even fewer mothers (and other
    marginalized folks) in engineering.”
    “On call duty is what burned me out of tech.”
    “My time is too valuable to be on call. You want
    me writing features and delivering user value.”

    View Slide

  5. “Sometimes you just have to buy the
    happiness of your users with the lifeblood
    of your engineers.”
    “I sacrificed MY health and sleep for 10 years
    of on call duty; now it’s YOUR turn.”
    “You aren’t a REAL engineer until you’ve
    debugged this live at 3 am.”

    View Slide

  6. (posturing, sunk cost fallacies, disrespect for sleep and personal lives, surface fixes, evading
    responsibility, flappy alerts, over-alerting, lack of training or support, snobbery…)
    But it doesn’t have to be this way.
    There are loads of toxic patterns around on call
    We can do so much better. 🥰

    View Slide

  7. I am here to convince you that on call can be:
    • an aspirational role.. an achievement,


    • worthy of pride and respect


    • fully compatible with adult lives & responsibilities


    • rarely sleep-disturbing or life-impacting


    • something that engineers enjoy and look forward to


    • maybe even … ✨volunteer-only✨

    View Slide

  8. A social contract between engineers and managers
    An essential part of software ownership
    A proxy metric for how well your team is performing, how


    functional your system is, and how happy your users are
    A set of expert practices and techniques in its own right
    What on call work means to me:

    View Slide

  9. On call rotations are a social contract
    between engineers and managers.
    Managers, your job is to make sure owning it doesn’t suck.
    ⭐ By ensuring you get enough time as a team to fix that which is broken ⭐ Making sure that people are taking time off to
    compensate for time on call ⭐ By tracking the frequency of out-of-hours alerting ⭐ By creating a training process ⭐
    Engineers, your job is to own your code.
    ⭐ You write it, you run it ⭐ If you work on a highly available service, you should expect to support it ⭐ It’s important for
    you to close the loop by watching your code in prod ⭐ Systems are getting increasingly complex, to the point that no one
    *can* be on call for software they don’t write & own ⭐ This is the way to cleaner systems ⭐ SOMEONE’s gotta do it ⭐

    View Slide

  10. Who should be on call for their code?
    Engineers with code in production.
    Is this payback?? 🤔
    No! OMG. I freely acknowledge that ops has always had a streak
    of masochism. But this isn’t about making software engineers as
    miserable as we used to be; this is about the fact that software
    ownership is the way to make things better. For everyone.

    View Slide

  11. sociotechnical (n)
    On call is how we close the feedback loop.
    Someone needs to be responsible for your
    services in the off-hours.


    This cannot be an afterthought; it should
    play a prominent role in your hiring, team
    structure, and compensation decisions
    from the very start.


    These are decisions that define who you
    are and what you value as a team.

    View Slide

  12. If owning your code sucks, managers need to:
    Audit the alert strategy


    Begin evaluating team performance


    Graph and track every alert outside of business hours


    Make sure the team has enough cycles to actually fix shit


    Get reliability projects on the roadmap


    Get curious about your on call culture


    And ideally add themselves to the rotation.

    View Slide

  13. Objections:
    “My time is too valuable”
    “I don’t know how to navigate production”
    “I have a new baby”
    “We have a follow-the-sun rotation”
    “I need my sleep / it’s stressful.”
    Learn.
    Whose time isn’t?
    You win. Nobody should have two alarms.
    Lucky you!
    Let’s make it better, not pawn


    it off on someone else.

    View Slide

  14. • Lack of alert discipline (alerts are flappy, noisy, or not actionable)


    View Slide

  15. Let’s make on call fun again!
    1. Adopt alerting best practices
    3. Become a higher-performing team
    2. Chase the gremlins from your system

    View Slide

  16. Adopt alerting best practices
    Conduct blameless retros after reliability events


    Stop alerting on symptoms (e.g. “high CPU”, “disk full”)


    Adopt Service Level Objectives (SLOs) and alert on those instead


    Delete any predictive alerts;


    Alert only when customers are in pain


    Kill flappy alerts with fire


    Treat any alert outside business hours like a reliability event;


    triage and retro it, and fix it so it won’t happen again.


    Create a second lane for alerts that are important but not urgent.

    View Slide

  17. SLOs
    Align on call pain with user pain: alert
    only on SLO violations and end-to-end
    checks which correlate directly to real
    user pain.
    Never, EVER alert on symptoms, like
    “high CPU” or “disk fail.”
    Fail partially and gracefully if at all
    possible. Better to spend down an SLO
    budget than have an outage.

    View Slide

  18. Out of hours alerts
    Set a high bar for what you are going to
    wake people up for. Actively curate a
    small list of these alerts. Tweak them until
    they are rock solid. No flaps.
    Take those alerts as seriously as a heart
    attack. Track them, graph them, FIX
    THEM. Managers, consider adding
    yourselves to the rotation.

    View Slide

  19. Retro and Resolve
    If something alerted out of hours, hold a
    retro. Can this be auto remediated? Can
    it be diverted to the second lane? What
    needs to be done to fix this for good?
    z
    Have a second lane of “stuff to deal with,
    but not urgently” where most alarms go.
    Send as much as possible to lane two.

    View Slide

  20. Reduce the gremlins in your system
    Embrace observability


    Make a habit of inspecting your instrumentation after each deploy


    Require everyone to answer, “how will you know when this breaks?”


    and “what are you going to look for?” before each merge.


    Proactively explore and inspect outliers every day


    Make sure that engineers have enough time to finish retro action items


    Get reliability projects on the schedule


    Spend 20% time steady state on tech debt and small improvements


    Run chaos experiments


    Decouple deploys from releases using feature flags.

    View Slide

  21. Investments
    Decouple your deploys from releases
    using feature flags. Consider progressive
    deployments.
    You will need to spend at least 20% time
    on system upkeep, steady state. If you’re
    in the red, you need to start with more.

    View Slide

  22. Principles
    It is easier to keep yourself from falling
    into an operational pit of doom than it is
    to dig your way out of one.
    Value good, clean high-level abstractions
    that let you delegate large swaths of
    operational burden and software surface
    area to vendors. Money is cheap;
    engineering cycles are not.

    View Slide

  23. Roadmapping
    Reliability work and technical debt are
    not secondary to product features. Use
    SLOs (and human SLOs!) to assert the
    time you need to build a better system.
    You need to spend at least 20% time on
    system upkeep, steady state. If you’re in
    the red, you need to start with more.

    View Slide

  24. Test in Production
    If you can’t reliably ascertain what’s
    happening within a few minutes of
    investigating, you need better
    observability. This is not normal or
    acceptable.
    Run chaos experiments (at 3pm, not
    3am) to make sure you’ve fixed it, and
    perhaps run them continuously, forever.

    View Slide

  25. Instrument your code for observability with wide events and spans


    Replace dashboards with ad hoc querying


    Practice ODD (Observability-Driven Development)


    (instrument as you go, ship quickly, then verify your changes in prod)


    After each merge, automatically run tests, build artifact, and deploy.


    Deploy one merge by one engineer at a time


    Run tests and auto-deploy in 15 minutes or less


    Consider whether you have sufficient operational expertise on hand
    Improve your performance as a team

    View Slide

  26. How well does your team perform?
    != “how good are you at engineering”

    View Slide

  27. High-performing teams
    spend the majority of their time solving interesting, novel
    problems that move the business materially forward.
    Lower-performing teams
    spend their time firefighting, waiting on code review, waiting
    on each other, resolving merge conflicts, reproducing tricky
    bugs, solving problems they thought were fixed, responding
    to customer complaints, fixing flaky tests, running deploys
    by hand, fighting with their infrastructure, fighting with their
    tools, fighting with each other…endless yak shaving and toil.

    View Slide

  28. 🔥1 — How frequently do you deploy?
    🔥2 — How long does it take for code to go live?
    🔥3 — How many of your deploys fail?
    🔥4 — How long does it take to recover from an outage?
    🔥5 — How often are you paged outside work hours?
    How high-performing is YOUR team?

    View Slide

  29. There is a wide gap between “elite” teams and the other 75%.
    2021 numbers

    View Slide

  30. Work on these things.


    Track these things.


    They matter.
    Deploy frequency


    Time to deploy


    Deploy failures


    Time to recovery


    Out-of-hours alert

    View Slide

  31. Great teams make great engineers. ❤

    View Slide

  32. Your ability to ship code swiftly and safely has less to do with
    your knowledge of algorithms and data structures,
    sociotechnical (n)
    “Technology is the sum of ways in which social groups construct the
    material objects of their civilizations. The things made are socially
    constructed just as much as technically constructed. The merging of
    these two things, construction and insight, is sociotechnology” —
    wikipedia
    and much more to do with the sociotechnical system you participate in.
    Technical leadership should focus intensely on constructing
    and tightening the feedback loops at the heart of their system.

    View Slide

  33. Now that we have tamed our alerts, and switched to SLOs…
    Now that we have dramatically fewer unknown-unknowns…
    Now that we have the instrumentation to swiftly pinpoint any cause…
    Now that we auto-deploy our changes to production within minutes…
    Now that out-of-hours alerts are vanishingly rare…
    So, now that we’ve done all that…
    “I’m still not happy. You said


    I’d be HAPPY to be on call.”

    View Slide

  34. If you are on call, you are not to
    work on product or the roadmap
    that week.
    You work on the system. Whatever has been
    bothering you, whatever you think is broken …
    you work on that. Use your judgment. Have fun.
    If you were on call this week, you get next Friday off.

    View Slide

  35. I’m of mixed mind about paying people for
    being on call. With one big exception.
    If you are struggling to get your engineers the time they
    need for systems work instead of just cranking features,
    you should start paying people a premium every time they


    are alerted out of hours. Pay them a lot. Pay enough for finance to
    complain. Pay them until management is begging you to work on
    reliability to save them money.
    If they don’t care about your people’s time and lives, convert it into
    something they do care about. ✨Money✨

    View Slide

  36. The End ☺

    View Slide

  37. Charity Majors


    @mipsytipsy

    View Slide