Being On Call Does Not Have to Suck.

Slide 1

Slide 1 text

@mipsytipsy Why on call doesn’t have to suck.

Slide 2

Slide 2 text

@mipsytipsy engineer/cofounder/CTO https://charity.wtf new!

Slide 3

Slide 3 text

“I dread being on call.” “I became a software engineer with the expectation that other people would be getting paged, not me.” “I didn’t sign up for this.”

Slide 4

Slide 4 text

“If you make everybody be on call, we’ll have even fewer mothers (and other marginalized folks) in engineering.” “On call duty is what burned me out of tech.” “My time is too valuable to be on call. You want me writing features and delivering user value.”

Slide 5

Slide 5 text

“Sometimes you just have to buy the happiness of your users with the lifeblood of your engineers.” “I sacrificed MY health and sleep for 10 years of on call duty; now it’s YOUR turn.” “You aren’t a REAL engineer until you’ve debugged this live at 3 am.”

Slide 6

Slide 6 text

(posturing, sunk cost fallacies, disrespect for sleep and personal lives, surface fixes, evading responsibility, flappy alerts, over-alerting, lack of training or support, snobbery…) But it doesn’t have to be this way. There are loads of toxic patterns around on call We can do so much better. 🥰

Slide 7

Slide 7 text

I am here to convince you that on call can be: • an aspirational role.. an achievement, • worthy of pride and respect • fully compatible with adult lives & responsibilities • rarely sleep-disturbing or life-impacting • something that engineers enjoy and look forward to • maybe even … ✨volunteer-only✨

Slide 8

Slide 8 text

A social contract between engineers and managers An essential part of software ownership A proxy metric for how well your team is performing, how functional your system is, and how happy your users are A set of expert practices and techniques in its own right What on call work means to me:

Slide 9

Slide 9 text

On call rotations are a social contract between engineers and managers. Managers, your job is to make sure owning it doesn’t suck. ⭐ By ensuring you get enough time as a team to fix that which is broken ⭐ Making sure that people are taking time off to compensate for time on call ⭐ By tracking the frequency of out-of-hours alerting ⭐ By creating a training process ⭐ Engineers, your job is to own your code. ⭐ You write it, you run it ⭐ If you work on a highly available service, you should expect to support it ⭐ It’s important for you to close the loop by watching your code in prod ⭐ Systems are getting increasingly complex, to the point that no one *can* be on call for software they don’t write & own ⭐ This is the way to cleaner systems ⭐ SOMEONE’s gotta do it ⭐

Slide 10

Slide 10 text

Who should be on call for their code? Engineers with code in production. Is this payback?? 🤔 No! OMG. I freely acknowledge that ops has always had a streak of masochism. But this isn’t about making software engineers as miserable as we used to be; this is about the fact that software ownership is the way to make things better. For everyone.

Slide 11

Slide 11 text

sociotechnical (n) On call is how we close the feedback loop. Someone needs to be responsible for your services in the off-hours. This cannot be an afterthought; it should play a prominent role in your hiring, team structure, and compensation decisions from the very start. These are decisions that define who you are and what you value as a team.

Slide 12

Slide 12 text

If owning your code sucks, managers need to: Audit the alert strategy Begin evaluating team performance Graph and track every alert outside of business hours Make sure the team has enough cycles to actually fix shit Get reliability projects on the roadmap Get curious about your on call culture And ideally add themselves to the rotation.

Slide 13

Slide 13 text

Objections: “My time is too valuable” “I don’t know how to navigate production” “I have a new baby” “We have a follow-the-sun rotation” “I need my sleep / it’s stressful.” Learn. Whose time isn’t? You win. Nobody should have two alarms. Lucky you! Let’s make it better, not pawn it off on someone else.

Slide 14

Slide 14 text

• Lack of alert discipline (alerts are flappy, noisy, or not actionable)

Slide 15

Slide 15 text

Let’s make on call fun again! 1. Adopt alerting best practices 3. Become a higher-performing team 2. Chase the gremlins from your system

Slide 16

Slide 16 text

Adopt alerting best practices Conduct blameless retros after reliability events Stop alerting on symptoms (e.g. “high CPU”, “disk full”) Adopt Service Level Objectives (SLOs) and alert on those instead Delete any predictive alerts; Alert only when customers are in pain Kill flappy alerts with fire Treat any alert outside business hours like a reliability event; triage and retro it, and fix it so it won’t happen again. Create a second lane for alerts that are important but not urgent.

Slide 17

Slide 17 text

SLOs Align on call pain with user pain: alert only on SLO violations and end-to-end checks which correlate directly to real user pain. Never, EVER alert on symptoms, like “high CPU” or “disk fail.” Fail partially and gracefully if at all possible. Better to spend down an SLO budget than have an outage.

Slide 18

Slide 18 text

Out of hours alerts Set a high bar for what you are going to wake people up for. Actively curate a small list of these alerts. Tweak them until they are rock solid. No flaps. Take those alerts as seriously as a heart attack. Track them, graph them, FIX THEM. Managers, consider adding yourselves to the rotation.

Slide 19

Slide 19 text

Retro and Resolve If something alerted out of hours, hold a retro. Can this be auto remediated? Can it be diverted to the second lane? What needs to be done to fix this for good? z Have a second lane of “stuff to deal with, but not urgently” where most alarms go. Send as much as possible to lane two.

Slide 20

Slide 20 text

Reduce the gremlins in your system Embrace observability Make a habit of inspecting your instrumentation after each deploy Require everyone to answer, “how will you know when this breaks?” and “what are you going to look for?” before each merge. Proactively explore and inspect outliers every day Make sure that engineers have enough time to finish retro action items Get reliability projects on the schedule Spend 20% time steady state on tech debt and small improvements Run chaos experiments Decouple deploys from releases using feature flags.

Slide 21

Slide 21 text

Investments Decouple your deploys from releases using feature flags. Consider progressive deployments. You will need to spend at least 20% time on system upkeep, steady state. If you’re in the red, you need to start with more.

Slide 22

Slide 22 text

Principles It is easier to keep yourself from falling into an operational pit of doom than it is to dig your way out of one. Value good, clean high-level abstractions that let you delegate large swaths of operational burden and software surface area to vendors. Money is cheap; engineering cycles are not.

Slide 23

Slide 23 text

Roadmapping Reliability work and technical debt are not secondary to product features. Use SLOs (and human SLOs!) to assert the time you need to build a better system. You need to spend at least 20% time on system upkeep, steady state. If you’re in the red, you need to start with more.

Slide 24

Slide 24 text

Test in Production If you can’t reliably ascertain what’s happening within a few minutes of investigating, you need better observability. This is not normal or acceptable. Run chaos experiments (at 3pm, not 3am) to make sure you’ve fixed it, and perhaps run them continuously, forever.

Slide 25

Slide 25 text

Instrument your code for observability with wide events and spans Replace dashboards with ad hoc querying Practice ODD (Observability-Driven Development) (instrument as you go, ship quickly, then verify your changes in prod) After each merge, automatically run tests, build artifact, and deploy. Deploy one merge by one engineer at a time Run tests and auto-deploy in 15 minutes or less Consider whether you have sufficient operational expertise on hand Improve your performance as a team

Slide 26

Slide 26 text

How well does your team perform? != “how good are you at engineering”

Slide 27

Slide 27 text

High-performing teams spend the majority of their time solving interesting, novel problems that move the business materially forward. Lower-performing teams spend their time firefighting, waiting on code review, waiting on each other, resolving merge conflicts, reproducing tricky bugs, solving problems they thought were fixed, responding to customer complaints, fixing flaky tests, running deploys by hand, fighting with their infrastructure, fighting with their tools, fighting with each other…endless yak shaving and toil.

Slide 28

Slide 28 text

🔥1 — How frequently do you deploy? 🔥2 — How long does it take for code to go live? 🔥3 — How many of your deploys fail? 🔥4 — How long does it take to recover from an outage? 🔥5 — How often are you paged outside work hours? How high-performing is YOUR team?

Slide 29

Slide 29 text

There is a wide gap between “elite” teams and the other 75%. 2021 numbers

Slide 30

Slide 30 text

Work on these things. Track these things. They matter. Deploy frequency Time to deploy Deploy failures Time to recovery Out-of-hours alert

Slide 31

Slide 31 text

Great teams make great engineers. ❤

Slide 32

Slide 32 text

Your ability to ship code swiftly and safely has less to do with your knowledge of algorithms and data structures, sociotechnical (n) “Technology is the sum of ways in which social groups construct the material objects of their civilizations. The things made are socially constructed just as much as technically constructed. The merging of these two things, construction and insight, is sociotechnology” — wikipedia and much more to do with the sociotechnical system you participate in. Technical leadership should focus intensely on constructing and tightening the feedback loops at the heart of their system.

Slide 33

Slide 33 text

Now that we have tamed our alerts, and switched to SLOs… Now that we have dramatically fewer unknown-unknowns… Now that we have the instrumentation to swiftly pinpoint any cause… Now that we auto-deploy our changes to production within minutes… Now that out-of-hours alerts are vanishingly rare… So, now that we’ve done all that… “I’m still not happy. You said I’d be HAPPY to be on call.”

Slide 34

Slide 34 text

If you are on call, you are not to work on product or the roadmap that week. You work on the system. Whatever has been bothering you, whatever you think is broken … you work on that. Use your judgment. Have fun. If you were on call this week, you get next Friday off.

Slide 35

Slide 35 text

I’m of mixed mind about paying people for being on call. With one big exception. If you are struggling to get your engineers the time they need for systems work instead of just cranking features, you should start paying people a premium every time they are alerted out of hours. Pay them a lot. Pay enough for finance to complain. Pay them until management is begging you to work on reliability to save them money. If they don’t care about your people’s time and lives, convert it into something they do care about. ✨Money✨