We have collectively internalized the idea that being on call is a distasteful burden, a necessary evil at best. What if I told you that:
* on-call duty could be 100% voluntary?
* being on call could be a sign of prestige?
* you could look forward to your turn on call -- the time when you get to take a break from the roadmap and fix all the shit that has been personally bugging you?
* being on-call can be a great time to unleash your curiosity, learn new skills, and develop/refine your technical judgment?
* an on-call rotation, properly run, can heighten the stakes, build your empathy towards your users, and strengthen the bonds between you and your teammates?
On call duty doesn't have to be something you are forced to endure... we can do so much better than that. It might even become your favorite week of the month. 🙃
Why on call doesn’t have to suck.
“I dread being on call.”
“I became a software engineer with
the expectation that other people
would be getting paged, not me.”
“I didn’t sign up for this.”
“If you make everybody be on call, we’ll
have even fewer mothers (and other
marginalized folks) in engineering.”
“On call duty is what burned me out of tech.”
“My time is too valuable to be on call. You want
me writing features and delivering user value.”
“Sometimes you just have to buy the
happiness of your users with the lifeblood
of your engineers.”
“I sacrificed MY health and sleep for 10 years
of on call duty; now it’s YOUR turn.”
“You aren’t a REAL engineer until you’ve
debugged this live at 3 am.”
(posturing, sunk cost fallacies, disrespect for sleep and personal lives, surface fixes, evading
responsibility, flappy alerts, over-alerting, lack of training or support, snobbery…)
But it doesn’t have to be this way.
There are loads of toxic patterns around on call
We can do so much better. 🥰
I am here to convince you that on call can be:
• an aspirational role.. an achievement,
• worthy of pride and respect
• fully compatible with adult lives & responsibilities
• rarely sleep-disturbing or life-impacting
• something that engineers enjoy and look forward to
• maybe even … ✨volunteer-only✨
A social contract between engineers and managers
An essential part of software ownership
A proxy metric for how well your team is performing, how
functional your system is, and how happy your users are
A set of expert practices and techniques in its own right
What on call work means to me:
On call rotations are a social contract
between engineers and managers.
Managers, your job is to make sure owning it doesn’t suck.
⭐ By ensuring you get enough time as a team to fix that which is broken ⭐ Making sure that people are taking time off to
compensate for time on call ⭐ By tracking the frequency of out-of-hours alerting ⭐ By creating a training process ⭐
Engineers, your job is to own your code.
⭐ You write it, you run it ⭐ If you work on a highly available service, you should expect to support it ⭐ It’s important for
you to close the loop by watching your code in prod ⭐ Systems are getting increasingly complex, to the point that no one
*can* be on call for software they don’t write & own ⭐ This is the way to cleaner systems ⭐ SOMEONE’s gotta do it ⭐
Who should be on call for their code?
Engineers with code in production.
Is this payback?? 🤔
No! OMG. I freely acknowledge that ops has always had a streak
of masochism. But this isn’t about making software engineers as
miserable as we used to be; this is about the fact that software
ownership is the way to make things better. For everyone.
On call is how we close the feedback loop.
Someone needs to be responsible for your
services in the off-hours.
This cannot be an afterthought; it should
play a prominent role in your hiring, team
structure, and compensation decisions
from the very start.
These are decisions that define who you
are and what you value as a team.
If owning your code sucks, managers need to:
Audit the alert strategy
Begin evaluating team performance
Graph and track every alert outside of business hours
Make sure the team has enough cycles to actually fix shit
Get reliability projects on the roadmap
Get curious about your on call culture
And ideally add themselves to the rotation.
“My time is too valuable”
“I don’t know how to navigate production”
“I have a new baby”
“We have a follow-the-sun rotation”
“I need my sleep / it’s stressful.”
Whose time isn’t?
You win. Nobody should have two alarms.
Let’s make it better, not pawn
it off on someone else.
• Lack of alert discipline (alerts are flappy, noisy, or not actionable)
Let’s make on call fun again!
1. Adopt alerting best practices
3. Become a higher-performing team
2. Chase the gremlins from your system
Adopt alerting best practices
Conduct blameless retros after reliability events
Stop alerting on symptoms (e.g. “high CPU”, “disk full”)
Adopt Service Level Objectives (SLOs) and alert on those instead
Delete any predictive alerts;
Alert only when customers are in pain
Kill flappy alerts with fire
Treat any alert outside business hours like a reliability event;
triage and retro it, and fix it so it won’t happen again.
Create a second lane for alerts that are important but not urgent.
Align on call pain with user pain: alert
only on SLO violations and end-to-end
checks which correlate directly to real
Never, EVER alert on symptoms, like
“high CPU” or “disk fail.”
Fail partially and gracefully if at all
possible. Better to spend down an SLO
budget than have an outage.
Out of hours alerts
Set a high bar for what you are going to
wake people up for. Actively curate a
small list of these alerts. Tweak them until
they are rock solid. No flaps.
Take those alerts as seriously as a heart
attack. Track them, graph them, FIX
THEM. Managers, consider adding
yourselves to the rotation.
Retro and Resolve
If something alerted out of hours, hold a
retro. Can this be auto remediated? Can
it be diverted to the second lane? What
needs to be done to fix this for good?
Have a second lane of “stuff to deal with,
but not urgently” where most alarms go.
Send as much as possible to lane two.
Reduce the gremlins in your system
Make a habit of inspecting your instrumentation after each deploy
Require everyone to answer, “how will you know when this breaks?”
and “what are you going to look for?” before each merge.
Proactively explore and inspect outliers every day
Make sure that engineers have enough time to finish retro action items
Get reliability projects on the schedule
Spend 20% time steady state on tech debt and small improvements
Run chaos experiments
Decouple deploys from releases using feature flags.
Decouple your deploys from releases
using feature flags. Consider progressive
You will need to spend at least 20% time
on system upkeep, steady state. If you’re
in the red, you need to start with more.
It is easier to keep yourself from falling
into an operational pit of doom than it is
to dig your way out of one.
Value good, clean high-level abstractions
that let you delegate large swaths of
operational burden and software surface
area to vendors. Money is cheap;
engineering cycles are not.
Reliability work and technical debt are
not secondary to product features. Use
SLOs (and human SLOs!) to assert the
time you need to build a better system.
You need to spend at least 20% time on
system upkeep, steady state. If you’re in
the red, you need to start with more.
Test in Production
If you can’t reliably ascertain what’s
happening within a few minutes of
investigating, you need better
observability. This is not normal or
Run chaos experiments (at 3pm, not
3am) to make sure you’ve fixed it, and
perhaps run them continuously, forever.
Instrument your code for observability with wide events and spans
Replace dashboards with ad hoc querying
Practice ODD (Observability-Driven Development)
(instrument as you go, ship quickly, then verify your changes in prod)
After each merge, automatically run tests, build artifact, and deploy.
Deploy one merge by one engineer at a time
Run tests and auto-deploy in 15 minutes or less
Consider whether you have sufficient operational expertise on hand
Improve your performance as a team
How well does your team perform?
!= “how good are you at engineering”
spend the majority of their time solving interesting, novel
problems that move the business materially forward.
spend their time firefighting, waiting on code review, waiting
on each other, resolving merge conflicts, reproducing tricky
bugs, solving problems they thought were fixed, responding
to customer complaints, fixing flaky tests, running deploys
by hand, fighting with their infrastructure, fighting with their
tools, fighting with each other…endless yak shaving and toil.
🔥1 — How frequently do you deploy?
🔥2 — How long does it take for code to go live?
🔥3 — How many of your deploys fail?
🔥4 — How long does it take to recover from an outage?
🔥5 — How often are you paged outside work hours?
How high-performing is YOUR team?
There is a wide gap between “elite” teams and the other 75%.
Work on these things.
Track these things.
Time to deploy
Time to recovery
Great teams make great engineers. ❤
Your ability to ship code swiftly and safely has less to do with
your knowledge of algorithms and data structures,
“Technology is the sum of ways in which social groups construct the
material objects of their civilizations. The things made are socially
constructed just as much as technically constructed. The merging of
these two things, construction and insight, is sociotechnology” —
and much more to do with the sociotechnical system you participate in.
Technical leadership should focus intensely on constructing
and tightening the feedback loops at the heart of their system.
Now that we have tamed our alerts, and switched to SLOs…
Now that we have dramatically fewer unknown-unknowns…
Now that we have the instrumentation to swiftly pinpoint any cause…
Now that we auto-deploy our changes to production within minutes…
Now that out-of-hours alerts are vanishingly rare…
So, now that we’ve done all that…
“I’m still not happy. You said
I’d be HAPPY to be on call.”
If you are on call, you are not to
work on product or the roadmap
You work on the system. Whatever has been
bothering you, whatever you think is broken …
you work on that. Use your judgment. Have fun.
If you were on call this week, you get next Friday off.
I’m of mixed mind about paying people for
being on call. With one big exception.
If you are struggling to get your engineers the time they
need for systems work instead of just cranking features,
you should start paying people a premium every time they
are alerted out of hours. Pay them a lot. Pay enough for finance to
complain. Pay them until management is begging you to work on
reliability to save them money.
If they don’t care about your people’s time and lives, convert it into
something they do care about. ✨Money✨
The End ☺