Better On-Call the SRE Way - Ramón Medrano Llamas

Ramón Medrano Llamas, Google Better On-Call the SRE Way

What does “better on-call” mean to you? Some possibilities to
think about: • Fewer alerts/pages • High signal pages • Better work/life balance for people who are on-call • All of these are aspects of reduced stress

Lessons learned by Google SRE • I’ll talk about general
principles for reducing on-call stress and making on-call work more effective • Based on the “Being On-Call” and “Managing Incidents” chapters of the SRE book and the “On-Call” chapter of the Site Reliability Workbook

On-Call System Design

SRE Owns Production “SRE on-call is different than traditional ops
roles. Rather than focusing solely on day-to-day operations, SRE fully owns the production environment, and seeks to better it through deﬁning appropriate reliability thresholds, developing automation, and undertaking strategic engineering projects.” - The Site Reliability Workbook, Ch. 8, p. 173

Design for Reduced Stress • Set limits on pages/shift, hours/shift,
hours/person/{month, quarter}, etc. • Build in work-life balance. ◦ Split shifts (across time zones, where possible) so they are 12 hrs on/12 hrs off ◦ Allow compensation and/or time off in lieu for being on-call (subject to local law) ◦ Consider split-week shifts (M-Th/F-Su) or other adjustments • Set up a standard handoff process (email, etc).

Ongoing System Maintenance Revisit your system regularly to see how
well it’s working for everyone. • Has the pager load grown too high over time? • Are shifts stressful? • Are problems fixed at a root-cause level, or are there “frequent flyer” pages that everyone recognizes and dreads? This may require you to delay or stop new releases until things are fixed.

When It Gets Really Bad Hand responsibility for the service
back to the developers. An SRE team’s ability to say “we are no longer willing to be on-call for this service” is critically important, even if it’s never used. With a healthy system, this option’s existence makes it less likely to be needed because the developers have an incentive to make sure it isn’t used. With an unhealthy system, it’s needed to keep the SRE team from being dragged under.

Joining an On-Call Rotation

Reducing Stress Before Joining On-Call • Have documentation (how to
roll back, drain, etc.) and dashboards! • Practice, practice, practice! ◦ Wheel of Misfortune exercises can help new team members get up to speed • Arrange shadow and reverse shadow on-call shifts.

Psychological Safety Google’s “Project Aristotle” found that the most important
factor in team effectiveness was psychological safety. “In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel conﬁdent that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea.” https://rework.withgoogle.com/print/guides/5721312655835136/

The Importance of Psychological Safety When people feel unsafe, they
work less effectively. On-call stress will amplify all of these effects. Imagine an outage taking 6 hours to ﬁx, solely because the on-caller couldn’t safely ask for help that would have ﬁxed it in 30 minutes.

Shifts and Incidents

Alerts and Silences Alerts should be actionable and urgent. It
is helpful to silence speciﬁc alerts for a deﬁned period in order to keep them both actionable and urgent.

Starting Your Shift • Make sure you have everything you
need. ◦ Phone and laptop are charged ◦ Phone isn’t silenced ◦ Access hasn’t expired (if applicable) • Go over the handoff email. • Look over any alert silences. Your goal: no surprises.

Handling an Incident (part 1) • Step 0: ACK the
page if you have an automatic escalation system ◦ This way the secondary won’t also get paged • Step 1: Assess Impact ◦ Lets you prioritize work ◦ If you aren’t sure of the impact, it could be high; escalate • Step 2: Mitigate and Verify ◦ Partial service drain, temporary quota increase, etc ◦ Buys time for better ﬁxes

Handling an Incident (part 2) • Step 3: Debug and
gather data ◦ Be mindful of the time mitigation bought you ◦ Take notes in a shareable format (shared doc, etc) ▪ but be wary of logs containing PII or other sensitive information • Step 4: Apply short-term ﬁx and verify ◦ Example: Roll back to previous release (if practical)

Handling an Incident (part 3) • Step 5: Clean up
temporary fixes, follow up on long-term fix • Step 6: Identify root cause(s), file bug(s) ◦ Goal: fix the class of problem, not the specific trigger • Step 7: Write a blameless postmortem, if incident is postmortem-worthy ◦ Goal: share knowledge within team & to other teams ◦ Goal: learn from the causes to avoid related issues • Step 8: Make sure the bugs are fixed ◦ You are not responsible for fixing them yourself, but rather for making sure someone does ◦ That someone can be you, though

Asking For Help Heroically ﬁxing things single-handedly whenever they break
doesn’t scale and isn’t sustainable. Ask for help from • Teammates • Developers • Other SRE teams

Handoffs and Responsibility If you don’t know who owns something,
you do. Google SRE uses an incident command model based on emergency response practices; the incident commander can delegate parts of the job (communications, operations, etc) to other people, and/or can hand off the IC role (especially at end-of-shift). Handoffs must be acknowledged (“I’m naming you as IC for the next shift” / “I accept IC responsibility”) and announced, as must delegation.

Better On-Call the SRE Way - Ramón Medrano Llamas

Better On-Call the SRE Way - Ramón Medrano Llamas

DevOpsDays Zurich

More Decks by DevOpsDays Zurich

Other Decks in Technology

Featured

Transcript

Ramón Medrano Llamas, Google Better On-Call the SRE Way

What does “better on-call” mean to you? Some possibilities to

Lessons learned by Google SRE • I’ll talk about general

On-Call System Design

SRE Owns Production “SRE on-call is different than traditional ops

Design for Reduced Stress • Set limits on pages/shift, hours/shift,

Ongoing System Maintenance Revisit your system regularly to see how

When It Gets Really Bad Hand responsibility for the service

Joining an On-Call Rotation

Reducing Stress Before Joining On-Call • Have documentation (how to

Psychological Safety Google’s “Project Aristotle” found that the most important

The Importance of Psychological Safety When people feel unsafe, they

Shifts and Incidents

Alerts and Silences Alerts should be actionable and urgent. It

Starting Your Shift • Make sure you have everything you

Handling an Incident (part 1) • Step 0: ACK the

Handling an Incident (part 2) • Step 3: Debug and

Handling an Incident (part 3) • Step 5: Clean up

Asking For Help Heroically ﬁxing things single-handedly whenever they break

Handoffs and Responsibility If you don’t know who owns something,