Better On-Call the SRE Way - Ramón Medrano Llamas

Slide 1

Slide 1 text

Ramón Medrano Llamas, Google Better On-Call the SRE Way

Slide 2

Slide 2 text

What does “better on-call” mean to you? Some possibilities to think about: ● Fewer alerts/pages ● High signal pages ● Better work/life balance for people who are on-call ● All of these are aspects of reduced stress

Slide 3

Slide 3 text

Lessons learned by Google SRE ● I’ll talk about general principles for reducing on-call stress and making on-call work more effective ● Based on the “Being On-Call” and “Managing Incidents” chapters of the SRE book and the “On-Call” chapter of the Site Reliability Workbook

Slide 4

Slide 4 text

On-Call System Design

Slide 5

Slide 5 text

SRE Owns Production “SRE on-call is different than traditional ops roles. Rather than focusing solely on day-to-day operations, SRE fully owns the production environment, and seeks to better it through deﬁning appropriate reliability thresholds, developing automation, and undertaking strategic engineering projects.” - The Site Reliability Workbook, Ch. 8, p. 173

Slide 6

Slide 6 text

Design for Reduced Stress ● Set limits on pages/shift, hours/shift, hours/person/{month, quarter}, etc. ● Build in work-life balance. ○ Split shifts (across time zones, where possible) so they are 12 hrs on/12 hrs off ○ Allow compensation and/or time off in lieu for being on-call (subject to local law) ○ Consider split-week shifts (M-Th/F-Su) or other adjustments ● Set up a standard handoff process (email, etc).

Slide 7

Slide 7 text

Ongoing System Maintenance Revisit your system regularly to see how well it’s working for everyone. ● Has the pager load grown too high over time? ● Are shifts stressful? ● Are problems fixed at a root-cause level, or are there “frequent flyer” pages that everyone recognizes and dreads? This may require you to delay or stop new releases until things are fixed.

Slide 8

Slide 8 text

When It Gets Really Bad Hand responsibility for the service back to the developers. An SRE team’s ability to say “we are no longer willing to be on-call for this service” is critically important, even if it’s never used. With a healthy system, this option’s existence makes it less likely to be needed because the developers have an incentive to make sure it isn’t used. With an unhealthy system, it’s needed to keep the SRE team from being dragged under.

Slide 9

Slide 9 text

Joining an On-Call Rotation

Slide 10

Slide 10 text

Reducing Stress Before Joining On-Call ● Have documentation (how to roll back, drain, etc.) and dashboards! ● Practice, practice, practice! ○ Wheel of Misfortune exercises can help new team members get up to speed ● Arrange shadow and reverse shadow on-call shifts.

Slide 11

Slide 11 text

Psychological Safety Google’s “Project Aristotle” found that the most important factor in team effectiveness was psychological safety. “In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel conﬁdent that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea.” https://rework.withgoogle.com/print/guides/5721312655835136/

Slide 12

Slide 12 text

The Importance of Psychological Safety When people feel unsafe, they work less effectively. On-call stress will amplify all of these effects. Imagine an outage taking 6 hours to ﬁx, solely because the on-caller couldn’t safely ask for help that would have ﬁxed it in 30 minutes.

Slide 13

Slide 13 text

Shifts and Incidents

Slide 14

Slide 14 text

Alerts and Silences Alerts should be actionable and urgent. It is helpful to silence speciﬁc alerts for a deﬁned period in order to keep them both actionable and urgent.

Slide 15

Slide 15 text

Starting Your Shift ● Make sure you have everything you need. ○ Phone and laptop are charged ○ Phone isn’t silenced ○ Access hasn’t expired (if applicable) ● Go over the handoff email. ● Look over any alert silences. Your goal: no surprises.

Slide 16

Slide 16 text

Handling an Incident (part 1) ● Step 0: ACK the page if you have an automatic escalation system ○ This way the secondary won’t also get paged ● Step 1: Assess Impact ○ Lets you prioritize work ○ If you aren’t sure of the impact, it could be high; escalate ● Step 2: Mitigate and Verify ○ Partial service drain, temporary quota increase, etc ○ Buys time for better ﬁxes

Slide 17

Slide 17 text

Handling an Incident (part 2) ● Step 3: Debug and gather data ○ Be mindful of the time mitigation bought you ○ Take notes in a shareable format (shared doc, etc) ■ but be wary of logs containing PII or other sensitive information ● Step 4: Apply short-term ﬁx and verify ○ Example: Roll back to previous release (if practical)

Slide 18

Slide 18 text

Handling an Incident (part 3) ● Step 5: Clean up temporary fixes, follow up on long-term fix ● Step 6: Identify root cause(s), file bug(s) ○ Goal: fix the class of problem, not the specific trigger ● Step 7: Write a blameless postmortem, if incident is postmortem-worthy ○ Goal: share knowledge within team & to other teams ○ Goal: learn from the causes to avoid related issues ● Step 8: Make sure the bugs are fixed ○ You are not responsible for fixing them yourself, but rather for making sure someone does ○ That someone can be you, though

Slide 19

Slide 19 text

Asking For Help Heroically ﬁxing things single-handedly whenever they break doesn’t scale and isn’t sustainable. Ask for help from ● Teammates ● Developers ● Other SRE teams

Slide 20

Slide 20 text

Handoffs and Responsibility If you don’t know who owns something, you do. Google SRE uses an incident command model based on emergency response practices; the incident commander can delegate parts of the job (communications, operations, etc) to other people, and/or can hand off the IC role (especially at end-of-shift). Handoffs must be acknowledged (“I’m naming you as IC for the next shift” / “I accept IC responsibility”) and announced, as must delegation.