Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Better On-Call the SRE Way - Ramón Medrano Llamas

Better On-Call the SRE Way - Ramón Medrano Llamas

SRE at Google uses an incident management protocol to manage outages of the complex distributed systems we manage. These are stressful times that involve a fundamental human factor, when many people can be involved and that the regular organisation might not help unlocking the whole potential of the persons involved.
From phycological safety, to training and protocols, this talk is meant to discuss everything that goes on when services are not available.

027edc76bf9f9c030820807f87c5dbdc?s=128

DevOpsDays Zurich

May 14, 2019
Tweet

Transcript

  1. Ramón Medrano Llamas, Google Better On-Call the SRE Way

  2. What does “better on-call” mean to you? Some possibilities to

    think about: • Fewer alerts/pages • High signal pages • Better work/life balance for people who are on-call • All of these are aspects of reduced stress
  3. Lessons learned by Google SRE • I’ll talk about general

    principles for reducing on-call stress and making on-call work more effective • Based on the “Being On-Call” and “Managing Incidents” chapters of the SRE book and the “On-Call” chapter of the Site Reliability Workbook
  4. On-Call System Design

  5. SRE Owns Production “SRE on-call is different than traditional ops

    roles. Rather than focusing solely on day-to-day operations, SRE fully owns the production environment, and seeks to better it through defining appropriate reliability thresholds, developing automation, and undertaking strategic engineering projects.” - The Site Reliability Workbook, Ch. 8, p. 173
  6. Design for Reduced Stress • Set limits on pages/shift, hours/shift,

    hours/person/{month, quarter}, etc. • Build in work-life balance. ◦ Split shifts (across time zones, where possible) so they are 12 hrs on/12 hrs off ◦ Allow compensation and/or time off in lieu for being on-call (subject to local law) ◦ Consider split-week shifts (M-Th/F-Su) or other adjustments • Set up a standard handoff process (email, etc).
  7. Ongoing System Maintenance Revisit your system regularly to see how

    well it’s working for everyone. • Has the pager load grown too high over time? • Are shifts stressful? • Are problems fixed at a root-cause level, or are there “frequent flyer” pages that everyone recognizes and dreads? This may require you to delay or stop new releases until things are fixed.
  8. When It Gets Really Bad Hand responsibility for the service

    back to the developers. An SRE team’s ability to say “we are no longer willing to be on-call for this service” is critically important, even if it’s never used. With a healthy system, this option’s existence makes it less likely to be needed because the developers have an incentive to make sure it isn’t used. With an unhealthy system, it’s needed to keep the SRE team from being dragged under.
  9. Joining an On-Call Rotation

  10. Reducing Stress Before Joining On-Call • Have documentation (how to

    roll back, drain, etc.) and dashboards! • Practice, practice, practice! ◦ Wheel of Misfortune exercises can help new team members get up to speed • Arrange shadow and reverse shadow on-call shifts.
  11. Psychological Safety Google’s “Project Aristotle” found that the most important

    factor in team effectiveness was psychological safety. “In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel confident that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea.” https://rework.withgoogle.com/print/guides/5721312655835136/
  12. The Importance of Psychological Safety When people feel unsafe, they

    work less effectively. On-call stress will amplify all of these effects. Imagine an outage taking 6 hours to fix, solely because the on-caller couldn’t safely ask for help that would have fixed it in 30 minutes.
  13. Shifts and Incidents

  14. Alerts and Silences Alerts should be actionable and urgent. It

    is helpful to silence specific alerts for a defined period in order to keep them both actionable and urgent.
  15. Starting Your Shift • Make sure you have everything you

    need. ◦ Phone and laptop are charged ◦ Phone isn’t silenced ◦ Access hasn’t expired (if applicable) • Go over the handoff email. • Look over any alert silences. Your goal: no surprises.
  16. Handling an Incident (part 1) • Step 0: ACK the

    page if you have an automatic escalation system ◦ This way the secondary won’t also get paged • Step 1: Assess Impact ◦ Lets you prioritize work ◦ If you aren’t sure of the impact, it could be high; escalate • Step 2: Mitigate and Verify ◦ Partial service drain, temporary quota increase, etc ◦ Buys time for better fixes
  17. Handling an Incident (part 2) • Step 3: Debug and

    gather data ◦ Be mindful of the time mitigation bought you ◦ Take notes in a shareable format (shared doc, etc) ▪ but be wary of logs containing PII or other sensitive information • Step 4: Apply short-term fix and verify ◦ Example: Roll back to previous release (if practical)
  18. Handling an Incident (part 3) • Step 5: Clean up

    temporary fixes, follow up on long-term fix • Step 6: Identify root cause(s), file bug(s) ◦ Goal: fix the class of problem, not the specific trigger • Step 7: Write a blameless postmortem, if incident is postmortem-worthy ◦ Goal: share knowledge within team & to other teams ◦ Goal: learn from the causes to avoid related issues • Step 8: Make sure the bugs are fixed ◦ You are not responsible for fixing them yourself, but rather for making sure someone does ◦ That someone can be you, though
  19. Asking For Help Heroically fixing things single-handedly whenever they break

    doesn’t scale and isn’t sustainable. Ask for help from • Teammates • Developers • Other SRE teams
  20. Handoffs and Responsibility If you don’t know who owns something,

    you do. Google SRE uses an incident command model based on emergency response practices; the incident commander can delegate parts of the job (communications, operations, etc) to other people, and/or can hand off the IC role (especially at end-of-shift). Handoffs must be acknowledged (“I’m naming you as IC for the next shift” / “I accept IC responsibility”) and announced, as must delegation.