Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Empowered SRE: Driving the Operational Burden to Zero

Empowered SRE: Driving the Operational Burden to Zero

Does it seem like your SRE team is starting to look way to much like the familiar Ops team that you have always known? It’s easy to fall back on the well known patterns of production support. While it is critical to have strong Operations expertise, it is equally critical that your SRE team adopt a new mindset.

We strive to drive our operational burden to zero. We look to automate last year’s work to make room for new challenges. We make the time to eliminate TOIL in our daily tasks.

In this talk, we will take a close look at how process and automation can be the driving force behind a truly empowered SRE team.

Michael Scott Winslow

October 22, 2020
Tweet

More Decks by Michael Scott Winslow

Other Decks in Technology

Transcript

  1. @michaelswinslow michaelswinslow • Michael Winslow • I DevOps. I DevOps

    a lot. • Spaces > Tabs Empowered SREs: Driving the Operational Burden to Zero
  2. Culture Automation Measurement Sharing C A M S Credit: Damon

    Edwards, John Willis https://itrevolution.com/devops-culture-part-1 @michaelswinslow | #doe20
  3. DEVOPS S R E Our Software Engineers (SWE) will develop

    software Our Site Reliability Engineers (SRE) will operate software @michaelswinslow | #doe20
  4. “We must automate away the 100s of routine tasks which

    create the fog that impedes our vision.” – Dana Wilson, SVP of Reliability Engineering, Comcast @michaelswinslow | #doe20
  5. The “Brent” Effect Brent is your go-to person. Reliance on

    one-person impacts people in multiple ways. PMs or Managers want or ask for that resource and are uncomfortable with other resources. This impacts confidence and morale of others in the team. Brent is so heavily used, he has no time to articulate, share and upskill others in the team. There is a clear dearth of documentation to help others as well. Letting go of Brent is not an option which not only hinders Brent’s growth but also stops others in the team from growing into Brent’s role. https://medium.com/@rajipillay/the-brent-effect-df10c4c5d3bc https://www.blackillustrations.com/ @michaelswinslow | #doe20
  6. INSTRUCTIONS 1. Verify Release Notes 2. Verify release file 3.

    Verify with Release Mgr 4. Inform people before start 5. Disable VIP https://www.blackillustrations.com/ @michaelswinslow | #doe20
  7. Trust the Process 1. The engineer executing the runbook (hands

    on the keyboard) ideally should be one of the junior team members. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  8. Trust the Process 1. The engineer executing the runbook (hands

    on the keyboard) ideally should be one of the junior team members. 2. The “Brent” should be available, but should only help when the junior engineer is stuck. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  9. Trust the Process 1. The engineer executing the runbook (hands

    on the keyboard) ideally should be one of the junior team members. 2. The “Brent” should be available, but should only help when the junior engineer is stuck. 3. Each time the junior engineer abandons the runbook and seeks guidance from Brent, the third engineer should update the runbook accordingly. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  10. Trust the Process 1. The engineer executing the runbook (hands

    on the keyboard) ideally should be one of the junior team members. 2. The “Brent” should be available, but should only help when the junior engineer is stuck. 3. Each time the junior engineer abandons the runbook and seeks guidance from Brent, the third engineer should update the runbook accordingly. 4. Repeat this each time the runbook needs to be executed until the error occurrences a rare. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  11. Process Transfer 1. The engineer executing the runbook (hands on

    the keyboard) ideally should be one of the junior team members. 2. The “Brent” should be available, but should only help when the junior engineer is stuck. 3. Each time the junior engineer abandons the runbook and seeks guidance from Brent, the third engineer should update the runbook accordingly. 4. Repeat this each time the runbook needs to be executed until the error occurrences a rare. 5. Once process is repeatable, automate. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  12. “Toil is the kind of work tied to running a

    production service that tends to be manual, repetitive, and automatable.” – Vivek Rau, Google @michaelswinslow | #doe20
  13. Automation Library TOIL Key Rotation FINDING TIME TO REDUCE TOIL

    Engineering Work Deployment @michaelswinslow | #doe20
  14. Automation Library TOIL FINDING TIME TO REDUCE TOIL Engineering Work

    Deployment Key Rotation Self Healing Certificates Patches Release Notes Testing Firewall Requests @michaelswinslow | #doe20
  15. TOIL FINDING TIME TO REDUCE TOIL Engineering Work Dashboards Monitoring

    Alerting On-Call Rotations Inventory IaC Incident Response Tool Evaluation @michaelswinslow | #doe20
  16. TOIL FINDING TIME TO REDUCE TOIL Engineering Work Dashboards Monitoring

    Alerting On-Call Rotations Inventory IaC Incident Response Tool Evaluation 50/50 @michaelswinslow | #doe20
  17. “If an SRE team cannot regulate its own workload, it

    becomes the aggrieved party.” – Damon Edwards, PagerDuty @michaelswinslow | #doe20
  18. • Treat SRE as a premium service • Encourage standards

    / economy of scale • Fortify your team’s strengths If handing Off to SRE Was Not a Right @michaelswinslow | #doe20
  19. 1. Build your team’s AUTOMATION skills 2. Track down and

    eliminate TOIL 3. Be SELECTIVE and regulate your workload Empowering SREs: @michaelswinslow | #doe20