Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Empowered SRE: Driving the Operational Burden to Zero

Empowered SRE: Driving the Operational Burden to Zero

Does it seem like your SRE team is starting to look way to much like the familiar Ops team that you have always known? It’s easy to fall back on the well known patterns of production support. While it is critical to have strong Operations expertise, it is equally critical that your SRE team adopt a new mindset.

We strive to drive our operational burden to zero. We look to automate last year’s work to make room for new challenges. We make the time to eliminate TOIL in our daily tasks.

In this talk, we will take a close look at how process and automation can be the driving force behind a truly empowered SRE team.

D2985b36fb510e4b37ec87d7b9ac979b?s=128

Michael Scott Winslow

October 22, 2020
Tweet

Transcript

  1. @michaelswinslow michaelswinslow • Michael Winslow • I DevOps. I DevOps

    a lot. • Spaces > Tabs Empowered SREs: Driving the Operational Burden to Zero
  2. DEVOPS @michaelswinslow | #doe20

  3. Culture Automation Measurement Sharing C A M S Credit: Damon

    Edwards, John Willis https://itrevolution.com/devops-culture-part-1 @michaelswinslow | #doe20
  4. DEVOPS S R E Our Software Engineers (SWE) will develop

    software Our Site Reliability Engineers (SRE) will operate software @michaelswinslow | #doe20
  5. "Myriam Gourfink" bydancetechtvis licensed underCC BY-SA 2.0 How is this

    NOT operations? @michaelswinslow | #doe20
  6. @michaelswinslow | #doe20

  7. @michaelswinslow | #doe20

  8. The Tweet | Andrew Clay Shafer

  9. https://blog.newrelic.com/engineering/devops-name/ @michaelswinslow | #doe20

  10. 1. AUTOMATION @michaelswinslow | #doe20

  11. “We must automate away the 100s of routine tasks which

    create the fog that impedes our vision.” – Dana Wilson, SVP of Reliability Engineering, Comcast @michaelswinslow | #doe20
  12. The “Brent” Effect Brent is your go-to person. Reliance on

    one-person impacts people in multiple ways. PMs or Managers want or ask for that resource and are uncomfortable with other resources. This impacts confidence and morale of others in the team. Brent is so heavily used, he has no time to articulate, share and upskill others in the team. There is a clear dearth of documentation to help others as well. Letting go of Brent is not an option which not only hinders Brent’s growth but also stops others in the team from growing into Brent’s role. https://medium.com/@rajipillay/the-brent-effect-df10c4c5d3bc https://www.blackillustrations.com/ @michaelswinslow | #doe20
  13. INSTRUCTIONS 1. Verify Release Notes 2. Verify release file 3.

    Verify with Release Mgr 4. Inform people before start 5. Disable VIP https://www.blackillustrations.com/ @michaelswinslow | #doe20
  14. @michaelswinslow | #doe20

  15. Trust the Process 1. The engineer executing the runbook (hands

    on the keyboard) ideally should be one of the junior team members. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  16. Trust the Process 1. The engineer executing the runbook (hands

    on the keyboard) ideally should be one of the junior team members. 2. The “Brent” should be available, but should only help when the junior engineer is stuck. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  17. Trust the Process 1. The engineer executing the runbook (hands

    on the keyboard) ideally should be one of the junior team members. 2. The “Brent” should be available, but should only help when the junior engineer is stuck. 3. Each time the junior engineer abandons the runbook and seeks guidance from Brent, the third engineer should update the runbook accordingly. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  18. Trust the Process 1. The engineer executing the runbook (hands

    on the keyboard) ideally should be one of the junior team members. 2. The “Brent” should be available, but should only help when the junior engineer is stuck. 3. Each time the junior engineer abandons the runbook and seeks guidance from Brent, the third engineer should update the runbook accordingly. 4. Repeat this each time the runbook needs to be executed until the error occurrences a rare. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  19. Process Transfer 1. The engineer executing the runbook (hands on

    the keyboard) ideally should be one of the junior team members. 2. The “Brent” should be available, but should only help when the junior engineer is stuck. 3. Each time the junior engineer abandons the runbook and seeks guidance from Brent, the third engineer should update the runbook accordingly. 4. Repeat this each time the runbook needs to be executed until the error occurrences a rare. 5. Once process is repeatable, automate. https://www.blackillustrations.com/ @michaelswinslow | #doe20
  20. @michaelswinslow | #doe20

  21. @michaelswinslow | #doe20

  22. @michaelswinslow | #doe20

  23. @michaelswinslow | #doe20

  24. @michaelswinslow | #doe20

  25. @michaelswinslow | #doe20

  26. 2. REDUCE TOIL @michaelswinslow | #doe20

  27. “Toil is the kind of work tied to running a

    production service that tends to be manual, repetitive, and automatable.” – Vivek Rau, Google @michaelswinslow | #doe20
  28. FINDING TIME TO REDUCE TOIL TOIL Automation Library @michaelswinslow |

    #doe20
  29. Automation Library TOIL Deployment FINDING TIME TO REDUCE TOIL @michaelswinslow

    | #doe20
  30. Automation Library TOIL FINDING TIME TO REDUCE TOIL Engineering Work

    Deployment @michaelswinslow | #doe20
  31. Automation Library TOIL Key Rotation FINDING TIME TO REDUCE TOIL

    Engineering Work Deployment @michaelswinslow | #doe20
  32. Automation Library TOIL FINDING TIME TO REDUCE TOIL Engineering Work

    Deployment Key Rotation Self Healing Certificates Patches Release Notes Testing Firewall Requests @michaelswinslow | #doe20
  33. TOIL FINDING TIME TO REDUCE TOIL Engineering Work Dashboards Monitoring

    Alerting On-Call Rotations Inventory IaC Incident Response Tool Evaluation @michaelswinslow | #doe20
  34. TOIL FINDING TIME TO REDUCE TOIL Engineering Work Dashboards Monitoring

    Alerting On-Call Rotations Inventory IaC Incident Response Tool Evaluation 50/50 @michaelswinslow | #doe20
  35. 3. BE SELECTIVE @michaelswinslow | #doe20

  36. “If an SRE team cannot regulate its own workload, it

    becomes the aggrieved party.” – Damon Edwards, PagerDuty @michaelswinslow | #doe20
  37. • Treat SRE as a premium service • Encourage standards

    / economy of scale • Fortify your team’s strengths If handing Off to SRE Was Not a Right @michaelswinslow | #doe20
  38. 1. Build your team’s AUTOMATION skills 2. Track down and

    eliminate TOIL 3. Be SELECTIVE and regulate your workload Empowering SREs: @michaelswinslow | #doe20
  39. @michaelswinslow michaelswinslow Empowered SREs: Driving the Operational Burden to Zero

    •Thank You!