Self-Contained “Encourage development teams to be self-contained so that each team can make products more comprehensively, proactively, and efficiently.”
SRE Mission for 2020 / Self-Contained • Service Team can develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Checklist • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting
SRE Mission for 2020 / Self-Contained • Service Team can develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting Service Team define their SLI/SLO and review it weekly
SRE Mission for 2020 / Self-Contained • Service Team can develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting Service Team define their SLI/SLO and review it weekly https://sre-next.dev/schedule#c4
Reviewed ALL 166 Alerts • Changed alert channel to follow the policy • Removed 40 alerts • Alert should… • Be Actionable • Detect what can only detect that alert
All alerts become for SLOs • Any other alerts will cause SLO violations • CPU usage is high • OOM Killer happens • Unavailable pods • Unicorn backlog is increasing • Service Team check only SLO alerts • Better to have insights when you received alerts
Special Thanks • @motobrew • Suggested the new policy for alerts • Confirmed all my review for alerts • @d-kuro • Suggested creating a mention group for each deployment • Lead alert for Kubernetes