Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alerting Strategy for Self-Contained Team

Alerting Strategy for Self-Contained Team

Takeshi Kondo

May 22, 2020
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. Self-Contained “Encourage development teams to be self-contained so that each

    team can make products more comprehensively, proactively, and efficiently.”
  2. SRE Mission for 2020 / Self-Contained • Service Team can

    develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Checklist • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting
  3. SRE Mission for 2020 / Self-Contained • Service Team can

    develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting Service Team define their SLI/SLO and review it weekly
  4. SRE Mission for 2020 / Self-Contained • Service Team can

    develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting Service Team define their SLI/SLO and review it weekly https://sre-next.dev/schedule#c4
  5. Alerting in Quipper SRE Japan Web Developer Global Web Developer

    2018 Jun. 4 10 15 2019 Jun. 4 25 20 2020 May. 6 42 24
  6. Alerting for SRE Team • Many problems • Escalation alerts

    flood conversation channels • Many alerts with unclear intentions and actions • No policy
  7. Alerting for SRE Team • Defined Policy • Reviewed ALL

    166 Alerts • Review alerts Daily and Weekly
  8. Reviewed ALL 166 Alerts • Changed alert channel to follow

    the policy • Removed 40 alerts • Alert should… • Be Actionable • Detect what can only detect that alert
  9. All alerts become for SLOs • Any other alerts will

    cause SLO violations • CPU usage is high • OOM Killer happens • Unavailable pods • Unicorn backlog is increasing • Service Team check only SLO alerts • Better to have insights when you received alerts
  10. All alerts become for SLOs • Started alerting for “Event

    Based” SLO as experiment • Channel Convention: #slo-<subdomain>-<product>
  11. All alerts become for SLOs • Started alerting for “Event

    Based” SLO as experiment • Channel Convention: #slo-<subdomain>-<product> On Going… Wait for the Blog post
  12. Alerting in Quipper SRE Web Developer Alerting for Platform Alerting

    for SLO Escalation Review alert daily and weekly Feedback
  13. Alerting in Quipper SRE Web Developer Alerting for Platform Alerting

    for SLO Escalation ❤Borderless Review alert daily and weekly Feedback
  14. Special Thanks • @motobrew • Suggested the new policy for

    alerts • Confirmed all my review for alerts • @d-kuro • Suggested creating a mention group for each deployment • Lead alert for Kubernetes
  15. Special Thanks • Thanks for giving your opinion at #topic-monitoring

    • @egmc • @katzchang • @yuuki • @y_kawasaki • and all SRE Lounge staff