Alerting Strategy for Self-Contained Team

Alerting Strategy for Self-Contained Team

93c80c388fe9d8f9df7d030549a0ff0b?s=128

Takeshi Kondo

May 22, 2020
Tweet

Transcript

  1. Alerting Strategy for Self-contained Team Takeshi Kondo / @chaspy 2020/05/22

    SRE Lounge#12
  2. Alerting ❗

  3. Self-Contained “Encourage development teams to be self-contained so that each

    team can make products more comprehensively, proactively, and efficiently.”
  4. SRE Mission for 2020 / Self-Contained • Service Team can

    develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Checklist • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting
  5. SRE Mission for 2020 / Self-Contained • Service Team can

    develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting Service Team define their SLI/SLO and review it weekly
  6. SRE Mission for 2020 / Self-Contained • Service Team can

    develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting Service Team define their SLI/SLO and review it weekly https://sre-next.dev/schedule#c4
  7. Alerting in Quipper SRE Japan Web Developer Global Web Developer

    Alerting❗ Escalation
  8. Alerting in Quipper SRE Japan Web Developer Global Web Developer

    2018 Jun. 4 10 15 2019 Jun. 4 25 20 2020 May. 6 42 24
  9. Alerting Strategy for Self-Contained Team

  10. Agenda • Shared Responsibility • Alerting for SRE Team •

    Alerting for Service Team
  11. tl;dr • Review alerts frequently • All alerts become for

    SLOs • Work together
  12. Agenda • Shared Responsibility • Alerting for SRE Team •

    Alerting for Service Team
  13. Shared Responsibility $MVTUFS.BOBHFNFOU %FQMPZNFOU "VUP4DBMJOH "QQMJDBUJPO Web Developer SRE $MPVE*OGSBTUSVDUVSF

  14. Shared Responsibility $MVTUFS.BOBHFNFOU %FQMPZNFOU "VUP4DBMJOH "QQMJDBUJPO Web Developer SRE $MPVE*OGSBTUSVDUVSF

    ❤Borderless
  15. Agenda • Shared Responsibility • Alerting for SRE Team •

    Alerting for Service Team
  16. Alerting for SRE Team • Many problems • Escalation alerts

    flood conversation channels • Many alerts with unclear intentions and actions • No policy
  17. Alerting for SRE Team • Defined Policy • Reviewed ALL

    166 Alerts • Review alerts Daily and Weekly
  18. Defined Policy

  19. Reviewed ALL 166 Alerts • Changed alert channel to follow

    the policy • Removed 40 alerts • Alert should… • Be Actionable • Detect what can only detect that alert
  20. Review alerts Daily

  21. Review alerts Weekly

  22. Agenda • Shared Responsibility • Alerting for SRE Team •

    Alerting for Service Team
  23. All alerts become for SLOs

  24. Alerting in Quipper SRE Web Developer Alerting for Platform Alerting

    for SLO
  25. All alerts become for SLOs • Any other alerts will

    cause SLO violations • CPU usage is high • OOM Killer happens • Unavailable pods • Unicorn backlog is increasing • Service Team check only SLO alerts • Better to have insights when you received alerts
  26. All alerts become for SLOs • Started alerting for “Event

    Based” SLO as experiment • Channel Convention: #slo-<subdomain>-<product>
  27. All alerts become for SLOs • Started alerting for “Event

    Based” SLO as experiment • Channel Convention: #slo-<subdomain>-<product> On Going… Wait for the Blog post
  28. Alerting in Quipper SRE Web Developer Alerting for Platform Alerting

    for SLO Escalation Review alert daily and weekly Feedback
  29. Alerting in Quipper SRE Web Developer Alerting for Platform Alerting

    for SLO Escalation ❤Borderless Review alert daily and weekly Feedback
  30. Summary • Review alerts frequently • All alerts become for

    SLOs • Work together
  31. Special Thanks • @motobrew • Suggested the new policy for

    alerts • Confirmed all my review for alerts • @d-kuro • Suggested creating a mention group for each deployment • Lead alert for Kubernetes
  32. Special Thanks • Thanks for giving your opinion at #topic-monitoring

    • @egmc • @katzchang • @yuuki • @y_kawasaki • and all SRE Lounge staff
  33. SRE.fm#1 w/@ryok6t and @_inductor_ https://sre-fm.connpass.com/event/175198/

  34. Thank You! chaspy chaspy_ Lead Software Engineer at Quipper Takeshi

    Kondo Terraform-jp