Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alerting Strategy for Self-Contained Team

Alerting Strategy for Self-Contained Team

Takeshi Kondo

May 22, 2020
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. Alerting Strategy
    for Self-contained Team
    Takeshi Kondo / @chaspy
    2020/05/22
    SRE Lounge#12

    View Slide

  2. Alerting ❗

    View Slide

  3. Self-Contained
    “Encourage development teams to be self-contained so that each team can make products
    more comprehensively, proactively, and efficiently.”

    View Slide

  4. SRE Mission for 2020 / Self-Contained
    • Service Team can develop by themselves
    • No ask SREs
    • We SRE provides the process
    • Design Doc
    • Production Readiness Checklist
    • Delegate Infrastructure Management(Terraform)
    • SLI/SLO
    • Alerting

    View Slide

  5. SRE Mission for 2020 / Self-Contained
    • Service Team can develop by themselves
    • No ask SREs
    • We SRE provides the process
    • Design Doc
    • Production Readiness Check
    • Delegate Infrastructure Management(Terraform)
    • SLI/SLO
    • Alerting Service Team define their SLI/SLO
    and review it weekly

    View Slide

  6. SRE Mission for 2020 / Self-Contained
    • Service Team can develop by themselves
    • No ask SREs
    • We SRE provides the process
    • Design Doc
    • Production Readiness Check
    • Delegate Infrastructure Management(Terraform)
    • SLI/SLO
    • Alerting
    Service Team define their SLI/SLO
    and review it weekly
    https://sre-next.dev/schedule#c4

    View Slide

  7. Alerting in Quipper
    SRE
    Japan
    Web Developer
    Global
    Web Developer
    Alerting❗ Escalation

    View Slide

  8. Alerting in Quipper
    SRE
    Japan
    Web Developer
    Global
    Web Developer
    2018 Jun. 4 10 15
    2019 Jun. 4 25 20
    2020 May. 6 42 24

    View Slide

  9. Alerting Strategy for Self-Contained Team

    View Slide

  10. Agenda
    • Shared Responsibility
    • Alerting for SRE Team
    • Alerting for Service Team

    View Slide

  11. tl;dr
    • Review alerts frequently
    • All alerts become for SLOs
    • Work together

    View Slide

  12. Agenda
    • Shared Responsibility
    • Alerting for SRE Team
    • Alerting for Service Team

    View Slide

  13. Shared Responsibility
    $MVTUFS.BOBHFNFOU
    %FQMPZNFOU
    "VUP4DBMJOH
    "QQMJDBUJPO
    Web Developer
    SRE
    $MPVE*OGSBTUSVDUVSF

    View Slide

  14. Shared Responsibility
    $MVTUFS.BOBHFNFOU
    %FQMPZNFOU
    "VUP4DBMJOH
    "QQMJDBUJPO
    Web Developer
    SRE
    $MPVE*OGSBTUSVDUVSF
    ❤Borderless

    View Slide

  15. Agenda
    • Shared Responsibility
    • Alerting for SRE Team
    • Alerting for Service Team

    View Slide

  16. Alerting for SRE Team
    • Many problems
    • Escalation alerts flood conversation channels
    • Many alerts with unclear intentions and actions
    • No policy

    View Slide

  17. Alerting for SRE Team
    • Defined Policy
    • Reviewed ALL 166 Alerts
    • Review alerts Daily and Weekly

    View Slide

  18. Defined Policy

    View Slide

  19. Reviewed ALL 166 Alerts
    • Changed alert channel to follow the policy
    • Removed 40 alerts
    • Alert should…
    • Be Actionable
    • Detect what can only detect that alert

    View Slide

  20. Review alerts Daily

    View Slide

  21. Review alerts Weekly

    View Slide

  22. Agenda
    • Shared Responsibility
    • Alerting for SRE Team
    • Alerting for Service Team

    View Slide

  23. All alerts become for SLOs

    View Slide

  24. Alerting in Quipper
    SRE
    Web Developer
    Alerting
    for Platform
    Alerting
    for SLO

    View Slide

  25. All alerts become for SLOs
    • Any other alerts will cause SLO violations
    • CPU usage is high
    • OOM Killer happens
    • Unavailable pods
    • Unicorn backlog is increasing
    • Service Team check only SLO alerts
    • Better to have insights when you received alerts

    View Slide

  26. All alerts become for SLOs
    • Started alerting for “Event Based” SLO as experiment
    • Channel Convention: #slo--

    View Slide

  27. All alerts become for SLOs
    • Started alerting for “Event Based” SLO as experiment
    • Channel Convention: #slo--
    On Going…
    Wait for the Blog post

    View Slide

  28. Alerting in Quipper
    SRE
    Web Developer
    Alerting
    for Platform
    Alerting
    for SLO
    Escalation
    Review alert daily and weekly
    Feedback

    View Slide

  29. Alerting in Quipper
    SRE
    Web Developer
    Alerting
    for Platform
    Alerting
    for SLO
    Escalation
    ❤Borderless
    Review alert daily and weekly
    Feedback

    View Slide

  30. Summary
    • Review alerts frequently
    • All alerts become for SLOs
    • Work together

    View Slide

  31. Special Thanks
    • @motobrew
    • Suggested the new policy for alerts
    • Confirmed all my review for alerts
    • @d-kuro
    • Suggested creating a mention group for each deployment
    • Lead alert for Kubernetes

    View Slide

  32. Special Thanks
    • Thanks for giving your opinion at #topic-monitoring
    • @egmc
    • @katzchang
    • @yuuki
    • @y_kawasaki
    • and all SRE Lounge staff

    View Slide

  33. SRE.fm#1 w/@ryok6t and @_inductor_
    https://sre-fm.connpass.com/event/175198/

    View Slide

  34. Thank You!
    chaspy
    chaspy_
    Lead Software Engineer

    at Quipper
    Takeshi Kondo
    Terraform-jp

    View Slide