Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alerting Strategy for Self-Contained Team

Alerting Strategy for Self-Contained Team

Takeshi Kondo

May 22, 2020
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. Alerting Strategy
    for Self-contained Team
    Takeshi Kondo / @chaspy
    2020/05/22
    SRE Lounge#12

    View full-size slide

  2. Alerting ❗

    View full-size slide

  3. Self-Contained
    “Encourage development teams to be self-contained so that each team can make products
    more comprehensively, proactively, and efficiently.”

    View full-size slide

  4. SRE Mission for 2020 / Self-Contained
    • Service Team can develop by themselves
    • No ask SREs
    • We SRE provides the process
    • Design Doc
    • Production Readiness Checklist
    • Delegate Infrastructure Management(Terraform)
    • SLI/SLO
    • Alerting

    View full-size slide

  5. SRE Mission for 2020 / Self-Contained
    • Service Team can develop by themselves
    • No ask SREs
    • We SRE provides the process
    • Design Doc
    • Production Readiness Check
    • Delegate Infrastructure Management(Terraform)
    • SLI/SLO
    • Alerting Service Team define their SLI/SLO
    and review it weekly

    View full-size slide

  6. SRE Mission for 2020 / Self-Contained
    • Service Team can develop by themselves
    • No ask SREs
    • We SRE provides the process
    • Design Doc
    • Production Readiness Check
    • Delegate Infrastructure Management(Terraform)
    • SLI/SLO
    • Alerting
    Service Team define their SLI/SLO
    and review it weekly
    https://sre-next.dev/schedule#c4

    View full-size slide

  7. Alerting in Quipper
    SRE
    Japan
    Web Developer
    Global
    Web Developer
    Alerting❗ Escalation

    View full-size slide

  8. Alerting in Quipper
    SRE
    Japan
    Web Developer
    Global
    Web Developer
    2018 Jun. 4 10 15
    2019 Jun. 4 25 20
    2020 May. 6 42 24

    View full-size slide

  9. Alerting Strategy for Self-Contained Team

    View full-size slide

  10. Agenda
    • Shared Responsibility
    • Alerting for SRE Team
    • Alerting for Service Team

    View full-size slide

  11. tl;dr
    • Review alerts frequently
    • All alerts become for SLOs
    • Work together

    View full-size slide

  12. Agenda
    • Shared Responsibility
    • Alerting for SRE Team
    • Alerting for Service Team

    View full-size slide

  13. Shared Responsibility
    $MVTUFS.BOBHFNFOU
    %FQMPZNFOU
    "VUP4DBMJOH
    "QQMJDBUJPO
    Web Developer
    SRE
    $MPVE*OGSBTUSVDUVSF

    View full-size slide

  14. Shared Responsibility
    $MVTUFS.BOBHFNFOU
    %FQMPZNFOU
    "VUP4DBMJOH
    "QQMJDBUJPO
    Web Developer
    SRE
    $MPVE*OGSBTUSVDUVSF
    ❤Borderless

    View full-size slide

  15. Agenda
    • Shared Responsibility
    • Alerting for SRE Team
    • Alerting for Service Team

    View full-size slide

  16. Alerting for SRE Team
    • Many problems
    • Escalation alerts flood conversation channels
    • Many alerts with unclear intentions and actions
    • No policy

    View full-size slide

  17. Alerting for SRE Team
    • Defined Policy
    • Reviewed ALL 166 Alerts
    • Review alerts Daily and Weekly

    View full-size slide

  18. Defined Policy

    View full-size slide

  19. Reviewed ALL 166 Alerts
    • Changed alert channel to follow the policy
    • Removed 40 alerts
    • Alert should…
    • Be Actionable
    • Detect what can only detect that alert

    View full-size slide

  20. Review alerts Daily

    View full-size slide

  21. Review alerts Weekly

    View full-size slide

  22. Agenda
    • Shared Responsibility
    • Alerting for SRE Team
    • Alerting for Service Team

    View full-size slide

  23. All alerts become for SLOs

    View full-size slide

  24. Alerting in Quipper
    SRE
    Web Developer
    Alerting
    for Platform
    Alerting
    for SLO

    View full-size slide

  25. All alerts become for SLOs
    • Any other alerts will cause SLO violations
    • CPU usage is high
    • OOM Killer happens
    • Unavailable pods
    • Unicorn backlog is increasing
    • Service Team check only SLO alerts
    • Better to have insights when you received alerts

    View full-size slide

  26. All alerts become for SLOs
    • Started alerting for “Event Based” SLO as experiment
    • Channel Convention: #slo--

    View full-size slide

  27. All alerts become for SLOs
    • Started alerting for “Event Based” SLO as experiment
    • Channel Convention: #slo--
    On Going…
    Wait for the Blog post

    View full-size slide

  28. Alerting in Quipper
    SRE
    Web Developer
    Alerting
    for Platform
    Alerting
    for SLO
    Escalation
    Review alert daily and weekly
    Feedback

    View full-size slide

  29. Alerting in Quipper
    SRE
    Web Developer
    Alerting
    for Platform
    Alerting
    for SLO
    Escalation
    ❤Borderless
    Review alert daily and weekly
    Feedback

    View full-size slide

  30. Summary
    • Review alerts frequently
    • All alerts become for SLOs
    • Work together

    View full-size slide

  31. Special Thanks
    • @motobrew
    • Suggested the new policy for alerts
    • Confirmed all my review for alerts
    • @d-kuro
    • Suggested creating a mention group for each deployment
    • Lead alert for Kubernetes

    View full-size slide

  32. Special Thanks
    • Thanks for giving your opinion at #topic-monitoring
    • @egmc
    • @katzchang
    • @yuuki
    • @y_kawasaki
    • and all SRE Lounge staff

    View full-size slide

  33. SRE.fm#1 w/@ryok6t and @_inductor_
    https://sre-fm.connpass.com/event/175198/

    View full-size slide

  34. Thank You!
    chaspy
    chaspy_
    Lead Software Engineer

    at Quipper
    Takeshi Kondo
    Terraform-jp

    View full-size slide