Alerting Strategy
for Self-contained Team
Takeshi Kondo / @chaspy
2020/05/22
SRE Lounge#12
Slide 2
Slide 2 text
Alerting ❗
Slide 3
Slide 3 text
Self-Contained
“Encourage development teams to be self-contained so that each team can make products
more comprehensively, proactively, and efficiently.”
Slide 4
Slide 4 text
SRE Mission for 2020 / Self-Contained
• Service Team can develop by themselves
• No ask SREs
• We SRE provides the process
• Design Doc
• Production Readiness Checklist
• Delegate Infrastructure Management(Terraform)
• SLI/SLO
• Alerting
Slide 5
Slide 5 text
SRE Mission for 2020 / Self-Contained
• Service Team can develop by themselves
• No ask SREs
• We SRE provides the process
• Design Doc
• Production Readiness Check
• Delegate Infrastructure Management(Terraform)
• SLI/SLO
• Alerting Service Team define their SLI/SLO
and review it weekly
Slide 6
Slide 6 text
SRE Mission for 2020 / Self-Contained
• Service Team can develop by themselves
• No ask SREs
• We SRE provides the process
• Design Doc
• Production Readiness Check
• Delegate Infrastructure Management(Terraform)
• SLI/SLO
• Alerting
Service Team define their SLI/SLO
and review it weekly
https://sre-next.dev/schedule#c4
Slide 7
Slide 7 text
Alerting in Quipper
SRE
Japan
Web Developer
Global
Web Developer
Alerting❗ Escalation
Slide 8
Slide 8 text
Alerting in Quipper
SRE
Japan
Web Developer
Global
Web Developer
2018 Jun. 4 10 15
2019 Jun. 4 25 20
2020 May. 6 42 24
Slide 9
Slide 9 text
Alerting Strategy for Self-Contained Team
Slide 10
Slide 10 text
Agenda
• Shared Responsibility
• Alerting for SRE Team
• Alerting for Service Team
Slide 11
Slide 11 text
tl;dr
• Review alerts frequently
• All alerts become for SLOs
• Work together
Slide 12
Slide 12 text
Agenda
• Shared Responsibility
• Alerting for SRE Team
• Alerting for Service Team
Slide 13
Slide 13 text
Shared Responsibility
$MVTUFS.BOBHFNFOU
%FQMPZNFOU
"VUP4DBMJOH
"QQMJDBUJPO
Web Developer
SRE
$MPVE*OGSBTUSVDUVSF
Agenda
• Shared Responsibility
• Alerting for SRE Team
• Alerting for Service Team
Slide 16
Slide 16 text
Alerting for SRE Team
• Many problems
• Escalation alerts flood conversation channels
• Many alerts with unclear intentions and actions
• No policy
Slide 17
Slide 17 text
Alerting for SRE Team
• Defined Policy
• Reviewed ALL 166 Alerts
• Review alerts Daily and Weekly
Slide 18
Slide 18 text
Defined Policy
Slide 19
Slide 19 text
Reviewed ALL 166 Alerts
• Changed alert channel to follow the policy
• Removed 40 alerts
• Alert should…
• Be Actionable
• Detect what can only detect that alert
Slide 20
Slide 20 text
Review alerts Daily
Slide 21
Slide 21 text
Review alerts Weekly
Slide 22
Slide 22 text
Agenda
• Shared Responsibility
• Alerting for SRE Team
• Alerting for Service Team
Slide 23
Slide 23 text
All alerts become for SLOs
Slide 24
Slide 24 text
Alerting in Quipper
SRE
Web Developer
Alerting
for Platform
Alerting
for SLO
Slide 25
Slide 25 text
All alerts become for SLOs
• Any other alerts will cause SLO violations
• CPU usage is high
• OOM Killer happens
• Unavailable pods
• Unicorn backlog is increasing
• Service Team check only SLO alerts
• Better to have insights when you received alerts
Slide 26
Slide 26 text
All alerts become for SLOs
• Started alerting for “Event Based” SLO as experiment
• Channel Convention: #slo--
Slide 27
Slide 27 text
All alerts become for SLOs
• Started alerting for “Event Based” SLO as experiment
• Channel Convention: #slo--
On Going…
Wait for the Blog post
Slide 28
Slide 28 text
Alerting in Quipper
SRE
Web Developer
Alerting
for Platform
Alerting
for SLO
Escalation
Review alert daily and weekly
Feedback
Slide 29
Slide 29 text
Alerting in Quipper
SRE
Web Developer
Alerting
for Platform
Alerting
for SLO
Escalation
❤Borderless
Review alert daily and weekly
Feedback
Slide 30
Slide 30 text
Summary
• Review alerts frequently
• All alerts become for SLOs
• Work together
Slide 31
Slide 31 text
Special Thanks
• @motobrew
• Suggested the new policy for alerts
• Confirmed all my review for alerts
• @d-kuro
• Suggested creating a mention group for each deployment
• Lead alert for Kubernetes
Slide 32
Slide 32 text
Special Thanks
• Thanks for giving your opinion at #topic-monitoring
• @egmc
• @katzchang
• @yuuki
• @y_kawasaki
• and all SRE Lounge staff
Slide 33
Slide 33 text
SRE.fm#1 w/@ryok6t and @_inductor_
https://sre-fm.connpass.com/event/175198/
Slide 34
Slide 34 text
Thank You!
chaspy
chaspy_
Lead Software Engineer
at Quipper
Takeshi Kondo
Terraform-jp