Slide 1

Slide 1 text

Alerting Strategy for Self-contained Team Takeshi Kondo / @chaspy 2020/05/22 SRE Lounge#12

Slide 2

Slide 2 text

Alerting ❗

Slide 3

Slide 3 text

Self-Contained “Encourage development teams to be self-contained so that each team can make products more comprehensively, proactively, and efficiently.”

Slide 4

Slide 4 text

SRE Mission for 2020 / Self-Contained • Service Team can develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Checklist • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting

Slide 5

Slide 5 text

SRE Mission for 2020 / Self-Contained • Service Team can develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting Service Team define their SLI/SLO and review it weekly

Slide 6

Slide 6 text

SRE Mission for 2020 / Self-Contained • Service Team can develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO • Alerting Service Team define their SLI/SLO and review it weekly https://sre-next.dev/schedule#c4

Slide 7

Slide 7 text

Alerting in Quipper SRE Japan Web Developer Global Web Developer Alerting❗ Escalation

Slide 8

Slide 8 text

Alerting in Quipper SRE Japan Web Developer Global Web Developer 2018 Jun. 4 10 15 2019 Jun. 4 25 20 2020 May. 6 42 24

Slide 9

Slide 9 text

Alerting Strategy for Self-Contained Team

Slide 10

Slide 10 text

Agenda • Shared Responsibility • Alerting for SRE Team • Alerting for Service Team

Slide 11

Slide 11 text

tl;dr • Review alerts frequently • All alerts become for SLOs • Work together

Slide 12

Slide 12 text

Agenda • Shared Responsibility • Alerting for SRE Team • Alerting for Service Team

Slide 13

Slide 13 text

Shared Responsibility $MVTUFS.BOBHFNFOU %FQMPZNFOU "VUP4DBMJOH "QQMJDBUJPO Web Developer SRE $MPVE*OGSBTUSVDUVSF

Slide 14

Slide 14 text

Shared Responsibility $MVTUFS.BOBHFNFOU %FQMPZNFOU "VUP4DBMJOH "QQMJDBUJPO Web Developer SRE $MPVE*OGSBTUSVDUVSF ❤Borderless

Slide 15

Slide 15 text

Agenda • Shared Responsibility • Alerting for SRE Team • Alerting for Service Team

Slide 16

Slide 16 text

Alerting for SRE Team • Many problems • Escalation alerts flood conversation channels • Many alerts with unclear intentions and actions • No policy

Slide 17

Slide 17 text

Alerting for SRE Team • Defined Policy • Reviewed ALL 166 Alerts • Review alerts Daily and Weekly

Slide 18

Slide 18 text

Defined Policy

Slide 19

Slide 19 text

Reviewed ALL 166 Alerts • Changed alert channel to follow the policy • Removed 40 alerts • Alert should… • Be Actionable • Detect what can only detect that alert

Slide 20

Slide 20 text

Review alerts Daily

Slide 21

Slide 21 text

Review alerts Weekly

Slide 22

Slide 22 text

Agenda • Shared Responsibility • Alerting for SRE Team • Alerting for Service Team

Slide 23

Slide 23 text

All alerts become for SLOs

Slide 24

Slide 24 text

Alerting in Quipper SRE Web Developer Alerting for Platform Alerting for SLO

Slide 25

Slide 25 text

All alerts become for SLOs • Any other alerts will cause SLO violations • CPU usage is high • OOM Killer happens • Unavailable pods • Unicorn backlog is increasing • Service Team check only SLO alerts • Better to have insights when you received alerts

Slide 26

Slide 26 text

All alerts become for SLOs • Started alerting for “Event Based” SLO as experiment • Channel Convention: #slo--

Slide 27

Slide 27 text

All alerts become for SLOs • Started alerting for “Event Based” SLO as experiment • Channel Convention: #slo-- On Going… Wait for the Blog post

Slide 28

Slide 28 text

Alerting in Quipper SRE Web Developer Alerting for Platform Alerting for SLO Escalation Review alert daily and weekly Feedback

Slide 29

Slide 29 text

Alerting in Quipper SRE Web Developer Alerting for Platform Alerting for SLO Escalation ❤Borderless Review alert daily and weekly Feedback

Slide 30

Slide 30 text

Summary • Review alerts frequently • All alerts become for SLOs • Work together

Slide 31

Slide 31 text

Special Thanks • @motobrew • Suggested the new policy for alerts • Confirmed all my review for alerts • @d-kuro • Suggested creating a mention group for each deployment • Lead alert for Kubernetes

Slide 32

Slide 32 text

Special Thanks • Thanks for giving your opinion at #topic-monitoring • @egmc • @katzchang • @yuuki • @y_kawasaki • and all SRE Lounge staff

Slide 33

Slide 33 text

SRE.fm#1 w/@ryok6t and @_inductor_ https://sre-fm.connpass.com/event/175198/

Slide 34

Slide 34 text

Thank You! chaspy chaspy_ Lead Software Engineer at Quipper Takeshi Kondo Terraform-jp