DevOpsDays Taipei 2022 - How we send alert to the right people

Slide 1

Slide 1 text

How We Send Alert To the right People DevOpsDays Taipei 2022

Slide 2

Slide 2 text

About Me Tsung-Ting, Wu ken.wu SRE Software engineer

Slide 3

Slide 3 text

Agenda Issues Issue with alert sending we encounter Alert Orchestration Concept and solution Conclusion Summary and takeway

Slide 4

Slide 4 text

Company A Organization Structure ● Startup(3+ years) ● Service Team: 4 ● RD: 30(Ops: 4, SRE: 1) ● You build it, You run it

Slide 5

Slide 5 text

Data Pipeline

Slide 6

Slide 6 text

Team in pipeline

Slide 7

Slide 7 text

Monitoring Prometheus + AlertManger

Slide 8

Slide 8 text

Service B worker Lost No consumer on Queue

Slide 9

Slide 9 text

Issue 1 Notify Team B their worker does not consume job ● manually mention team b member in channel ● No alert activity ● who is the responder ? ops_channel pipeline_team_channel

Slide 10

Slide 10 text

Issue 2 No one knows who is already read/received the alert ● manually mention team b member in channel ● No alert activity ● who is the responder ?

Slide 11

Slide 11 text

Issue 3 Alert Routing issue with other team ● Different Team may have different way to collect metric and receive alert ● No Standard way to deliver alert to others in the same company

Slide 12

Slide 12 text

Issue 4 Alert Routing with Multiple region/dc tw-1 eu-1

Slide 13

Slide 13 text

Issues Issue 1 Multiple Region/dc Management Issue 3 No standard way to deliver alert across teams Issue 2 lack of Alert Observability alert management should decouple with monitoring tool

Slide 14

Slide 14 text

Alert Orchestration Alert Centralization Alert Routing On-call management & Alert Escalation Incident Management

Slide 15

Slide 15 text

SRE Conference 2022 - How to Build a Healthy On-Call Culture

Slide 16

Slide 16 text

Pagerduty

Slide 17

Slide 17 text

Opsgenie ref: https://serhatcan.medium.com/avoiding-alerts-overload-from-microservices-with-opsgenie-4f671f38e8af

Slide 18

Slide 18 text

Opsgenie ref: https://serhatcan.medium.com/avoiding-alerts-overload-from-microservices-with-opsgenie-4f671f38e8af Detect Response

Slide 19

Slide 19 text

Grafana OnCall ref:https://grafana.com/docs/oncall/latest/getting-started/

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

這不是業配

Slide 22

Slide 22 text

Alert Centralization Team based integration

Slide 23

Slide 23 text

Alert Centralization ● All Alert in one place ● Ack !!!! ● Alert Activity

Slide 24

Slide 24 text

Alert Centralization ● All Alert in one place ● Ack !!!! ● Alert Activity

Slide 25

Slide 25 text

Alert Centralization Alert Activity Log Alert Fired notify next user in Team A notify on-call user in Team A no ack

Slide 26

Slide 26 text

Alert Escalation Notify users in a desired order Alert Fired notify next user in Team A notify on-call user in Team A no ack notify Admin in Team A no ack

Slide 27

Slide 27 text

Alert Escalation P1 Alert

Slide 28

Slide 28 text

Alert Escalation ● daily/Weekly rotation ● Work-hour and off-hour rotation On-call management

Slide 29

Slide 29 text

Alert Escalation ● usecase 1 ○ P1: Phone call ○ P2: App ○ P3~P5: email ● usecase 2 ○ off-hour low priority alert :email Notification method

Slide 30

Slide 30 text

● Alert policy in opsgenie ● Alert enrichment/silence ● Global Policy and Team based Policy ● processed by ○ alert message title ○ tag ○ Key-value Alert Routing routing and enrichment

Slide 31

Slide 31 text

routing and enrichment ● Enrich alert ● Global Policy and Team based Policy ● processed by ○ alert message title ○ tag ○ Key-value ○ etc Alert Routing

Slide 32

Slide 32 text

Opsgenie ref: https://serhatcan.medium.com/avoiding-alerts-overload-from-microservices-with-opsgenie-4f671f38e8af

Slide 33

Slide 33 text

Incident Management Alert associated with Incident Timeline

Slide 34

Slide 34 text

Opsgenie ref: https://serhatcan.medium.com/avoiding-alerts-overload-from-microservices-with-opsgenie-4f671f38e8af

Slide 35

Slide 35 text

Incident Management Post incident analysis

Slide 36

Slide 36 text

Recap Prometheus + AlertManger

Slide 37

Slide 37 text

Recap(cond.) Issue 1 Multiple Region/dc Management Issue 3 No standard way to deliver alert across teams Issue 2 lack of Alert Observability alert management should decouple with monitoring tool

Slide 38

Slide 38 text

Alert Lifecycle ref:https://grafana.com/docs/oncall/latest/getting-started/

Slide 39

Slide 39 text

Takeaway ● Define alert label first(e.g reigon, service) ● Actionable alert / Alert Severity/ Alert Priority ● Use managed service ○ Pagerduty, opsgenie, grafana OnCall(cloud) etc. ● Opsgenie must use Standard plan to use advanced alert routing feature

Slide 40

Slide 40 text

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik. Thanks! CREDITS: this presentation template was created by slidesgo, Flaticon, Freepik tsungtwu @tsungtwu