Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsDays Taipei 2022 - How we send alert to t...

Tsung-Ting, Wu
September 15, 2022
580

DevOpsDays Taipei 2022 - How we send alert to the right people

分享近期我們開始導入 SRE Practice 的經驗談,主要著重在 Alerting Management 的部分。會以實際案例說明,在我們的組織內發現了哪些問題,以及我們是如何改善它們。原來 Alert 並非只靠 Prometheus + Alert manager 就夠了?

Tsung-Ting, Wu

September 15, 2022
Tweet

Transcript

  1. Agenda Issues Issue with alert sending we encounter Alert Orchestration

    Concept and solution Conclusion Summary and takeway
  2. Company A Organization Structure • Startup(3+ years) • Service Team:

    4 • RD: 30(Ops: 4, SRE: 1) • You build it, You run it
  3. Issue 1 Notify Team B their worker does not consume

    job • manually mention team b member in channel • No alert activity • who is the responder ? ops_channel pipeline_team_channel
  4. Issue 2 No one knows who is already read/received the

    alert • manually mention team b member in channel • No alert activity • who is the responder ?
  5. Issue 3 Alert Routing issue with other team • Different

    Team may have different way to collect metric and receive alert • No Standard way to deliver alert to others in the same company
  6. Issues Issue 1 Multiple Region/dc Management Issue 3 No standard

    way to deliver alert across teams Issue 2 lack of Alert Observability alert management should decouple with monitoring tool
  7. Alert Centralization Alert Activity Log Alert Fired notify next user

    in Team A notify on-call user in Team A no ack
  8. Alert Escalation Notify users in a desired order Alert Fired

    notify next user in Team A notify on-call user in Team A no ack notify Admin in Team A no ack
  9. Alert Escalation • usecase 1 ◦ P1: Phone call ◦

    P2: App ◦ P3~P5: email • usecase 2 ◦ off-hour low priority alert :email Notification method
  10. • Alert policy in opsgenie • Alert enrichment/silence • Global

    Policy and Team based Policy • processed by ◦ alert message title ◦ tag ◦ Key-value Alert Routing routing and enrichment
  11. routing and enrichment • Enrich alert • Global Policy and

    Team based Policy • processed by ◦ alert message title ◦ tag ◦ Key-value ◦ etc Alert Routing
  12. Recap(cond.) Issue 1 Multiple Region/dc Management Issue 3 No standard

    way to deliver alert across teams Issue 2 lack of Alert Observability alert management should decouple with monitoring tool
  13. Takeaway • Define alert label first(e.g reigon, service) • Actionable

    alert / Alert Severity/ Alert Priority • Use managed service ◦ Pagerduty, opsgenie, grafana OnCall(cloud) etc. • Opsgenie must use Standard plan to use advanced alert routing feature
  14. CREDITS: This presentation template was created by Slidesgo, including icons

    by Flaticon, and infographics & images by Freepik. Thanks! CREDITS: this presentation template was created by slidesgo, Flaticon, Freepik tsungtwu @tsungtwu