Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Service alerting and monitoring

Service alerting and monitoring

In this talk, we’ll dive into the wonderful world of observability. We’ll first get to know what those famous SLIs and SLOs stand for and how we can define and utilize them for our application. Once we understand those concepts, we’ll look at how we can tie our alerting and monitoring together in a declarative manner.

Links:
Service Level Objectives: https://sre.google/sre-book/service-level-objectives/
Implementing SLOs: https://sre.google/workbook/implementing-slos/
Practical alerting: https://sre.google/sre-book/practical-alerting/
Availability table: https://sre.google/sre-book/availability-table/
Motivation for Error Budgets: https://sre.google/sre-book/embracing-risk/#xref_risk-management_unreliability-budgets
Experimental refactoring: https://www.youtube.com/watch?v=9MW4H6kFb7M
Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/

Jelmer Snoeck

March 30, 2021
Tweet

More Decks by Jelmer Snoeck

Other Decks in Programming

Transcript

  1. Today’s agenda ◦ Service Level what? ◦ Error budgets ◦

    Monitoring and alerting ◦ Demos ◦ Questions ◦ Further reading 3
  2. Service Level What? Indicator A strictly defined measurement of some

    part of the system. Objective What target do we want to reach for this defined measurement? Agreement What are the consequences if we (don’t) meet the target? 5
  3. Service Level Objectives ◦ 99.9% availability ◦ 90% requests <100ms

    ◦ 95% requests <250ms ◦ 99.9% requests <500ms 7
  4. Service Level Agreements ◦ 99.5-99.9% availability => 10% credited ◦

    95-99.5% availability => 15% credited ◦ <95% availability => 25% credited 8
  5. Practical Error Budgets 10 ◦ SLO: ▫ 99.9% success for

    the past 7 days ▫ 99.9% success for the past 30 days ◦ 1,000,000 requests past 7 days ▫ 995,000 successful requests (99.5%) ◦ 10,000,000 requests past 30 days ▫ 9,991,000 successful requests (99.91%)
  6. Practical Error Budgets 11 ◦ Negative budget for the past

    7 days ▫ 99.5% success rate (-0.4% budget spent) ◦ Positive budget for the past 30 days ▫ 99.91% success rate (+0.01% budget left)
  7. Monitoring and alerting Monitoring Collecting, processing, aggregating and displaying data

    about a system (latencies, error rates, number of servers, …) Alerting A notification destined for a human, pushed through another system (JIRA tickets, GH Issues, PagerDuty pages, …) 13
  8. Monitoring ◦ Analyze long term trends ◦ Historical or experimental

    comparison ◦ Debugging ◦ Dashboards ◦ Alerting 14
  9. ◦ Service Level Objectives ◦ Implementing SLOs ◦ Practical alerting

    ◦ Availability Table ◦ Motivation for Error Budgets ◦ Experimental Refactoring ◦ Alerting on SLOs Further reading 17