Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Service alerting and monitoring

Service alerting and monitoring

In this talk, we’ll dive into the wonderful world of observability. We’ll first get to know what those famous SLIs and SLOs stand for and how we can define and utilize them for our application. Once we understand those concepts, we’ll look at how we can tie our alerting and monitoring together in a declarative manner.

Links:
Service Level Objectives: https://sre.google/sre-book/service-level-objectives/
Implementing SLOs: https://sre.google/workbook/implementing-slos/
Practical alerting: https://sre.google/sre-book/practical-alerting/
Availability table: https://sre.google/sre-book/availability-table/
Motivation for Error Budgets: https://sre.google/sre-book/embracing-risk/#xref_risk-management_unreliability-budgets
Experimental refactoring: https://www.youtube.com/watch?v=9MW4H6kFb7M
Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/

3f4444967dfc7a5a2a71d24175d94c3c?s=128

Jelmer Snoeck

March 30, 2021
Tweet

Transcript

  1. Service monitoring and alerting Halihax - March 2021

  2. Hello! Jelmer Snoeck Sr. Site Reliability Engineer 2

  3. Today’s agenda ◦ Service Level what? ◦ Error budgets ◦

    Monitoring and alerting ◦ Demos ◦ Questions ◦ Further reading 3
  4. Service Level What? Lets get on the same page 1

    4
  5. Service Level What? Indicator A strictly defined measurement of some

    part of the system. Objective What target do we want to reach for this defined measurement? Agreement What are the consequences if we (don’t) meet the target? 5
  6. Service Level Indicators ◦ Request latency ◦ Success rate ◦

    Durability ◦ Throughput 6
  7. Service Level Objectives ◦ 99.9% availability ◦ 90% requests <100ms

    ◦ 95% requests <250ms ◦ 99.9% requests <500ms 7
  8. Service Level Agreements ◦ 99.5-99.9% availability => 10% credited ◦

    95-99.5% availability => 15% credited ◦ <95% availability => 25% credited 8
  9. Error Budgets 2 9

  10. Practical Error Budgets 10 ◦ SLO: ▫ 99.9% success for

    the past 7 days ▫ 99.9% success for the past 30 days ◦ 1,000,000 requests past 7 days ▫ 995,000 successful requests (99.5%) ◦ 10,000,000 requests past 30 days ▫ 9,991,000 successful requests (99.91%)
  11. Practical Error Budgets 11 ◦ Negative budget for the past

    7 days ▫ 99.5% success rate (-0.4% budget spent) ◦ Positive budget for the past 30 days ▫ 99.91% success rate (+0.01% budget left)
  12. Monitoring and alerting 3 12

  13. Monitoring and alerting Monitoring Collecting, processing, aggregating and displaying data

    about a system (latencies, error rates, number of servers, …) Alerting A notification destined for a human, pushed through another system (JIRA tickets, GH Issues, PagerDuty pages, …) 13
  14. Monitoring ◦ Analyze long term trends ◦ Historical or experimental

    comparison ◦ Debugging ◦ Dashboards ◦ Alerting 14
  15. Alerting ◦ Error rate >= SLO Threshold ◦ Alert durations

    ◦ Burn rate(s) 15
  16. Demos 4 16

  17. ◦ Service Level Objectives ◦ Implementing SLOs ◦ Practical alerting

    ◦ Availability Table ◦ Motivation for Error Budgets ◦ Experimental Refactoring ◦ Alerting on SLOs Further reading 17
  18. Thanks! QUESTIONS? @jelmersnoeck github.com/jelmersnoeck 18