Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Service alerting and monitoring

Service alerting and monitoring

In this talk, we’ll dive into the wonderful world of observability. We’ll first get to know what those famous SLIs and SLOs stand for and how we can define and utilize them for our application. Once we understand those concepts, we’ll look at how we can tie our alerting and monitoring together in a declarative manner.

Links:
Service Level Objectives: https://sre.google/sre-book/service-level-objectives/
Implementing SLOs: https://sre.google/workbook/implementing-slos/
Practical alerting: https://sre.google/sre-book/practical-alerting/
Availability table: https://sre.google/sre-book/availability-table/
Motivation for Error Budgets: https://sre.google/sre-book/embracing-risk/#xref_risk-management_unreliability-budgets
Experimental refactoring: https://www.youtube.com/watch?v=9MW4H6kFb7M
Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/

Jelmer Snoeck

March 30, 2021
Tweet

More Decks by Jelmer Snoeck

Other Decks in Programming

Transcript

  1. Service monitoring
    and alerting
    Halihax - March 2021

    View full-size slide

  2. Hello!
    Jelmer Snoeck
    Sr. Site Reliability Engineer
    2

    View full-size slide

  3. Today’s agenda
    ◦ Service Level what?
    ◦ Error budgets
    ◦ Monitoring and alerting
    ◦ Demos
    ◦ Questions
    ◦ Further reading
    3

    View full-size slide

  4. Service Level What?
    Lets get on the same page
    1
    4

    View full-size slide

  5. Service Level What?
    Indicator
    A strictly defined
    measurement of
    some part of the
    system.
    Objective
    What target do we
    want to reach for
    this defined
    measurement?
    Agreement
    What are the
    consequences if we
    (don’t) meet the
    target?
    5

    View full-size slide

  6. Service Level Indicators
    ◦ Request latency
    ◦ Success rate
    ◦ Durability
    ◦ Throughput
    6

    View full-size slide

  7. Service Level Objectives
    ◦ 99.9% availability
    ◦ 90% requests <100ms
    ◦ 95% requests <250ms
    ◦ 99.9% requests <500ms
    7

    View full-size slide

  8. Service Level Agreements
    ◦ 99.5-99.9% availability => 10%
    credited
    ◦ 95-99.5% availability => 15%
    credited
    ◦ <95% availability => 25%
    credited
    8

    View full-size slide

  9. Error Budgets
    2
    9

    View full-size slide

  10. Practical Error Budgets
    10
    ◦ SLO:
    ▫ 99.9% success for the past 7 days
    ▫ 99.9% success for the past 30 days
    ◦ 1,000,000 requests past 7 days
    ▫ 995,000 successful requests (99.5%)
    ◦ 10,000,000 requests past 30 days
    ▫ 9,991,000 successful requests (99.91%)

    View full-size slide

  11. Practical Error Budgets
    11
    ◦ Negative budget for the past 7 days
    ▫ 99.5% success rate (-0.4% budget spent)
    ◦ Positive budget for the past 30 days
    ▫ 99.91% success rate (+0.01% budget left)

    View full-size slide

  12. Monitoring and alerting
    3
    12

    View full-size slide

  13. Monitoring and alerting
    Monitoring
    Collecting, processing,
    aggregating and displaying
    data about a system
    (latencies, error rates,
    number of servers, …)
    Alerting
    A notification destined for a
    human, pushed through
    another system
    (JIRA tickets, GH Issues,
    PagerDuty pages, …)
    13

    View full-size slide

  14. Monitoring
    ◦ Analyze long term trends
    ◦ Historical or experimental comparison
    ◦ Debugging
    ◦ Dashboards
    ◦ Alerting
    14

    View full-size slide

  15. Alerting
    ◦ Error rate >= SLO Threshold
    ◦ Alert durations
    ◦ Burn rate(s)
    15

    View full-size slide

  16. ◦ Service Level Objectives
    ◦ Implementing SLOs
    ◦ Practical alerting
    ◦ Availability Table
    ◦ Motivation for Error Budgets
    ◦ Experimental Refactoring
    ◦ Alerting on SLOs
    Further reading
    17

    View full-size slide

  17. Thanks!
    QUESTIONS?
    @jelmersnoeck
    github.com/jelmersnoeck
    18

    View full-size slide