Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Service alerting and monitoring

Service alerting and monitoring

In this talk, we’ll dive into the wonderful world of observability. We’ll first get to know what those famous SLIs and SLOs stand for and how we can define and utilize them for our application. Once we understand those concepts, we’ll look at how we can tie our alerting and monitoring together in a declarative manner.

Links:
Service Level Objectives: https://sre.google/sre-book/service-level-objectives/
Implementing SLOs: https://sre.google/workbook/implementing-slos/
Practical alerting: https://sre.google/sre-book/practical-alerting/
Availability table: https://sre.google/sre-book/availability-table/
Motivation for Error Budgets: https://sre.google/sre-book/embracing-risk/#xref_risk-management_unreliability-budgets
Experimental refactoring: https://www.youtube.com/watch?v=9MW4H6kFb7M
Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/

Jelmer Snoeck

March 30, 2021
Tweet

More Decks by Jelmer Snoeck

Other Decks in Programming

Transcript

  1. Service monitoring
    and alerting
    Halihax - March 2021

    View Slide

  2. Hello!
    Jelmer Snoeck
    Sr. Site Reliability Engineer
    2

    View Slide

  3. Today’s agenda
    ◦ Service Level what?
    ◦ Error budgets
    ◦ Monitoring and alerting
    ◦ Demos
    ◦ Questions
    ◦ Further reading
    3

    View Slide

  4. Service Level What?
    Lets get on the same page
    1
    4

    View Slide

  5. Service Level What?
    Indicator
    A strictly defined
    measurement of
    some part of the
    system.
    Objective
    What target do we
    want to reach for
    this defined
    measurement?
    Agreement
    What are the
    consequences if we
    (don’t) meet the
    target?
    5

    View Slide

  6. Service Level Indicators
    ◦ Request latency
    ◦ Success rate
    ◦ Durability
    ◦ Throughput
    6

    View Slide

  7. Service Level Objectives
    ◦ 99.9% availability
    ◦ 90% requests <100ms
    ◦ 95% requests <250ms
    ◦ 99.9% requests <500ms
    7

    View Slide

  8. Service Level Agreements
    ◦ 99.5-99.9% availability => 10%
    credited
    ◦ 95-99.5% availability => 15%
    credited
    ◦ <95% availability => 25%
    credited
    8

    View Slide

  9. Error Budgets
    2
    9

    View Slide

  10. Practical Error Budgets
    10
    ◦ SLO:
    ▫ 99.9% success for the past 7 days
    ▫ 99.9% success for the past 30 days
    ◦ 1,000,000 requests past 7 days
    ▫ 995,000 successful requests (99.5%)
    ◦ 10,000,000 requests past 30 days
    ▫ 9,991,000 successful requests (99.91%)

    View Slide

  11. Practical Error Budgets
    11
    ◦ Negative budget for the past 7 days
    ▫ 99.5% success rate (-0.4% budget spent)
    ◦ Positive budget for the past 30 days
    ▫ 99.91% success rate (+0.01% budget left)

    View Slide

  12. Monitoring and alerting
    3
    12

    View Slide

  13. Monitoring and alerting
    Monitoring
    Collecting, processing,
    aggregating and displaying
    data about a system
    (latencies, error rates,
    number of servers, …)
    Alerting
    A notification destined for a
    human, pushed through
    another system
    (JIRA tickets, GH Issues,
    PagerDuty pages, …)
    13

    View Slide

  14. Monitoring
    ◦ Analyze long term trends
    ◦ Historical or experimental comparison
    ◦ Debugging
    ◦ Dashboards
    ◦ Alerting
    14

    View Slide

  15. Alerting
    ◦ Error rate >= SLO Threshold
    ◦ Alert durations
    ◦ Burn rate(s)
    15

    View Slide

  16. Demos
    4
    16

    View Slide

  17. ◦ Service Level Objectives
    ◦ Implementing SLOs
    ◦ Practical alerting
    ◦ Availability Table
    ◦ Motivation for Error Budgets
    ◦ Experimental Refactoring
    ◦ Alerting on SLOs
    Further reading
    17

    View Slide

  18. Thanks!
    QUESTIONS?
    @jelmersnoeck
    github.com/jelmersnoeck
    18

    View Slide