Service alerting and monitoring

Service monitoring and alerting Halihax - March 2021

Hello! Jelmer Snoeck Sr. Site Reliability Engineer 2

Today’s agenda ◦ Service Level what? ◦ Error budgets ◦
Monitoring and alerting ◦ Demos ◦ Questions ◦ Further reading 3

Service Level What? Lets get on the same page 1
4

Service Level What? Indicator A strictly deﬁned measurement of some
part of the system. Objective What target do we want to reach for this deﬁned measurement? Agreement What are the consequences if we (don’t) meet the target? 5

Service Level Indicators ◦ Request latency ◦ Success rate ◦
Durability ◦ Throughput 6

Service Level Objectives ◦ 99.9% availability ◦ 90% requests <100ms
◦ 95% requests <250ms ◦ 99.9% requests <500ms 7

Service Level Agreements ◦ 99.5-99.9% availability => 10% credited ◦
95-99.5% availability => 15% credited ◦ <95% availability => 25% credited 8

Error Budgets 2 9

Practical Error Budgets 10 ◦ SLO: ▫ 99.9% success for
the past 7 days ▫ 99.9% success for the past 30 days ◦ 1,000,000 requests past 7 days ▫ 995,000 successful requests (99.5%) ◦ 10,000,000 requests past 30 days ▫ 9,991,000 successful requests (99.91%)

Practical Error Budgets 11 ◦ Negative budget for the past
7 days ▫ 99.5% success rate (-0.4% budget spent) ◦ Positive budget for the past 30 days ▫ 99.91% success rate (+0.01% budget left)

Monitoring and alerting 3 12

Monitoring and alerting Monitoring Collecting, processing, aggregating and displaying data
about a system (latencies, error rates, number of servers, …) Alerting A notiﬁcation destined for a human, pushed through another system (JIRA tickets, GH Issues, PagerDuty pages, …) 13

Monitoring ◦ Analyze long term trends ◦ Historical or experimental
comparison ◦ Debugging ◦ Dashboards ◦ Alerting 14

Alerting ◦ Error rate >= SLO Threshold ◦ Alert durations
◦ Burn rate(s) 15

Demos 4 16

◦ Service Level Objectives ◦ Implementing SLOs ◦ Practical alerting
◦ Availability Table ◦ Motivation for Error Budgets ◦ Experimental Refactoring ◦ Alerting on SLOs Further reading 17

Thanks! QUESTIONS? @jelmersnoeck github.com/jelmersnoeck 18

Service alerting and monitoring

Service alerting and monitoring

Jelmer Snoeck

More Decks by Jelmer Snoeck

Other Decks in Programming

Featured

Transcript

Service monitoring and alerting Halihax - March 2021

Hello! Jelmer Snoeck Sr. Site Reliability Engineer 2

Today’s agenda ◦ Service Level what? ◦ Error budgets ◦

Service Level What? Lets get on the same page 1

Service Level What? Indicator A strictly deﬁned measurement of some

Service Level Indicators ◦ Request latency ◦ Success rate ◦

Service Level Objectives ◦ 99.9% availability ◦ 90% requests <100ms

Service Level Agreements ◦ 99.5-99.9% availability => 10% credited ◦

Error Budgets 2 9

Practical Error Budgets 10 ◦ SLO: ▫ 99.9% success for

Practical Error Budgets 11 ◦ Negative budget for the past

Monitoring and alerting 3 12

Monitoring and alerting Monitoring Collecting, processing, aggregating and displaying data

Monitoring ◦ Analyze long term trends ◦ Historical or experimental

Alerting ◦ Error rate >= SLO Threshold ◦ Alert durations

Demos 4 16

◦ Service Level Objectives ◦ Implementing SLOs ◦ Practical alerting

Thanks! QUESTIONS? @jelmersnoeck github.com/jelmersnoeck 18