Observability

2 ➔ Service Level Indicator: A quantitative measure of some
aspect of the system (e.g., request duration, error rate, system throughput...) ➔ Service Level Objective: A target value or range of values for a SLI (e.g., request duration under 100ms, 99.95% Availability...) ➔ Service Level Agreement: An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs (e.g., Charge only 50% of monthly bill if availability SLO is not met) SLI, SLO, SLA

3 Recommended Books

4 Monitoring Framework Business Logic Application Operating System Events, Logs,
Metrics Monitoring Framework Event Router Destinations Store Graph Alert Source: “The art of monitoring”

Logging 5

6 ➔ One of the simplest forms of tracking temporal
information ➔ Very easy to write but not that easy to read (or parse) ➔ Great to know what happened and when it happened ➔ On modern environments, always log to stdout or stderr Logging

7 ➔ DEBUG: Development only ➔ INFO: Useful actions (e.g.,
“beginning transaction”) ➔ WARNING: Conditions that could become an error (e.g., “disk is at 80% of its capacity”) ➔ ERROR: Error conditions (e.g., API call failures, internal errors) ➔ FATAL: Unrecoverable errors (e.g., “could not connect to database”) Logging Levels

8 Practical Logging (Go)

9 Fluentd

Telemetry 10

“Telemetry is an automated communications process by which measurements and
other data are collected at remote or inaccessible points and transmitted to receiving equipment for monitoring.” Roubei da wikipedia 11 Telemetry

12 ➔ Safer deployments ➔ Shift mindset to hypothesis driven
➔ Track SLOs Why use telemetry?

13 “Everytime NASA launches a rocket, it has millions of
automated sensors reporting the status of every component of this valuable asset. And yet, we often don’t take the same care with software - we found that creating application and infrastructure telemetry to be one of the highest return over investment we’ve made. (...)“ Scott Prugh, Chief Architect at CSG Telemetry

14 Metrics Levels ➔ Business level: Number of sales, revenue
of transactions, user signups... ➔ Application level: Transaction durations, user response times, application faults... ➔ Infrastructure level: CPU usage, Memory saturation, disk pressure... ➔ Client software level: Errors and crashes on the browser... ➔ CI/CD pipeline level: Deployment frequencies, environment status...

15 ➔ A systems monitoring and alerting toolkit ➔ Stores
time-series based data ➔ Prometheus is not an event-based system!!! Prometheus

16 Prometheus - Architecture

17 ➔ Counter: A cumulative metric that represents a increasing
counter whose value can only increase or be reset to zero on restart. ➔ Gauge: A metric that represents a single numerical value that can arbitrarily go up and down. ➔ Histogram: Sampled observations counted in conﬁgurable buckets. It also provides a sum of all observed values. ➔ Summary: Sampled observations. It provides a total count of observations and a sum of all observed values and calculates conﬁgurable quantiles over a sliding time window. Prometheus - Metric Types

18 ➔ Counter: Use counters when the value will never
decrease. (e.g., total errors, requests received, events consumed…) ➔ Gauge: Use gauges when the value will possibly decrease. (e.g., current temperature, currently running processes...) Prometheus - Counter vs Gauge

19 Quantiles

20 Prometheus - Histogram vs Summary

21 ➔ Namespace preﬁx (e.g. proteus_active_sheets) ➔ Unit suﬃx (e.g.
proteus_http_request_duration_seconds, http_requests_total) ➔ Never mix units (seconds and milliseconds are not the same thing!) ➔ Use labels to differentiate characteristics of measurements ◆ Beware of label values with high cardinality!!!! Prometheus - Best Practices

22 Prometheus - Practical Metrics (Go)

Graphing 23

24 Graphing

25 ➔ Only plot what you need to observe (What
is useful when you need to take action?) ➔ Have few graphs per dashboard ➔ Focus on what shows the user impact ➔ Beware of unmeaningful and unuseful aggregations ➔ Graphs are for humans, focus on ease of visualization Graphing - Best Practices

26 Grafana - Demo

Alerting 27

28 ➔ Don’t just alert on failure, also alert to
prevent failures ➔ Be careful on alerting based on standard deviation ◆ (You cannot assume your time series has a Gaussian distribution) ➔ Beware of false alerts ➔ Only alert if someone must take action, if not, the alert becomes spam Alerting

29 https://prometheus.io/docs/prometheus/latest/ (RTFM!) http://latencytipoftheday.blogspot.com/ https://bravenewgeek.com/everything-you-know-about-latency-is-wrong/ https://www.robustperception.io https://github.com/roaldnefs/awesome-prometheus https://landing.google.com/sre/books/ https://www.reddit.com/r/PrometheusMonitoring/ Useful
Links

Obrigado! 30 [email protected]

Observability

Observability

felipe

More Decks by felipe

Other Decks in Technology

Featured

Transcript

Observability

2 ➔ Service Level Indicator: A quantitative measure of some

3 Recommended Books

4 Monitoring Framework Business Logic Application Operating System Events, Logs,

Logging 5

6 ➔ One of the simplest forms of tracking temporal

7 ➔ DEBUG: Development only ➔ INFO: Useful actions (e.g.,

8 Practical Logging (Go)

9 Fluentd

Telemetry 10

“Telemetry is an automated communications process by which measurements and

12 ➔ Safer deployments ➔ Shift mindset to hypothesis driven

13 “Everytime NASA launches a rocket, it has millions of

14 Metrics Levels ➔ Business level: Number of sales, revenue

15 ➔ A systems monitoring and alerting toolkit ➔ Stores

16 Prometheus - Architecture

17 ➔ Counter: A cumulative metric that represents a increasing

18 ➔ Counter: Use counters when the value will never

19 Quantiles

20 Prometheus - Histogram vs Summary

21 ➔ Namespace preﬁx (e.g. proteus_active_sheets) ➔ Unit suﬃx (e.g.

22 Prometheus - Practical Metrics (Go)

Graphing 23

24 Graphing

25 ➔ Only plot what you need to observe (What

26 Grafana - Demo

Alerting 27

28 ➔ Don’t just alert on failure, also alert to

Obrigado! 30 [email protected]