Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability

Avatar for felipe felipe
November 20, 2019

 Observability

Avatar for felipe

felipe

November 20, 2019
Tweet

More Decks by felipe

Other Decks in Technology

Transcript

  1. 2 ➔ Service Level Indicator: A quantitative measure of some

    aspect of the system (e.g., request duration, error rate, system throughput...) ➔ Service Level Objective: A target value or range of values for a SLI (e.g., request duration under 100ms, 99.95% Availability...) ➔ Service Level Agreement: An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs (e.g., Charge only 50% of monthly bill if availability SLO is not met) SLI, SLO, SLA
  2. 4 Monitoring Framework Business Logic Application Operating System Events, Logs,

    Metrics Monitoring Framework Event Router Destinations Store Graph Alert Source: “The art of monitoring”
  3. 6 ➔ One of the simplest forms of tracking temporal

    information ➔ Very easy to write but not that easy to read (or parse) ➔ Great to know what happened and when it happened ➔ On modern environments, always log to stdout or stderr Logging
  4. 7 ➔ DEBUG: Development only ➔ INFO: Useful actions (e.g.,

    “beginning transaction”) ➔ WARNING: Conditions that could become an error (e.g., “disk is at 80% of its capacity”) ➔ ERROR: Error conditions (e.g., API call failures, internal errors) ➔ FATAL: Unrecoverable errors (e.g., “could not connect to database”) Logging Levels
  5. “Telemetry is an automated communications process by which measurements and

    other data are collected at remote or inaccessible points and transmitted to receiving equipment for monitoring.” Roubei da wikipedia 11 Telemetry
  6. 13 “Everytime NASA launches a rocket, it has millions of

    automated sensors reporting the status of every component of this valuable asset. And yet, we often don’t take the same care with software - we found that creating application and infrastructure telemetry to be one of the highest return over investment we’ve made. (...)“ Scott Prugh, Chief Architect at CSG Telemetry
  7. 14 Metrics Levels ➔ Business level: Number of sales, revenue

    of transactions, user signups... ➔ Application level: Transaction durations, user response times, application faults... ➔ Infrastructure level: CPU usage, Memory saturation, disk pressure... ➔ Client software level: Errors and crashes on the browser... ➔ CI/CD pipeline level: Deployment frequencies, environment status...
  8. 15 ➔ A systems monitoring and alerting toolkit ➔ Stores

    time-series based data ➔ Prometheus is not an event-based system!!! Prometheus
  9. 17 ➔ Counter: A cumulative metric that represents a increasing

    counter whose value can only increase or be reset to zero on restart. ➔ Gauge: A metric that represents a single numerical value that can arbitrarily go up and down. ➔ Histogram: Sampled observations counted in configurable buckets. It also provides a sum of all observed values. ➔ Summary: Sampled observations. It provides a total count of observations and a sum of all observed values and calculates configurable quantiles over a sliding time window. Prometheus - Metric Types
  10. 18 ➔ Counter: Use counters when the value will never

    decrease. (e.g., total errors, requests received, events consumed…) ➔ Gauge: Use gauges when the value will possibly decrease. (e.g., current temperature, currently running processes...) Prometheus - Counter vs Gauge
  11. 21 ➔ Namespace prefix (e.g. proteus_active_sheets) ➔ Unit suffix (e.g.

    proteus_http_request_duration_seconds, http_requests_total) ➔ Never mix units (seconds and milliseconds are not the same thing!) ➔ Use labels to differentiate characteristics of measurements ◆ Beware of label values with high cardinality!!!! Prometheus - Best Practices
  12. 25 ➔ Only plot what you need to observe (What

    is useful when you need to take action?) ➔ Have few graphs per dashboard ➔ Focus on what shows the user impact ➔ Beware of unmeaningful and unuseful aggregations ➔ Graphs are for humans, focus on ease of visualization Graphing - Best Practices
  13. 28 ➔ Don’t just alert on failure, also alert to

    prevent failures ➔ Be careful on alerting based on standard deviation ◆ (You cannot assume your time series has a Gaussian distribution) ➔ Beware of false alerts ➔ Only alert if someone must take action, if not, the alert becomes spam Alerting