Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitor What Matters

Monitor What Matters

Avatar for Connor Mendenhall

Connor Mendenhall

March 08, 2021
Tweet

More Decks by Connor Mendenhall

Other Decks in Technology

Transcript

  1. 8th Light, Inc. | Software is our craft.TM Why monitor?

    Automated tests “The login endpoint should return a 403 when I enter the wrong password” • Describe expectations • Run continuously • Provide design feedback • Enable change • Communicate mental models Monitoring “The login endpoint should return a response in under 300 milliseconds” • Describe expectations • Run continuously • Provide design feedback • Enable change • Communicate mental models
  2. 8th Light, Inc. | Software is our craft.TM “Testing in

    Production”, Cindy Sridharan Testing and monitoring are tools for quality.
  3. 8th Light, Inc. | Software is our craft.TM Monitoring vs

    observability Monitoring is a practice. Observability is a characteristic. Monitoring is about known unknowns. Observability is about unknown unknowns. Monitoring is like testing. Observability is like debugging.
  4. 8th Light, Inc. | Software is our craft.TM Logs Just

    about everything emits logs. Logs are an immutable record of discrete events over time. Logs are useful as an authoritative record of what happened when.
  5. 8th Light, Inc. | Software is our craft.TM Logs 2020/10/02

    09:20:28 [INFO] Terraform version: 0.13.3 2020/10/02 09:20:28 [INFO] CLI args: []string{"/usr/local/bin/terraform", "apply"} 2020/10/02 09:20:28 [DEBUG] Trying to get account information via sts:GetCallerIdentity 2020/10/02 09:20:28 [TRACE] Meta.Backend: instantiated backend of type *s3.Backend 10.0.2.95 - - [13/Aug/2017:14:09:37 +0000] "GET / HTTP/1.1" 200 13627 "-" "ELB-HealthChecker/2.0" 10.0.1.82 - - [13/Aug/2017:14:09:48 +0000] "GET / HTTP/1.1" 200 13627 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64)” 10.0.1.82 - - [13/Aug/2017:14:14:19 +0000] "GET / HTTP/1.1" 200 3111 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64)” 2015-09-01 15:10:40 -0600 docker.3fd487e: {"source":"stdout","log":"Hello Fluentd!”,"container_id":"3fd8678d487e540c7a303e1613101e746c5012f3317434eda93","container_name":"/angry_kalam"} Unstructured (Terraform) Semi-structured (nginx) Structured ( fl uentd)
  6. 8th Light, Inc. | Software is our craft.TM Logs Tools:

    Splunk, journald, Cloudwatch, Elasticsearch/Logstash/ Kibana (“ELK”)
  7. 8th Light, Inc. | Software is our craft.TM Errors Capture

    exceptions and stack traces and enrich them with application context Logs are an immutable record of discrete events over time. Don’t forget the frontend!
  8. 8th Light, Inc. | Software is our craft.TM Metrics Logs

    are a narrative, metrics are quantitative indicators. Numbers are great because we can do math with them! Information dense, both for humans and computers.
  9. 8th Light, Inc. | Software is our craft.TM Metrics Tools:

    Prometheus/Grafana, statsd/collectd, Cloudwatch, Datadog
  10. 8th Light, Inc. | Software is our craft.TM Traces Tools:

    Zipkin, OpenTracing, X-Ray, Honeycomb, APM tools Like logs + metrics Capture data at each step of a request
  11. 8th Light, Inc. | Software is our craft.TM Choosing tools

    • Embrace specialization • Reduce complexity • Empower humans
  12. 8th Light, Inc. | Software is our craft.TM “Monitoring in

    the time of Cloud Native”, Cindy Sridharan Alerting, diagnostics, debugging
  13. 8th Light, Inc. | Software is our craft.TM The Golden

    Signals • Latency • Errors • Traffic • Saturation “If you can only measure four metrics of your user-facing system, focus on these four.” —Site Reliability Engineering
  14. 8th Light, Inc. | Software is our craft.TM Errors •

    HTTP 500s at the load balancer • Exceptions from the app server • Timeouts from the database Traffic • HTTP requests per second • Queries/transactions per second • Network IO Latency • Latency is tricky! • p90-99 response time Saturation • Memory and disk usage • HTTP 503s • p99 latency MySQL command latency, from Brendan Gregg, “Frequency Trails"
  15. 8th Light, Inc. | Software is our craft.TM Symptoms vs

    Causes “Your monitoring system should address two questions: what’s broken, and why? The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause. ‘What’ versus ‘why’ is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.” —Site Reliability Engineering Symptom Cause I’m serving HTTP 500s or 404s Database servers are refusing connections My responses are slow CPUs are overloaded by a bogosort, or an Ethernet cable is crimped under a rack Users in Antarctica aren’t receiving animated cat GIFs Your Content Distribution Network blacklisted some client IPs Private content is world-readable A new software push caused ACLs to be forgotten and allowed all requests
  16. 8th Light, Inc. | Software is our craft.TM Custom Metrics

    statsd_client = statsd.StatsClient('localhost', 8125) def login(username, password): if password_valid(username, password): statsd_client.incr('login.success') render_welcome_page() else: statsd_client.incr('login.failed') render_error(403) Consider collecting metrics that have meaning in your application and domain: • Active users • Successful/failed logins • New signups • Comments posted • Checkouts completed
  17. 8th Light, Inc. | Software is our craft.TM Effective alerts

    • Urgent • Actionable • Unique • Focused • Real “I think if you maintain a force in the world that comes into people’s sleep, you are exercising a meaningful power.” —Don DeLillo, Underworld “My Philosophy on Alerting”, Rob Ewaschuk