Observability 101

Observability ------ Metrics, Monitoring, Alerting Piyush Goel, March 2021 @pigol1

Release to Production is just the beginning! “40% to 90%
of the total costs of software are incurred after launch.” • Facts and Fallacies of Software Engineering, Glass R (2002), Addison-Wesley, p-115 • Which Factors affect Software Projects maintenance costs more? Acta Informatica Medica

Systems Will Fail!

#failure points in a distributed system increase with each new
component.

What to monitor?

What to monitor? Let’s take an example: HTTP REST API’s

• API Latency (95th percentile, Avg, 99th percentile) • CPU, Load Avg • Memory • Swap • JMX Heap Size (Assuming Java implementation) • HTTP Error Codes (200, 400, 300) • Exceptions • External API call Latencies • … • And more…….

• API Latency (95th percentile, Avg, 99th percentile) • CPU, Load Avg • Memory • Swap • JMX Heap Size • HTTP Error Codes (200, 400, 300) • Exceptions • External API call Latencies • And more……. X Servers X Clusters

Let’s step back and understand monitoring.

Monitoring • Capturing the state of the system to determine
its health. ◦ HealthChecks ▪ Is the service running? ▪ Can I send work? ◦ Metrics ▪ System, Application, Functional • Alerts ◦ Anomalous behaviors - How do you deﬁne an anomaly?

Monitoring • Alerts - Known Failures ◦ Knowledge Based ◦
Reactive (post-outages) What about the unknown failures?

Observability

Observability The bygone era!!

Observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

Observability • https://www.vividcortex.com/blog/monitoring-isnt-observability

Observability • https://theagileadmin.com/2018/02/16/monitoring-and-observability/ Observability is a measure of how well
the internal states of a system can be inferred from knowledge of its external outputs. Analogous to medical diagnosis.

Observability - Internal States • Context Speciﬁc ◦ Web Servers
▪ Availability ▪ Incoming Request Rate ▪ Latency ▪ HTTP Failures ◦ Application Services ▪ Success Rate ▪ Functionalities ◦ Message Queue ▪ Queue Length ▪ Consumer/Producer Count Needs Instrumentation! While writing code

Observability - External Outputs (3 Pillars) • Metrics ◦ Health
Checks • Logging • Tracing

Observability - Metrics • External State at a broad scope
(Time dimension) ◦ System ◦ Application ▪ Success Rate/Failure Rate ▪ Latency (internal/external) ▪ Error Codes ▪ Exceptions ◦ Business/Functional ◦ Order Management System - Order Rate (Regular, Cancel, Return) ◦ Payments - Forward payments, reverse payments, recon requests ◦ Coupons - Issued, Redeemed ◦ Promotion Engine - Success Evaluations, Failures.

Observability - Metrics • Meaningful Metrics* - Generous • Alerts
- Judicious • Low Cardinality ◦ Keep a Watch! ◦ Don’t emit for users/orders. We use Logs for that! • Provide System Summary • Questions: ▪ How many transactions failed? ▪ How many logins succeeded? * Metrics That Matter - https://queue.acm.org/detail.cfm?id=3309571 - Must Read

Observability - Metrics • Tools ◦ Prometheus ◦ InﬂuxDB ◦
TimescaleDB ◦ Graphite ◦ OpenTSDB ◦ Scuba (Facebook) ◦ Apache Druid ◦ Grafana

Observability - Health Checks • HealthChecks ◦ Is the service
running? (Liveness) ◦ Can I send work? (Readiness) • Methods ◦ Broadcast - Gossip Protocols (Cassandra, Riak) ◦ Register - Service Discovery ◦ Health endpoints - ELB, HAProxy, Nginx

Observability - Logging • Understanding at a smaller scope ◦
Request, customer, transaction • Ask Questions ◦ Why couldn’t the customer place an order? ◦ Why did the transaction fail? • Centralised - Log Collection, Aggregation • Searchable - Indexing • Correlatable - Common Key (Request Id) • Tools - ELK, EFK, Splunk, SumoLogic, Loki

Observability - Logging (Anatomy)

Observability - Tracing • Dissect a request into sub-paths. (Spans)
• Proﬁle system usage at a span level. • Extract Insights Tools: • Google Dapper (https://ai.google/research/pubs/pub36356) • Twitter Zipkin (https://zipkin.io/) • Open Jaeger (https://www.jaegertracing.io/) • New Relic

Observability Spectrum

We have Metrics. What to alert on?

Service Level Objectives (SLO)

Service Level Objectives (SLO) * https://landing.google.com/sre/sre-book/chapters/service-level-objectives/ * https://www.youtube.com/watch?v=tEylFyxbDLE • Deﬁnes
a Quantiﬁable Goal for a service. • Measure the goal - Represents the User Experience/Delight Factor. • First step before writing a new service. Work backwards • Have as few SLO’s as possible. ◦ Represents the system behaviour.

Service Level Objectives (SLO) - Exercise • Cart Service •
Authentication & Authorization Service • Communication Engine (SMS, Email, Push Notiﬁcations)

Why Observe? • Stable systems in the long run. ▪
Better Capacity & Load Planning • Lower MTTR. • Promotes a data-driven culture within the team. • Makes outcomes measurable and removes subjectivity. • Brings accountability across teams - Tech & Product. ▪ Transparency ▪ Less friction.

Observability - Standards • OpenMetrics • OpenTelemetry ◦ OpenCensus &
OpenTracing ◦ API’s, SDK’s - Metrics, Tracing, Context. ◦ Open Source, Vendor Neutral ◦ Avoid lock-in

Observability - Recent Trends • Data Observability ◦ Freshness ◦
Volume ◦ Schema ◦ Distribution ◦ Lineage • AccelData, MonteCarlo, Soda * https://www.montecarlodata.com/what-is-data-observability/ * https://cloudedjudgement.substack.com/p/data-observability-the-next-monitoring

Happy Observing!

References • Debugging Production Systems • Pierre Vincent - How
to build observable Distributed systems? • Charity Majors - Observability for Emerging Infra: What Got You Here Won't Get You There" • Caitee McAfree - Of the Order of Billions: Building Observability at Twitter • https://eng.uber.com/observability-at-scale/ • OpenTelemetry • OpenMetrics • Monitoring and Observability

Q & A Reach out to me @pigol1

Observability 101

Observability 101

More Decks by pigol

Other Decks in Programming

Featured

Transcript