Observability

Observability ------ Metrics, Monitoring, Alerting Piyush Goel, Feb 2019

Disclaimer: Content is not original. Check References for the source.

When is a developers job done? 1. After dev-complete? 2.
After Staging push? 3. After QA sign-off? 4. After Prod Release?

When is a developers job done? 1. After dev-complete? 2.
After Staging push? 3. After QA sign-off? 4. After Prod Release? Answer: None of the above!

Release to Production is just the beginning!

Release to Production is just the beginning! “40% to 90%
of the total costs of software are incurred after launch.” • Facts and Fallacies of Software Engineering, Glass R (2002), Addison-Wesley, p-115 • Which Factors affect Software Projects maintenance costs more? Acta Informatica Medica

Systems Will Fail!

Systems Will Fail … Be Prepared for it!

#failure points in a distributed system increase with each new
component.

What to monitor?

What to monitor? Let’s take an example: Platform API’s

What to monitor? Let’s take an example: Platform API’s •
API Latency (95th percentile, Avg, 99th percentile) • CPU, Load Avg • Memory • Swap • JMX Heap Size • HTTP Error Codes (200, 400, 300) • Exceptions • External API call Latencies • … • And more…….

What to monitor? Let’s take an example: Platform API’s •
API Latency (95th percentile, Avg, 99th percentile) • CPU, Load Avg • Memory • Swap • JMX Heap Size • HTTP Error Codes (200, 400, 300) • Exceptions • External API call Latencies • And more……. X Servers X Clusters

Let’s step back and understand monitoring.

Monitoring • Capturing the state of the system to determine
its health. ◦ HealthChecks ▪ Is the service running? ▪ Can I do more work? ◦ Metrics ▪ System ▪ Application ▪ Functional • Alerts ◦ Anomalous behaviors - How do you deﬁne an anomaly?

Monitoring • Alerts - Known Failures ◦ Knowledge Based ◦
Reactive (post-outages) What about the unknown failures?

Observability

Observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

Observability The bygone era!!

Observability • https://www.vividcortex.com/blog/monitoring-isnt-observability

Observability • https://theagileadmin.com/2018/02/16/monitoring-and-observability/ Observability is a measure of how well
the internal states of a system can be inferred from knowledge of its external outputs.

Observability - Internal States • Context Speciﬁc ◦ Web Servers
▪ Availability ▪ Incoming Request Rate ▪ Latency ▪ HTTP Failures ◦ Micro-Services ▪ Success Rate ▪ Functionalities ◦ Message Queue ▪ Queue Length ▪ Consumer/Producer Count Needs Instrumentation! While writing code

Observability - External Outputs • Health Checks • Metrics •
Logging • Tracing

Observability - Health Checks • HealthChecks ◦ Is the service
running? ◦ Can I do more work? • Methods ◦ Broadcast - Gossip Protocols (Cassandra) ◦ Register - Service Discovery ◦ Health endpoints - ELB, HAProxy, Nginx

Observability - Metrics • External State at a broad scope
(Time dimension) ◦ System ◦ Application ▪ Success Rate/Failure Rate ▪ Latency (internal/external) ▪ Error Codes ▪ Exceptions ◦ Business/Functional ◦ Order Rate (Regular, Cancel, Return) ◦ Payments ◦ Conversion Rates ◦ Coupon Issual/Redemption ◦ Points Issued/Redeemed

Observability - Metrics • Meaningful Metrics** - Generous • Alerts
- Judicious • Low Cardinality ◦ Keep a Watch! ◦ Don’t emit for users/orders. We use Logs for that! • Provide system summary • Questions: ▪ How many transactions failed? ▪ How many logins succeeded? ** https://queue.acm.org/detail.cfm?id=3309571 - Must Read

Observability - Metrics • Tools ◦ Graphite ◦ InﬂuxDB ◦
Prometheus ◦ OpenTSDB ◦ Scuba (Facebook) ◦ Apache Druid

Observability - Logging • Understanding at a smaller scope ◦
Request, customer, transaction • Ask Questions: ◦ Why couldn’t the customer place an order? ◦ Why did the transaction fail? • Centralised - ElasticSearch, Splunk • Searchable - Indexed • Correlatable - Common Key (Request Id)

Observability - Logging

Observability - Logging (Anatomy)

Observability - Logging • Tools ◦ Splunk ◦ ELK ◦
SumoLogic ◦ RLA

Observability - Tracing • Dissect a request into sub-paths. (Spans)
• Proﬁle system usage at a span level. • Extract Insights Tools: • Google Dapper (https://ai.google/research/pubs/pub36356) • Twitter Zipkin (https://zipkin.io/) • Open Jaeger (https://www.jaegertracing.io/) • New Relic

Observability Spectrum

We have Metrics. What to alert on?

Service Level Objectives (SLO)

Service Level Objectives (SLO) * https://landing.google.com/sre/sre-book/chapters/service-level-objectives/ * https://www.youtube.com/watch?v=tEylFyxbDLE • Deﬁnes
a Quantiﬁable Goal for a service. • Measure the goal - Represents the User Experience/Delight Factor. • First step before writing a new service. Work backwards • Have as few SLO’s as possible. ◦ Represents the system behaviour.

Service Level Objectives (SLO) - Exercise • Cart Service •
Payments Service • Card Generation • Order Management Service • Communication Engine

Happy Observing!

References • Debugging Production Systems : https://www.youtube.com/watch?v=YlrAakN90D0 • Pierre Vincent
- How to build observable Distributed systems? https://www.youtube.com/watch?v=ACL_YVPD3gw • Charity Majors - Observability for Emerging Infra: What Got You Here Won't Get You There" https://www.youtube.com/watch?v=1wjovFSCGhE • Caitee McAfree - Of the Order of Billions: Building Observability at Twitter https://www.youtube.com/watch?v=SC6XuD1tgcQ • https://eng.uber.com/observability-at-scale/

Observability

Observability

More Decks by pigol

Other Decks in Programming

Featured

Transcript