Observability at scale

Observability at scale Devi A S L PowerToFly Diversity Reboot,
June 2020

• 14 years in Software industry • Staff engineer @
Razorpay • Ex Architect @ powertoﬂy • @asldevi About me

• Micro-service architecture and design patterns have emerged • Development
practices have evolved • Infrastructure and deployment practices have gotten better Over the last few years... Monitoring solutions and the culture around it? Software systems have become more complex

Glimpse of microservice architecture

The impact - Developer productivity • The systems are harder
to understand and pinpoint where errors originate • Logs and metrics are humongous in volume and making sense is tough • Increased MTTD (Mean Time To Detect) and MTTR (Mean Time to Resolve)

The Impact - Cost of monitoring Logging Cost Number of
transactions * number of microservices * days of retention period * cost of (network + storage) Metrics Cost Number of metrics * days of retention period * cost of (network + storage) ( number of metrics to monitor increases with number of microservices too)

A system is observable if and only if you can
determine the behavior of the system based on its outputs. “Observability is about being able to ask arbitrary questions about your environment without---and this is the key part---having to know ahead of time what you wanted to ask” - Charity Majors, CoFounder @ HoneyComb What is Observability

Observability vs Monitoring “You think to yourself, 'I had an
incident, so I should make an alarm for that.' And you had another incident, and you should make a tool for that. At each step you’re making a rational choice, but you don’t realize that the cumulative effect is something that’s hard to maintain, and kind of unbearable.” - Greg Poirier (in his famous talk “Monitoring is dead”)

Current Toolset • Log management solution • Application performance management
(APM) • Infra metrics and dashboards • Custom data pipelines and analytics dashboards Limitations • Inability to do high cardinality queries across services • Missing distributed tracing • Root cause analysis starts with assumptions • No ﬁrst pane of analysis to go to • No scope for unknown-unknowns Our march towards better Observability

• Distributed Tracing • Logs • Metrics These are only
pillars, but not all of it! Three Pillars of observability OpenTelemetry (OpenTracing+OpenCensus)

Glimpse of Distributed tracing

Terminology - Distributed tracing • Trace: end-to-end request involving one
or more services • Span: work done by a single-service with time intervals • Tags: metadata to help contextualize a span While traces and spans do the stitching of the requests, tags enable context

• Start at the user facing service and work down
through all others • Make the traces rich with context in tags ◦ add business values ◦ add technical details • Enrich traces with tags and do sampling for sanity of costs • Wrap every network call • Wrap every data fetch call Best practices: Instrumenting traces Embrace a culture where instrumentation is part of building software

Best practices: logging • Include trace id in logs •
Use log_levels appropriately • Embrace structured logging Plain Text Log "2020-06-15T11:01:29.571" 500 “ERROR” “payments_module” 45453 “payment request failed for merchant id m123456 while using hdfc netbanking” vs JSON Log { “timestamp”: “2020-06-15T11:01:29.571, “status_code”: 500 “log_level” : “ERROR”, “module” : “payments”, “merchant_id”: “m123456”, “payment_mode”: “netbanking”, ... }

Best practices: metrics Health of your systems is irrelevant to
the customers. Health of each individual request is of supreme consequence. RED Method • Request Rate (the number of requests per second) • Error Rate (the number of those requests that are failing) • Duration (the amount of time those requests take)

• Observability is important in the era of microservice architecture
• Embrace instrumentation as part of building software • Write -> Test -> Commit -> Release and Observe Conclusion

Questions?

Thank You @asldevi

Image Credits • Image on Slide 11 - https://www.jaegertracing.io/docs/1.16/frontend-ui/

Observability at scale

Observability at scale

Devi

More Decks by Devi

Other Decks in Technology

Featured

Transcript