$30 off During Our Annual Pro Sale. View Details »

Observability at scale

Devi
June 16, 2020

Observability at scale

Devi

June 16, 2020
Tweet

More Decks by Devi

Other Decks in Technology

Transcript

  1. • 14 years in Software industry • Staff engineer @

    Razorpay • Ex Architect @ powertofly • @asldevi About me
  2. • Micro-service architecture and design patterns have emerged • Development

    practices have evolved • Infrastructure and deployment practices have gotten better Over the last few years... Monitoring solutions and the culture around it? Software systems have become more complex
  3. The impact - Developer productivity • The systems are harder

    to understand and pinpoint where errors originate • Logs and metrics are humongous in volume and making sense is tough • Increased MTTD (Mean Time To Detect) and MTTR (Mean Time to Resolve)
  4. The Impact - Cost of monitoring Logging Cost Number of

    transactions * number of microservices * days of retention period * cost of (network + storage) Metrics Cost Number of metrics * days of retention period * cost of (network + storage) ( number of metrics to monitor increases with number of microservices too)
  5. A system is observable if and only if you can

    determine the behavior of the system based on its outputs. “Observability is about being able to ask arbitrary questions about your environment without---and this is the key part---having to know ahead of time what you wanted to ask” - Charity Majors, CoFounder @ HoneyComb What is Observability
  6. Observability vs Monitoring “You think to yourself, 'I had an

    incident, so I should make an alarm for that.' And you had another incident, and you should make a tool for that. At each step you’re making a rational choice, but you don’t realize that the cumulative effect is something that’s hard to maintain, and kind of unbearable.” - Greg Poirier (in his famous talk “Monitoring is dead”)
  7. Current Toolset • Log management solution • Application performance management

    (APM) • Infra metrics and dashboards • Custom data pipelines and analytics dashboards Limitations • Inability to do high cardinality queries across services • Missing distributed tracing • Root cause analysis starts with assumptions • No first pane of analysis to go to • No scope for unknown-unknowns Our march towards better Observability
  8. • Distributed Tracing • Logs • Metrics These are only

    pillars, but not all of it! Three Pillars of observability OpenTelemetry (OpenTracing+OpenCensus)
  9. Terminology - Distributed tracing • Trace: end-to-end request involving one

    or more services • Span: work done by a single-service with time intervals • Tags: metadata to help contextualize a span While traces and spans do the stitching of the requests, tags enable context
  10. • Start at the user facing service and work down

    through all others • Make the traces rich with context in tags ◦ add business values ◦ add technical details • Enrich traces with tags and do sampling for sanity of costs • Wrap every network call • Wrap every data fetch call Best practices: Instrumenting traces Embrace a culture where instrumentation is part of building software
  11. Best practices: logging • Include trace id in logs •

    Use log_levels appropriately • Embrace structured logging Plain Text Log "2020-06-15T11:01:29.571" 500 “ERROR” “payments_module” 45453 “payment request failed for merchant id m123456 while using hdfc netbanking” vs JSON Log { “timestamp”: “2020-06-15T11:01:29.571, “status_code”: 500 “log_level” : “ERROR”, “module” : “payments”, “merchant_id”: “m123456”, “payment_mode”: “netbanking”, ... }
  12. Best practices: metrics Health of your systems is irrelevant to

    the customers. Health of each individual request is of supreme consequence. RED Method • Request Rate (the number of requests per second) • Error Rate (the number of those requests that are failing) • Duration (the amount of time those requests take)
  13. • Observability is important in the era of microservice architecture

    • Embrace instrumentation as part of building software • Write -> Test -> Commit -> Release and Observe Conclusion