Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability

 Observability

Tech Talk given on Observability and it's pillars to multiple groups.

pigol

May 14, 2020
Tweet

More Decks by pigol

Other Decks in Programming

Transcript

  1. When is a developers job done? 1. After dev-complete? 2.

    After Staging push? 3. After QA sign-off? 4. After Prod Release?
  2. When is a developers job done? 1. After dev-complete? 2.

    After Staging push? 3. After QA sign-off? 4. After Prod Release? Answer: None of the above!
  3. Release to Production is just the beginning! “40% to 90%

    of the total costs of software are incurred after launch.” • Facts and Fallacies of Software Engineering, Glass R (2002), Addison-Wesley, p-115 • Which Factors affect Software Projects maintenance costs more? Acta Informatica Medica
  4. What to monitor? Let’s take an example: Platform API’s •

    API Latency (95th percentile, Avg, 99th percentile) • CPU, Load Avg • Memory • Swap • JMX Heap Size • HTTP Error Codes (200, 400, 300) • Exceptions • External API call Latencies • … • And more…….
  5. What to monitor? Let’s take an example: Platform API’s •

    API Latency (95th percentile, Avg, 99th percentile) • CPU, Load Avg • Memory • Swap • JMX Heap Size • HTTP Error Codes (200, 400, 300) • Exceptions • External API call Latencies • And more……. X Servers X Clusters
  6. What to monitor? Let’s take an example: Platform API’s •

    API Latency (95th percentile, Avg, 99th percentile) • CPU, Load Avg • Memory • Swap • JMX Heap Size • HTTP Error Codes (200, 400, 300) • Exceptions • External API call Latencies • And more……. X Servers X Clusters
  7. Monitoring • Capturing the state of the system to determine

    its health. ◦ HealthChecks ▪ Is the service running? ▪ Can I do more work? ◦ Metrics ▪ System ▪ Application ▪ Functional • Alerts ◦ Anomalous behaviors - How do you define an anomaly?
  8. Monitoring • Alerts - Known Failures ◦ Knowledge Based ◦

    Reactive (post-outages) What about the unknown failures?
  9. Observability • https://theagileadmin.com/2018/02/16/monitoring-and-observability/ Observability is a measure of how well

    the internal states of a system can be inferred from knowledge of its external outputs.
  10. Observability - Internal States • Context Specific ◦ Web Servers

    ▪ Availability ▪ Incoming Request Rate ▪ Latency ▪ HTTP Failures ◦ Micro-Services ▪ Success Rate ▪ Functionalities ◦ Message Queue ▪ Queue Length ▪ Consumer/Producer Count Needs Instrumentation! While writing code
  11. Observability - Health Checks • HealthChecks ◦ Is the service

    running? ◦ Can I do more work? • Methods ◦ Broadcast - Gossip Protocols (Cassandra) ◦ Register - Service Discovery ◦ Health endpoints - ELB, HAProxy, Nginx
  12. Observability - Metrics • External State at a broad scope

    (Time dimension) ◦ System ◦ Application ▪ Success Rate/Failure Rate ▪ Latency (internal/external) ▪ Error Codes ▪ Exceptions ◦ Business/Functional ◦ Order Rate (Regular, Cancel, Return) ◦ Payments ◦ Conversion Rates ◦ Coupon Issual/Redemption ◦ Points Issued/Redeemed
  13. Observability - Metrics • Meaningful Metrics** - Generous • Alerts

    - Judicious • Low Cardinality ◦ Keep a Watch! ◦ Don’t emit for users/orders. We use Logs for that! • Provide system summary • Questions: ▪ How many transactions failed? ▪ How many logins succeeded? ** https://queue.acm.org/detail.cfm?id=3309571 - Must Read
  14. Observability - Metrics • Tools ◦ Graphite ◦ InfluxDB ◦

    Prometheus ◦ OpenTSDB ◦ Scuba (Facebook) ◦ Apache Druid
  15. Observability - Logging • Understanding at a smaller scope ◦

    Request, customer, transaction • Ask Questions: ◦ Why couldn’t the customer place an order? ◦ Why did the transaction fail? • Centralised - ElasticSearch, Splunk • Searchable - Indexed • Correlatable - Common Key (Request Id)
  16. Observability - Tracing • Dissect a request into sub-paths. (Spans)

    • Profile system usage at a span level. • Extract Insights Tools: • Google Dapper (https://ai.google/research/pubs/pub36356) • Twitter Zipkin (https://zipkin.io/) • Open Jaeger (https://www.jaegertracing.io/) • New Relic
  17. Service Level Objectives (SLO) * https://landing.google.com/sre/sre-book/chapters/service-level-objectives/ * https://www.youtube.com/watch?v=tEylFyxbDLE • Defines

    a Quantifiable Goal for a service. • Measure the goal - Represents the User Experience/Delight Factor. • First step before writing a new service. Work backwards • Have as few SLO’s as possible. ◦ Represents the system behaviour.
  18. Service Level Objectives (SLO) - Exercise • Cart Service •

    Payments Service • Card Generation • Order Management Service • Communication Engine
  19. References • Debugging Production Systems : https://www.youtube.com/watch?v=YlrAakN90D0 • Pierre Vincent

    - How to build observable Distributed systems? https://www.youtube.com/watch?v=ACL_YVPD3gw • Charity Majors - Observability for Emerging Infra: What Got You Here Won't Get You There" https://www.youtube.com/watch?v=1wjovFSCGhE • Caitee McAfree - Of the Order of Billions: Building Observability at Twitter https://www.youtube.com/watch?v=SC6XuD1tgcQ • https://eng.uber.com/observability-at-scale/