Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Know You’re Doing Good: Measuring Performance in Cloud-Native Systems

How to Know You’re Doing Good: Measuring Performance in Cloud-Native Systems

How do you know where the bottlenecks are in your service and how do you measure your improvements? As systems become more distributed and heterogenous it becomes increasingly difficult to answer questions about performance, reliability, and ultimately, end-user user experience. This talk is aimed at developers who are getting started with instrumenting and monitoring their services within a micro services architecture running on common public cloud platforms. It will cover the essentials such as collecting basic performance metrics, how to visualise these metrics, and how to incorporate them in SLAs. We’ll also talk about distributed tracing, events, and logging and how they can help developers fix bugs and diagnose outages quickly. These tools shouldn’t remain the realm of SREs or “DevOps engineers”—in an organisation with strong DevOps practices, developers should be responsible for measuring and setting performance targets of the services they build. By collecting this performance data, teams can justify how they spend their time and effort and prove when they’ve done good.


Jamie Hewland

April 02, 2020

More Decks by Jamie Hewland

Other Decks in Programming


  1. How to Know You’re Doing Good: Measuring Performance in Cloud-Native

    Systems Jamie Hewland DevConf 2020
  2. About me • ~4 years as a Site Reliability Engineer

    (SRE) working on container stuff • Now: 1 year as a Backend Engineer at Aruba working mostly on Python services • Given talks at PyConZA, ScaleConf, KubeCon • Based in Cape Town @jayhewland jamie.hewland@hpe.com #DevConf, #ZATech /in/jhewland @JayH5
  3. Aruba User Experience Insight https://capenetworks.com/

  4. Aruba User Experience Insight My team https://capenetworks.com/

  5. About this talk • For developers, focus on application metrics

    • An intro/crash-course + “what I’ve learnt” • Essential to software engineering, not taught in computer science • Focus on concepts over specific tech (but features DataDog graphs)
  6. “Cloud-native systems” • It’s a buzzword, sorry • But some

    basic properties: • “Microservices”: multiple distributed services that collaborate via HTTP APIs • Cloud Infrastructure as a Service: scale- out, pay people to do stuff not core to the product • DevOps: developers handle operations of what they build • Monitoring as a way to understand the complexity that arises from this
  7. Let’s add metrics • Metrics are just measurements • First

    step is to instrument your application with metrics • Generally just use a library for the monitoring system you have • Often a library or tool is available that can instrument some basics automatically
  8. Anatomy of a metric flask_response_total{method="post", code="200"} 42 flask.response.count:42|c|#method:post,code:200 flask_responses,method=post,code=200 count=42i

    Prometheus StatsD InfluxDB
  9. Anatomy of a metric flask_response_total{method="post", code="200"} 42 flask.response.count:42|c|#method:post,code:200 flask_responses,method=post,code=200 count=42i

    Prometheus StatsD InfluxDB Metric name Labels Value Metric name Tags (DogStatsD) Value Type Measurement Tags Fields
  10. Names vs tags vs values Metric name corresponds to the

    one thing being measured • Keep standardised across services so that comparisons can be made—generally a graph will have measurements with the same name • But measurements specific to a service should have specific names: e.g. queue_length would be bad but maybe myservice_email_queue_len flask_response_total{method="post", view="customers"} 42 Metric name
  11. Names vs tags vs values Tags are the dimensions to

    query across • e.g. Want to graph request rate per API, don’t want to graph the number of APIs • Almost always string key/value pairs • Usually limited cardinality. e.g. can’t graph request rate per user for a million different users flask_response_total{method="post", view="customers"} 42 Tags
  12. Names vs tags vs values Values are the instantaneous measurement

    • Sometimes different types: e.g. counter, gauge, timing • Usually 64-bit number—but sometimes typed: e.g. boolean, string, geolocation • Sometimes multiple values per measurement flask_response_total{method="post", view="customers"} 42 Value
  13. Basic metrics every app should have • Throughput Rate: requests

    per second • Latency Distribution: response time (ms) • Availability Error rate (%) over time period
  14. • Throughput Rate: requests per second • Latency Distribution: response

    time (ms) • Availability Error rate (%) over time period Basic metrics every app should have Golden Signals: Latency, Traffic, Errors, Saturation RED: Rate, Errors, Duration USE: Utilisation, Saturation, Errors a.k.a. Observability Buzzwords You Should Know By Now!, Harry Martland https://link.medium.com/kihGD3GHk5
  15. • Throughput Rate: requests per second • Latency Distribution: response

    time (ms) • Availability Error rate (%) over time period How much? How fast? How reliably? Basic metrics every app should have
  16. Throughput • Additive across replicas • Requests/transactions/ messages per second

    (please) • Often periodic/seasonal
  17. Throughput • Throughput per API • Help to diagnose opportunities

    for rearchitecture
  18. Load-balancer Service A Service B Load-balancer Service A Service B

    To Service B
  19. Historical latency • Always measure percentiles: e.g. median (p50), p95,

    p99, p99.9 • Not normally distributed! Mean is virtually useless • Usually estimates
  20. None
  21. None
  22. Latency distribution Multi-modal Long tail Lower bound

  23. Latency distribution over time-span including periods before & after optimisation

  24. Throughput x latency • Throughput x latency = total time

    • Very useful metric for finding APIs to optimise: want the popular & slow APIs
  25. Availability availability = good service total demanded service • Generally

    specify availability in “number of 9s”: 99%, 99.9%… • In practice often calculated: 100% - “error rate” • 100% (real) availability is impossible • User metrics or external probe (synthetic testing)?
  26. Availability • Meaningful: capture the end-user experience • Proportional: change

    in metric proportional to perceived availability • Actionable: insight into why availability was low availability = uptime uptime + downtime availability = successful requests total requests Time-based Count-based Meaningful Availability, Hauer et al., NSDI’20
  27. Availability examples Amazon Web Services S3 Error Rate “[…] total

    number of internal server errors returned […] divided by the total number of requests for the applicable request type during that 5-minute interval. […] for each Amazon S3 Service account” Azure Cosmos DB Error Rate “[…] total number of Failed Requests divided by Total Requests, across all Resources in a given Azure subscription, during a given one-hour interval.” https://aws.amazon.com/s3/sla/ https://azure.microsoft.com/en-us/support/legal/sla/cosmos-db/v1_3/
  28. service level objective (SLO) service level agreement (SLA) service level

    indicator (SLI) a carefully defined quantitative measure of some aspect of the level of service that is provided Site Reliability Engineering: How Google Runs Production Systems a target value or range of values for a service level that is measured by an SLI an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain SLIs, SLOs, SLAs, oh my! https://youtu.be/tEylFyxbDLE
  29. Dashboards • Useful for debugging incidents or checking on deployments

    • Limitations: • Easily go out-of-date • “Single Pane of Glass” generally impossible • Who looks at it? • Aim: make it easy to check for basic anomalies at a glance
  30. Distributed Tracing • Measure the entire user request across multiple

    distributed services • Share context across request boundaries (usually using headers) • User request generates a trace which has spans for each sub-request Opentracing in Java for microservices observability, Saurav Sharma https://blog.imaginea.com/opentracing-in-java-for-microservices-observability/
  31. Distributed Tracing challenges • Challenges: • Instrumentation • Visualisation •

    Cost • Our use cases: • Which services involved in request? • Where was most time spent? • Where was the error? Distributed Tracing — we’ve been doing it wrong, Cindy Sridharan https://link.medium.com/THfnK6Xrk5
  32. Error reporting •Report stack traces to external service (e.g. Sentry)

    •Team works through backlog of reported errors •Significantly reduces the need to look at logs •Often some integration with tracing, logs
  33. In conclusion… …this was an introduction • Understand what a

    metric is • Start recording basic metrics now: throughput, latency, error rate • Prove your work using data • Think about how you define availability • Set objectives & contracts for service levels • Explore new techniques like distributed tracing
  34. Thanks! Any questions? @jayhewland jamie.hewland@hpe.com #DevConf, #ZATech /in/jhewland @JayH5

  35. Further reading: - Nines are Not Enough: Meaningful Metrics for

    Clouds, Mogul & Wilkes - HotOS’19 - Distributed Systems Observability, Cindy Sridharan - Observability for Developers (honeycomb.io guide) - Framework for an Observability Maturity Model (honeycomb.io guide)
  36. observability (o11y) telemetry monitoring a measure of how well internal

    states of a system can be inferred from knowledge of its external outputs the collection of measurements or other data at remote or inaccessible points and their automatic transmission to receiving equipment for monitoring observe and check the progress or quality of (something) over a period of time; keep under systematic review Wikipedia Wikipedia Google Dictionary