How to Know You’re Doing Good: Measuring Performance in Cloud-Native Systems

Slide 1

Slide 1 text

How to Know You’re Doing Good: Measuring Performance in Cloud-Native Systems Jamie Hewland DevConf 2020

Slide 2

Slide 2 text

About me • ~4 years as a Site Reliability Engineer (SRE) working on container stuff • Now: 1 year as a Backend Engineer at Aruba working mostly on Python services • Given talks at PyConZA, ScaleConf, KubeCon • Based in Cape Town @jayhewland [email protected] #DevConf, #ZATech /in/jhewland @JayH5

Slide 3

Slide 3 text

Aruba User Experience Insight https://capenetworks.com/

Slide 4

Slide 4 text

Aruba User Experience Insight My team https://capenetworks.com/

Slide 5

Slide 5 text

About this talk • For developers, focus on application metrics • An intro/crash-course + “what I’ve learnt” • Essential to software engineering, not taught in computer science • Focus on concepts over speciﬁc tech (but features DataDog graphs)

Slide 6

Slide 6 text

“Cloud-native systems” • It’s a buzzword, sorry • But some basic properties: • “Microservices”: multiple distributed services that collaborate via HTTP APIs • Cloud Infrastructure as a Service: scale- out, pay people to do stuff not core to the product • DevOps: developers handle operations of what they build • Monitoring as a way to understand the complexity that arises from this

Slide 7

Slide 7 text

Let’s add metrics • Metrics are just measurements • First step is to instrument your application with metrics • Generally just use a library for the monitoring system you have • Often a library or tool is available that can instrument some basics automatically

Slide 8

Slide 8 text

Anatomy of a metric flask_response_total{method="post", code="200"} 42 flask.response.count:42|c|#method:post,code:200 flask_responses,method=post,code=200 count=42i Prometheus StatsD InﬂuxDB

Slide 9

Slide 9 text

Anatomy of a metric flask_response_total{method="post", code="200"} 42 flask.response.count:42|c|#method:post,code:200 flask_responses,method=post,code=200 count=42i Prometheus StatsD InﬂuxDB Metric name Labels Value Metric name Tags (DogStatsD) Value Type Measurement Tags Fields

Slide 10

Slide 10 text

Names vs tags vs values Metric name corresponds to the one thing being measured • Keep standardised across services so that comparisons can be made—generally a graph will have measurements with the same name • But measurements speciﬁc to a service should have speciﬁc names: e.g. queue_length would be bad but maybe myservice_email_queue_len flask_response_total{method="post", view="customers"} 42 Metric name

Slide 11

Slide 11 text

Names vs tags vs values Tags are the dimensions to query across • e.g. Want to graph request rate per API, don’t want to graph the number of APIs • Almost always string key/value pairs • Usually limited cardinality. e.g. can’t graph request rate per user for a million different users flask_response_total{method="post", view="customers"} 42 Tags

Slide 12

Slide 12 text

Names vs tags vs values Values are the instantaneous measurement • Sometimes different types: e.g. counter, gauge, timing • Usually 64-bit number—but sometimes typed: e.g. boolean, string, geolocation • Sometimes multiple values per measurement flask_response_total{method="post", view="customers"} 42 Value

Slide 13

Slide 13 text

Basic metrics every app should have • Throughput Rate: requests per second • Latency Distribution: response time (ms) • Availability Error rate (%) over time period

Slide 14

Slide 14 text

• Throughput Rate: requests per second • Latency Distribution: response time (ms) • Availability Error rate (%) over time period Basic metrics every app should have Golden Signals: Latency, Trafﬁc, Errors, Saturation RED: Rate, Errors, Duration USE: Utilisation, Saturation, Errors a.k.a. Observability Buzzwords You Should Know By Now!, Harry Martland https://link.medium.com/kihGD3GHk5

Slide 15

Slide 15 text

• Throughput Rate: requests per second • Latency Distribution: response time (ms) • Availability Error rate (%) over time period How much? How fast? How reliably? Basic metrics every app should have

Slide 16

Slide 16 text

Throughput • Additive across replicas • Requests/transactions/ messages per second (please) • Often periodic/seasonal

Slide 17

Slide 17 text

Throughput • Throughput per API • Help to diagnose opportunities for rearchitecture

Slide 18

Slide 18 text

Load-balancer Service A Service B Load-balancer Service A Service B To Service B

Slide 19

Slide 19 text

Historical latency • Always measure percentiles: e.g. median (p50), p95, p99, p99.9 • Not normally distributed! Mean is virtually useless • Usually estimates

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Latency distribution Multi-modal Long tail Lower bound

Slide 23

Slide 23 text

Latency distribution over time-span including periods before & after optimisation

Slide 24

Slide 24 text

Throughput x latency • Throughput x latency = total time • Very useful metric for ﬁnding APIs to optimise: want the popular & slow APIs

Slide 25

Slide 25 text

Availability availability = good service total demanded service • Generally specify availability in “number of 9s”: 99%, 99.9%… • In practice often calculated: 100% - “error rate” • 100% (real) availability is impossible • User metrics or external probe (synthetic testing)?

Slide 26

Slide 26 text

Availability • Meaningful: capture the end-user experience • Proportional: change in metric proportional to perceived availability • Actionable: insight into why availability was low availability = uptime uptime + downtime availability = successful requests total requests Time-based Count-based Meaningful Availability, Hauer et al., NSDI’20

Slide 27

Slide 27 text

Availability examples Amazon Web Services S3 Error Rate “[…] total number of internal server errors returned […] divided by the total number of requests for the applicable request type during that 5-minute interval. […] for each Amazon S3 Service account” Azure Cosmos DB Error Rate “[…] total number of Failed Requests divided by Total Requests, across all Resources in a given Azure subscription, during a given one-hour interval.” https://aws.amazon.com/s3/sla/ https://azure.microsoft.com/en-us/support/legal/sla/cosmos-db/v1_3/

Slide 28

Slide 28 text

service level objective (SLO) service level agreement (SLA) service level indicator (SLI) a carefully deﬁned quantitative measure of some aspect of the level of service that is provided Site Reliability Engineering: How Google Runs Production Systems a target value or range of values for a service level that is measured by an SLI an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain SLIs, SLOs, SLAs, oh my! https://youtu.be/tEylFyxbDLE

Slide 29

Slide 29 text

Dashboards • Useful for debugging incidents or checking on deployments • Limitations: • Easily go out-of-date • “Single Pane of Glass” generally impossible • Who looks at it? • Aim: make it easy to check for basic anomalies at a glance

Slide 30

Slide 30 text

Distributed Tracing • Measure the entire user request across multiple distributed services • Share context across request boundaries (usually using headers) • User request generates a trace which has spans for each sub-request Opentracing in Java for microservices observability, Saurav Sharma https://blog.imaginea.com/opentracing-in-java-for-microservices-observability/

Slide 31

Slide 31 text

Distributed Tracing challenges • Challenges: • Instrumentation • Visualisation • Cost • Our use cases: • Which services involved in request? • Where was most time spent? • Where was the error? Distributed Tracing — we’ve been doing it wrong, Cindy Sridharan https://link.medium.com/THfnK6Xrk5

Slide 32

Slide 32 text

Error reporting •Report stack traces to external service (e.g. Sentry) •Team works through backlog of reported errors •Signiﬁcantly reduces the need to look at logs •Often some integration with tracing, logs

Slide 33

Slide 33 text

In conclusion… …this was an introduction • Understand what a metric is • Start recording basic metrics now: throughput, latency, error rate • Prove your work using data • Think about how you deﬁne availability • Set objectives & contracts for service levels • Explore new techniques like distributed tracing

Slide 34

Slide 34 text

Thanks! Any questions? @jayhewland [email protected] #DevConf, #ZATech /in/jhewland @JayH5

Slide 35

Slide 35 text

Further reading: - Nines are Not Enough: Meaningful Metrics for Clouds, Mogul & Wilkes - HotOS’19 - Distributed Systems Observability, Cindy Sridharan - Observability for Developers (honeycomb.io guide) - Framework for an Observability Maturity Model (honeycomb.io guide)

Slide 36

Slide 36 text

observability (o11y) telemetry monitoring a measure of how well internal states of a system can be inferred from knowledge of its external outputs the collection of measurements or other data at remote or inaccessible points and their automatic transmission to receiving equipment for monitoring observe and check the progress or quality of (something) over a period of time; keep under systematic review Wikipedia Wikipedia Google Dictionary