How to Know You’re Doing Good: Measuring Performance in Cloud-Native Systems

How to Know You’re Doing Good: Measuring Performance in Cloud-Native
Systems Jamie Hewland DevConf 2020

About me • ~4 years as a Site Reliability Engineer
(SRE) working on container stuff • Now: 1 year as a Backend Engineer at Aruba working mostly on Python services • Given talks at PyConZA, ScaleConf, KubeCon • Based in Cape Town @jayhewland [email protected] #DevConf, #ZATech /in/jhewland @JayH5

Aruba User Experience Insight https://capenetworks.com/

Aruba User Experience Insight My team https://capenetworks.com/

About this talk • For developers, focus on application metrics
• An intro/crash-course + “what I’ve learnt” • Essential to software engineering, not taught in computer science • Focus on concepts over speciﬁc tech (but features DataDog graphs)

“Cloud-native systems” • It’s a buzzword, sorry • But some
basic properties: • “Microservices”: multiple distributed services that collaborate via HTTP APIs • Cloud Infrastructure as a Service: scale- out, pay people to do stuff not core to the product • DevOps: developers handle operations of what they build • Monitoring as a way to understand the complexity that arises from this

Let’s add metrics • Metrics are just measurements • First
step is to instrument your application with metrics • Generally just use a library for the monitoring system you have • Often a library or tool is available that can instrument some basics automatically

Anatomy of a metric flask_response_total{method="post", code="200"} 42 flask.response.count:42|c|#method:post,code:200 flask_responses,method=post,code=200 count=42i
Prometheus StatsD InﬂuxDB

Anatomy of a metric flask_response_total{method="post", code="200"} 42 flask.response.count:42|c|#method:post,code:200 flask_responses,method=post,code=200 count=42i
Prometheus StatsD InﬂuxDB Metric name Labels Value Metric name Tags (DogStatsD) Value Type Measurement Tags Fields

Names vs tags vs values Metric name corresponds to the
one thing being measured • Keep standardised across services so that comparisons can be made—generally a graph will have measurements with the same name • But measurements speciﬁc to a service should have speciﬁc names: e.g. queue_length would be bad but maybe myservice_email_queue_len flask_response_total{method="post", view="customers"} 42 Metric name

Names vs tags vs values Tags are the dimensions to
query across • e.g. Want to graph request rate per API, don’t want to graph the number of APIs • Almost always string key/value pairs • Usually limited cardinality. e.g. can’t graph request rate per user for a million different users flask_response_total{method="post", view="customers"} 42 Tags

Names vs tags vs values Values are the instantaneous measurement
• Sometimes different types: e.g. counter, gauge, timing • Usually 64-bit number—but sometimes typed: e.g. boolean, string, geolocation • Sometimes multiple values per measurement flask_response_total{method="post", view="customers"} 42 Value

Basic metrics every app should have • Throughput Rate: requests
per second • Latency Distribution: response time (ms) • Availability Error rate (%) over time period

• Throughput Rate: requests per second • Latency Distribution: response
time (ms) • Availability Error rate (%) over time period Basic metrics every app should have Golden Signals: Latency, Trafﬁc, Errors, Saturation RED: Rate, Errors, Duration USE: Utilisation, Saturation, Errors a.k.a. Observability Buzzwords You Should Know By Now!, Harry Martland https://link.medium.com/kihGD3GHk5

• Throughput Rate: requests per second • Latency Distribution: response
time (ms) • Availability Error rate (%) over time period How much? How fast? How reliably? Basic metrics every app should have

Throughput • Additive across replicas • Requests/transactions/ messages per second
(please) • Often periodic/seasonal

Throughput • Throughput per API • Help to diagnose opportunities
for rearchitecture

Load-balancer Service A Service B Load-balancer Service A Service B
To Service B

Historical latency • Always measure percentiles: e.g. median (p50), p95,
p99, p99.9 • Not normally distributed! Mean is virtually useless • Usually estimates

Latency distribution Multi-modal Long tail Lower bound

Latency distribution over time-span including periods before & after optimisation

Throughput x latency • Throughput x latency = total time
• Very useful metric for ﬁnding APIs to optimise: want the popular & slow APIs

Availability availability = good service total demanded service • Generally
specify availability in “number of 9s”: 99%, 99.9%… • In practice often calculated: 100% - “error rate” • 100% (real) availability is impossible • User metrics or external probe (synthetic testing)?

Availability • Meaningful: capture the end-user experience • Proportional: change
in metric proportional to perceived availability • Actionable: insight into why availability was low availability = uptime uptime + downtime availability = successful requests total requests Time-based Count-based Meaningful Availability, Hauer et al., NSDI’20

Availability examples Amazon Web Services S3 Error Rate “[…] total
number of internal server errors returned […] divided by the total number of requests for the applicable request type during that 5-minute interval. […] for each Amazon S3 Service account” Azure Cosmos DB Error Rate “[…] total number of Failed Requests divided by Total Requests, across all Resources in a given Azure subscription, during a given one-hour interval.” https://aws.amazon.com/s3/sla/ https://azure.microsoft.com/en-us/support/legal/sla/cosmos-db/v1_3/

service level objective (SLO) service level agreement (SLA) service level
indicator (SLI) a carefully deﬁned quantitative measure of some aspect of the level of service that is provided Site Reliability Engineering: How Google Runs Production Systems a target value or range of values for a service level that is measured by an SLI an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain SLIs, SLOs, SLAs, oh my! https://youtu.be/tEylFyxbDLE

Dashboards • Useful for debugging incidents or checking on deployments
• Limitations: • Easily go out-of-date • “Single Pane of Glass” generally impossible • Who looks at it? • Aim: make it easy to check for basic anomalies at a glance

Distributed Tracing • Measure the entire user request across multiple
distributed services • Share context across request boundaries (usually using headers) • User request generates a trace which has spans for each sub-request Opentracing in Java for microservices observability, Saurav Sharma https://blog.imaginea.com/opentracing-in-java-for-microservices-observability/

Distributed Tracing challenges • Challenges: • Instrumentation • Visualisation •
Cost • Our use cases: • Which services involved in request? • Where was most time spent? • Where was the error? Distributed Tracing — we’ve been doing it wrong, Cindy Sridharan https://link.medium.com/THfnK6Xrk5

Error reporting •Report stack traces to external service (e.g. Sentry)
•Team works through backlog of reported errors •Signiﬁcantly reduces the need to look at logs •Often some integration with tracing, logs

In conclusion… …this was an introduction • Understand what a
metric is • Start recording basic metrics now: throughput, latency, error rate • Prove your work using data • Think about how you deﬁne availability • Set objectives & contracts for service levels • Explore new techniques like distributed tracing

Thanks! Any questions? @jayhewland [email protected] #DevConf, #ZATech /in/jhewland @JayH5

Further reading: - Nines are Not Enough: Meaningful Metrics for
Clouds, Mogul & Wilkes - HotOS’19 - Distributed Systems Observability, Cindy Sridharan - Observability for Developers (honeycomb.io guide) - Framework for an Observability Maturity Model (honeycomb.io guide)

observability (o11y) telemetry monitoring a measure of how well internal
states of a system can be inferred from knowledge of its external outputs the collection of measurements or other data at remote or inaccessible points and their automatic transmission to receiving equipment for monitoring observe and check the progress or quality of (something) over a period of time; keep under systematic review Wikipedia Wikipedia Google Dictionary

How to Know You’re Doing Good: Measuring Perfor...

How to Know You’re Doing Good: Measuring Performance in Cloud-Native Systems

More Decks by Jamie Hewland

Other Decks in Programming

Featured

Transcript