Operational visibility in distributed systems through instrumentation

© 2022, Amazon Web Services, Inc. or its affiliates. ©
2022, Amazon Web Services, Inc. or its affiliates. Operational visibility in distributed systems Mohammed Fazalullah Qudrath Senior Developer Advocate, MENA Amazon Web Services

© 2022, Amazon Web Services, Inc. or its affiliates. What
we will cover today Why instrumentation is important What to measure Log best practices High throughput services logging Resources to learn more from

is observability Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs Monitoring tells you whether a system is working; observability lets you understand why it isn't working Good observability allows you to answer the questions you didn’t know that you needed to ask

© 2022, Amazon Web Services, Inc. or its affiliates. Observability
is monitoring more than failures What is the business impact? What is the usage? Is it behaving as expected?

© 2022, Amazon Web Services, Inc. or its affiliates. User
saw 503 error Monitoring Observability Metrics – What browser agents connected to my page? Which saw 503 errors? Logs – /purchase/dogtoy gailed to connect with credit card company Traces – How long did I spend trying to connect with the credit card company? Who else did I try sending traffic to? compared to

© 2022, Amazon Web Services, Inc. or its affiliates. Monitoring
layers – Traditional deployment Server Hardware Network/Storage (Virtualized) Hardware Operating System Runtime / Middleware Application + Data

© 2022, Amazon Web Services, Inc. or its affiliates. Monitoring
layers – Container deployment Server Hardware Network/Storage (Virtualized) Hardware Operating System Container Runtime Application Container Kubernetes

© 2022, Amazon Web Services, Inc. or its affiliates. A
system is observable if the behavior of the entire system can be determined by only looking at it’s inputs and outputs Rudolf E. Kalman In 1960 paper on general theory of control systems

© 2022, Amazon Web Services, Inc. or its affiliates. Observability
signals Metrics Traces Logs

© 2022, Amazon Web Services, Inc. or its affiliates. Signal
lifecycle Instrument Capture Transport Analyze Act Language-specific SDK OR Runtime-specific agent Collection agent to transport data Dashboarding tool to monitor signals Alerting mechanism to send notifications

2022, Amazon Web Services, Inc. or its affiliates. Instrument

© 2022, Amazon Web Services, Inc. or its affiliates. Why
instrumentation is important – Amazon.com • Instrumentation is a lens to learn how a system works • Amazon.com is built on a service-oriented architecture, many services collobrate with each other to get something done • Increase in latency deep in the call chain has a ripple effect on latency • Instrumentation enables us to detect and respond to operational events tactically • Feed data into operational dashboard through instrumentation

to measure • As a Service owner, I need to: § Measure system behavior § Make sense of a request flow through a distributed system § Have one place to pull together all measurements • Instrument each service with a trace ID § Propogate the trace ID to every other service collaborating on the task § Collect instrumentation across systems for a given trace ID

© 2022, Amazon Web Services, Inc. or its affiliates. Drill
down • Instrumentation is aggregated into metrics that can trigger alarms and display on dashboards • Why is an anomaly happening? Reliance on metrics with more instrumentation • If metrics arent enough then look at the raw, detailed log data

© 2022, Amazon Web Services, Inc. or its affiliates. How
to instrument • Instrumentation requires coding • Look at adopting common patterns: § Standardization for common instrumentation libraries § Standardization for structured log-based metric reporting • Instrumented application’s resulting telemetry data written to a structured log file – Emitted as one log entry per “unit of work” • At Amazon we log first, and produce aggreate metrics later

© 2022, Amazon Web Services, Inc. or its affiliates. Different
types of logging Unstructured [INFO] 2021-04-21T12:34:56Z Order 'fdbf7245' created with 5 products Semi-structured [INFO] 2021-04-21T12:34:56Z {"message": " Order 'fdbf7245' created with 5 products"} Structured { "loglevel": "INFO", "timestamp": "2021-04-21T12:34:56Z", "message": " Order 'fdbf7245' created with 5 products", "orderId": "fdbf7245", "productCount": 5 }

© 2022, Amazon Web Services, Inc. or its affiliates. Instrumentation
through logging • Instrument services to emit two types of log data: § Request data § Debugging data • Request log data represented as a single structured log entry for each “unit of work” • Debugging data includes unstructured debugging lines emitted • Consider detailed logs especially when you have to investigate avaailability blips, latency spikes and customer-related problems • Detailed logs allow you to provide customers with answers and improve services

2022, Amazon Web Services, Inc. or its affiliates. Log best practices

© 2022, Amazon Web Services, Inc. or its affiliates. Request
log best practices – How we log at Amazon • Emit one request log entry for every unit of work. • Emit no more than one request log entry for a given request. • Break long-running tasks into multiple log entries. • Record details about the request before doing stuff like validation. • Plan for a way to log at increased verbosity. • Ensure log volumes are big enough to handle logging at max throughput. • Consider the behavior of the system when disks fill up

© 2022, Amazon Web Services, Inc. or its affiliates. Request
log best practices – What we log at Amazon • Log a trace ID and propagate it in backend calls • Log the availability and latency of all of dependencies. • Add an additional counter for every error reason • Organize errors by category of cause. • Log important metadata about the unit of work. • Protect logs with access control and encryption.

© 2022, Amazon Web Services, Inc. or its affiliates. Application
log best practices • Include the corresponding request ID • Rate-limit an application log’s error spam • Log request IDs from failed service calls

© 2022, Amazon Web Services, Inc. or its affiliates. AWS
observability tools Container Insights Metrics Explorer Synthetics ServiceLens X-Ray Insights Lambda Insights Contributor Insights Automatically identifies issues using Anomaly Detection and notifies you Deeper insights into lambda performance and health metric data Create time-series data from CloudWatch logs on top contributors Fully managed container monitoring platform Perform real-user monitoring on websites and end-points Easily correlate logs, metrics and traces to quickly identify service bottle-necks Dynamic dashboards based on resource tags

Services for observability INFRASTRUCTURE VMs, Containers, OS AWS SERVICES Vended Monitoring APPLICATION PERFORMANCE Tracing and Profiling END-USER Synthetic Monitoring Amazon CloudWatch Amazon Managed Service for Prometheus Amazon Managed Service for Grafana Amazon CloudWatch AWS X-Ray Amazon CodeGuru Amazon Distro for Open Telemetry Amazon CloudWatch Amazon CloudWatch AWS X-Ray AWS NATIVE OPEN SOURCE PARTNER APN APN APN Amazon Distro for Open Telemetry

services for observability AWS X-Ray Amazon CloudWatch Dashboards Logs Metrics Alarms Events Traces Analytics Service map

© 2022, Amazon Web Services, Inc. or its affiliates. Build
highly resilient applications Find trends and correlate issues Monitor infrastructure and applications Observe modern applications Enable developers to gain insights Visualise data Quantify real- world user experiences Integrate with your existing third-party tools Customer use cases for observability M E E T Y O U R O B S E R V A B I L I T Y N E E D S W I T H E N T E R P R I S E T O O L S B U I L T F O R T H E C L O U D

© 2022, Amazon Web Services, Inc. or its affiliates. Querying
logs Amazon CloudWatch Logs Insights • Interactively search and analyze your log data in CloudWatch Logs • Processes structured log data • Flexible purpose-built query language • Query up to 20 log groups • Save queries fields Timestamp, LogLevel, Message | filter LogLevel == "ERR" | sort @timestamp desc | limit 10 https://s12d.com/loginsights-examples

© 2022, Amazon Web Services, Inc. or its affiliates. •
Scales for microservice and serverless architectures • Identify the root cause of performance issues and errors • X-Ray provides a cross-service view of requests made to application • Aggregates data across services into a trace (via a passed trace identifier) • Support for Lambda, API Gateway, SNS, Step Functions, DynamoDB, S3, etc. Enable tracing in distributed applications AWS X-Ray

© 2022, Amazon Web Services, Inc. or its affiliates. Metrics
Traces AWS monitoring and observability services help you maintain SLAs by detecting, investigating, and remediating problems such as Reliability Availability Latency Data drives decisions R E A L - T I M E D A T A D R I V E S R E A L - T I M E D E C I S I O N S Logs

© 2022, Amazon Web Services, Inc. or its affiliates. Additional
resources Amazon Builders’ Library Instrumenting distributed systems for operational visibility Look at your data by former Amazonian John Rauser Investigating anomalies by former Amazonian John Rauser One Observability Workshop Workshop on monitoring and observability using AWS Monitoring production services at Amazon by David Yanacek, Senior Principal Engineer, AWS

© 2022, Amazon Web Services, Inc. or its affiliates. Serverless
observability learning path s12d.com/mastering-observability

Operational visibility in distributed systems t...

Operational visibility in distributed systems through instrumentation

More Decks by Mohammed Fazalullah

Other Decks in Technology

Featured

Transcript