ObservabilityTalk.pdf

Observability in Devops Tailormade for AWS Serverless workloads Dipro Chatterjee

Who, why and what (s)? • Who am I? •
Why am I doing this talk? • Who may listen to this talk? • Why would you listen to this talk?

Devops • One team responsible for both development and operational
excellence • Planning operational responsibilities start with planning development design • Iterate over operational specifications and implementations just like development designs are improved upon in an agile

SRE • Keep uptime as per SLA • Defend against
SLO • Monitor SLIs • Get paged based on error budget policy • Troubleshoot incidents • Co-relate SLO dashboards and incidents • Bring system back in error budget • Invest time as per Product owner to improve for mid term or fix

What is ‘Observability’? • Property of software systems. Ability to
understand system behaviour at any given point in time aka state. • 3 pillars - Logging, Metrics and Tracing • Who should care about this? - Devops or IT Management or Product or someone else.. • When shall I start bothering about this property? a. post production b. during shipping c. requirements gathering and design d. development e. never?

Why is Observability important? • To know how the system
behaves • To keep customers happy • The more observable a system is, the easier it is to monitor, defend and improve. • Accidents and risk should be embraced. • Monitoring helps us understand the ‘what’ and ‘why’ of the accident to safeguard • Monitoring helps us to navigate the ‘how’ while defending against deficit • Observability helps us to be iterative and agile.

References • The SRE book from Google - https://landing.google.com/sre/sre-book/toc/ index.html
• The SRE workbook from Google - https://landing.google.com/sre/workbook/toc/ • The Operational excellence white paper from AWS - https:// wa.aws.amazon.com/wat.pillar.operationalExcellence.en.html

Golden signals •Latency •Traffic •Error •Saturation •More signals, less noise

Sample Product • Simple HTTP REST API using API GW,
Lambda, DynamoDB • Create an inventory of groceries with their expiry date and quantity • Measure customer happiness with SLIs on availability, latency • Measure system state with SLIs on error rate • Create SLO document based on these SLIs and intuitive threshold • Create error budget and alerting policy

Sample Product AWS Sequence Flow

Sample Product • Error rate • P99 latency • Tail
latency • Latency histogram • Availability rate • Logs • Traces

What is SLI? •Some SLIs •Availability - ratio of successful
requests to total requests •Latency - latency graph or histogram •Latency - ratio of requests within threshold to total requests •Latency - p99 - latency of 99% of requests •Error rate - ratio of errors to total requests •Saturation - CPU utilisation or Memory management graphs

Measuring SLIs •Mean, Sum, histogram, ratio, percentiles •Mean or average
latency could suppress latency of tail requests •Success Ratio or availability does not say much about the error requests •Ratio of errors does not point to why those errors happened •Granularity of time matters in histograms. For e.g. latency over 100-300 ms, 300-900ms, 900ms and beyond •Granularity of time matters in time series graphs. For e.g. CPU utilisation over 30s vs 1min •Hierarchical metrics for distributed systems

AWS Cloudwatch •Metrics come for free •Lambda metrics •DynamoDB metrics
•API Gateway metrics •Custom metrics using Cloudwatch •Cloudwatch alarms

What is SLO? •Uses both black box and white box
monitoring •Black box helps in measuring customer satisfaction and page SREs •White box helps in understanding system state and co- relate incidents, logs and traces

What is Error budget policy? •This error budget is decided
by the Devops team with both SRE and development hats on and the Product owner •The stakeholders must agree on these alerts and frequency of alerts •Reasonable actions needed to defend error budget •Tradeoffs must be considered and error budget policy, revisited after every incident

What is SLA? • Availability agreement with Customer • Financial
compensation, if broken • Not the same as SLO • https://uptime.is/

SRE Takeaways Protect your health Prevent SPAM Alerts Re-evaluate SLO
document periodically Negotiate on-call agreement

Product Owner Takeaways Think of opaque metrics to measure both
internal and external customer satisfaction Decision matrix for incidents Protect devops team and product health Re-evaluate error budget periodically

Thank you! • Twitter - https://twitter.com/ chatterjeedipro • Linkedin -
https://www.linkedin.com/in/ diprochatterjee/

ObservabilityTalk.pdf

ObservabilityTalk.pdf

Dipro Chatterjee

Other Decks in Technology

Featured

Transcript

Observability in Devops Tailormade for AWS Serverless workloads Dipro Chatterjee

Who, why and what (s)? • Who am I? •

Devops • One team responsible for both development and operational

SRE • Keep uptime as per SLA • Defend against

What is ‘Observability’? • Property of software systems. Ability to

Why is Observability important? • To know how the system

References • The SRE book from Google - https://landing.google.com/sre/sre-book/toc/ index.html

Golden signals •Latency •Traffic •Error •Saturation •More signals, less noise

Sample Product • Simple HTTP REST API using API GW,

Sample Product AWS Sequence Flow

Sample Product • Error rate • P99 latency • Tail

What is SLI? •Some SLIs •Availability - ratio of successful

Measuring SLIs •Mean, Sum, histogram, ratio, percentiles •Mean or average

AWS Cloudwatch •Metrics come for free •Lambda metrics •DynamoDB metrics

What is SLO? •Uses both black box and white box

What is Error budget policy? •This error budget is decided

What is SLA? • Availability agreement with Customer • Financial

SRE Takeaways Protect your health Prevent SPAM Alerts Re-evaluate SLO

Product Owner Takeaways Think of opaque metrics to measure both

Thank you! • Twitter - https://twitter.com/ chatterjeedipro • Linkedin -