AWS Observability Made Simple

Slide 1

Slide 1 text

AWS Observability made simple Eóin Shanaghy - Luciano Mammino AWS Community Day - November 11th 2021 fth.link/o11y-simple

Slide 2

Slide 2 text

Hi! I’m Eoin 🙂 CTO aiasaservicebook.com @eoins eoins ✉ Get in touch

Slide 3

Slide 3 text

👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect: 🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino

Slide 4

Slide 4 text

We are business focused technologists that deliver. Accelerated Serverless | AI as a Service | Platform Modernisation We are hiring! Let’s have a chat 🙂

Slide 5

Slide 5 text

Check out our new Podcast! awsbites.com

Slide 6

Slide 6 text

fth.link/o11y-simple

Slide 7

Slide 7 text

Observability in the cloud a measure of how well internal states of a system can be inferred from knowledge of its external outputs 🪵 🔍 📈 🚨 Structured Logs Tracing Metrics Alarms “

Slide 8

Slide 8 text

A typical case study ⚡ Serverless app ● Distributed system (100s of components) 🔌 HTTP APIs using ● Lambda ● DynamoDB ● API Gateway ● Cognito 🧱 Multiple services / stacks 🏁 Using SLIC Starter (fth.link/slic) 173 resources!

Slide 9

Slide 9 text

A typical case study ⚽ The goal: know about problems before users do How? 📝 Structured Logs 📐 Metrics 🔔 Alarms 📊 Dashboards 🗺 Traces (X-Ray)

Slide 10

Slide 10 text

Can we test our observability? 󰝊 We run a stress test ○ Simulate traﬃc using the integration test ○ Run the test a number of times in parallel (in a loop) ○ Exercises all the APIs with typical use cases (login, CRUD operations, etc.) 🚨 After 10-15 minutes, we started to get alarms...

Slide 11

Slide 11 text

🚨 Alerts ﬂow!

Slide 12

Slide 12 text

Making sense of alerts

Slide 13

Slide 13 text

Initial Hypothesis 🛑 We got throttled (DynamoDB write throttle events) ↪ 🔁 causing AWS SDK retries (in the Lambda function) ↪ ⏱ causing Lambda timeouts ↪ 👎 causing API Gateway 502 🧪 How do we validate this? 1. Check the timeout cause ➡ Lambda metrics/logs 2. Check the Lambda error cause ➡ Lambda logs 3. Identify the source of 5xx errors in API Gateway ➡ X-Ray 4. Check the DynamoDB metrics ➡ Dashboards

Slide 14

Slide 14 text

Gathering evidence

Slide 15

Slide 15 text

Checking timeouts ● Check lambda timeouts ○ Duration metrics (aggregated data) ○ Logs (individual requests) ● Logs Insights give us duration for each individual request. We can use this to isolate the logs for just that request. ● We use stats to see how many executions are affected.

Slide 16

Slide 16 text

Inspecting DynamoDB Capacity

Slide 17

Slide 17 text

Tracing errors

Slide 18

Slide 18 text

HTTP 502

Slide 19

Slide 19 text

HTTP 500 UNEXPECTED! 😱

Slide 20

Slide 20 text

Lambda CloudWatch Logs

Slide 21

Slide 21 text

Conclusions 🌡 Symptom 🐞 Problem 󰟿 Resolution 1 DynamoDB throttles Table with low provisioned WCUs (write capacity) Switch table to PAY_PER_REQUEST Add throttling in API Gateway to limit potential cost impact 2 API 502 Errors Lambda Timeouts Throttles caused DynamoDB retries with exponential backoff - up to 50 seconds of retry Change maxRetries to 3 (350ms max retry) 3 API 500 Errors Attempt to update a missing record - problem with integration test! Fix the integration test to ensure deletion occurs after other actions complete. Also improved the API design

Slide 22

Slide 22 text

Before and after

Slide 23

Slide 23 text

What we have learned so far 󰠅 ● We were able to identify, understand and ﬁx these errors quite quickly ● We didn’t have to change the code to do that ● Nor did we run it locally with a debugger ● All of this was possible because we conﬁgured observability tools in AWS in advance

Slide 24

Slide 24 text

AWS native o11y = CloudWatch Cloudwatch gives you: ➔ Logs with Insights ➔ Metrics ➔ Dashboards ➔ Alarms ➔ Canaries ➔ Distributed tracing (with X-Ray)

Slide 25

Slide 25 text

Alternatives outside AWS Established New entrants Roll your own (only for the brave)

Slide 26

Slide 26 text

CloudWatch out of the box 😍 A toolkit you can use to build observability 🤩 Metrics are automatically generated for all services! 😟 Lots of dashboards, but by service and not by application! 😢 Zero alarms out of the box!

Slide 27

Slide 27 text

Getting the best out of Cloudwatch Cloudwatch can be your friend if you... 📚 Research and understand available metrics 📐 Decide thresholds 📊 Write IaC for application dashboards ⏰ Write IaC for service metric alarms ⏪ Update every time your application changes 📋 Copy and paste for each stack in your application (a.k.a. A LOT OF WORK!)

Slide 28

Slide 28 text

Best practices 😇 AWS Well Architected Framework 🏛 5 Pillars ⚙ Operational excellence pillar covers observability 🧐 Serverless lens applies these pillars 👍 Good guidance on metrics to observe 👎 More reading and research + you still have to pick thresholds

Slide 29

Slide 29 text

CloudFormation for CloudWatch Alarms 😬 "Type": "AWS::CloudWatch::Alarm", "Properties": { "ActionsEnabled": true, "AlarmActions": [ "arn:aws:sns:eu-west-1:665863320777:FTSLICAlarms" ], "AlarmName": "LambdaThrottles_serverless-test-project-dev-hello", "AlarmDescription": "Throttles % for serverless-test-project-dev-hello ..", "EvaluationPeriods": 1, "ComparisonOperator": "GreaterThanThreshold", "Threshold": 0, "TreatMissingData": "notBreaching", "Metrics": [ { "Id": "throttles_pc", "Expression": "(throttles / throttles + invocations) * 100", "Label": "% Throttles", "ReturnData": true }, { "Id": "throttles", "MetricStat": { "Metric": { "Namespace": "AWS/Lambda", "MetricName": "Throttles", "Dimensions": [ { "Name": "FunctionName", "Value": "serverless-test-project-dev-hello" } ] }, "Period": 60, "Stat": "Sum" }, "ReturnData": false }, { "Id": "invocations", "MetricStat": { "Metric": { "Namespace": "AWS/Lambda", "MetricName": "Invocations",

Slide 30

Slide 30 text

Can we automate this? Magically generated alarms and dashboards for each application!

Slide 31

Slide 31 text

fth.link/slic-watch Introducing SLIC Watch

Slide 32

Slide 32 text

How SLIC Watch works 🛠 Your app serverless.yml sls deploy CloudFormation stack very-big.json SLIC Watch 👀 🛠 CloudFormation stack ++ even-bigger.json Deploy ☁ 📊📈

Slide 33

Slide 33 text

Before SLIC Watch

Slide 34

Slide 34 text

After SLIC Watch

Slide 35

Slide 35 text

After SLIC Watch

Slide 36

Slide 36 text

After SLIC Watch

Slide 37

Slide 37 text

After SLIC Watch

Slide 38

Slide 38 text

After SLIC Watch Check out SLIC Slack

Slide 39

Slide 39 text

Configuration 🎀 SLIC Watch comes with sane defaults 📝 You can configure what you don’t like 🔌 Or disable specific dashboards or alarms

Slide 40

Slide 40 text

How to get started 📣 Create an SNS Topic as the alarm destination (optional) 📦 ❯ npm install serverless-slic-watch-plugin --save-dev ✍ Update serverless.yml ⚙ Conﬁgure (optional) 🚢 ❯ sls deploy plugins: - serverless-slic-watch-plugin 💡 Check out the complete example project in the repo!

Slide 41

Slide 41 text

Wrapping up 🎁 ★ If your services are failing you definitely want to know about it! ★ Observability can save you from hundreds of hours of blind debugging! ★ CloudWatch is the go to tool in AWS but you have to configure it! ★ Automation can take most of the configuration pain away ★ SLIC Watch can give you this automation ★ You still have control and flexibility 🔬Try it out! 🗣 Give feedback! 🌈 Let’s make it better! fth.link/slic-watch

Slide 42

Slide 42 text

Thank you! fth.link/o11y-simple Cover picture by Markus Spiske on Unsplash