Failure modes in AWS serverless services - ServerlessDays Cardiff 2020

@Sarutule 1 FAILURE MODES In AWS serverless services

@Sarutule 2 FAILURE

@Sarutule 3 Simplifying with serverless computing

@Sarutule 4

@Sarutule 5

@Sarutule 6 It seems like the solution to all my
problems, right?

@Sarutule 7 Lambda asynchronous invocations

@Sarutule 8 Lambda asynchronous invocations Error retried 3 times, consecutively
+ DLQ support

@Sarutule 9 Lambda asynchronous invocations 1. Abnormal number of timeouts
on stage and production

on stage and production 2. Unable to customize number of retries

on stage and production 2. Unable to customize number of retries 3. Hard time limit of 15 minutes

@Sarutule 12 Our solution

@Sarutule 13 Event source mapping

@Sarutule 14 Event source mapping

@Sarutule 15

@Sarutule 16 Event source mapping In case of a failure,
the batch is retried with exponential backoff until one of the following: - success - records expire (24 h by default) - max retry limit is reached

@Sarutule 17 Event source mapping In case of a failure,
the batch is retried with exponential backoff until one of the following: - success - records expire (24 h by default) - max retry limit is reached With default settings, a “bad” record can block the processing of the stream for an entire day!

@Sarutule 18 Batch operations to DynamoDB

@Sarutule 19 Batch operations to DynamoDB

@Sarutule 20

@Sarutule 21 !!!

@Sarutule 22 !!!

@Sarutule 23 Batch operations to DynamoDB The operation is done
via HTTP Retry policy in the AWS sdk (Node.js)

@Sarutule 24 5xx errors reported by API Gateway

@Sarutule 25 5xx errors reported by API Gateway 5xx

@Sarutule 26 5xx errors reported by API Gateway 5xx

@Sarutule 27

@Sarutule 28 API Gateway Service Commitment AWS will use commercially
reasonable efforts to make API Gateway available with a Monthly Uptime Percentage of at least 99.95% for each AWS region, during any monthly billing cycle (the “Service Commitment”). In the event that a API Gateway does not meet the Service Commitment, you will be eligible to receive a Service Credit as described below. https://aws.amazon.com/api-gateway/sla/

@Sarutule 29

@Sarutule 30 Observability is a must. Learnings

@Sarutule 31 By shifting responsibilities to AWS and relying on
serverless services, you lose control over custom resiliency strategies you may want to implement. Learnings

@Sarutule 32 Testing code meant to run in distributed systems
and / or microservices is hard, and failure scenarios caused by downstream dependencies must be included. Learnings

@Sarutule 33 Default settings are not always optimal. Learnings

@Sarutule 34 In case of HTTP calls, the client must
have a retry strategy with backoff in place, and must be able to handle different failure scenarios. That is taken care for you if you use the AWS sdk. Learnings

@Sarutule 35 No system is 100% reliable, AWS services included.
This may have an impact on the SLA’s of your own service. Learnings

@Sarutule 36 Failures can't all be avoided, but you can
make sure your system is prepared to handle them. Learnings

@Sarutule 37 Thank you

Failure modes in AWS serverless services - Serv...

Failure modes in AWS serverless services - ServerlessDays Cardiff 2020

More Decks by Sara Gerion

Other Decks in Programming

Featured

Transcript