Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure modes in AWS serverless services - ServerlessDays Cardiff 2020

Sara Gerion
February 13, 2020

Failure modes in AWS serverless services - ServerlessDays Cardiff 2020

Public talk presented at the ServerlessDays Cardiff 2020 conference.

Abstract:

In DAZN, we often rely on AWS serverless services to build a global platform that needs to be highly scalable and highly available. Serverless computing allows developers to focus on the business logic implementation rather than cluster provisioning, maintenance of operative systems and containers orchestration. It also comes with great fault tolerance and reliability. How many times have you heard the phrase: 'You don't need to worry about it, AWS takes care of that for you'. While this is certainly true in some scenarios, it's not always the case.
I will go through some examples of not so obvious failure modes of serverless services in AWS, and how to mitigate them, from a backend engineer perspective.

Sara Gerion

February 13, 2020
Tweet

More Decks by Sara Gerion

Other Decks in Programming

Transcript

  1. @Sarutule 10 Lambda asynchronous invocations 1. Abnormal number of timeouts

    on stage and production 2. Unable to customize number of retries
  2. @Sarutule 11 Lambda asynchronous invocations 1. Abnormal number of timeouts

    on stage and production 2. Unable to customize number of retries 3. Hard time limit of 15 minutes
  3. @Sarutule 16 Event source mapping In case of a failure,

    the batch is retried with exponential backoff until one of the following: - success - records expire (24 h by default) - max retry limit is reached
  4. @Sarutule 17 Event source mapping In case of a failure,

    the batch is retried with exponential backoff until one of the following: - success - records expire (24 h by default) - max retry limit is reached With default settings, a “bad” record can block the processing of the stream for an entire day!
  5. @Sarutule 23 Batch operations to DynamoDB The operation is done

    via HTTP Retry policy in the AWS sdk (Node.js)
  6. @Sarutule 28 API Gateway Service Commitment AWS will use commercially

    reasonable efforts to make API Gateway available with a Monthly Uptime Percentage of at least 99.95% for each AWS region, during any monthly billing cycle (the “Service Commitment”). In the event that a API Gateway does not meet the Service Commitment, you will be eligible to receive a Service Credit as described below. https://aws.amazon.com/api-gateway/sla/
  7. @Sarutule 31 By shifting responsibilities to AWS and relying on

    serverless services, you lose control over custom resiliency strategies you may want to implement. Learnings
  8. @Sarutule 32 Testing code meant to run in distributed systems

    and / or microservices is hard, and failure scenarios caused by downstream dependencies must be included. Learnings
  9. @Sarutule 34 In case of HTTP calls, the client must

    have a retry strategy with backoff in place, and must be able to handle different failure scenarios. That is taken care for you if you use the AWS sdk. Learnings
  10. @Sarutule 35 No system is 100% reliable, AWS services included.

    This may have an impact on the SLA’s of your own service. Learnings
  11. @Sarutule 36 Failures can't all be avoided, but you can

    make sure your system is prepared to handle them. Learnings