Failure modes in AWS serverless services - ServerlessDays Cardiff 2020

2722343c1887c90843a3bda34a2cf10e?s=47 Sara Gerion
February 13, 2020

Failure modes in AWS serverless services - ServerlessDays Cardiff 2020

Public talk presented at the ServerlessDays Cardiff 2020 conference.

Abstract:

In DAZN, we often rely on AWS serverless services to build a global platform that needs to be highly scalable and highly available. Serverless computing allows developers to focus on the business logic implementation rather than cluster provisioning, maintenance of operative systems and containers orchestration. It also comes with great fault tolerance and reliability. How many times have you heard the phrase: 'You don't need to worry about it, AWS takes care of that for you'. While this is certainly true in some scenarios, it's not always the case.
I will go through some examples of not so obvious failure modes of serverless services in AWS, and how to mitigate them, from a backend engineer perspective.

2722343c1887c90843a3bda34a2cf10e?s=128

Sara Gerion

February 13, 2020
Tweet

Transcript

  1. @Sarutule 1 FAILURE MODES In AWS serverless services

  2. @Sarutule 2 FAILURE

  3. @Sarutule 3 Simplifying with serverless computing

  4. @Sarutule 4

  5. @Sarutule 5

  6. @Sarutule 6 It seems like the solution to all my

    problems, right?
  7. @Sarutule 7 Lambda asynchronous invocations

  8. @Sarutule 8 Lambda asynchronous invocations Error retried 3 times, consecutively

    + DLQ support
  9. @Sarutule 9 Lambda asynchronous invocations 1. Abnormal number of timeouts

    on stage and production
  10. @Sarutule 10 Lambda asynchronous invocations 1. Abnormal number of timeouts

    on stage and production 2. Unable to customize number of retries
  11. @Sarutule 11 Lambda asynchronous invocations 1. Abnormal number of timeouts

    on stage and production 2. Unable to customize number of retries 3. Hard time limit of 15 minutes
  12. @Sarutule 12 Our solution

  13. @Sarutule 13 Event source mapping

  14. @Sarutule 14 Event source mapping

  15. @Sarutule 15

  16. @Sarutule 16 Event source mapping In case of a failure,

    the batch is retried with exponential backoff until one of the following: - success - records expire (24 h by default) - max retry limit is reached
  17. @Sarutule 17 Event source mapping In case of a failure,

    the batch is retried with exponential backoff until one of the following: - success - records expire (24 h by default) - max retry limit is reached With default settings, a “bad” record can block the processing of the stream for an entire day!
  18. @Sarutule 18 Batch operations to DynamoDB

  19. @Sarutule 19 Batch operations to DynamoDB

  20. @Sarutule 20

  21. @Sarutule 21 !!!

  22. @Sarutule 22 !!!

  23. @Sarutule 23 Batch operations to DynamoDB The operation is done

    via HTTP Retry policy in the AWS sdk (Node.js)
  24. @Sarutule 24 5xx errors reported by API Gateway

  25. @Sarutule 25 5xx errors reported by API Gateway 5xx

  26. @Sarutule 26 5xx errors reported by API Gateway 5xx

  27. @Sarutule 27

  28. @Sarutule 28 API Gateway Service Commitment AWS will use commercially

    reasonable efforts to make API Gateway available with a Monthly Uptime Percentage of at least 99.95% for each AWS region, during any monthly billing cycle (the “Service Commitment”). In the event that a API Gateway does not meet the Service Commitment, you will be eligible to receive a Service Credit as described below. https://aws.amazon.com/api-gateway/sla/
  29. @Sarutule 29

  30. @Sarutule 30 Observability is a must. Learnings

  31. @Sarutule 31 By shifting responsibilities to AWS and relying on

    serverless services, you lose control over custom resiliency strategies you may want to implement. Learnings
  32. @Sarutule 32 Testing code meant to run in distributed systems

    and / or microservices is hard, and failure scenarios caused by downstream dependencies must be included. Learnings
  33. @Sarutule 33 Default settings are not always optimal. Learnings

  34. @Sarutule 34 In case of HTTP calls, the client must

    have a retry strategy with backoff in place, and must be able to handle different failure scenarios. That is taken care for you if you use the AWS sdk. Learnings
  35. @Sarutule 35 No system is 100% reliable, AWS services included.

    This may have an impact on the SLA’s of your own service. Learnings
  36. @Sarutule 36 Failures can't all be avoided, but you can

    make sure your system is prepared to handle them. Learnings
  37. @Sarutule 37 Thank you