Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You shall not fail! (In the face of turbulent conditions)

You shall not fail! (In the face of turbulent conditions)

Public talk presented at the AWS Community Summit Online 2020 conference.
Speakers:
Sara Gerion, @sarutule
Yan Cui, @theburningmonk

Lambda gives you a lot of scalability and multi-AZ out-of-the-box, but still, things can go wrong in production.
There are region-wide outages, and performance degradation in services your function depends on can cause it to time out or error. And what if you're dealing with downstream systems that just aren't as scalable and can't handle the load you put on them?
The bottom line is many things can go wrong and they often do at the worst times. The goal of building resilient systems is not to prevent failures, but to build systems that can withstand these failures. In this talk, we will look at a number of practices and architectural patterns that can help you build more resilient serverless applications. Such as multi-region, active-active, employing DLQs and surge queues, and using chaos experiments to identify failure modes before they manifest in production.

Recording available here:
https://www.youtube.com/watch?v=elVeOYYtLM0

Sara Gerion

May 14, 2020
Tweet

More Decks by Sara Gerion

Other Decks in Technology

Transcript

  1. You Shall Not Fail! in the face of turbulent conditionsTM

    what is RESILIENCE chaos ENGINEERING multi-region STRATEGIES retries & TIMEOUTS lambda SCALING decoupled INVOCATION
  2. PRODUCERS Yan Cui, @theburningmonk Sara Gerion, @sarutule SPEAKERS Yan Cui,

    @theburningmonk Sara Gerion, @sarutule SPEAKING AT AWS Community Summit Online SPECIAL THANKS Phil Horn Joe Park
  3. SARA GERION Italian living in Amsterdam, The Netherlands Passionate about

    cloud, scalability, resilience Twitter: @Sarutule Backend engineer at DAZN @dazneng Director of Tech at SheSharp @SheSharpNL
  4. @theburningmonk @sarutule Serverless - multiple AZ’s out of the box

    14 Total resources created: 1 API Gateway 1 Lambda
  5. @theburningmonk @sarutule REST API - Lambda autoscaling 17 Concurrency limits:


    3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute

  6. @theburningmonk @sarutule REST API - Lambda autoscaling 18 X number

    of execution environments 
 pre-initialized (ready to respond to invocations) Note: standard burst concurrency limits when over the provisioned capacity 
 Concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute

  7. @theburningmonk @sarutule REST API - Lambda autoscaling 19 Adjustable provisioned

    capacity based on CloudWatch metrics X number of execution environments 
 pre-initialized (ready to respond to invocations) Note: standard burst concurrency limits when over the provisioned capacity 
 Concurrency limits:
 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute

  8. @theburningmonk @sarutule Scenario: predictable spikes 27 Holidays, weekends,
 celebrations
 (Black

    Friday) Planned launch of
 resources
 (new series available) Sport events
  9. @theburningmonk @sarutule Possible mitigations for REST API’s 31 Use 1

    Lambda
 for each
 endpoint Raise limits with
 an AWS support ticket 31
  10. @theburningmonk @sarutule Possible mitigations for REST API’s 32 Use 1

    Lambda
 for each
 endpoint Optimise 
 performance Raise limits with
 an AWS support ticket 32
  11. @theburningmonk @sarutule Possible mitigations for REST API’s 33 Use 1

    Lambda
 for each
 endpoint Optimise 
 performance Offload computing
 operations to an 
 async flow (SQS, SNS, …) Raise limits with
 an AWS support ticket 33
  12. @theburningmonk @sarutule Possible mitigations for REST API’s 36 Use 1

    Lambda
 for each
 endpoint Optimise 
 performance Offload computing
 operations to an 
 async flow (SQS, SNS, …) Use provisioned capacity
 (plus autoscaling) Raise limits with
 an AWS support ticket 36
  13. @theburningmonk @sarutule Reminder: beware of long timeouts 37 API Gateway


    Integration timeout 
 Default: 29s Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours
  14. @theburningmonk @sarutule Multi-region architecture - benefits & tradeoffs 44 Protection

    against
 regional failures Higher complexity Very hard to test
  15. @theburningmonk @sarutule 47 “the discipline of experimenting on a system

    in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  16. @theburningmonk @sarutule 48 “You don't choose the moment, the moment

    chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch
  17. @theburningmonk @sarutule 51 learn about the system’s behavior by observing

    it during a controlled experiments HOW game days failure injection
  18. @theburningmonk @sarutule 55 STEP 2. hypothesis that steady state continues

    in control and experimental group e.g. “the system stays up if a server dies”
  19. @theburningmonk @sarutule 57 STEP 4. try to disprove hypothesis i.e.

    “look for difference between control and experimental group”
  20. @theburningmonk @sarutule 60 “Corporation X lost millions due to a

    chaos experiment went wrong and destroyed key infrastructure, resulting in hours of downtime and unrecoverable data loss.”
  21. @theburningmonk @sarutule 65 CONTAINMENT run experiments during office hours let

    others know what you’re doing, no surprises avoid important dates
  22. @theburningmonk @sarutule 66 CONTAINMENT run experiments during office hours let

    others know what you’re doing, no surprises avoid important dates make the smallest change possible
  23. @theburningmonk @sarutule 67 CONTAINMENT run experiments during office hours let

    others know what you’re doing, no surprises avoid important dates make the smallest change possible have a rollback plan before you start
  24. @theburningmonk @sarutule 70 chaos monkey kills an EC2 instance latency

    monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  25. @theburningmonk @sarutule 83 TIL: the js DynamoDB client defaults to

    10 retries with base delay of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!
  26. @theburningmonk @sarutule 92 TIL: most HTTP client libraries have default

    timeout of 60s. API Gateway has an integration timeout of 29s. Most Lambda functions default to timeout of 3-6s.
  27. @theburningmonk @sarutule Serverless - multiple AZ’s out of the box

    101 Total resources created: 1 API Gateway 1 Lambda
  28. @theburningmonk @sarutule Beware of timeouts 102 API Gateway
 Integration timeout

    
 Default: 29s Lambda
 Timeout Max: 15 minutes SQS
 Visibility timeout
 Default: 30s Min: 0s Max: 12 hours
  29. @theburningmonk @sarutule 105 “You don't choose the moment, the moment

    chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch