You shall not fail! (In the face of turbulent conditions)

@theburningmonk @sarutule 1 "I COME BACK TO YOU NOW AT
THE TURN OF THE TIDE" 1

@theburningmonk @sarutule WHAT IS RESILIENCE? 2

@theburningmonk @sarutule Failures in distributed systems 3

@theburningmonk @sarutule Failures on load: exhaustion of resources 4

@theburningmonk @sarutule Failures on load: exhaustion of resources 5

You Shall Not Fail! in the face of turbulent conditionsTM
what is RESILIENCE chaos ENGINEERING multi-region STRATEGIES retries & TIMEOUTS lambda SCALING decoupled INVOCATION

PRODUCERS Yan Cui, @theburningmonk Sara Gerion, @sarutule SPEAKERS Yan Cui,
@theburningmonk Sara Gerion, @sarutule SPEAKING AT AWS Community Summit Online SPECIAL THANKS Phil Horn Joe Park

@theburningmonk @sarutule 8 Yan Cui http://theburningmonk.com @theburningmonk AWS user for
10 years

@theburningmonk @sarutule 9 Yan Cui http://theburningmonk.com @theburningmonk Developer Advocate @

@theburningmonk @sarutule 10

@theburningmonk @sarutule 11 Yan Cui http://theburningmonk.com @theburningmonk Independent Consultant advise
training delivery

SARA GERION Italian living in Amsterdam, The Netherlands Passionate about
cloud, scalability, resilience Twitter: @Sarutule Backend engineer at DAZN @dazneng Director of Tech at SheSharp @SheSharpNL

@theburningmonk @sarutule Lambda execution environment 13

@theburningmonk @sarutule Serverless - multiple AZ’s out of the box
14 Total resources created: 1 API Gateway 1 Lambda

@theburningmonk @sarutule Load balancing 15

@theburningmonk @sarutule Data replication 16

@theburningmonk @sarutule REST API - Lambda autoscaling 17 Concurrency limits: 
3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Paciﬁc (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute 

@theburningmonk @sarutule REST API - Lambda autoscaling 18 X number
of execution environments   pre-initialized (ready to respond to invocations) Note: standard burst concurrency limits when over the provisioned capacity   Concurrency limits:  3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Paciﬁc (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute 

@theburningmonk @sarutule REST API - Lambda autoscaling 19 Adjustable provisioned
capacity based on CloudWatch metrics X number of execution environments   pre-initialized (ready to respond to invocations) Note: standard burst concurrency limits when over the provisioned capacity   Concurrency limits:  3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland), 1000 – Asia Paciﬁc (Tokyo), Europe (Frankfurt), 500 – Other Regions Later bursts: 500 new containers / each minute 

@theburningmonk @sarutule REST API - Lambda limitations & throttling 20

@theburningmonk @sarutule HOW TO SOLVE IT? 21

@theburningmonk @sarutule HOW TO SOLVE IT? IT DEPENDS 22

@theburningmonk @sarutule The importance of retry policies 23

@theburningmonk @sarutule Scenario: client only needs an acknowledgement 24

@theburningmonk @sarutule If fast acknowledgement not possible…

@theburningmonk @sarutule Scenario: predictable spikes 26

@theburningmonk @sarutule Scenario: predictable spikes 27 Holidays, weekends,  celebrations  (Black
Friday) Planned launch of  resources  (new series available) Sport events

@theburningmonk @sarutule Scenario: unpredictable spikes 28 Trafﬁc generated by user
actions    Jennifer Aniston’s ﬁrst post

@theburningmonk @sarutule Possible mitigations for REST API’s 29 Use 1
Lambda  for each  endpoint 29

@theburningmonk @sarutule One Lambda function for each endpoint 30 30

Lambda  for each  endpoint Raise limits with  an AWS support ticket 31

Lambda  for each  endpoint Optimise   performance Raise limits with  an AWS support ticket 32

Lambda  for each  endpoint Optimise   performance Ofﬂoad computing  operations to an   async ﬂow (SQS, SNS, …) Raise limits with  an AWS support ticket 33

@theburningmonk @sarutule Ofﬂoad computing operations to queues 34

Lambda  for each  endpoint Optimise   performance Ofﬂoad computing  operations to an   async ﬂow (SQS, SNS, …) Use provisioned capacity  (plus autoscaling) Raise limits with  an AWS support ticket 36

@theburningmonk @sarutule Reminder: beware of long timeouts 37 API Gateway 
Integration timeout   Default: 29s Lambda  Timeout Max: 15 minutes SQS  Visibility timeout  Default: 30s Min: 0s Max: 12 hours

@theburningmonk @sarutule Single-region architectures 38

@theburningmonk @sarutule Multi-region: active-passive 39

@theburningmonk @sarutule Multi-region: active-active 40

@theburningmonk @sarutule Active-active & data replication 41

@theburningmonk @sarutule Multi-region architecture - beneﬁts & tradeoffs 42 Protection
against  regional failures

against  regional failures Higher complexity

against  regional failures Higher complexity Very hard to test

@theburningmonk @sarutule CHAOS ENGINEERING 45

@theburningmonk @sarutule 46 MUST KILL SERVERS! RAWR!! RAWR!!

@theburningmonk @sarutule 47 “the discipline of experimenting on a system
in order to build conﬁdence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org

@theburningmonk @sarutule 48 “You don't choose the moment, the moment
chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch

@theburningmonk @sarutule 49 identify weaknesses before they manifest in system-wide,
aberrant behaviors GOAL

@theburningmonk @sarutule 50 learn about the system’s behavior by observing
it during a controlled experiments HOW

@theburningmonk @sarutule 51 learn about the system’s behavior by observing
it during a controlled experiments HOW game days failure injection

@theburningmonk @sarutule 52 MUST KILL SERVERS! RAWR!! RAWR!! ahhhhhhh!!!! HELP!!!
OMG!!! F***!!!

@theburningmonk @sarutule 53 phew!

@theburningmonk @sarutule 54 STEP 1. deﬁne steady state i.e. “what
does normal look like”

@theburningmonk @sarutule 55 STEP 2. hypothesis that steady state continues
in control and experimental group e.g. “the system stays up if a server dies”

@theburningmonk @sarutule 56 STEP 3. inject realistic failures e.g. “slow
response from 3rd-party service”

@theburningmonk @sarutule 57 STEP 4. try to disprove hypothesis i.e.
“look for difference between control and experimental group”

@theburningmonk @sarutule DON’T START EXPERIMENTS IN PRODUCTION 58

@theburningmonk @sarutule 59 identify weaknesses before they manifest in system-wide,
aberrant behaviors GOAL

@theburningmonk @sarutule 60 “Corporation X lost millions due to a
chaos experiment went wrong and destroyed key infrastructure, resulting in hours of downtime and unrecoverable data loss.”

@theburningmonk @sarutule 61 Chaos Engineering doesn't cause problems. It reveals
them. Nora Jones

@theburningmonk @sarutule 62 CONTAINMENT

@theburningmonk @sarutule 63 CONTAINMENT run experiments during ofﬁce hours

@theburningmonk @sarutule 64 CONTAINMENT run experiments during ofﬁce hours let
others know what you’re doing, no surprises

others know what you’re doing, no surprises avoid important dates

others know what you’re doing, no surprises avoid important dates make the smallest change possible

others know what you’re doing, no surprises avoid important dates make the smallest change possible have a rollback plan before you start

@theburningmonk @sarutule DON’T START EXPERIMENTS IN PRODUCTION 68

@theburningmonk @sarutule 69 by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361

@theburningmonk @sarutule 70 chaos monkey kills an EC2 instance latency
monkey induces artiﬁcial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region

@theburningmonk @sarutule 72 there are no servers to kill! SERVERLESS

@theburningmonk @sarutule 75 improperly tuned timeouts

@theburningmonk @sarutule 76 missing error handling

@theburningmonk @sarutule 77 missing fallbacks

@theburningmonk @sarutule 79 “what if DynamoDB has an elevated error
rate?”

@theburningmonk @sarutule 80 hypothesis: the AWS SDK retries would handle
it

@theburningmonk @sarutule 81 runs experiment…

@theburningmonk @sarutule 82 TIL: the js DynamoDB client defaults to
10 retries with base delay of 50ms

@theburningmonk @sarutule 83 TIL: the js DynamoDB client defaults to
10 retries with base delay of 50ms delay = Math.random() * (Math.pow(2, retryCount) * base) this is Marc Brooker’s fav formula!

@theburningmonk @sarutule 85 result: function times out after 6s (hypothesis
is disproved)

@theburningmonk @sarutule 86 action: set max retry count + fallback

@theburningmonk @sarutule 87 outcome: a more resilient system

@theburningmonk @sarutule 88 “what if service X has elevated latency?”

@theburningmonk @sarutule 89 hypothesis: our try-catch would handle it

@theburningmonk @sarutule 90 runs experiment…

@theburningmonk @sarutule 91 result: function times out after 6s (hypothesis
is disproved)

@theburningmonk @sarutule 92 TIL: most HTTP client libraries have default
timeout of 60s. API Gateway has an integration timeout of 29s. Most Lambda functions default to timeout of 3-6s.

@theburningmonk @sarutule 95 https://bit.ly/2Wvfort

@theburningmonk @sarutule 98 outcome: a more resilient system

@theburningmonk @sarutule 99 recap

@theburningmonk @sarutule Failures in distributed systems 100

@theburningmonk @sarutule Serverless - multiple AZ’s out of the box
101 Total resources created: 1 API Gateway 1 Lambda

@theburningmonk @sarutule Beware of timeouts 102 API Gateway  Integration timeout
  Default: 29s Lambda  Timeout Max: 15 minutes SQS  Visibility timeout  Default: 30s Min: 0s Max: 12 hours

@theburningmonk @sarutule Active-active 104

@theburningmonk @sarutule 105 “You don't choose the moment, the moment
chooses you! You only choose how prepared you are when it does.” Fire Chief Mike Burtch

You shall not fail! (In the face of turbulent c...

You shall not fail! (In the face of turbulent conditions)

More Decks by Sara Gerion

Other Decks in Technology

Featured

Transcript