Slide 1

Slide 1 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Resiliency and Availability Design Patterns for the Cloud Sébastien Stormacq Principal Developer Advocate @sebsto

Slide 2

Slide 2 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Can you guess what will happen?

Slide 3

Slide 3 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed Systems are hard

Slide 4

Slide 4 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “

Slide 5

Slide 5 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions

Slide 6

Slide 6 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partial failure mode

Slide 7

Slide 7 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. People Application Network & Data Infrastructure

Slide 8

Slide 8 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about Geo Availability

Slide 9

Slide 9 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 10

Slide 10 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fully-scaled Availability Zone

Slide 11

Slide 11 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Highly redundant regional network

Slide 12

Slide 12 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Region and availability zones Region Availability zone a Availability zone b Availability zone c data center data center data center 1 or more data centers per AZ 2 or more AZs per region (new regions min 3) data center data center data center data center data center data center

Slide 13

Slide 13 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds

Slide 14

Slide 14 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)

Slide 15

Slide 15 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture X Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)

Slide 16

Slide 16 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB) X

Slide 17

Slide 17 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance new master Elastic Load Balancing (ELB) X

Slide 18

Slide 18 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture • Enables fault-tolerant applications • AWS regional services designed to withstand AZ failures • Leveraged by AWS regional services such as Amazon S3, Amazon DynamoDB, Amazon Aurora, Amazon ELBs, etc. Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)

Slide 19

Slide 19 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about auto scaling

Slide 20

Slide 20 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto-Scaling Fixed Variable

Slide 21

Slide 21 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability zone 1 Auto Scaling group AWS Region Availability zone 2 Auto-scaling for self-healing Elastic Load Balancing (ELB) X

Slide 22

Slide 22 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about decoupling and async

Slide 23

Slide 23 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch result Get result Decoupling with async pattern

Slide 24

Slide 24 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue/Streaming API Instance API Instance API Instance

Slide 25

Slide 25 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Notification User Worker Instance Worker Instance API Instance API Instance Cache node Fetch results API Instance Queue/Streaming

Slide 26

Slide 26 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue

Slide 27

Slide 27 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about databases.

Slide 28

Slide 28 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Read / Write separation DB Instance DB instance read replica DB instance read replica DB instance read replica Instance Instance Instance Supports degradation through Read-Only mode

Slide 29

Slide 29 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Federation Users DB Products DB Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica

Slide 30

Slide 30 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica DB Instance DB instance read replica

Slide 31

Slide 31 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about timeouts, backoff & retries!

Slide 32

Slide 32 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ? ?

Slide 33

Slide 33 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = default = Infinite Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry

Slide 34

Slide 34 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://docs.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout

Slide 35

Slide 35 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 36

Slide 36 text

https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

Slide 37

Slide 37 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/

Slide 38

Slide 38 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set the timeouts!

Slide 39

Slide 39 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool

Slide 40

Slide 40 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connections Backoff

Slide 41

Slide 41 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter

Slide 42

Slide 42 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: add jitter 0-1000ms MAX_TRIES = 12 def get_item(self, url, n=1): try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res

Slide 43

Slide 43 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 44

Slide 44 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.

Slide 45

Slide 45 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit

Slide 46

Slide 46 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://github.com/Netflix/Hystrix

Slide 47

Slide 47 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://spring.io/guides/gs/circuit-breaker/

Slide 48

Slide 48 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about health checking!

Slide 49

Slide 49 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B database Email Probing for health Cluster

Slide 50

Slide 50 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes

Slide 51

Slide 51 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes

Slide 52

Slide 52 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? yes Are you healthy? yes yes yes yes

Slide 53

Slide 53 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? no Are you healthy? no yes yes yes

Slide 54

Slide 54 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prioritize shallow health checks during hard times. Cache and be careful with logging.

Slide 55

Slide 55 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about load shedding.

Slide 56

Slide 56 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 57

Slide 57 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 58

Slide 58 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 59

Slide 59 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Don’t be overly optimistic and take on more than you can. Find an operational metric to reject what you cannot take in. Favor cached and static content Prioritize ELB health check (shallow) pings In an overload situation you have precious resources, do not let any of it go to waste. Load Shedding

Slide 60

Slide 60 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks

Slide 61

Slide 61 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://twitter.com/redditstatus/status/1116204502703493120

Slide 62

Slide 62 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about sharding.

Slide 63

Slide 63 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 64

Slide 64 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4

Slide 65

Slide 65 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4

Slide 66

Slide 66 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4

Slide 67

Slide 67 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4

Slide 68

Slide 68 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4

Slide 69

Slide 69 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4

Slide 70

Slide 70 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4

Slide 71

Slide 71 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measure for this: blast radius

Slide 72

Slide 72 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Blast radius • How many customers? • What functionality? • How many locations?

Slide 73

Slide 73 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding

Slide 74

Slide 74 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Random Cells

Slide 75

Slide 75 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Random Cells

Slide 76

Slide 76 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Random Cells

Slide 77

Slide 77 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Cells

Slide 78

Slide 78 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Random Cells

Slide 79

Slide 79 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%

Slide 80

Slide 80 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%

Slide 81

Slide 81 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding

Slide 82

Slide 82 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about chaos!

Slide 83

Slide 83 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

Slide 84

Slide 84 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy

Slide 85

Slide 85 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org

Slide 86

Slide 86 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failure injection Start small & build confidence •Application level •Host failure •Resource attacks (CPU, memory, …) •Network attacks (dependencies, latency, …) •Region attacks

Slide 87

Slide 87 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Fault Injection Simulator Fully managed chaos engineering service on AWS Coming soon

Slide 88

Slide 88 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo

Slide 89

Slide 89 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Learn more.

Slide 90

Slide 90 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://aws.amazon.com/wellarchitected

Slide 91

Slide 91 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://medium.com/@adhorn

Slide 92

Slide 92 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 93

Slide 93 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Plan for the worst, prepare for the unexpected.

Slide 94

Slide 94 text

© 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you ! @sebsto /sebsto /sebsto /sebAWS Sébastien Stormacq