Slide 1

Slide 1 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Resiliency and Availability Design Patterns Adrian Hornsby Sr. Technical Evangelist Amazon Web Services @adhorn

Slide 2

Slide 2 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Looks familiar?

Slide 3

Slide 3 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed Systems are hard Amazon Twitter Netflix

Slide 4

Slide 4 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions

Slide 5

Slide 5 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partial failure mode

Slide 6

Slide 6 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do we build resilient software systems?

Slide 7

Slide 7 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. People Application Network & Data Infrastructure

Slide 8

Slide 8 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about isolation and containment

Slide 9

Slide 9 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 10

Slide 10 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Typical service application Compute Cell Storage

Slide 11

Slide 11 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell-based architecture Compute Cell 0 Compute Cell n Regional Service Storage Compute Cell 1 Storage Storage

Slide 12

Slide 12 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. REGIONAL SERVICE Zone A Zone B Zone C Zone A Zone B Zone C Zone A Zone B Zone C Z O N A L S E R V I C E Z O N A L S E R V I C E Z O N A L S E R V I C E S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L Zone A Zone B Zone C S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L W I T H O U T C E L L S W I T H C E L L S

Slide 13

Slide 13 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. System properties Cell 0 Service Cell 1 Cell n • Workload isolation • Failure containment • Scale-out vs. scale-up • Testability • Manageability Cell n+1

Slide 14

Slide 14 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about blast radius

Slide 15

Slide 15 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 16

Slide 16 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. X X X X X X X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢

Slide 17

Slide 17 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell-based architecture X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢

Slide 18

Slide 18 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding X X ♤ ♡ ♢ ⚀ ⚁⚂ ⚃ ♡ ♤ ♧ ♢ ⚀⚂ ♧ ⚁⚃ ♢ ♢ ♡ ♧ ♢

Slide 19

Slide 19 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%

Slide 20

Slide 20 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%

Slide 21

Slide 21 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding

Slide 22

Slide 22 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Better to react without reacting

Slide 23

Slide 23 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application Lets take an application …

Slide 24

Slide 24 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application Requires 8 Instances or containers

Slide 25

Slide 25 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Overload Failures

Slide 26

Slide 26 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application

Slide 27

Slide 27 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application Requires 6 Instances or Containers

Slide 28

Slide 28 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about timeouts, backoff & retries.

Slide 29

Slide 29 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 30

Slide 30 text

https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

Slide 31

Slide 31 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = default Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry

Slide 32

Slide 32 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set the timeouts through inheritance Timeout backend = Timeout client – time elapsed

Slide 33

Slide 33 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s – time elapsed Wait 2s before Retry Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backoff between retries Backoff

Slide 34

Slide 34 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter

Slide 35

Slide 35 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adding Jitter

Slide 36

Slide 36 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit

Slide 37

Slide 37 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about load shedding.

Slide 38

Slide 38 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 39

Slide 39 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 40

Slide 40 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cheaply reject excess work

Slide 41

Slide 41 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 42

Slide 42 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue

Slide 43

Slide 43 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about databases.

Slide 44

Slide 44 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Read / Write separation DB Instance DB instance read replica DB instance read replica DB instance read replica Instance Instance Instance Supports degradation through Read-Only mode

Slide 45

Slide 45 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks

Slide 46

Slide 46 text

https://twitter.com/redditstatus/status/1116204502703493120

Slide 47

Slide 47 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transient state does not belong in the database.

Slide 48

Slide 48 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about chaos engineering.

Slide 49

Slide 49 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fire Drills

Slide 50

Slide 50 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

Slide 51

Slide 51 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy

Slide 52

Slide 52 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • People attack https://www.gremlin.com https://github.com/Netflix/SimianArmy https://chaostoolkit.org

Slide 53

Slide 53 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. "Chaos engineering is NOT about breaking things randomly without a purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.”

Slide 54

Slide 54 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. And before we go.

Slide 55

Slide 55 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DON’T blame people for failure…

Slide 56

Slide 56 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Thanks you! @adhorn https://medium.com/@adhorn https://speakerdeck.com/adhorn/patterns-for-building-resilient-software-systems-2019