Slide 1

Slide 1 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark A blast-radius reduction approach to embracing failure at scale. Adrian Hornsby Principal Evangelist Amazon Web Services @adhorn

Slide 2

Slide 2 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark “You need to know the past to understand the present.” Carl Sagan

Slide 3

Slide 3 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chapter 1: AWS endpoints

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark s3.amazonaws.com/bucket/ And it all started with …

Slide 6

Slide 6 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark s3.us-east-1.amazonaws.com/bucket/ Then more AWS region arrived

Slide 7

Slide 7 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Slide 8

Slide 8 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark autoscaling.us-east-1.amazonaws.com athena.us-east-1.amazonaws.com rds.us-east-1.amazonaws.com monitoring.us-east-1.amazonaws.com ec2.us-east-1.amazonaws.com dynamodb.us-east-1.amazonaws.com elasticloadbalancing.us-east-1.amazonaws.com SERVICE | REGION | AMAZONAWS | COM Today

Slide 9

Slide 9 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark bucket.s3.us-east-1.amazonaws.com Today RESOURCE | SERVICE | REGION | AMAZONAWS | COM

Slide 10

Slide 10 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark ALB-761504102.us-west-1.elb.amazonaws.com adhorn-war.s3.us-west-1.amazonaws.com dl1bss4jo007o.cloudfront.net RESOURCE | SERVICE | REGION | AMAZONAWS | COM

Slide 11

Slide 11 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chapter 2: Health Dashboard

Slide 12

Slide 12 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Service health dashboard

Slide 13

Slide 13 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark … gets personal

Slide 14

Slide 14 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chapter 3: AWS Regions

Slide 15

Slide 15 text

https://www.infrastructure.aws

Slide 16

Slide 16 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark autoscaling.us-east-1.amazonaws.com athena.us-east-1.amazonaws.com rds.us-east-1.amazonaws.com monitoring.us-east-1.amazonaws.com ec2.us-east-1.amazonaws.com dynamodb.us-east-1.amazonaws.com elasticloadbalancing.us-east-1.amazonaws.com SERVICE | REGION | AMAZONAWS | COM

Slide 17

Slide 17 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark By default - Shared nothing architecture X X X us-west-1 us-east-2

Slide 18

Slide 18 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chapter 4: Availability Zones

Slide 19

Slide 19 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fully-scaled Availability Zone

Slide 20

Slide 20 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Highly redundant regional network

Slide 21

Slide 21 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Availability Zones Region Availability zone a Availability zone b Availability zone c data center data center data center data center data center data center data center data center data center

Slide 22

Slide 22 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)

Slide 23

Slide 23 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB) X

Slide 24

Slide 24 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB) X

Slide 25

Slide 25 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Multi-AZ architecture • Enables fault-tolerant applications • AWS regional services designed to withstand AZ failures • Leveraged by AWS regional services such as Amazon S3, Amazon DynamoDB, Amazon Aurora, Amazon ELBs, etc. Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)

Slide 26

Slide 26 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Database distributed architecture DynamoDB Aurora

Slide 27

Slide 27 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Zone a control plane Zone a Zone a data plane Regional control plane Zone b control plane Zone b Zone b data plane Zone c control plane Zone c Zone c data plane Availability Zone independence

Slide 28

Slide 28 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Abstracted Architecture Aggregation layer Routing Compartmentalized resources Failure isolation Entry point

Slide 29

Slide 29 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Slide 30

Slide 30 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chapter 5: Cells

Slide 31

Slide 31 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB) Compute Storage Cell

Slide 32

Slide 32 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Cell-based architecture Regional Service Compute Storage Cell 1 Compute Storage Cell 2 Compute Storage Cell n [ … ]

Slide 33

Slide 33 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark REGIONAL SERVICE Zone A Zone B Zone C Zone A Zone B Zone C Zone A Zone B Zone C Z O N A L S E R V I C E Z O N A S E R V I C E Z O N A L S E R V I C E S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L Zone A Zone B Zone C S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L W I T H O U T C E L L S W I T H C E L L S

Slide 34

Slide 34 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark System properties Cell 0 Cell router Service Cell 1 Cell n • Workload isolation • Failure containment • Scale-out vs. scale-up • Testability • Manageability Cell n+1

Slide 35

Slide 35 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Cell size tradeoffs • Reduced blast radius • Easier to test • Cells easier to operate • Cost efficiency • Reduced splits • System easier to operate

Slide 36

Slide 36 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

Slide 37

Slide 37 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark X X X X X X X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢

Slide 38

Slide 38 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cell-based architecture X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢

Slide 39

Slide 39 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 40

Slide 40 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Can we do better?

Slide 41

Slide 41 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chapter 6: Shuffle Sharding

Slide 42

Slide 42 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding X X ♤ ♡ ♢ ⚀ ⚁⚂ ⚃ ♡ ♤ ♧ ♢ ⚀⚂ ♧ ⚁⚃ ♢ ♢ ♡ ♧ ♢

Slide 43

Slide 43 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%

Slide 44

Slide 44 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%

Slide 45

Slide 45 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shuffle sharding

Slide 46

Slide 46 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Customer –based shards

Slide 47

Slide 47 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chapter 7: Backoff, throttles and Little’s law

Slide 48

Slide 48 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Exponential Backoff is not enough

Slide 49

Slide 49 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

Slide 50

Slide 50 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html Throttles

Slide 51

Slide 51 text

Token bucket rate (tokens/sec) Token Generation capacity (finite) Incoming Request Outgoing Request 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 Mbps 3 Mbps 3 Mbps get token Burst Steady Rejection (429 Too Many Requests) yes no

Slide 52

Slide 52 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Dining with philosophers Image source https://en.wikipedia.org/wiki/Dining_philosophers_problem

Slide 53

Slide 53 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Dining with philosophers Image source https://en.wikipedia.org/wiki/Dining_philosophers_problem • Concurrency is a useful measure of capacity in real systems. • Concurrency measures consumption of resources like threads, memory, connections, file handles, etc.

Slide 54

Slide 54 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Little’s law and concurrency L = λ . W Long-term average effective arrival rate average time a customer spends in the system Average number of customers

Slide 55

Slide 55 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Little’s law and concurrency L = λ . W requests per second average time for each request to complete mean number of concurrent requests

Slide 56

Slide 56 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Little’s law and concurrency L = λ . W

Slide 57

Slide 57 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Chapter 8: Conclusion

Slide 58

Slide 58 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark “Anything that can go wrong will go wrong.” Murphy’s Law

Slide 59

Slide 59 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark • Start with the customers! • DNS names are your friends! • Bulkheads and isolation. • Shuffle Sharding. • Backoff and throttles. • Little’s law and concurrency. • Go eat with philosophers. Blast-radius reduction mindset

Slide 60

Slide 60 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Operational Excellence Tools Processes Culture Technology

Slide 61

Slide 61 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark The master minds behind all this cool technology …

Slide 62

Slide 62 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Resources https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/ https://aws.amazon.com/blogs/architecture/a-case-study-in-global-fault-isolation/ https://twitter.com/colmmacc/status/1034492056968736768 https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html https://en.wikipedia.org/wiki/Token_bucket https://en.wikipedia.org/wiki/Little%27s_law http://brooker.co.za/blog/2018/06/20/littles-law.html http://brooker.co.za/blog/2017/12/28/mean.html https://en.wikipedia.org/wiki/Dining_philosophers_problem https://en.wikipedia.org/wiki/Amdahl%27s_law https://medium.com/@adhorn https://www.infrastructure.aws https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter

Slide 63

Slide 63 text

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Thank you! @adhorn https://medium.com/@adhorn https://github.com/adhorn