Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A blast-radius reduction approach to embracing failure at scale.

A blast-radius reduction approach to embracing failure at scale.

As presented at re-deploy 2019.
https://re-deploy.io/2019/speakers/#adrian-hornsby

Abstract:
Mistakes. Bad judgment. Errors. Failures. They are all part of our engineering lives. While many think of them as being undesirable aspects of engineering, failures are very important, and even- beneficial. One thing that is sure is that failures will happen and will come in many forms, some expected, and some unexpected. It’s therefore important to embrace failure. The question is how to limit its blast-radius? In this talk, I will discuss a range of blast radius reduction design techniques used at AWS and by our customers, including isolation, bulkheads, cells, and sharding. I will also discuss how embracing failure infuses impact our operational practices.

Adrian Hornsby

October 16, 2019
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark A blast-radius reduction approach to embracing failure at scale. Adrian Hornsby Principal Evangelist Amazon Web Services @adhorn
  2. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark “You need to know the past to understand the present.” Carl Sagan
  3. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Chapter 1: AWS endpoints
  4. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark s3.amazonaws.com/bucket/ And it all started with …
  5. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark s3.us-east-1.amazonaws.com/bucket/ Then more AWS region arrived
  6. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark
  7. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark autoscaling.us-east-1.amazonaws.com athena.us-east-1.amazonaws.com rds.us-east-1.amazonaws.com monitoring.us-east-1.amazonaws.com ec2.us-east-1.amazonaws.com dynamodb.us-east-1.amazonaws.com elasticloadbalancing.us-east-1.amazonaws.com SERVICE | REGION | AMAZONAWS | COM Today
  8. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark bucket.s3.us-east-1.amazonaws.com Today RESOURCE | SERVICE | REGION | AMAZONAWS | COM
  9. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark ALB-761504102.us-west-1.elb.amazonaws.com adhorn-war.s3.us-west-1.amazonaws.com dl1bss4jo007o.cloudfront.net RESOURCE | SERVICE | REGION | AMAZONAWS | COM
  10. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Chapter 2: Health Dashboard
  11. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Service health dashboard
  12. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark … gets personal
  13. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Chapter 3: AWS Regions
  14. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark autoscaling.us-east-1.amazonaws.com athena.us-east-1.amazonaws.com rds.us-east-1.amazonaws.com monitoring.us-east-1.amazonaws.com ec2.us-east-1.amazonaws.com dynamodb.us-east-1.amazonaws.com elasticloadbalancing.us-east-1.amazonaws.com SERVICE | REGION | AMAZONAWS | COM
  15. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark By default - Shared nothing architecture X X X us-west-1 us-east-2
  16. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Chapter 4: Availability Zones
  17. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Fully-scaled Availability Zone
  18. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Highly redundant regional network
  19. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Availability Zones Region Availability zone a Availability zone b Availability zone c data center data center data center data center data center data center data center data center data center
  20. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  21. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB) X
  22. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB) X
  23. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Multi-AZ architecture • Enables fault-tolerant applications • AWS regional services designed to withstand AZ failures • Leveraged by AWS regional services such as Amazon S3, Amazon DynamoDB, Amazon Aurora, Amazon ELBs, etc. Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  24. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Database distributed architecture DynamoDB Aurora
  25. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Zone a control plane Zone a Zone a data plane Regional control plane Zone b control plane Zone b Zone b data plane Zone c control plane Zone c Zone c data plane Availability Zone independence
  26. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Abstracted Architecture Aggregation layer Routing Compartmentalized resources Failure isolation Entry point
  27. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark
  28. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Chapter 5: Cells
  29. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB) Compute Storage Cell
  30. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Cell-based architecture Regional Service Compute Storage Cell 1 Compute Storage Cell 2 Compute Storage Cell n [ … ]
  31. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark REGIONAL SERVICE Zone A Zone B Zone C Zone A Zone B Zone C Zone A Zone B Zone C Z O N A L S E R V I C E Z O N A S E R V I C E Z O N A L S E R V I C E S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L Zone A Zone B Zone C S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L W I T H O U T C E L L S W I T H C E L L S
  32. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark System properties Cell 0 Cell router Service Cell 1 Cell n • Workload isolation • Failure containment • Scale-out vs. scale-up • Testability • Manageability Cell n+1
  33. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Cell size tradeoffs • Reduced blast radius • Easier to test • Cells easier to operate • Cost efficiency • Reduced splits • System easier to operate
  34. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark
  35. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark X X X X X X X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢
  36. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell-based architecture X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢
  37. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chapter 6: Shuffle Sharding
  38. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding X X ♤ ♡ ♢ ⚀ ⚁⚂ ⚃ ♡ ♤ ♧ ♢ ⚀⚂ ♧ ⚁⚃ ♢ ♢ ♡ ♧ ♢
  39. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%
  40. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%
  41. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Customer –based shards
  42. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Chapter 7: Backoff, throttles and Little’s law
  43. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Exponential Backoff is not enough
  44. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
  45. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html Throttles
  46. Token bucket rate (tokens/sec) Token Generation capacity (finite) Incoming Request

    Outgoing Request 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 Mbps 3 Mbps 3 Mbps get token Burst Steady Rejection (429 Too Many Requests) yes no
  47. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Dining with philosophers Image source https://en.wikipedia.org/wiki/Dining_philosophers_problem
  48. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Dining with philosophers Image source https://en.wikipedia.org/wiki/Dining_philosophers_problem • Concurrency is a useful measure of capacity in real systems. • Concurrency measures consumption of resources like threads, memory, connections, file handles, etc.
  49. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Little’s law and concurrency L = λ . W Long-term average effective arrival rate average time a customer spends in the system Average number of customers
  50. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Little’s law and concurrency L = λ . W requests per second average time for each request to complete mean number of concurrent requests
  51. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Little’s law and concurrency L = λ . W
  52. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Chapter 8: Conclusion
  53. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark “Anything that can go wrong will go wrong.” Murphy’s Law
  54. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark • Start with the customers! • DNS names are your friends! • Bulkheads and isolation. • Shuffle Sharding. • Backoff and throttles. • Little’s law and concurrency. • Go eat with philosophers. Blast-radius reduction mindset
  55. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Operational Excellence Tools Processes Culture Technology
  56. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark The master minds behind all this cool technology …
  57. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Resources https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/ https://aws.amazon.com/blogs/architecture/a-case-study-in-global-fault-isolation/ https://twitter.com/colmmacc/status/1034492056968736768 https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html https://en.wikipedia.org/wiki/Token_bucket https://en.wikipedia.org/wiki/Little%27s_law http://brooker.co.za/blog/2018/06/20/littles-law.html http://brooker.co.za/blog/2017/12/28/mean.html https://en.wikipedia.org/wiki/Dining_philosophers_problem https://en.wikipedia.org/wiki/Amdahl%27s_law https://medium.com/@adhorn https://www.infrastructure.aws https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter
  58. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Thank you! @adhorn https://medium.com/@adhorn https://github.com/adhorn