Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A blast-radius reduction approach to embracing failure at scale.

A blast-radius reduction approach to embracing failure at scale.

As presented at re-deploy 2019.
https://re-deploy.io/2019/speakers/#adrian-hornsby

Abstract:
Mistakes. Bad judgment. Errors. Failures. They are all part of our engineering lives. While many think of them as being undesirable aspects of engineering, failures are very important, and even- beneficial. One thing that is sure is that failures will happen and will come in many forms, some expected, and some unexpected. It’s therefore important to embrace failure. The question is how to limit its blast-radius? In this talk, I will discuss a range of blast radius reduction design techniques used at AWS and by our customers, including isolation, bulkheads, cells, and sharding. I will also discuss how embracing failure infuses impact our operational practices.

Adrian Hornsby

October 16, 2019
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    A blast-radius reduction approach to
    embracing failure at scale.
    Adrian Hornsby
    Principal Evangelist
    Amazon Web Services
    @adhorn

    View Slide

  2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    “You need to know the past to understand
    the present.”
    Carl Sagan

    View Slide

  3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Chapter 1: AWS endpoints

    View Slide

  4. View Slide

  5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    s3.amazonaws.com/bucket/
    And it all started with …

    View Slide

  6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    s3.us-east-1.amazonaws.com/bucket/
    Then more AWS region arrived

    View Slide

  7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

    View Slide

  8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    autoscaling.us-east-1.amazonaws.com
    athena.us-east-1.amazonaws.com
    rds.us-east-1.amazonaws.com
    monitoring.us-east-1.amazonaws.com
    ec2.us-east-1.amazonaws.com
    dynamodb.us-east-1.amazonaws.com
    elasticloadbalancing.us-east-1.amazonaws.com
    SERVICE | REGION | AMAZONAWS | COM
    Today

    View Slide

  9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    bucket.s3.us-east-1.amazonaws.com
    Today
    RESOURCE | SERVICE | REGION | AMAZONAWS | COM

    View Slide

  10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    ALB-761504102.us-west-1.elb.amazonaws.com
    adhorn-war.s3.us-west-1.amazonaws.com
    dl1bss4jo007o.cloudfront.net
    RESOURCE | SERVICE | REGION | AMAZONAWS | COM

    View Slide

  11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Chapter 2: Health Dashboard

    View Slide

  12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Service health dashboard

    View Slide

  13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    … gets personal

    View Slide

  14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Chapter 3: AWS Regions

    View Slide

  15. https://www.infrastructure.aws

    View Slide

  16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    autoscaling.us-east-1.amazonaws.com
    athena.us-east-1.amazonaws.com
    rds.us-east-1.amazonaws.com
    monitoring.us-east-1.amazonaws.com
    ec2.us-east-1.amazonaws.com
    dynamodb.us-east-1.amazonaws.com
    elasticloadbalancing.us-east-1.amazonaws.com
    SERVICE | REGION | AMAZONAWS | COM

    View Slide

  17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    By default - Shared nothing architecture
    X
    X
    X
    us-west-1 us-east-2

    View Slide

  18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Chapter 4: Availability Zones

    View Slide

  19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Fully-scaled Availability Zone

    View Slide

  20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Highly redundant regional network

    View Slide

  21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Availability Zones
    Region
    Availability zone a Availability zone b Availability zone c
    data center
    data center
    data center
    data center
    data center
    data center
    data center
    data center
    data center

    View Slide

  22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Multi-AZ architecture
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing (ELB)

    View Slide

  23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Multi-AZ architecture
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing (ELB)
    X

    View Slide

  24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Multi-AZ architecture
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing (ELB)
    X

    View Slide

  25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Multi-AZ architecture
    • Enables fault-tolerant applications
    • AWS regional services designed to
    withstand AZ failures
    • Leveraged by AWS regional
    services such as Amazon S3,
    Amazon DynamoDB, Amazon
    Aurora, Amazon ELBs, etc.
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing (ELB)

    View Slide

  26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Database distributed architecture
    DynamoDB
    Aurora

    View Slide

  27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Zone a control plane
    Zone a
    Zone a data plane
    Regional control plane
    Zone b control plane
    Zone b
    Zone b data plane
    Zone c control plane
    Zone c
    Zone c data plane
    Availability Zone independence

    View Slide

  28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Abstracted Architecture
    Aggregation layer
    Routing
    Compartmentalized
    resources
    Failure isolation
    Entry point

    View Slide

  29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

    View Slide

  30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Chapter 5: Cells

    View Slide

  31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing (ELB)
    Compute
    Storage
    Cell

    View Slide

  32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Cell-based architecture
    Regional Service
    Compute
    Storage
    Cell 1
    Compute
    Storage
    Cell 2
    Compute
    Storage
    Cell n
    [ … ]

    View Slide

  33. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    REGIONAL
    SERVICE
    Zone A Zone B Zone C Zone A Zone B Zone C
    Zone A Zone B Zone C
    Z O N A L
    S E R V I C E
    Z O N A
    S E R V I C E
    Z O N A L
    S E R V I C E
    S E R V I C E C E L L
    S E R V I C E C E L L
    S E R V I C E C E L L
    Zone A Zone B Zone C
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    W I T H O U T C E L L S W I T H C E L L S

    View Slide

  34. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    System properties
    Cell 0
    Cell router
    Service
    Cell 1 Cell n
    • Workload isolation
    • Failure containment
    • Scale-out vs. scale-up
    • Testability
    • Manageability
    Cell n+1

    View Slide

  35. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Cell size tradeoffs
    • Reduced blast radius
    • Easier to test
    • Cells easier to operate
    • Cost efficiency
    • Reduced splits
    • System easier to operate

    View Slide

  36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

    View Slide

  37. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    X X X X X X X
    X


    ♢ ⚀ ⚁ ⚂ ⚃


    View Slide

  38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cell-based architecture
    X
    X


    ♢ ⚀ ⚁ ⚂ ⚃


    View Slide

  39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Can we do better?

    View Slide

  41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chapter 6: Shuffle Sharding

    View Slide

  42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding
    X
    X


    ♢ ⚀ ⚁⚂ ⚃
    ♡ ♤ ♧
    ♢ ⚀⚂
    ♧ ⚁⚃
    ♢ ♢
    ♡ ♧

    View Slide

  43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding
    Nodes = 8
    Shard size = 2
    Combinations = 28
    Overlap % customers
    0 53.6%
    1 42.8%
    2 3.6%

    View Slide

  44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding
    Nodes = 100
    Shard size = 5
    Combinations = 75 million!
    Overlap % customers
    0 77%
    1 21%
    2 1.8%
    3 0.06%
    4 0.0006%
    5 0.0000013%

    View Slide

  45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding

    View Slide

  46. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Customer –based shards

    View Slide

  47. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Chapter 7: Backoff, throttles and Little’s law

    View Slide

  48. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    No jitter With jitter
    https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
    Exponential Backoff is not enough

    View Slide

  49. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

    View Slide

  50. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html
    https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html
    Throttles

    View Slide

  51. Token bucket
    rate (tokens/sec)
    Token Generation
    capacity
    (finite)
    Incoming Request Outgoing Request
    0 1 2 3 4 5 6 7 8 9 10
    0 1 2 3 4 5 6 7 8 9 10
    12 Mbps
    3 Mbps
    3 Mbps
    get
    token
    Burst
    Steady
    Rejection (429 Too Many Requests)
    yes
    no

    View Slide

  52. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Dining with philosophers
    Image source https://en.wikipedia.org/wiki/Dining_philosophers_problem

    View Slide

  53. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Dining with philosophers
    Image source https://en.wikipedia.org/wiki/Dining_philosophers_problem
    • Concurrency is a useful
    measure of capacity in real
    systems.
    • Concurrency measures
    consumption of resources like
    threads, memory, connections,
    file handles, etc.

    View Slide

  54. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Little’s law and concurrency
    L = λ . W
    Long-term average effective arrival rate
    average time a customer spends in the system
    Average number of customers

    View Slide

  55. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Little’s law and concurrency
    L = λ . W
    requests per second
    average time for each request to complete
    mean number of concurrent requests

    View Slide

  56. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Little’s law and concurrency
    L = λ . W

    View Slide

  57. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Chapter 8: Conclusion

    View Slide

  58. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    “Anything that can go wrong will go wrong.”
    Murphy’s Law

    View Slide

  59. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    • Start with the customers!
    • DNS names are your friends!
    • Bulkheads and isolation.
    • Shuffle Sharding.
    • Backoff and throttles.
    • Little’s law and concurrency.
    • Go eat with philosophers.
    Blast-radius reduction mindset

    View Slide

  60. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Operational Excellence
    Tools Processes
    Culture
    Technology

    View Slide

  61. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    The master minds behind all this cool technology …

    View Slide

  62. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Resources
    https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/
    https://aws.amazon.com/blogs/architecture/a-case-study-in-global-fault-isolation/
    https://twitter.com/colmmacc/status/1034492056968736768
    https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html
    https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html
    https://en.wikipedia.org/wiki/Token_bucket
    https://en.wikipedia.org/wiki/Little%27s_law
    http://brooker.co.za/blog/2018/06/20/littles-law.html
    http://brooker.co.za/blog/2017/12/28/mean.html
    https://en.wikipedia.org/wiki/Dining_philosophers_problem
    https://en.wikipedia.org/wiki/Amdahl%27s_law
    https://medium.com/@adhorn
    https://www.infrastructure.aws
    https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter

    View Slide

  63. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Thank you!
    @adhorn
    https://medium.com/@adhorn
    https://github.com/adhorn

    View Slide