Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resiliency and Availability Design Patterns for the Cloud

Resiliency and Availability Design Patterns for the Cloud

We have traditionally built robust architectures by trying to avoid mistakes or failures in production, or by testing parts of the system in isolation. However, modern techniques take a very different approach: embracing failure instead of trying to avoid it. Resilient architectures enhance observability, leverage well-known patterns such as graceful degradation, timeouts and circuit breakers but also new patterns like cell-based architecture and shuffle sharding. In this session, will review the most useful patterns for building resilient software systems and especially show the audience how they can benefit from the patterns.

More Decks by Sébastien Stormacq - AWS Developer Advocate

Other Decks in Programming

Transcript

  1. © 2021, Amazon Web Services, Inc. or its Affiliates.
    Resiliency and Availability
    Design Patterns for the Cloud
    Sébastien Stormacq
    Principal Developer Advocate
    @sebsto

    View Slide

  2. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Can you guess what will happen?

    View Slide

  3. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Distributed Systems are hard

    View Slide

  4. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Failures are a given and
    everything will eventually
    fail over time.
    Werner Vogels
    CTO – Amazon.com


    View Slide

  5. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Resiliency: Ability for a system to handle and
    eventually recover from unexpected conditions

    View Slide

  6. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Partial failure mode

    View Slide

  7. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    People
    Application
    Network & Data
    Infrastructure

    View Slide

  8. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about Geo Availability

    View Slide

  9. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  10. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Fully-scaled Availability Zone

    View Slide

  11. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Highly redundant regional network

    View Slide

  12. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    AWS Region and availability zones
    Region
    Availability zone a Availability zone b Availability zone c
    data center
    data center
    data center
    1 or more data centers per AZ
    2 or more AZs per region (new regions min 3)
    data center
    data center
    data center
    data center
    data center
    data center

    View Slide

  13. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Availability in parallel
    Component Availability Downtime
    X 99% (2-nines) 3 days 15 hours
    Two X in parallel 99.99% (4-nines) 52 minutes
    Three X in parallel 99.9999% (6-nines) 31 seconds

    View Slide

  14. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Multi-AZ architecture
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing
    (ELB)

    View Slide

  15. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Multi-AZ architecture
    X
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing
    (ELB)

    View Slide

  16. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Multi-AZ architecture
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing
    (ELB)
    X

    View Slide

  17. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Multi-AZ architecture
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    new master
    Elastic Load
    Balancing
    (ELB)
    X

    View Slide

  18. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Multi-AZ architecture
    • Enables fault-tolerant applications
    • AWS regional services designed to
    withstand AZ failures
    • Leveraged by AWS regional
    services such as Amazon S3,
    Amazon DynamoDB, Amazon
    Aurora, Amazon ELBs, etc.
    Region
    Availability zone a Availability zone b Availability zone c
    Instances Instances Instances
    DB Instance DB instance
    standby
    Elastic Load
    Balancing (ELB)

    View Slide

  19. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about auto scaling

    View Slide

  20. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Auto-Scaling
    Fixed
    Variable

    View Slide

  21. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Availability zone 1
    Auto Scaling group
    AWS Region
    Availability zone 2
    Auto-scaling for self-healing
    Elastic Load
    Balancing
    (ELB)
    X

    View Slide

  22. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about decoupling and async

    View Slide

  23. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Process A Process B Process A Process B
    Synchronous Asynchronous
    Waiting
    Working
    Continues
    get or fetch result
    Get result
    Decoupling with async pattern

    View Slide

  24. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    API: {DO foo}
    PUT JOB: {JobID: 0001, Task: DO foo}
    API: {JobID: 0001}
    GET JOB: {JobID: 0001, Task: DO foo}
    {JobID: 0001, Result: bar}
    Cache node
    Worker
    Instance
    Worker
    Instance
    Queue/Streaming
    API
    Instance
    API
    Instance
    API
    Instance

    View Slide

  25. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Push Notification
    User
    Worker
    Instance
    Worker
    Instance
    API
    Instance
    API
    Instance
    Cache node
    Fetch results
    API
    Instance
    Queue/Streaming

    View Slide

  26. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Degrade & prioritize traffic
    with queues
    Worker
    Instance
    Worker
    Instance
    API
    Instance
    API
    Instance
    API
    Instance
    High Priority Queue
    Low Priority Queue

    View Slide

  27. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about databases.

    View Slide

  28. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Read / Write separation
    DB Instance DB instance
    read replica
    DB instance
    read replica
    DB instance
    read replica
    Instance Instance
    Instance
    Supports degradation through Read-Only mode

    View Slide

  29. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Database Federation
    Users
    DB
    Products
    DB
    Instance Instance
    Instance
    DB Instance
    DB instance
    read replica
    DB Instance
    DB instance
    read replica

    View Slide

  30. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Database Sharding
    User ShardID
    002345 A
    002346 B
    002347 C
    002348 B
    002349 A
    C
    B
    A
    Instance Instance
    Instance
    DB Instance
    DB instance
    read replica
    DB Instance
    DB instance
    read replica
    DB Instance
    DB instance
    read replica

    View Slide

  31. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about timeouts, backoff &
    retries!

    View Slide

  32. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Users
    App
    DB
    Conn
    Pool
    INSERT
    INSERT
    INSERT
    INSERT
    What happens if the DB “slows down”?
    Timeout client side Timeout backend side
    ?
    ?

    View Slide

  33. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    User 1
    App DB
    Conn
    Pool
    INSERT
    Timeout client side = 10s Timeout backend side = default = Infinite
    Retry INSERT
    Retry INSERT
    ERROR: Failed to get connection from pool
    Retry

    View Slide

  34. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://docs.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout

    View Slide

  35. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  36. https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

    View Slide

  37. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    @timeout_decorator.timeout(5, timeout_exception=StopIteration)
    def timed_get(url):
    return requests.get(url)
    https://pypi.org/project/timeout-decorator/

    View Slide

  38. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Set the timeouts!

    View Slide

  39. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    How else could we have prevented the error?
    User 1
    DB
    Conn
    Pool
    INSERT
    Retry INSERT
    Retry INSERT
    Retry
    ERROR: Failed to get connection from pool

    View Slide

  40. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    User 1
    DB
    Conn
    Pool
    INSERT
    Timeout client side = 10s Timeout backend side = 10s
    Wait 2s before Retry
    INSERT
    INSERT
    Wait 4s before Retry
    Wait 8s before Retry
    Wait 16s before Retry
    Backing off between retries
    Releasing connections
    Backoff

    View Slide

  41. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    No jitter With jitter
    https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
    Simple Exponential Backoff is not enough: Add Jitter

    View Slide

  42. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Example: add jitter 0-1000ms
    MAX_TRIES = 12
    def get_item(self, url, n=1):
    try:
    res = requests.get(url)
    except:
    if n > MAX_TRIES:
    return None
    n += 1
    time.sleep((2 ** n) + (random.randint(0, 1000) /
    1000.0))
    return self.get_item(url, n)
    else:
    return res

    View Slide

  43. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  44. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Idempotent operation
    No additional effect if it is called more
    than once with the same input parameters.

    View Slide

  45. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Circuit Breaker
    • Wrap a protected function
    call in a circuit breaker
    object, which monitors for
    failures.
    • If failures reach a certain
    threshold, the circuit
    breaker trips.
    Producer Circuit Breaker Consumer
    Connection
    Monitoring
    Timeouts
    Breaking Circuit

    View Slide

  46. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://github.com/Netflix/Hystrix

    View Slide

  47. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://spring.io/guides/gs/circuit-breaker/

    View Slide

  48. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about health checking!

    View Slide

  49. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Auto Scaling group
    Service A
    Availability zone 1
    Auto Scaling group
    AWS Region
    Service A
    Availability zone 2
    Service B
    Service B
    database Email
    Probing for health
    Cluster

    View Slide

  50. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shallow health check
    Instance
    Cache
    node
    Email
    database
    Cluster
    Are you healthy?
    yes

    View Slide

  51. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shallow health check
    Instance
    Cache
    node
    Email
    database
    Cluster
    Are you healthy?
    yes

    View Slide

  52. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Deep health check
    Instance
    Cache
    node
    Email
    database
    Cluster
    Are you healthy?
    yes
    Are you healthy?
    yes
    yes
    yes
    yes

    View Slide

  53. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Deep health check
    Instance
    Cache
    node
    Email
    database
    Cluster
    Are you healthy?
    no
    Are you healthy?
    no
    yes
    yes
    yes

    View Slide

  54. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Prioritize shallow health checks during
    hard times.
    Cache and be careful with logging.

    View Slide

  55. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about load shedding.

    View Slide

  56. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  57. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  58. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  59. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Don’t be overly optimistic and take on more than you can.
    Find an operational metric to reject what you cannot take in.
    Favor cached and static content
    Prioritize ELB health check (shallow) pings
    In an overload situation you have precious resources, do not
    let any of it go to waste.
    Load Shedding

    View Slide

  60. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Service Degradation & Fallbacks

    View Slide

  61. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://twitter.com/redditstatus/status/1116204502703493120

    View Slide

  62. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about sharding.

    View Slide

  63. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  64. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Assign Customers to Cells
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4

    View Slide

  65. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Assign Customers to Cells
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4

    View Slide

  66. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Assign Customers to Cells
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4

    View Slide

  67. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Assign Customers to Cells
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4

    View Slide

  68. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Assign Customers to Cells
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4

    View Slide

  69. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Assign Customers to Cells
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4

    View Slide

  70. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Assign Customers to Cells
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4

    View Slide

  71. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Measure for this: blast radius

    View Slide

  72. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Blast radius
    • How many customers?
    • What functionality?
    • How many locations?

    View Slide

  73. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding

    View Slide

  74. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4
    Assign Customers to Random Cells

    View Slide

  75. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4
    Assign Customers to Random Cells

    View Slide

  76. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4
    Assign Customers to Random Cells

    View Slide

  77. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4
    Assign Customers to Cells

    View Slide

  78. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cell 1 Cell 2 Cell 3 Cell 5 Cell 6
    Cell 4
    Assign Customers to Random Cells

    View Slide

  79. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding
    Nodes = 8
    Shard size = 2
    Combinations = 28
    Overlap % customers
    0 53.6%
    1 42.8%
    2 3.6%

    View Slide

  80. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding
    Nodes = 100
    Shard size = 5
    Combinations = 75 million!
    Overlap % customers
    0 77%
    1 21%
    2 1.8%
    3 0.06%
    4 0.0006%
    5 0.0000013%

    View Slide

  81. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding

    View Slide

  82. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about chaos!

    View Slide

  83. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    GameDay at Amazon
    Creating Resiliency Through Destruction
    https://www.youtube.com/watch?v=zoz0ZjfrQ9s

    View Slide

  84. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos engineering
    https://github.com/Netflix/SimianArmy

    View Slide

  85. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    “Chaos Engineering is the discipline of
    experimenting on a distributed system
    in order to build confidence in the system’s
    capability to withstand turbulent conditions in
    production.”
    http://principlesofchaos.org

    View Slide

  86. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Failure injection
    Start small & build confidence
    •Application level
    •Host failure
    •Resource attacks (CPU, memory, …)
    •Network attacks (dependencies, latency, …)
    •Region attacks

    View Slide

  87. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    AWS Fault Injection Simulator
    Fully managed chaos engineering service on AWS
    Coming soon

    View Slide

  88. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Demo

    View Slide

  89. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Learn more.

    View Slide

  90. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://aws.amazon.com/wellarchitected

    View Slide

  91. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://medium.com/@adhorn

    View Slide

  92. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  93. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Plan for the worst, prepare for the
    unexpected.

    View Slide

  94. © 2021, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Thank you !
    @sebsto
    /sebsto
    /sebsto
    /sebAWS
    Sébastien Stormacq

    View Slide