Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resiliency and Availability Design Patterns

Resiliency and Availability Design Patterns

As presented at https://devone.at/speakers/#adrianhornsby

---
We have traditionally built robust software systems by trying to avoid mistakes and by dodging failures when they occur in production or by testing parts of the system in isolation from one another. Modern methods and techniques take a very different approach based on resiliency, which promotes embracing failure instead of trying to avoid it. Resilient architectures enhance observability, leverage well-known patterns such as graceful degradation, timeouts and circuit breakers. In this session, will review the most useful patterns for building resilient software systems and especially show the audience how they can benefit from the patterns.

Adrian Hornsby

April 11, 2019
Tweet

More Decks by Adrian Hornsby

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Resiliency and Availability Design
    Patterns
    Adrian Hornsby
    Sr. Technical Evangelist
    Amazon Web Services
    @adhorn

    View Slide

  2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Looks familiar?

    View Slide

  3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Distributed Systems are hard
    Amazon Twitter Netflix

    View Slide

  4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Resiliency: Ability for a system to handle and
    eventually recover from unexpected conditions

    View Slide

  5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Partial failure mode

    View Slide

  6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    How do we build resilient software
    systems?

    View Slide

  7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    People
    Application
    Network & Data
    Infrastructure

    View Slide

  8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about isolation and containment

    View Slide

  9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Typical service application
    Compute
    Cell
    Storage

    View Slide

  11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cell-based architecture
    Compute
    Cell 0
    Compute
    Cell n
    Regional Service
    Storage
    Compute
    Cell 1
    Storage Storage

    View Slide

  12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    REGIONAL
    SERVICE
    Zone A Zone B Zone C Zone A Zone B Zone C
    Zone A Zone B Zone C
    Z O N A L
    S E R V I C E
    Z O N A L
    S E R V I C E
    Z O N A L
    S E R V I C E
    S E R V I C E C E L L
    S E R V I C E C E L L
    S E R V I C E C E L L
    Zone A Zone B Zone C
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    S E R V I C E
    C E L L
    W I T H O U T C E L L S W I T H C E L L S

    View Slide

  13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    System properties
    Cell 0
    Service
    Cell 1 Cell n
    • Workload isolation
    • Failure containment
    • Scale-out vs. scale-up
    • Testability
    • Manageability Cell n+1

    View Slide

  14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about blast radius

    View Slide

  15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    X X X X X X X
    X


    ♢ ⚀ ⚁ ⚂ ⚃


    View Slide

  17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cell-based architecture
    X
    X


    ♢ ⚀ ⚁ ⚂ ⚃


    View Slide

  18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding
    X
    X


    ♢ ⚀ ⚁⚂ ⚃
    ♡ ♤ ♧
    ♢ ⚀⚂
    ♧ ⚁⚃
    ♢ ♢
    ♡ ♧

    View Slide

  19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding
    Nodes = 8
    Shard size = 2
    Combinations = 28
    Overlap % customers
    0 53.6%
    1 42.8%
    2 3.6%

    View Slide

  20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding
    Nodes = 100
    Shard size = 5
    Combinations = 75 million!
    Overlap % customers
    0 77%
    1 21%
    2 1.8%
    3 0.06%
    4 0.0006%
    5 0.0000013%

    View Slide

  21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Shuffle sharding

    View Slide

  22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Better to react without reacting

    View Slide

  23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Region
    Availability zone a Availability zone b Availability zone c
    Application
    Lets take an application …

    View Slide

  24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Region
    Availability zone a Availability zone b Availability zone c
    Application
    Requires 8 Instances
    or containers

    View Slide

  25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Overload Failures

    View Slide

  26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Region
    Availability zone a Availability zone b Availability zone c
    Application

    View Slide

  27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Region
    Availability zone a Availability zone b Availability zone c
    Application
    Requires 6 Instances
    or Containers

    View Slide

  28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about timeouts, backoff &
    retries.

    View Slide

  29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  30. https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

    View Slide

  31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    User 1
    App DB
    Conn
    Pool
    INSERT
    Timeout client side = 10s Timeout backend side = default
    Retry INSERT
    Retry INSERT
    ERROR: Failed to get connection from pool
    Retry

    View Slide

  32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Set the timeouts through inheritance
    Timeout backend = Timeout client – time elapsed

    View Slide

  33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    User 1
    DB
    Conn
    Pool
    INSERT
    Timeout client side = 10s Timeout backend side = 10s – time elapsed
    Wait 2s before Retry
    Wait 4s before Retry
    Wait 8s before Retry
    Wait 16s before Retry
    Backoff between retries
    Backoff

    View Slide

  34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    No jitter With jitter
    https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
    Simple Exponential Backoff is not enough: Add Jitter

    View Slide

  35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Adding Jitter

    View Slide

  36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Circuit Breaker
    • Wrap a protected function
    call in a circuit breaker
    object, which monitors for
    failures.
    • If failures reach a certain
    threshold, the circuit
    breaker trips.
    Producer Circuit Breaker Consumer
    Connection
    Monitoring
    Timeouts
    Breaking Circuit

    View Slide

  37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about load shedding.

    View Slide

  38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Cheaply reject excess work

    View Slide

  41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Degrade & prioritize traffic
    with queues
    Worker
    Instance
    Worker
    Instance
    API
    Instance
    API
    Instance
    API
    Instance
    High Priority Queue
    Low Priority Queue

    View Slide

  43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about databases.

    View Slide

  44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Read / Write separation
    DB Instance DB instance read
    replica
    DB instance read
    replica
    DB instance read
    replica
    Instance Instance
    Instance
    Supports degradation through Read-Only mode

    View Slide

  45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Service Degradation & Fallbacks

    View Slide

  46. https://twitter.com/redditstatus/status/1116204502703493120

    View Slide

  47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Transient state does not
    belong in the database.

    View Slide

  48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Let’s talk about chaos engineering.

    View Slide

  49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Fire Drills

    View Slide

  50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    GameDay at Amazon
    Creating Resiliency Through Destruction
    https://www.youtube.com/watch?v=zoz0ZjfrQ9s

    View Slide

  51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos engineering
    https://github.com/Netflix/SimianArmy

    View Slide

  52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Failure injection
    • Start small & build confidence
    • Application level
    • Host failure
    • Resource attacks (CPU, memory, …)
    • Network attacks (dependencies, latency, …)
    • Region attacks
    • People attack
    https://www.gremlin.com
    https://github.com/Netflix/SimianArmy https://chaostoolkit.org

    View Slide

  53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    "Chaos engineering is NOT about breaking
    things randomly without a purpose, chaos
    engineering is about breaking things in a
    controlled environment and through well-
    planned experiments in order to build
    confidence in your application to withstand
    turbulent conditions.”

    View Slide

  54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    And before we go.

    View Slide

  55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    DON’T blame people for failure…

    View Slide

  56. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    Thanks you!
    @adhorn
    https://medium.com/@adhorn
    https://speakerdeck.com/adhorn/patterns-for-building-resilient-software-systems-2019

    View Slide