Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resiliency and Availability Design Patterns

Resiliency and Availability Design Patterns

As presented at https://devone.at/speakers/#adrianhornsby

---
We have traditionally built robust software systems by trying to avoid mistakes and by dodging failures when they occur in production or by testing parts of the system in isolation from one another. Modern methods and techniques take a very different approach based on resiliency, which promotes embracing failure instead of trying to avoid it. Resilient architectures enhance observability, leverage well-known patterns such as graceful degradation, timeouts and circuit breakers. In this session, will review the most useful patterns for building resilient software systems and especially show the audience how they can benefit from the patterns.

Adrian Hornsby

April 11, 2019
Tweet

More Decks by Adrian Hornsby

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Resiliency and Availability Design Patterns Adrian Hornsby Sr. Technical Evangelist Amazon Web Services @adhorn
  2. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Distributed Systems are hard Amazon Twitter Netflix
  3. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions
  4. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Partial failure mode
  5. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How do we build resilient software systems?
  6. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. People Application Network & Data Infrastructure
  7. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about isolation and containment
  8. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Typical service application Compute Cell Storage
  9. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell-based architecture Compute Cell 0 Compute Cell n Regional Service Storage Compute Cell 1 Storage Storage
  10. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. REGIONAL SERVICE Zone A Zone B Zone C Zone A Zone B Zone C Zone A Zone B Zone C Z O N A L S E R V I C E Z O N A L S E R V I C E Z O N A L S E R V I C E S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L Zone A Zone B Zone C S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L S E R V I C E C E L L W I T H O U T C E L L S W I T H C E L L S
  11. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. System properties Cell 0 Service Cell 1 Cell n • Workload isolation • Failure containment • Scale-out vs. scale-up • Testability • Manageability Cell n+1
  12. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about blast radius
  13. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. X X X X X X X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢
  14. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell-based architecture X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢
  15. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding X X ♤ ♡ ♢ ⚀ ⚁⚂ ⚃ ♡ ♤ ♧ ♢ ⚀⚂ ♧ ⚁⚃ ♢ ♢ ♡ ♧ ♢
  16. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%
  17. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%
  18. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Better to react without reacting
  19. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Region Availability zone a Availability zone b Availability zone c Application Lets take an application …
  20. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Region Availability zone a Availability zone b Availability zone c Application Requires 8 Instances or containers
  21. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Region Availability zone a Availability zone b Availability zone c Application
  22. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Region Availability zone a Availability zone b Availability zone c Application Requires 6 Instances or Containers
  23. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about timeouts, backoff & retries.
  24. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = default Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry
  25. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Set the timeouts through inheritance Timeout backend = Timeout client – time elapsed
  26. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s – time elapsed Wait 2s before Retry Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backoff between retries Backoff
  27. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter
  28. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit
  29. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about load shedding.
  30. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cheaply reject excess work
  31. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue
  32. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about databases.
  33. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Read / Write separation DB Instance DB instance read replica DB instance read replica DB instance read replica Instance Instance Instance Supports degradation through Read-Only mode
  34. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Service Degradation & Fallbacks
  35. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transient state does not belong in the database.
  36. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about chaos engineering.
  37. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  38. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy
  39. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • People attack https://www.gremlin.com https://github.com/Netflix/SimianArmy https://chaostoolkit.org
  40. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. "Chaos engineering is NOT about breaking things randomly without a purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.”
  41. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. DON’T blame people for failure…
  42. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark Thanks you! @adhorn https://medium.com/@adhorn https://speakerdeck.com/adhorn/patterns-for-building-resilient-software-systems-2019