Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resiliency and Availability Design Patterns for the Cloud

Resiliency and Availability Design Patterns for the Cloud

We have traditionally built robust architectures by trying to avoid mistakes or failures in production, or by testing parts of the system in isolation. However, modern techniques take a very different approach: embracing failure instead of trying to avoid it. Resilient architectures enhance observability, leverage well-known patterns such as graceful degradation, timeouts and circuit breakers but also new patterns like cell-based architecture and shuffle sharding. In this session, will review the most useful patterns for building resilient software systems and especially show the audience how they can benefit from the patterns.

Transcript

  1. © 2021, Amazon Web Services, Inc. or its Affiliates. Resiliency

    and Availability Design Patterns for the Cloud Sébastien Stormacq Principal Developer Advocate @sebsto
  2. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Can you guess what will happen?
  3. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Distributed Systems are hard
  4. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  5. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions
  6. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Partial failure mode
  7. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. People Application Network & Data Infrastructure
  8. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about Geo Availability
  9. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  10. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Fully-scaled Availability Zone
  11. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Highly redundant regional network
  12. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Region and availability zones Region Availability zone a Availability zone b Availability zone c data center data center data center 1 or more data centers per AZ 2 or more AZs per region (new regions min 3) data center data center data center data center data center data center
  13. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  14. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  15. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture X Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  16. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB) X
  17. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance new master Elastic Load Balancing (ELB) X
  18. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture • Enables fault-tolerant applications • AWS regional services designed to withstand AZ failures • Leveraged by AWS regional services such as Amazon S3, Amazon DynamoDB, Amazon Aurora, Amazon ELBs, etc. Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)
  19. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about auto scaling
  20. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Auto-Scaling Fixed Variable
  21. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability zone 1 Auto Scaling group AWS Region Availability zone 2 Auto-scaling for self-healing Elastic Load Balancing (ELB) X
  22. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about decoupling and async
  23. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch result Get result Decoupling with async pattern
  24. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue/Streaming API Instance API Instance API Instance
  25. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Push Notification User Worker Instance Worker Instance API Instance API Instance Cache node Fetch results API Instance Queue/Streaming
  26. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue
  27. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about databases.
  28. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Read / Write separation DB Instance DB instance read replica DB instance read replica DB instance read replica Instance Instance Instance Supports degradation through Read-Only mode
  29. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Database Federation Users DB Products DB Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica
  30. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica DB Instance DB instance read replica
  31. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about timeouts, backoff & retries!
  32. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ? ?
  33. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = default = Infinite Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry
  34. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://docs.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout
  35. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  36. https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

  37. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/
  38. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Set the timeouts!
  39. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool
  40. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connections Backoff
  41. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter
  42. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Example: add jitter 0-1000ms MAX_TRIES = 12 def get_item(self, url, n=1): try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res
  43. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  44. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.
  45. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit
  46. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://github.com/Netflix/Hystrix
  47. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://spring.io/guides/gs/circuit-breaker/
  48. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about health checking!
  49. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B database Email Probing for health Cluster
  50. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes
  51. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes
  52. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? yes Are you healthy? yes yes yes yes
  53. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? no Are you healthy? no yes yes yes
  54. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Prioritize shallow health checks during hard times. Cache and be careful with logging.
  55. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about load shedding.
  56. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  57. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  58. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  59. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Don’t be overly optimistic and take on more than you can. Find an operational metric to reject what you cannot take in. Favor cached and static content Prioritize ELB health check (shallow) pings In an overload situation you have precious resources, do not let any of it go to waste. Load Shedding
  60. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Service Degradation & Fallbacks
  61. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://twitter.com/redditstatus/status/1116204502703493120
  62. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about sharding.
  63. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  64. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4
  65. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4
  66. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4
  67. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4
  68. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4
  69. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4
  70. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Assign Customers to Cells Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4
  71. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Measure for this: blast radius
  72. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Blast radius • How many customers? • What functionality? • How many locations?
  73. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding
  74. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Random Cells
  75. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Random Cells
  76. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Random Cells
  77. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Cells
  78. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell 1 Cell 2 Cell 3 Cell 5 Cell 6 Cell 4 Assign Customers to Random Cells
  79. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%
  80. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%
  81. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding
  82. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about chaos!
  83. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  84. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy
  85. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  86. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Failure injection Start small & build confidence •Application level •Host failure •Resource attacks (CPU, memory, …) •Network attacks (dependencies, latency, …) •Region attacks
  87. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Fault Injection Simulator Fully managed chaos engineering service on AWS Coming soon
  88. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Demo
  89. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Learn more.
  90. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://aws.amazon.com/wellarchitected
  91. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://medium.com/@adhorn
  92. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved.
  93. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Plan for the worst, prepare for the unexpected.
  94. © 2021, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Thank you ! @sebsto /sebsto /sebsto /sebAWS Sébastien Stormacq