Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resiliency and Availability Design Patterns for the Cloud

Resiliency and Availability Design Patterns for the Cloud

We have traditionally built robust software systems by trying to avoid mistakes and by dodging failures when they occur in production or by testing parts of the system in isolation from one another. Modern methods and techniques take a very different approach based on resiliency, which promotes embracing failure instead of trying to avoid it. Resilient architectures enhance observability, leverage well-known patterns . In this session, will review the most useful patterns for building resilient software systems such as graceful degradation, timeouts and circuit breakers but also new patterns like cell-based architecture and shuffle sharding.

More Decks by Sébastien Stormacq - AWS Developer Advocate

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Resiliency and Availability Design Patterns for the Cloud Sébastien Stormacq, Technical Evangelist, AWS EMEA @sebsto
  2. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed System are hard
  3. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Complex systems Amazon Twitter Netflix
  4. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partial failure mode
  5. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions
  6. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. …
  7. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://aws.amazon.com/wellarchitected
  8. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Quality is not an act, it is a habit” Aristotle, some time around 350BC
  9. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do we build resilient software systems?
  10. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in parallel A = 1 – (1 – Ax)2 Part X Part X
  11. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  12. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Region and availability zones Region Availability zone a Availability zone b Availability zone c data center data center data center data center data center data center data center data center data center n data centers per AZ (1 or more) n AZs per region (typically 3+) High speed private fiber links
  13. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about Multi-AZ
  14. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby
  15. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby X
  16. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby X
  17. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance X
  18. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-AZ architecture • Enables fault-tolerant applications • AWS regional services designed to withstand AZ failures • Leveraged by most AWS regional services (S3, DynamoDB, ELBs, …) Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby
  19. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application Lets take an application …
  20. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application Lets take an application …
  21. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application Requires 8 Instances
  22. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Overload Failures
  23. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application
  24. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region Availability zone a Availability zone b Availability zone c Application Requires 6 Instances
  25. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Better to react without reacting
  26. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. But Why?
  27. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. 3 AZ’s is better than 2
  28. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about auto scaling
  29. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto-Scaling Fixed Variable
  30. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability zone 1 Auto Scaling group AWS Region Availability zone 2 Auto-scaling for self-healing • Set min > 0 X
  31. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  32. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about decoupling and async
  33. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch result Get result Decoupling with async pattern
  34. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue/Streaming API Instance API Instance API Instance
  35. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Notification User Worker Instance Worker Instance API Instance API Instance Cache node Fetch results API Instance Queue/Streaming
  36. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue
  37. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s talk about timeouts, backoff & retries!
  38. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ? ?
  39. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = default = Infinite Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry
  40. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  41. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. @timeout_decorator.timeout(5,timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/
  42. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set the timeouts!
  43. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool
  44. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connections Backoff
  45. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter
  46. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adding Jitter
  47. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  48. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: add jitter 0-1000ms def get_item(self, url, n=1): MAX_TRIES = 12 try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res
  49. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://pypi.org/project/backoff/ @backoff.on_exception(backoff.full_jitter, max_time=60) def poll_for_message(queue): return queue.get() As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the AWS Architecture Blog’s Exponential Backoff And Jitter post.
  50. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.
  51. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker Wrap a protected function call in a circuit breaker object, which monitors for failures. If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit
  52. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://github.com/Netflix/Hystrix
  53. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://spring.io/guides/gs/circuit-breaker/
  54. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Let’s talk about health checking!
  55. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B database Email Probing for health Cluster
  56. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shallow health check Instance Cache node Email database Cluster Users Are you healthy? yes
  57. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shallow health check Instance Cache node Email database Cluster Users Are you healthy? yes
  58. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep health check Instance Cache node Email database Cluster Users Are you healthy? yes Are you healthy? yes yes yes yes
  59. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep health check Instance Cache node Email database Cluster Users Are you healthy? no Are you healthy? no yes yes yes
  60. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Prioritize shallow health checks during hard times. Cache.
  61. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Service Degradation & Fallbacks
  62. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Let’s talk about chaos!
  63. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fire Drills
  64. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  65. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy
  66. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  67. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failure injection Start small & build confidence Application level Host failure Resource attacks (CPU, memory, …) Network attacks (dependencies, latency, …) Region attacks “Paul” attack
  68. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Plan for the worst, prepare for the unexpected.
  69. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Today (Thursday 16th) 09:15 An Introduction to AWS for Developers 13:40 Understanding Graph Databases 15:40 Deep Learning Demystified: A (mostly) Effortless Introduction 17:00 Resiliency and Availability Design Patterns for the Cloud
  70. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tomorrow (Friday 17th) 10:30 AI & Machine Learning at Amazon 13:40 How web sites goes serverless? 15:40 How to build multi region application in the cloud? 17:20 Simplify your web and mobile apps with serverless backend in the cloud.
  71. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Saturday 18th 10:30 Une introduction à AWS pour les développeurs 13:40 Utilisez des services d'intelligence artificielle dans vos applications sans être expert en apprentissage machine. 15:30 Hébergez votre site web sur AWS, en serverless et avec intégration continue.
  72. © 2019, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential Thank you – please share feedback ! Sébastien Stormacq, Technical Evangelist, AWS EMEA @sebsto