Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Patterns for building resilient software system...

Patterns for building resilient software systems - 2019

We have traditionally built robust software systems by trying to avoid mistakes and by dodging failures when they occur in production or by testing parts of the system in isolation from one another. Modern methods and techniques take a very different approach based on resiliency, which promotes embracing failure instead of trying to avoid it. Resilient architectures enhance observability, leverage well-known patterns such as graceful degradation, timeouts and circuit breakers and embrace chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more resilient systems. In this session, will review the most useful patterns for building resilient software systems and I will introduce chaos engineering methodology and especially show the audience how they can benefit from breaking things on purpose.

Adrian Hornsby

January 30, 2019
Tweet

More Decks by Adrian Hornsby

Other Decks in Programming

Transcript

  1. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Adrian Hornsby Sr. Technical Evangelist Amazon Web Services @adhorn Patterns for Building Resilient Software Systems
  2. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Complex systems Amazon Twitter Netflix
  3. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Partial failure mode
  4. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions
  5. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How do we build resilient software systems?
  6. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark …
  7. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Quality is not an act, it is a habit” Aristotle, some time around 350BC
  8. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark 10 Patterns for Resilient Architecture
  9. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. The Famous 9s of availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.9% (3-nines) 8 hours 45 minutes 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds
  10. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability in parallel A = 1 – (1 – Ax)2 Part X Part X
  11. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  12. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Region and availability zones Region Availability zone a Availability zone b Availability zone c data center data center data center data center data center data center data center data center data center n data centers per AZ (1 or more) n AZs per region (typically 3+)
  13. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Pattern 1: Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby
  14. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby X
  15. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby X
  16. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance X
  17. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture • Enables fault-tolerant applications • AWS regional services designed to withstand AZ failures • Leveraged by most AWS regional services (S3, DynamoDB, ELBs, …)
  18. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Zonal services • Amazon EC2 instances • Amazon EBS volumes • Amazon EMR clusters • AWS CloudHSM instances • NAT gateways • etc. Regional services • Amazon S3 • Amazon DynamoDB • Amazon EFS • Amazon Aurora Serverless • Amazon API Gateway • AWS Fargate • AWS Lambda • Amazon Kinesis • Amazon SQS • Etc.
  19. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability zone failure Zone a Zone a
  20. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Theoretical blast radius Zone a Zone a
  21. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell-based architecture
  22. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Typical service application Compute Cell Storage
  23. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Pattern 2: Cell-based architecture Compute Cell 0 Compute Cell n Regional Service Storage Compute Cell 1 Storage Storage
  24. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cells and Availability zones Zone a Zone a
  25. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Theoretical blast radius Zone a Zone a
  26. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cells and Availability Zones – zonal services Zone a Zone a
  27. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability zone failure Zone a Zone a
  28. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Partial availability zone failure Zone a Zone a
  29. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. System properties Cell 0 Service Cell 1 Cell n • Workload isolation • Failure containment • Scale-out vs. scale-up • Testability • Manageability X
  30. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. System properties Cell 0 Service Cell 1 Cell n • Workload isolation • Failure containment • Scale-out vs. scale-up • Testability • Manageability Cell n+1
  31. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. X X X X X X X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢
  32. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cascading Failures
  33. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cell-based architecture X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢
  34. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Pattern 3: Shuffle sharding X X ♤ ♡ ♢ ⚀ ⚁⚂ ⚃ ♡ ♤ ♧ ♢ ⚀⚂ ♧ ⚁⚃ ♢ ♢ ♡ ♧ ♢
  35. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%
  36. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%
  37. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Auto-Scaling Fixed Variable
  38. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability zone 1 Auto Scaling group AWS Region Availability zone 2 Pattern 4: Auto-scaling for self-healing X
  39. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch result Get result Pattern 5: Decoupling with async pattern
  40. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue API Instance API Instance API Instance
  41. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Push Notification User Worker Instance Worker Instance Queue API Instance API Instance Cache node Fetch results API Instance
  42. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue
  43. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about timeouts & retries!
  44. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ? ?
  45. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = default = Infinite Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry
  46. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/
  47. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Pattern 6: Set the timeouts!
  48. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool
  49. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Pattern 7: Backing off between retries Releasing connections Backoff
  50. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter
  51. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Example: add jitter 0-1000ms def get_item(self, url, n=1): MAX_TRIES = 12 try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res
  52. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. @backoff.on_exception(backoff.full_jitter, max_time=60) def poll_for_message(queue): return queue.get() https://pypi.org/project/backoff/ As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the AWS Architecture Blog’s Exponential Backoff And Jitter post.
  53. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.
  54. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B database Email Probing for health Cluster
  55. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes
  56. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes
  57. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? yes Are you healthy? yes yes yes yes
  58. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? no Are you healthy? no yes yes yes
  59. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Service Degradation & Fallbacks
  60. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Pattern 8: Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit
  61. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about databases!
  62. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Database Federation Users DB Products DB Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica
  63. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica DB Instance DB instance read replica
  64. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Pattern 9: Read / Write separation DB Instance DB instance read replica DB instance read replica DB instance read replica Instance Instance Instance Supports degradation through Read-Only mode
  65. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transient state does not belong in the database. Pattern 9+
  66. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  67. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy
  68. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  69. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Pattern 10: Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • “Paul” attack
  70. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Plan for the worst, prepare for the unexpected.
  71. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Thanks you! @adhorn https://medium.com/@adhorn