Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Patterns for Resilient Architecture

Patterns for Resilient Architecture

As presented at the AWS Loft in Stockholm - October 22nd, 2018.

We have traditionally built robust software systems by trying to avoid mistakes and by dodging failures when they occur in production or by testing parts of the system in isolation from one another. Modern methods and techniques take a very different approach based on resiliency, which promotes embracing failure instead of trying to avoid it. Resilient architectures enhance observability, leverage well-known patterns such as graceful degradation, timeouts and circuit breakers and embrace chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more resilient systems. In this session, will review the most useful patterns for building resilient software systems and I will introduce chaos engineering methodology and especially show the audience how they can benefit from breaking things on purpose.

Adrian Hornsby

October 22, 2018
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Patterns for Resilient Architecture Adrian Hornsby @adhorn
  2. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. https://xkcd.com/1428/ About me … @adhorn
  3. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Complex systems Amazon Twitter Netflix
  4. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Partial failure mode
  5. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions
  6. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. People Application Network & Data Infrastructure
  7. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Patterns for Resilient Architecture Infrastructure
  8. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. About availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.9% (3-nines) 8 hours 45 minutes 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds
  9. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. System availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
  10. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability in series Part X Part Y A = Ax Ay
  11. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability in series Component Availability Downtime X 99% (2-nines) 3 days 15 hours Y 99.99% (4-nines) 52 minutes X and Y Combined 98.99% 3 days 16 hours 33 minutes
  12. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability in parallel A = 1 – (1 – Ax)2 Part X Part X
  13. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  14. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Component redundancy increases availability significantly!
  15. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Application
  16. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Region and availability zones Region Availability zone a Availability zone b Availability zone c data center data center data center data center data center data center data center data center data center
  17. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Redundancy across multiple regions
  18. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 18 Geographic Regions 55 Availability Zones (AZs) 4 regions and 12 more Availability Zones announced
  19. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Improve latency for end-users ~300ms ~140ms
  20. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Combined with disaster recovery Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4 US-WEST-2 US-EAST-1
  21. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Routing policies with Route 53 Region Region Application Application
  22. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Latency based routing Region Region Application Application
  23. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Geo-based routing Region us-east-1 Region us-west-2 Application Application
  24. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Weighted round robin routing Region us-east-1 Region us-west-2 Application Application
  25. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. DNS failover Region us-east-1 Region us-west-2 Application Application
  26. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Learn more about multi-region architecture
  27. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  28. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B Auto-scaling
  29. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong
  30. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Patterns for Resilient Architectures Network & Data
  31. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!
  32. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Embrase eventual consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.
  33. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch result Get result
  34. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Decoupling with async pattern Listener Pub-Sub Queue Queue A A B B
  35. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue API Instance API Instance API Instance
  36. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Push Notification User Worker Instance Worker Instance Queue API Instance API Instance Cache node Fetch results API Instance
  37. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue
  38. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Read / Write Sharding DB Instance DB instance read replica DB instance read replica DB instance read replica Instance Instance Instance
  39. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Database Federation Users DB Products DB Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica
  40. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica DB Instance DB instance read replica
  41. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transient state does not belong in the database.
  42. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Patterns for Resilient Architecture Application
  43. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B Stateless Services
  44. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Cascading Failures
  45. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s talk about timeouts & retries!
  46. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ? ?
  47. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = Not implemented Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry
  48. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/
  49. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool
  50. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connections Backoff
  51. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter
  52. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Example: add jitter 0-1000ms def get_item(self, url, n=1): MAX_TRIES = 12 try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res
  53. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. @backoff.on_exception(backoff.full_jitter, max_time=60) def poll_for_message(queue): return queue.get() https://pypi.org/project/backoff/ As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the AWS Architecture Blog’s Exponential Backoff And Jitter post.
  54. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.
  55. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B database Email Probing for health Cluster
  56. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes
  57. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes
  58. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? yes Are you healthy? yes yes yes yes
  59. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? no Are you healthy? no yes yes yes
  60. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Service Degradation & Fallbacks
  61. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit
  62. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
  63. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Exception Handling
  64. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Patterns for Resilient Architecture People
  65. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications
  66. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Conway’s Law http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.”
  67. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???
  68. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  69. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy
  70. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What “really” is Chaos Engineering?
  71. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  72. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  73. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • “Paul” attack
  74. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Chaos doesn’t cause problems. It reveals them.” Nora Jones Senior Chaos Engineer, Netflix
  75. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ
  76. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Steady State Hypothesis Design & Run Experiment Verify & Learn Fix
  77. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  78. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is steady state? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
  79. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Business metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  80. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  81. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!
  82. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Design & Run Experiment
  83. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Designing experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
  84. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  85. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Running Chaos Experiment Users Canary deployment Normal Version 99% Users 1% Users Start with ..
  86. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  87. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. DON’T blame that one person …
  88. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of … NOT ENOUGH
  89. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?
  90. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Rules to remember! 1. Failure requires multiple faults 2. There is no isolated ‘cause’ of an accident. 3. There are multiple contributors to accidents.
  91. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Never let a good crisis go to waste
  92. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  93. © 2018, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Changing culture takes time! Be patient…
  94. Thank you! © 2018, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Adrian Hornsby @adhorn