Resiliency and Availability Design Patterns for the Cloud

© 2019, Amazon Web Services, Inc. or its affiliates. All
rights reserved. Resiliency and Availability Design Patterns for the Cloud B A R 4 K Y I V 1 1 . 0 6 . 2 0 1 9 { "name": "Sébastien Stormacq", "role": ”Technical Evangelist", "company": "Amazon Web Services”, "twitter": ”@sebsto”, ”github": ”sebsto” }

rights reserved. Can you guess what will happen?

rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “

rights reserved. Distributed Systems are hard

rights reserved. Complex systems Amazon Twitter Netflix

rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions

rights reserved. Partial failure mode

rights reserved. How do we build resilient software systems?

rights reserved. People Application Network & Data Infrastructure

rights reserved. Let’s talk about Availability

rights reserved. Availability in parallel A = 1 – (1 – Ax)2 Part X Part X

rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds

rights reserved. Component redundancy increases availability significantly!

rights reserved.

rights reserved. Fully-scaled Availability Zone

rights reserved. Highly redundant regional network

rights reserved. AWS Region and availability zones Region Availability zone a Availability zone b Availability zone c data center data center data center 1 or more data centers per AZ 2 or more AZs per region (new regions min 3) data center data center data center data center data center data center

rights reserved. Let’s talk about Multi-AZ

rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)

rights reserved. Multi-AZ architecture X Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)

rights reserved. Multi-AZ architecture X Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance new master Elastic Load Balancing (ELB)

rights reserved. Multi-AZ architecture • Enables fault-tolerant applications • AWS regional services designed to withstand AZ failures • Leveraged by AWS regional services such as Amazon S3, Amazon DynamoDB, Amazon Aurora, Amazon ELBs, etc. Region Availability zone a Availability zone b Availability zone c Instances Instances Instances DB Instance DB instance standby Elastic Load Balancing (ELB)

rights reserved. Let’s talk about auto scaling

rights reserved. Auto-Scaling Fixed Variable

rights reserved. Availability zone 1 Auto Scaling group AWS Region Availability zone 2 Auto-scaling for self-healing Elastic Load Balancing (ELB) X

rights reserved. Let’s talk about decoupling and async

rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch result Get result Decoupling with async pattern

rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue/Streaming API Instance API Instance API Instance

rights reserved. Push Notification User Worker Instance Worker Instance API Instance API Instance Cache node Fetch results API Instance Queue/Streaming

rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue

rights reserved. Let’s talk about the failures in distributed systems

rights reserved. Recommendation Engine Service Service Service Preserve at all cost Preventing failures

rights reserved. Some of the most important things to think about Recommendation Engine Service Service Service Preserve at all cost

rights reserved. Let’s talk about timeouts, backoff & retries!

rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ? ?

rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = default = Infinite Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry

rights reserved. https://docs.microsoft.com/en-us/dotnet/api/system.net.httpwebrequest.timeout

rights reserved.

https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html

rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/

rights reserved. Set the timeouts!

rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool

rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connections Backoff

rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter

rights reserved. Adding Jitter

rights reserved.

rights reserved. Example: add jitter 0-1000ms def get_item(self, url, n=1): MAX_TRIES = 12 try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res

rights reserved. @backoff.on_exception(backoff.full_jitter, max_time=60) def poll_for_message(queue): return queue.get() https://pypi.org/project/backoff/ As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the AWS Architecture Blog’s Exponential Backoff And Jitter post.

rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.

rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit

rights reserved. https://github.com/Netflix/Hystrix

rights reserved. https://spring.io/guides/gs/circuit-breaker/

rights reserved. Let’s talk about health checking!

rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B database Email Probing for health Cluster

rights reserved. Shallow health check Instance Cache node Email database Cluster Are you healthy? yes

rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? yes Are you healthy? yes yes yes yes

rights reserved. Deep health check Instance Cache node Email database Cluster Are you healthy? no Are you healthy? no yes yes yes

rights reserved. Prioritize shallow health checks during hard times. Cache.

rights reserved. Let’s talk about load shedding.

rights reserved.

rights reserved. Cheaply reject excess work

rights reserved.

rights reserved. Be careful when selecting the right metric

rights reserved. Don’t be overly optimistic and take on more than you can. Find an operational metric to reject what you cannot take in. Favor cached and static content Prioritize ELB health check (shallow) pings In an overload situation you have precious resources, do not let any of it go to waste. Load Shedding

rights reserved. Service Degradation & Fallbacks

rights reserved. https://twitter.com/redditstatus/status/1116204502703493120

rights reserved. Let’s talk about shuffle sharding.

rights reserved.

rights reserved. X X X X X X X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢

rights reserved. Measure for this: blast radius

rights reserved. Blast radius • How many customers? • What functionality? • How many locations?

rights reserved.

rights reserved. Cell-based architecture X X ♤ ♡ ♢ ⚀ ⚁ ⚂ ⚃ ♧ ♢

rights reserved. Shuffle sharding X X ♤ ♡ ♢ ⚀ ⚁⚂ ⚃ ♡ ♤ ♧ ♢ ⚀⚂ ♧ ⚁⚃ ♢ ♢ ♡ ♧ ♢

rights reserved. Shuffle sharding Nodes = 8 Shard size = 2 Combinations = 28 Overlap % customers 0 53.6% 1 42.8% 2 3.6%

rights reserved. Shuffle sharding Nodes = 100 Shard size = 5 Combinations = 75 million! Overlap % customers 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013%

rights reserved. Shuffle sharding

rights reserved. Let’s talk about chaos!

rights reserved. Fire Drills

rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy

rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org

rights reserved. Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • “Paul” attack https://www.gremlin.com https://github.com/Netflix/SimianArmy https://chaostoolkit.org

rights reserved. Bananas for Monkeys

rights reserved. How to DDoS yourself ~ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

rights reserved. Adding delay to the network ~ tc qdisc add dev eth0 root netem delay 200ms

rights reserved. https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD)

rights reserved. Let’s talk about operational resiliency

rights reserved. Value realized by example Operational resilience 1. Scaled to handle a 400% increase in page views (Kurt Geiger) 2. Improved security posture (CapitalOne) 3. 8600 transactions/second (McDonalds) 4. Transfer of over 750 TB of data from pipeline inspection machinery (GE) 5. Processing over 75 billion market events daily (FINRA) 6. Critical applications run in multiple AZs, x-Regions for robust disaster recovery (Expedia) 7. Supports over 300,000 requests per minute to its API (Easy Taxi) 8. 60% reduced downtime (Trainline) 9. Migration of SAP on Oracle to AWS with zero unplanned downtime across five countries (Kellogg’s) 10. SAP availability boosted to 100% (MacMillan)

rights reserved. Operational Resilience Operational resilience Critical workloads run in Multiple AZs and Regions for robust DR (Expedia) Benefit of improving SLAs and reducing unplanned outages What is it? Example

rights reserved. The cost of downtime Annual Fortune 1000 application downtime costs (IDC) $1.25 to $2.5B Average cost of a data breach (Ponemon Institute) $3.6M Cost/hr of a critical application failure (IDC) $500K to $1M Average cost/hr of downtime (Ponemon Institute) $474K Average cost per lost or stolen record (Ponemon Institute) $141

rights reserved. Operational resilience: Quantifying cost Cost Category % of Total Definition Third Parties 1.3% The cost of contractors, consultants, auditors and other specialists engaged to help resolve unplanned outages. Equipment 1.3% The cost of new equipment purchases and repairs, including refurbishment. Ex-post Activities 1.1% All after-the-fact incidental costs associated with business disruption and recovery. Recovery 2.9% Activities and associated costs that relate to bringing the organization’s networks and core systems back to a state of readiness. Detection 3.6% Activities associated with the initial discovery and subsequent investigation of the partial or complete outage incident. IT Productivity 8.4% The lost time and related expenses associated with IT personnel downtime. End-user Productivity 18.7% The lost time and related expenses associated with end-user downtime. Lost Revenue 28.2% The total revenue loss from customers and potential customers because of their inability to access core systems during the outage period. Business disruption 34.6% Additional economic loss of the outage, including reputational damages, customer churn and lost business opportunities. TOTAL 100.0%

rights reserved. Operational resilience: Case studies Migrated to AWS in 6 weeks with no downtime and improved availability to 99.99%+ Migrated all workloads to AWS to reduce downtime by 60% with an annual savings of £1.2M Rebuilt patient engagement portal on AWS and reduced downtime from 120 to <5 min / month Using AWS, Travelstart has seized opportunities in emerging markets and has cut operational costs by 43% and downtime by 25% With its on-premises setup, the availability of its system ran to 98%, but on its cloud infrastructure, this has risen to 99.965% Three 9’s to five 9’s “We no longer need to worry about data center, server, or hypervisor security…which allows us to focus our attention on securing our applications.”

rights reserved. And before we go.

rights reserved. DON’T blame people for failure…

rights reserved. “Quality is not an act, it is a habit” Aristotle, some time around 350BC

rights reserved. https://aws.amazon.com/wellarchitected

rights reserved. https://medium.com/@adhorn

Thank you! © 2019, Amazon Web Services, Inc. or its
affiliates. All rights reserved. { "name": "Sébastien Stormacq", "role": ”Technical Evangelist", "company": "Amazon Web Services”, "twitter": ”@sebsto”, ”github": ”sebsto” }

rights reserved.

Resiliency and Availability Design Patterns for...

Resiliency and Availability Design Patterns for the Cloud

More Decks by Sébastien Stormacq - AWS Developer Advocate

Other Decks in Programming

Featured

Transcript