Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient_Cloud_Architecture_Design_Patterns_.pdf

 Resilient_Cloud_Architecture_Design_Patterns_.pdf

Resilient architecture is crucial for all cloud implementations. In this talk, we explore different design patterns to make a distributed application more resilient.

As part of this journey, for any process, we need to ask what if something goes wrong? Then, plan a course of action to the process auto heal without any human intervention and how to lower risks by performing canary deployments. Design starts with at first understanding of requirements and performing empathy map and value chain analysis.

Thinking application as stateless for all the API calls makes the system available most of the time requires creating a cache for common distributed data. Next, we examine how to deal with cascading failures, and timeouts scenarios. Applications, as part of auto-healing, need to Detect, Prevent, Recover, Mitigate, Complement so that the service is resilient.

Key takeaways for the audience are as follows:

Resiliency is essential for any feature in cloud
Understanding the value chain is critical to identify failure points
Challenges come in determining if there is a failure and design the system for auto-healing.
The focus should be first to prevent a failure to occur.
Identifying key challenges in your company and tools and techniques to auto-heal and provide a sustainable solution

Rohit Bhardwaj

March 14, 2019
Tweet

Other Decks in Technology

Transcript

  1. Availability • Elimination of single points of failure. • Reliable

    crossover • Detection of failures as they occur.
  2. When something is strong and able to recover from damage

    quickly, call it resilient. https://www.vocabulary.com/dictionary/resilient
  3. Availability is the assurance that an enterprise’s IT infrastructure has

    suitable recoverability and protection from system failures, natural disasters or malicious attacks. https://www.gartner.com/it-glossary/availability
  4. Recovery Point Objective RPO Recovery Time Objectivre RTO WRT: Work

    recovery time MTDT: Maximum Tolerable Down Time = RTO + WRT
  5. Serial Availability Service A Service B Service Availability Downtime Service

    A 99% 3.65 days/year Service B 99.99% 52 minutes/year A + B 98.99% 3.69 days/year A = A a A b
  6. Parallel Availability Service A Service B Service Availability Downtime Service

    A 99% 3.65 days/year Service B 99.99% 52 minutes/year A + B 99.9999% 31 seconds year A = 1-(1-A a )2
  7. Zonal services Regional services EC2 instances S3, EFS, Dynamo DB

    EBS volumes Fargate, Kinesis EMR clusters API Gateway, Lambda NAT gateways SQS
  8. Availability Zone failure Zone A Zone B Zone C Zone

    A EC2 Zone B EC2 Zone C EC2 AWS SQS Service
  9. Zone A Zone B Zone C Zone A EC2 Zone

    B EC2 Zone C EC2 AWS SQS Service Availability Zone failure
  10. Cell-based architecture Application Load Balancer Compute Storage Application Load Balancer

    Compute Storage Application Load Balancer Compute Storage Cell 0 Cell 1 Cell 2 Cell Router Regional Service
  11. Region service Zone A Zone B Zone C Zone A

    Zone B Zone C AWS SQS Service AWS SQS Service Cell AWS SQS Service Cell AWS SQS Service Cell
  12. Region service outage Zone A Zone B Zone C Zone

    A Zone B Zone C AWS SQS Service AWS SQS Service Cell AWS SQS Service Cell AWS SQS Service Cell
  13. Zonal service Zone A Zone B Zone C EC2 Service

    Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell Zone A Zone B Zone C EC2 Service EC2 Service EC2 Service
  14. Zonal service outage Zone A Zone B Zone C EC2

    Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell Zone A Zone B Zone C EC2 Service EC2 Service EC2 Service
  15. Partial Zonal service outage Zone A Zone B Zone C

    EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell Zone A Zone B Zone C EC2 Service EC2 Service EC2 Service
  16. Cell-based architecture Application Load Balancer Compute Storage Application Load Balancer

    Compute Storage Application Load Balancer Compute Storage Cell 0 Cell 1 Cell 2 Cell Router Regional Service
  17. Compute Compute Compute Service A Cell Router by Customers Regional

    Service Service B Service C Blast radius = All customers
  18. Application Load Balancer Compute Storage Cell 0 Cell Router by

    Customer Regional Service A B Application Load Balancer Compute Storage Cell 1 Cell Router by Customer C D Application Load Balancer Compute Storage Cell 2 Cell Router by Customer E F Application Load Balancer Compute Storage Cell 3 Cell Router by Customer G H
  19. Application Load Balancer Compute Storage Cell 0 Cell Router by

    Customer Regional Service A B Application Load Balancer Compute Storage Cell 1 Cell Router by Customer C D Application Load Balancer Compute Storage Cell 2 Cell Router by Customer E F Application Load Balancer Compute Storage Cell 3 Cell Router by Customer G H
  20. Application Load Balancer Compute Storage Cell 0 Cell Router by

    Customer Regional Service A B Application Load Balancer Compute Storage Cell 1 Cell Router by Customer C D Application Load Balancer Compute Storage Cell 2 Cell Router by Customer E F Application Load Balancer Compute Storage Cell 3 Cell Router by Customer G H A C B A C E B D
  21. Application Load Balancer Compute Storage Cell 0 Cell Router by

    Customer A B Application Load Balancer Compute Storage Cell 1 Cell Router by Customer C D Application Load Balancer Compute Storage Cell 2 Cell Router by Customer E F Application Load Balancer Compute Storage Cell 3 Cell Router by Customer G H A C B A C E B D Blast radius = Customers / Combinations
  22. Shuffle Sharding Nodes = 7 Shard size = 2 Combinations

    = 8! / (2! * (8-2)! = 28 Overlap % customers impacted 0 53.6% 1 42.8% 2 3.6% Combinations (n/k) = n! / (k! (n-k)! https://threadreaderapp.com/thread/1034492056968736768.html
  23. Shuffle Sharding Nodes = 100 Shard size = 5 Overlap

    % customers impacted 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013% https://threadreaderapp.com/thread/1034492056968736768.html
  24. Issue: Infrastructure failures causing delays • Network congestion or network

    partition • Request contention in non-scalable infrastructure services • I/O bandwidth saturation in storage systems • Software failures in shared services • Denial-of-service attacks that trigger service failure or resource exhaustion
  25. Issue: Latency between New York and San Francisco • 13.8

    ms • 27.6 ms (Round trip) • As per AT&T: 70 ms Distance, as the crow flies, between New York and San Francisco = 2,565 miles; speed of light = 186,000 miles/second.
  26. Solution 2: Reducing the number of network requests • REST

    API design • Ask for Zip and phone in same REST request • Consolidate small details into one form
  27. Web App DB connect ion Pool DB Frontend timeout Backend

    timeout Database timeout INSERT INSERT INSERT INSERT
  28. Web App DB connect ion Pool DB Frontend timeout Backend

    timeout Database timeout INSERT INSERT INSERT INSERT RETRY RETRY
  29. Python - infinity timeout! • connect_timeout=31 # number of seconds

    to wait for connection to be established • read_timeout=5 # maximum number of seconds between data reads • r = requests.get('https://github.com', timeout=(connect_timeout, read_timeout))
  30. MYSQL 5.7 or higher • SELECT /*+ MAX_EXECUTION_TIME(1000) */ status,

    count(*) FROM articles GROUP BY status ORDER BY status; • SET SESSION MAX_EXECUTION_TIME=2000; • SET GLOBAL MAX_EXECUTION_TIME=2000;
  31. Web App DB connect ion Pool DB Frontend timeout Backend

    timeout Database timeout INSERT INSERT INSERT INSERT
  32. Web App DB connect ion Pool DB Frontend timeout =

    10s Backend timeout 10s Database timeout 10s INSERT INSERT INSERT INSERT Wait 2s before retry Wait 4s before retry Wait 8s before retry Backoff between retries Release connections
  33. • A request to a remote service timed out. •

    The request queue for the remote service is full, indicating that the remote service is unable to handle additional requests.
  34. Fallback Strategy • Fail transparently. Generates a custom response •

    Fail silently. Returns a null value for the request. • Fail fast. Generates a 5xx response code
  35. Web App DB connect ion Pool DB Frontend timeout Backend

    timeout Database timeout INSERT INSERT INSERT INSERT
  36. Web App DB connect ion Pool DB Frontend timeout Backend

    timeout Database timeout INSERT INSERT INSERT INSERT Wait 2s before retry Insert already in database!
  37. Web App DB connect ion Pool DB Frontend timeout Backend

    timeout Database timeout INSERT Wait 2s before retry HOW TO DO UPSERT? UPSERT UPSERT UPSERT
  38. order by CREATE TABLE crossfit_gyms_by_location ( country_code text, state_province text,

    city text, gym_name text, PRIMARY KEY (country_code, state_province, city, gym_name) ) WITH CLUSTERING ORDER BY (state_province DESC, city ASC, gym_name ASC);
  39. hashed values • country_code | state_province | city | gym_name

    • --------------+----------------+---------------+-------------------------- • CAN | ON | Toronto | CrossFit Leslieville • CAN | ON | Toronto | CrossFit Toronto • CAN | BC | Vancouver | CrossFit BC • CAN | BC | Vancouver | CrossFit Vancouver • USA | NY | New York | CrossFit Metropolis • USA | NY | New York | CrossFit NYC • USA | NV | Las Vegas | CrossFit Las Vegas • USA | NV | Las Vegas | Kaizen CrossFit • USA | CA | San Francisco | LaLanne Fitness CrossFit • USA | CA | San Francisco | San Francisco CrossFit
  40. inserting/UPDATING data • INSERT INTO users (email, bio, birthday, active)

    VALUES (‘[email protected]’, ‘Coach’, 646464676600, true); • UPDATE users SET email=‘[email protected]’, bio=‘Coach’, birthday=646464676600, active=true)
  41. upsert • Under the hood, INSERT and UPDATE are same

    • INSERT = UPDATE • BOTH OPERATIONS REQURE PRIMARY KEY
  42. deleting data • Deleting a row • DELETE FROM customers

    WHERE id = '2829'; • Deleting a column • DELETE email FROM customers
 WHERE id = '2829'; • UPDATE customers SET email = null
 WHERE id = '2829'; • INSERT INTO customers (id, email)
 VALUES ('2829', null);
  43. Idempotent HTTP method IDEMPOTENT SAFETY GET Yes Yes HEAD Yes

    Yes PUT Yes No DELETE Yes No POST No No PATCH No No
  44. Get Latest Customer Orders • API Composition: Join in API

    side • Get Customers • Get Most recent Orders
  45. For Impatient Web Users, an Eye Blink Is Just Too

    Long to Wait https://www.nytimes.com/2012/03/01/technology/impatient-web-users-flee-slow-loading-sites.html?pagewanted=all&_r=0 250 ms is magic number
  46. Fallacies of Distributed Systems • The network is reliable •

    Latency is zero • Bandwidth is infinite • The network is secure • Topology doesn't change • There is one administrator • Transport cost is zero • The network is homogeneous