Resilient_Cloud_Architecture_Design_Patterns_.pdf

Resilient Cloud Architecture Design Patterns Rohit Bhardwaj Twitter: rbhardwaj1 Chief
Architect

Fail points Resiliency Cell Based Auto Scaling Reactive Timeouts Eventual
consistency Caching Failure Injection

Fail points

https://eng.uber.com/wp-content/uploads/2015/01/SOA_overview.png

EMPATHY MAP

https://app.creately.com/diagram/vector/i6pzy90c2

Initial über

Availability

Availability • Elimination of single points of failure. • Reliable
crossover • Detection of failures as they occur.

Operational cost per trip

Response time

https://creately.com/jupiter/diagram/image/i9rx49yi1

Resiliency

When something is strong and able to recover from damage
quickly, call it resilient. https://www.vocabulary.com/dictionary/resilient

Availability is the assurance that an enterprise’s IT infrastructure has
suitable recoverability and protection from system failures, natural disasters or malicious attacks. https://www.gartner.com/it-glossary/availability

Recovery Point Objective RPO Recovery Time Objectivre RTO WRT: Work
recovery time MTDT: Maximum Tolerable Down Time = RTO + WRT

Serial Availability Service A Service B Service Availability Downtime Service
A 99% 3.65 days/year Service B 99.99% 52 minutes/year A + B 98.99% 3.69 days/year A = A a A b

Parallel Availability Service A Service B Service Availability Downtime Service
A 99% 3.65 days/year Service B 99.99% 52 minutes/year A + B 99.9999% 31 seconds year A = 1-(1-A a )2

Zonal services Regional services EC2 instances S3, EFS, Dynamo DB
EBS volumes Fargate, Kinesis EMR clusters API Gateway, Lambda NAT gateways SQS

Availability Zone failure Zone A Zone B Zone C Zone
A EC2 Zone B EC2 Zone C EC2 AWS SQS Service

Zone A Zone B Zone C Zone A EC2 Zone
B EC2 Zone C EC2 AWS SQS Service Availability Zone failure

Cell Based Architecture

Cell-based architecture Application Load Balancer Compute Storage Application Load Balancer
Compute Storage Application Load Balancer Compute Storage Cell 0 Cell 1 Cell 2 Cell Router Regional Service

Region service Zone A Zone B Zone C Zone A
Zone B Zone C AWS SQS Service AWS SQS Service Cell AWS SQS Service Cell AWS SQS Service Cell

Region service outage Zone A Zone B Zone C Zone
A Zone B Zone C AWS SQS Service AWS SQS Service Cell AWS SQS Service Cell AWS SQS Service Cell

Zonal service Zone A Zone B Zone C EC2 Service
Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell Zone A Zone B Zone C EC2 Service EC2 Service EC2 Service

Zonal service outage Zone A Zone B Zone C EC2
Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell Zone A Zone B Zone C EC2 Service EC2 Service EC2 Service

Partial Zonal service outage Zone A Zone B Zone C
EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell EC2 Service Cell Zone A Zone B Zone C EC2 Service EC2 Service EC2 Service

Cell Based Architecture • Workload isolation • Failure containment •
Testability • Manageability

Cell-based architecture Application Load Balancer Compute Storage Application Load Balancer
Compute Storage Application Load Balancer Compute Storage Cell 0 Cell 1 Cell 2 Cell Router Regional Service

Scale-out Scale-up Master Worker Worker T3 EC2 M4 EC2 XL
EC2

Compute Compute Compute Service A Cell Router all Customers Regional
Service Service B Service C

Compute Compute Compute Service A Cell Router by Customers Regional
Service Service B Service C Blast radius = All customers

Application Load Balancer Compute Storage Cell 0 Cell Router by
Customer Regional Service A B Application Load Balancer Compute Storage Cell 1 Cell Router by Customer C D Application Load Balancer Compute Storage Cell 2 Cell Router by Customer E F Application Load Balancer Compute Storage Cell 3 Cell Router by Customer G H

Customer Regional Service A B Application Load Balancer Compute Storage Cell 1 Cell Router by Customer C D Application Load Balancer Compute Storage Cell 2 Cell Router by Customer E F Application Load Balancer Compute Storage Cell 3 Cell Router by Customer G H A C B A C E B D

Customer A B Application Load Balancer Compute Storage Cell 1 Cell Router by Customer C D Application Load Balancer Compute Storage Cell 2 Cell Router by Customer E F Application Load Balancer Compute Storage Cell 3 Cell Router by Customer G H A C B A C E B D Blast radius = Customers / Combinations

Shufﬂe Sharding Nodes = 7 Shard size = 2 Combinations
= 8! / (2! * (8-2)! = 28 Overlap % customers impacted 0 53.6% 1 42.8% 2 3.6% Combinations (n/k) = n! / (k! (n-k)! https://threadreaderapp.com/thread/1034492056968736768.html

Shufﬂe Sharding Nodes = 100 Shard size = 5 Overlap
% customers impacted 0 77% 1 21% 2 1.8% 3 0.06% 4 0.0006% 5 0.0000013% https://threadreaderapp.com/thread/1034492056968736768.html

Auto Scaling

Autoscaling

https://aws.amazon.com/blogs/aws/new-predictive-scaling-for-ec2-powered-by-machine-learning/

CloudWatch

https://www.reactivemanifesto.org/

Sync vs Async

Amplifyjs

Issue: Infrastructure failures causing delays • Network congestion or network
partition • Request contention in non-scalable infrastructure services • I/O bandwidth saturation in storage systems • Software failures in shared services • Denial-of-service attacks that trigger service failure or resource exhaustion

Issue: Latency between New York and San Francisco • 13.8
ms • 27.6 ms (Round trip) • As per AT&T: 70 ms Distance, as the crow ﬂies, between New York and San Francisco = 2,565 miles; speed of light = 186,000 miles/second.

Service A depends on B which depends on C http://www.stratoscale.com/wp-content/uploads/imageblog14.2.png

Solution 1: Co-locate http://associationnow.wpengine.netdna-cdn.com/wp-content/uploads/2012/09/0928_team-800x480.jpg

Solution 2: Reducing the number of network requests • REST
API design • Ask for Zip and phone in same REST request • Consolidate small details into one form

Bandwidth Aware

Reduce chattiness http://techblog.netﬂix.com/2013/01/optimizing-netﬂix-api.html

Timeouts and retries

Web App DB connect ion Pool DB Frontend timeout Backend
timeout Database timeout INSERT INSERT INSERT INSERT

timeout Database timeout INSERT INSERT INSERT INSERT RETRY RETRY

Python - inﬁnity timeout! • connect_timeout=31 # number of seconds
to wait for connection to be established • read_timeout=5 # maximum number of seconds between data reads • r = requests.get('https://github.com', timeout=(connect_timeout, read_timeout))

MYSQL 5.7 or higher • SELECT /*+ MAX_EXECUTION_TIME(1000) */ status,
count(*) FROM articles GROUP BY status ORDER BY status; • SET SESSION MAX_EXECUTION_TIME=2000; • SET GLOBAL MAX_EXECUTION_TIME=2000;

“INFINITY” IS A BAD DEFAULT TIMEOUT

Web App DB connect ion Pool DB Frontend timeout =
10s Backend timeout 10s Database timeout 10s INSERT INSERT INSERT INSERT Wait 2s before retry Wait 4s before retry Wait 8s before retry Backoff between retries Release connections

https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

Netﬂix case study

Cached trending movies Service degradation

Component microservices

Prioritization trafﬁc - SLAs

Circuit Breaker

• A request to a remote service timed out. •
The request queue for the remote service is full, indicating that the remote service is unable to handle additional requests.

Fallback Strategy • Fail transparently. Generates a custom response •
Fail silently. Returns a null value for the request. • Fail fast. Generates a 5xx response code

Circuit Breaker demo https://github.com/mstine/2016_CloudNativeAppArchWorkshop

https://github.com/Netﬂix/Hystrix/tree/master/hystrix-dashboard

timeout Database timeout INSERT INSERT INSERT INSERT Wait 2s before retry Insert already in database!

big data http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png

timeout Database timeout INSERT Wait 2s before retry HOW TO DO UPSERT? UPSERT UPSERT UPSERT

Shallow health check Shallow: Are you healthy? OK Health Check

Deep health check Deep: Are you healthy? Failed Health Check

Database resilience

db sharding

order by CREATE TABLE crossfit_gyms_by_location ( country_code text, state_province text,
city text, gym_name text, PRIMARY KEY (country_code, state_province, city, gym_name) ) WITH CLUSTERING ORDER BY (state_province DESC, city ASC, gym_name ASC);

Transition state do not store in database

Big data characteristics https://imasaikirangeek.ﬁles.wordpress.com/2014/05/deﬁning-big-data1.png

https://www.qubole.com/resources/solution/best-use-cases-for-big-data-analytics/

inserting/UPDATING data • INSERT INTO users (email, bio, birthday, active)
VALUES (‘[email protected]’, ‘Coach’, 646464676600, true); • UPDATE users SET email=‘[email protected]’, bio=‘Coach’, birthday=646464676600, active=true)

upsert • Under the hood, INSERT and UPDATE are same
• INSERT = UPDATE • BOTH OPERATIONS REQURE PRIMARY KEY

deleting data • Deleting a row • DELETE FROM customers
WHERE id = '2829'; • Deleting a column • DELETE email FROM customers  WHERE id = '2829'; • UPDATE customers SET email = null  WHERE id = '2829'; • INSERT INTO customers (id, email)  VALUES ('2829', null);

Idempotent HTTP method IDEMPOTENT SAFETY GET Yes Yes HEAD Yes
Yes PUT Yes No DELETE Yes No POST No No PATCH No No

Communication

How to get Latest Customer Orders http://microservices.io/patterns/data/database-per-service.html

Get Latest Customer Orders • API Composition: Join in API
side • Get Customers • Get Most recent Orders

Command Query Responsibility Segregation (CQRS) https://blogs.vmware.com/management/2013/02/scaling-in-complex-domains-using-cqrs-axon-and-spring-insight.html

Find Latest orders of Customer

How to update Customer data?

What about cancellation? https://msdn.microsoft.com/en-us/library/dn589804.aspx

http://microservices.io/patterns/data/saga.html

Caching

2007, Greg Linden Amazon 100ms delay cost === 1 %
sales

For Impatient Web Users, an Eye Blink Is Just Too
Long to Wait https://www.nytimes.com/2012/03/01/technology/impatient-web-users-ﬂee-slow-loading-sites.html?pagewanted=all&_r=0 250 ms is magic number

images, videos, CSS, JavaScript, and fonts in-memory caching page caching
Redis or Memcached

SYN Flood

DDoS attacks

cache expiration policy cache eviction memory size soft-TTL: 30 minutes
hard-TTL: If refresh fails

10000 requests Cache miss request coalescing “waiting rooms”

Cache-aside See Cache then DB Inline-cache

Principles of Chaos Engineering https://principlesofchaos.org/

Fallacies of Distributed Systems • The network is reliable •
Latency is zero • Bandwidth is inﬁnite • The network is secure • Topology doesn't change • There is one administrator • Transport cost is zero • The network is homogeneous

Chaos Gorilla

Cassandra maintenance

Isolated Regions

Failure Modes and Effects

AWS Security Epics

Resilient_Cloud_Architecture_Design_Patterns_.pdf

Resilient_Cloud_Architecture_Design_Patterns_.pdf

Other Decks in Technology

Featured

Transcript