Reliability

Reliability - Piyush Verma

Nginx Fabio Traeﬁk Kong Who is more Reliable? 2

Cassandra InﬂuxDB TimeseriesDB Prometheus Who is more Reliable? 3

Zookeeper Etcd Consul Filaan Dhimkaan Who is more Reliable? 4

At-least one server is online All servers are below 100%
All servers are responding within x ms. All of the above. 01 02 03 04 5

Service receives SMS User sends SMS Remind me to buy
milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 7 Sample Product

milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 8 Sample Product: Inbound

Cron gets Activated when time is right. Call the User
9 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308

10 Inbound Connection

— Leslie Lamport https://www.microsoft.com/en-us/research/uploads/prod/2016 /12/Distribution.pdf “A distributed system is one
in which the failure of a computer you didn’t even know existed can render your own computer unusable” 11

Four Flavors of Failure 12 Disk Network CPU Memory

13 Scope of Failures: Again

AWS GCP Azure Who is more Reliable? 14

15 #1 Server is Unavailable

16 Failover Available

17 Failover Available

18 Failover != Load Balanced

19 Load Balanced

20 Architecture of a Balancer

GCP AWS On-Prem Azure Who uses What? 21

22 Failover + Load Balanced

3-Rack 4-Rack 5-Rack Who is more Reliable? 23

Trilemma 24 Available Economical Endurable

Load Balancing 25

Random Round-Robin Last Frequently Used Least Connections Who is more
Reliable? 26

27 Problems

Client Side Load balancing Server Side Load Balancing Look Aside
Load Balancing Who is more Reliable? 28

29 Server-side Load Balancing Example: Fabio

30 Look-aside Load Balancing Example: Consul/ DNS

31 Client-side Load Balancing Example: Ribbon

32 Client-side Load Balancing Example: Ribbon + Curator

33 Load Shedding

— Tyler McMullen https://www.infoq.com/presentations/load-balancing/ “Load Balancing is almost Impossible” 34

Cron gets Activated when time is right. Call the User
35 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Part 1

36 Outbound

37 Scope of Failure: Outbound

38 Retries

39 Retries: Transient Failures

40 Exponential Backoff: Short term Transient Failures ✋ ✋ ✋
✋ ✋

41 Circuit Breaking: Long Term Transient Failures

Retry-Once Keep-Retrying Circuit Breaking Who is more Reliable? 42

Who is more Reliable? 43 At-least Once Exactly-Once At-most Once

44 At-most once delivery

45 At-least once delivery

46 Exactly once delivery

47 Exactly once delivery = At-least-once Delivery + Exactly-once Processing

Keys to Only-Once delivery 48 Atomic Window Idempotent

Out-of-Order delivery 49

Revisit 50

Kafka Celery RabbitMQ Who is more Reliable? 51 Sidekiq

“Guaranteed delivery in multi-party system is almost Impossible” 52

milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 53 Sample Product

54 Problems of State

55 Locked /Serialization

56 Master/ Master/ Slave

57 Clustering

Master-Master Master-Slave Clustering Who is more Reliable? 58 Eventually Consistent

CAP Theorem [Sab topi pehna rahe] 59

60 What is Better? Available Partition Consistent

PACeLC Theorem 61

What is Better? 62 Consistency Latency

63 Revised Flow

“Absolute availability is almost Impossible” 65

Reliable System 66 Scalable Correct Transparent

Summary 67 Consistent Available Economical Low Latency

Does anyone have any questions? [email protected] Thanks 68

Reliability

Reliability

More Decks by Piyush Verma

Other Decks in Technology

Featured

Transcript