Slide 1

Slide 1 text

Reliability - Piyush Verma

Slide 2

Slide 2 text

Nginx Fabio Traefik Kong Who is more Reliable? 2

Slide 3

Slide 3 text

Cassandra InfluxDB TimeseriesDB Prometheus Who is more Reliable? 3

Slide 4

Slide 4 text

Zookeeper Etcd Consul Filaan Dhimkaan Who is more Reliable? 4

Slide 5

Slide 5 text

At-least one server is online All servers are below 100% All servers are responding within x ms. All of the above. 01 02 03 04 5

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 7 Sample Product

Slide 8

Slide 8 text

Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 8 Sample Product: Inbound

Slide 9

Slide 9 text

Cron gets Activated when time is right. Call the User 9 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308

Slide 10

Slide 10 text

10 Inbound Connection

Slide 11

Slide 11 text

— Leslie Lamport https://www.microsoft.com/en-us/research/uploads/prod/2016 /12/Distribution.pdf “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” 11

Slide 12

Slide 12 text

Four Flavors of Failure 12 Disk Network CPU Memory

Slide 13

Slide 13 text

13 Scope of Failures: Again

Slide 14

Slide 14 text

AWS GCP Azure Who is more Reliable? 14

Slide 15

Slide 15 text

15 #1 Server is Unavailable

Slide 16

Slide 16 text

16 Failover Available

Slide 17

Slide 17 text

17 Failover Available

Slide 18

Slide 18 text

18 Failover != Load Balanced

Slide 19

Slide 19 text

19 Load Balanced

Slide 20

Slide 20 text

20 Architecture of a Balancer

Slide 21

Slide 21 text

GCP AWS On-Prem Azure Who uses What? 21

Slide 22

Slide 22 text

22 Failover + Load Balanced

Slide 23

Slide 23 text

3-Rack 4-Rack 5-Rack Who is more Reliable? 23

Slide 24

Slide 24 text

Trilemma 24 Available Economical Endurable

Slide 25

Slide 25 text

Load Balancing 25

Slide 26

Slide 26 text

Random Round-Robin Last Frequently Used Least Connections Who is more Reliable? 26

Slide 27

Slide 27 text

27 Problems

Slide 28

Slide 28 text

Client Side Load balancing Server Side Load Balancing Look Aside Load Balancing Who is more Reliable? 28

Slide 29

Slide 29 text

29 Server-side Load Balancing Example: Fabio

Slide 30

Slide 30 text

30 Look-aside Load Balancing Example: Consul/ DNS

Slide 31

Slide 31 text

31 Client-side Load Balancing Example: Ribbon

Slide 32

Slide 32 text

32 Client-side Load Balancing Example: Ribbon + Curator

Slide 33

Slide 33 text

33 Load Shedding

Slide 34

Slide 34 text

— Tyler McMullen https://www.infoq.com/presentations/load-balancing/ “Load Balancing is almost Impossible” 34

Slide 35

Slide 35 text

Cron gets Activated when time is right. Call the User 35 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Part 1

Slide 36

Slide 36 text

36 Outbound

Slide 37

Slide 37 text

37 Scope of Failure: Outbound

Slide 38

Slide 38 text

38 Retries

Slide 39

Slide 39 text

39 Retries: Transient Failures

Slide 40

Slide 40 text

40 Exponential Backoff: Short term Transient Failures ✋ ✋ ✋ ✋ ✋

Slide 41

Slide 41 text

41 Circuit Breaking: Long Term Transient Failures

Slide 42

Slide 42 text

Retry-Once Keep-Retrying Circuit Breaking Who is more Reliable? 42

Slide 43

Slide 43 text

Who is more Reliable? 43 At-least Once Exactly-Once At-most Once

Slide 44

Slide 44 text

44 At-most once delivery

Slide 45

Slide 45 text

45 At-least once delivery

Slide 46

Slide 46 text

46 Exactly once delivery

Slide 47

Slide 47 text

47 Exactly once delivery = At-least-once Delivery + Exactly-once Processing

Slide 48

Slide 48 text

Keys to Only-Once delivery 48 Atomic Window Idempotent

Slide 49

Slide 49 text

Out-of-Order delivery 49

Slide 50

Slide 50 text

Revisit 50

Slide 51

Slide 51 text

Kafka Celery RabbitMQ Who is more Reliable? 51 Sidekiq

Slide 52

Slide 52 text

“Guaranteed delivery in multi-party system is almost Impossible” 52

Slide 53

Slide 53 text

Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 53 Sample Product

Slide 54

Slide 54 text

54 Problems of State

Slide 55

Slide 55 text

55 Locked /Serialization

Slide 56

Slide 56 text

56 Master/ Master/ Slave

Slide 57

Slide 57 text

57 Clustering

Slide 58

Slide 58 text

Master-Master Master-Slave Clustering Who is more Reliable? 58 Eventually Consistent

Slide 59

Slide 59 text

CAP Theorem [Sab topi pehna rahe] 59

Slide 60

Slide 60 text

60 What is Better? Available Partition Consistent

Slide 61

Slide 61 text

PACeLC Theorem 61

Slide 62

Slide 62 text

What is Better? 62 Consistency Latency

Slide 63

Slide 63 text

63 Revised Flow

Slide 64

Slide 64 text

64

Slide 65

Slide 65 text

“Absolute availability is almost Impossible” 65

Slide 66

Slide 66 text

Reliable System 66 Scalable Correct Transparent

Slide 67

Slide 67 text

Summary 67 Consistent Available Economical Low Latency

Slide 68

Slide 68 text

Does anyone have any questions? [email protected] Thanks 68