Slide 1

Slide 1 text

Reliability of Distributed Systems - Piyush Verma

Slide 2

Slide 2 text

Every product either dies a hero or lives long enough to hit Reliability issues. 2

Slide 3

Slide 3 text

Customer Empathy No Chooran. Cost to Everything Architectures adapt to $ Priority 01 02 03 04 3 Take it and Go

Slide 4

Slide 4 text

Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 4 Sample Product

Slide 5

Slide 5 text

Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 5 Sample Product: Inbound

Slide 6

Slide 6 text

Cron gets Activated when time is right. Call the User 6 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308

Slide 7

Slide 7 text

7 Inbound Connection

Slide 8

Slide 8 text

— Leslie Lamport https://www.microsoft.com/en-us/research/uploads/prod/2016 /12/Distribution.pdf “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” 8

Slide 9

Slide 9 text

Four Flavors of Failure 9 Disk Network CPU Memory

Slide 10

Slide 10 text

Network is Reliable Intra-LAN latency is ~ Zero Network is Homogeneous Network cost is Zero 01 02 03 04 10

Slide 11

Slide 11 text

11 Scope of Failures: Again

Slide 12

Slide 12 text

At-least one server is online All servers are below 100% All servers are responding within x ms. All of the above. 01 02 03 04 12

Slide 13

Slide 13 text

13 #1 Server is Unavailable

Slide 14

Slide 14 text

14 Replication Available

Slide 15

Slide 15 text

15 Replication Available

Slide 16

Slide 16 text

16 Available != Load Balanced

Slide 17

Slide 17 text

17 Load Balanced

Slide 18

Slide 18 text

18 Architecture of a Balancer

Slide 19

Slide 19 text

GCP AWS On-Prem Azure Who uses What? 19

Slide 20

Slide 20 text

Trilemma 20

Slide 21

Slide 21 text

Trilemma 21 Available Economical Endurable

Slide 22

Slide 22 text

22 Available + Load Balanced

Slide 23

Slide 23 text

Load Balancing 23

Slide 24

Slide 24 text

24 Monty Hall Problem: Was Marilyn vos Savant, right?

Slide 25

Slide 25 text

25 Server-side Load Balancing Example: Fabio

Slide 26

Slide 26 text

26 Look-aside Load Balancing Example: Consul/ DNS

Slide 27

Slide 27 text

27 Client-side Load Balancing Example: Ribbon

Slide 28

Slide 28 text

28 Client-side Load Balancing Example: Ribbon + Curator

Slide 29

Slide 29 text

29 Problems

Slide 30

Slide 30 text

30 Load Shedding

Slide 31

Slide 31 text

— Tyler McMullen https://www.infoq.com/presentations/load-balancing/ “Load Balancing is almost Impossible” 31

Slide 32

Slide 32 text

Alternate Reliability 32

Slide 33

Slide 33 text

33 Asynchronous Architectures

Slide 34

Slide 34 text

34 Asynchronous Architectures Example: RabbitMQ Kafka Kinesis SQS

Slide 35

Slide 35 text

Cron gets Activated when time is right. Call the User 35 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Part 1

Slide 36

Slide 36 text

36 Outbound

Slide 37

Slide 37 text

37 Scope of Failure: Outbound

Slide 38

Slide 38 text

38 Retries

Slide 39

Slide 39 text

39 Retries: Transient Failures

Slide 40

Slide 40 text

40 Exponential Backoff: Short term Transient Failures ✋ ✋ ✋ ✋ ✋

Slide 41

Slide 41 text

41 Circuit Breaking: Long Term Transient Failures

Slide 42

Slide 42 text

42 Revisited

Slide 43

Slide 43 text

Dilemma 43 At-least Once Exactly-Once At-most Once

Slide 44

Slide 44 text

44 At-most once delivery

Slide 45

Slide 45 text

45 At-least once delivery

Slide 46

Slide 46 text

46 Exactly once delivery

Slide 47

Slide 47 text

47 Exactly once delivery = At-least-once Delivery + Exactly-once Processing

Slide 48

Slide 48 text

Keys to Only-Once delivery 48 Atomic Window Idempotent

Slide 49

Slide 49 text

Out-of-Order delivery 49

Slide 50

Slide 50 text

Revisit 50

Slide 51

Slide 51 text

Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 51 Sample Product

Slide 52

Slide 52 text

52 Problems of State

Slide 53

Slide 53 text

53 Problems of State

Slide 54

Slide 54 text

54 Locked /Serialization

Slide 55

Slide 55 text

55 Master/ Master/ Slave

Slide 56

Slide 56 text

56 Clustering

Slide 57

Slide 57 text

Scalability 57 Data Replication Reduced Communication Logic/Data Decentralization

Slide 58

Slide 58 text

CAP Theorem [Sab topi pehna rahe] 58

Slide 59

Slide 59 text

59 Trilemma Available Partition Consistent

Slide 60

Slide 60 text

PACeLC Theorem 60

Slide 61

Slide 61 text

Dilemma 61 Consistency Latency

Slide 62

Slide 62 text

62 Revised Flow

Slide 63

Slide 63 text

63 What about Spanner? What about Calvin?

Slide 64

Slide 64 text

Reliable System 64 Scalable Correct Transparent

Slide 65

Slide 65 text

Access Transparency Location Transparency Concurrency Transparency Failure Transparency 01 02 03 04 65

Slide 66

Slide 66 text

Size Scalability Geographical Scalability 01 02 66

Slide 67

Slide 67 text

Summary 67 Consistent Available Economical Low Latency

Slide 68

Slide 68 text

68 All Put Together

Slide 69

Slide 69 text

Embrace your Bugs No Silver Bullet Cost to Everything Product First 01 02 03 04 69

Slide 70

Slide 70 text

Does anyone have any questions? [email protected] Thanks 70