Reliability of Distributed Systems

Reliability of Distributed Systems

Ee5407f7a79eb620c4fd54c136847b33?s=128

Piyush Verma

June 22, 2019
Tweet

Transcript

  1. Reliability of Distributed Systems - Piyush Verma

  2. Every product either dies a hero or lives long enough

    to hit Reliability issues. 2
  3. Customer Empathy No Chooran. Cost to Everything Architectures adapt to

    $ Priority 01 02 03 04 3 Take it and Go
  4. Service receives SMS User sends SMS Remind me to buy

    milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 4 Sample Product
  5. Service receives SMS User sends SMS Remind me to buy

    milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 5 Sample Product: Inbound
  6. Cron gets Activated when time is right. Call the User

    6 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308
  7. 7 Inbound Connection

  8. — Leslie Lamport https://www.microsoft.com/en-us/research/uploads/prod/2016 /12/Distribution.pdf “A distributed system is one

    in which the failure of a computer you didn’t even know existed can render your own computer unusable” 8
  9. Four Flavors of Failure 9 Disk Network CPU Memory

  10. Network is Reliable Intra-LAN latency is ~ Zero Network is

    Homogeneous Network cost is Zero 01 02 03 04 10
  11. 11 Scope of Failures: Again

  12. At-least one server is online All servers are below 100%

    All servers are responding within x ms. All of the above. 01 02 03 04 12
  13. 13 #1 Server is Unavailable

  14. 14 Replication Available

  15. 15 Replication Available

  16. 16 Available != Load Balanced

  17. 17 Load Balanced

  18. 18 Architecture of a Balancer

  19. GCP AWS On-Prem Azure Who uses What? 19

  20. Trilemma 20

  21. Trilemma 21 Available Economical Endurable

  22. 22 Available + Load Balanced

  23. Load Balancing 23

  24. 24 Monty Hall Problem: Was Marilyn vos Savant, right?

  25. 25 Server-side Load Balancing Example: Fabio

  26. 26 Look-aside Load Balancing Example: Consul/ DNS

  27. 27 Client-side Load Balancing Example: Ribbon

  28. 28 Client-side Load Balancing Example: Ribbon + Curator

  29. 29 Problems

  30. 30 Load Shedding

  31. — Tyler McMullen https://www.infoq.com/presentations/load-balancing/ “Load Balancing is almost Impossible” 31

  32. Alternate Reliability 32

  33. 33 Asynchronous Architectures

  34. 34 Asynchronous Architectures Example: RabbitMQ Kafka Kinesis SQS

  35. Cron gets Activated when time is right. Call the User

    35 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Part 1
  36. 36 Outbound

  37. 37 Scope of Failure: Outbound

  38. 38 Retries

  39. 39 Retries: Transient Failures

  40. 40 Exponential Backoff: Short term Transient Failures ✋ ✋ ✋

    ✋ ✋
  41. 41 Circuit Breaking: Long Term Transient Failures

  42. 42 Revisited

  43. Dilemma 43 At-least Once Exactly-Once At-most Once

  44. 44 At-most once delivery

  45. 45 At-least once delivery

  46. 46 Exactly once delivery

  47. 47 Exactly once delivery = At-least-once Delivery + Exactly-once Processing

  48. Keys to Only-Once delivery 48 Atomic Window Idempotent

  49. Out-of-Order delivery 49

  50. Revisit 50

  51. Service receives SMS User sends SMS Remind me to buy

    milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 51 Sample Product
  52. 52 Problems of State

  53. 53 Problems of State

  54. 54 Locked /Serialization

  55. 55 Master/ Master/ Slave

  56. 56 Clustering

  57. Scalability 57 Data Replication Reduced Communication Logic/Data Decentralization

  58. CAP Theorem [Sab topi pehna rahe] 58

  59. 59 Trilemma Available Partition Consistent

  60. PACeLC Theorem 60

  61. Dilemma 61 Consistency Latency

  62. 62 Revised Flow

  63. 63 What about Spanner? What about Calvin?

  64. Reliable System 64 Scalable Correct Transparent

  65. Access Transparency Location Transparency Concurrency Transparency Failure Transparency 01 02

    03 04 65
  66. Size Scalability Geographical Scalability 01 02 66

  67. Summary 67 Consistent Available Economical Low Latency

  68. 68 All Put Together

  69. Embrace your Bugs No Silver Bullet Cost to Everything Product

    First 01 02 03 04 69
  70. Does anyone have any questions? piyush@piyushverma.net Thanks 70