Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability of Distributed Systems

Reliability of Distributed Systems

Piyush Verma

June 22, 2019
Tweet

More Decks by Piyush Verma

Other Decks in Technology

Transcript

  1. Reliability of
    Distributed
    Systems
    - Piyush Verma

    View Slide

  2. Every product
    either dies a
    hero or lives
    long enough to
    hit Reliability
    issues.
    2

    View Slide

  3. Customer Empathy
    No Chooran.
    Cost to Everything
    Architectures adapt to $ Priority
    01
    02
    03
    04
    3
    Take it and Go

    View Slide

  4. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    4
    Sample
    Product

    View Slide

  5. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    5
    Sample
    Product:
    Inbound

    View Slide

  6. Cron gets
    Activated when
    time is right.
    Call the User
    6
    Sample
    Product:
    Outbound
    Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308

    View Slide

  7. 7
    Inbound
    Connection

    View Slide

  8. — Leslie Lamport
    https://www.microsoft.com/en-us/research/uploads/prod/2016
    /12/Distribution.pdf
    “A distributed system is one in which the failure of
    a computer you didn’t even know existed can
    render your own computer unusable”
    8

    View Slide

  9. Four Flavors of
    Failure 9
    Disk Network
    CPU Memory

    View Slide

  10. Network is Reliable
    Intra-LAN latency is ~ Zero
    Network is Homogeneous
    Network cost is Zero
    01
    02
    03
    04
    10

    View Slide

  11. 11
    Scope of
    Failures:
    Again

    View Slide

  12. At-least one server is online
    All servers are below 100%
    All servers are responding within x
    ms.
    All of the above.
    01
    02
    03
    04
    12

    View Slide

  13. 13
    #1 Server is
    Unavailable

    View Slide

  14. 14
    Replication
    Available

    View Slide

  15. 15
    Replication
    Available

    View Slide

  16. 16
    Available !=
    Load Balanced

    View Slide

  17. 17
    Load Balanced

    View Slide

  18. 18
    Architecture of
    a Balancer

    View Slide

  19. GCP AWS
    On-Prem Azure
    Who uses What? 19

    View Slide

  20. Trilemma 20

    View Slide

  21. Trilemma 21
    Available Economical
    Endurable

    View Slide

  22. 22
    Available +
    Load Balanced

    View Slide

  23. Load Balancing 23

    View Slide

  24. 24
    Monty Hall
    Problem:
    Was Marilyn
    vos Savant,
    right?

    View Slide

  25. 25
    Server-side
    Load Balancing
    Example: Fabio

    View Slide

  26. 26
    Look-aside
    Load Balancing
    Example:
    Consul/ DNS

    View Slide

  27. 27
    Client-side
    Load Balancing
    Example:
    Ribbon

    View Slide

  28. 28
    Client-side
    Load Balancing
    Example:
    Ribbon +
    Curator

    View Slide

  29. 29
    Problems

    View Slide

  30. 30
    Load
    Shedding

    View Slide

  31. — Tyler McMullen
    https://www.infoq.com/presentations/load-balancing/
    “Load Balancing is almost Impossible”
    31

    View Slide

  32. Alternate
    Reliability 32

    View Slide

  33. 33
    Asynchronous
    Architectures

    View Slide

  34. 34
    Asynchronous
    Architectures
    Example: RabbitMQ
    Kafka
    Kinesis
    SQS

    View Slide

  35. Cron gets
    Activated when
    time is right.
    Call the User
    35
    Sample
    Product:
    Outbound
    Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Part 1

    View Slide

  36. 36
    Outbound

    View Slide

  37. 37
    Scope of
    Failure:
    Outbound

    View Slide

  38. 38
    Retries

    View Slide

  39. 39
    Retries:
    Transient
    Failures

    View Slide

  40. 40
    Exponential
    Backoff:
    Short term
    Transient
    Failures
    ✋ ✋ ✋ ✋

    View Slide

  41. 41
    Circuit
    Breaking:
    Long Term
    Transient
    Failures

    View Slide

  42. 42
    Revisited

    View Slide

  43. Dilemma 43
    At-least Once
    Exactly-Once
    At-most Once

    View Slide

  44. 44
    At-most once
    delivery

    View Slide

  45. 45
    At-least once
    delivery

    View Slide

  46. 46
    Exactly once
    delivery

    View Slide

  47. 47
    Exactly once delivery
    =
    At-least-once Delivery
    +
    Exactly-once Processing

    View Slide

  48. Keys to
    Only-Once
    delivery
    48
    Atomic Window
    Idempotent

    View Slide

  49. Out-of-Order
    delivery 49

    View Slide

  50. Revisit 50

    View Slide

  51. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    51
    Sample
    Product

    View Slide

  52. 52
    Problems of
    State

    View Slide

  53. 53
    Problems of
    State

    View Slide

  54. 54
    Locked
    /Serialization

    View Slide

  55. 55
    Master/
    Master/
    Slave

    View Slide

  56. 56
    Clustering

    View Slide

  57. Scalability 57
    Data
    Replication
    Reduced
    Communication
    Logic/Data
    Decentralization

    View Slide

  58. CAP
    Theorem
    [Sab topi pehna rahe]
    58

    View Slide

  59. 59
    Trilemma
    Available Partition
    Consistent

    View Slide

  60. PACeLC
    Theorem 60

    View Slide

  61. Dilemma 61
    Consistency
    Latency

    View Slide

  62. 62
    Revised Flow

    View Slide

  63. 63
    What about Spanner?
    What about Calvin?

    View Slide

  64. Reliable System 64
    Scalable Correct
    Transparent

    View Slide

  65. Access Transparency
    Location Transparency
    Concurrency Transparency
    Failure Transparency
    01
    02
    03
    04
    65

    View Slide

  66. Size Scalability
    Geographical Scalability
    01
    02
    66

    View Slide

  67. Summary 67
    Consistent
    Available
    Economical
    Low Latency

    View Slide

  68. 68
    All Put
    Together

    View Slide

  69. Embrace your Bugs
    No Silver Bullet
    Cost to Everything
    Product First
    01
    02
    03
    04
    69

    View Slide

  70. Does anyone have any questions?
    [email protected]
    Thanks 70

    View Slide