Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability of Distributed Systems

Reliability of Distributed Systems

Piyush Verma

June 22, 2019
Tweet

More Decks by Piyush Verma

Other Decks in Technology

Transcript

  1. Reliability of
    Distributed
    Systems
    - Piyush Verma

    View full-size slide

  2. Every product
    either dies a
    hero or lives
    long enough to
    hit Reliability
    issues.
    2

    View full-size slide

  3. Customer Empathy
    No Chooran.
    Cost to Everything
    Architectures adapt to $ Priority
    01
    02
    03
    04
    3
    Take it and Go

    View full-size slide

  4. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    4
    Sample
    Product

    View full-size slide

  5. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    5
    Sample
    Product:
    Inbound

    View full-size slide

  6. Cron gets
    Activated when
    time is right.
    Call the User
    6
    Sample
    Product:
    Outbound
    Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308

    View full-size slide

  7. 7
    Inbound
    Connection

    View full-size slide

  8. — Leslie Lamport
    https://www.microsoft.com/en-us/research/uploads/prod/2016
    /12/Distribution.pdf
    “A distributed system is one in which the failure of
    a computer you didn’t even know existed can
    render your own computer unusable”
    8

    View full-size slide

  9. Four Flavors of
    Failure 9
    Disk Network
    CPU Memory

    View full-size slide

  10. Network is Reliable
    Intra-LAN latency is ~ Zero
    Network is Homogeneous
    Network cost is Zero
    01
    02
    03
    04
    10

    View full-size slide

  11. 11
    Scope of
    Failures:
    Again

    View full-size slide

  12. At-least one server is online
    All servers are below 100%
    All servers are responding within x
    ms.
    All of the above.
    01
    02
    03
    04
    12

    View full-size slide

  13. 13
    #1 Server is
    Unavailable

    View full-size slide

  14. 14
    Replication
    Available

    View full-size slide

  15. 15
    Replication
    Available

    View full-size slide

  16. 16
    Available !=
    Load Balanced

    View full-size slide

  17. 17
    Load Balanced

    View full-size slide

  18. 18
    Architecture of
    a Balancer

    View full-size slide

  19. GCP AWS
    On-Prem Azure
    Who uses What? 19

    View full-size slide

  20. Trilemma 21
    Available Economical
    Endurable

    View full-size slide

  21. 22
    Available +
    Load Balanced

    View full-size slide

  22. Load Balancing 23

    View full-size slide

  23. 24
    Monty Hall
    Problem:
    Was Marilyn
    vos Savant,
    right?

    View full-size slide

  24. 25
    Server-side
    Load Balancing
    Example: Fabio

    View full-size slide

  25. 26
    Look-aside
    Load Balancing
    Example:
    Consul/ DNS

    View full-size slide

  26. 27
    Client-side
    Load Balancing
    Example:
    Ribbon

    View full-size slide

  27. 28
    Client-side
    Load Balancing
    Example:
    Ribbon +
    Curator

    View full-size slide

  28. 30
    Load
    Shedding

    View full-size slide

  29. — Tyler McMullen
    https://www.infoq.com/presentations/load-balancing/
    “Load Balancing is almost Impossible”
    31

    View full-size slide

  30. Alternate
    Reliability 32

    View full-size slide

  31. 33
    Asynchronous
    Architectures

    View full-size slide

  32. 34
    Asynchronous
    Architectures
    Example: RabbitMQ
    Kafka
    Kinesis
    SQS

    View full-size slide

  33. Cron gets
    Activated when
    time is right.
    Call the User
    35
    Sample
    Product:
    Outbound
    Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Part 1

    View full-size slide

  34. 37
    Scope of
    Failure:
    Outbound

    View full-size slide

  35. 39
    Retries:
    Transient
    Failures

    View full-size slide

  36. 40
    Exponential
    Backoff:
    Short term
    Transient
    Failures
    ✋ ✋ ✋ ✋

    View full-size slide

  37. 41
    Circuit
    Breaking:
    Long Term
    Transient
    Failures

    View full-size slide

  38. Dilemma 43
    At-least Once
    Exactly-Once
    At-most Once

    View full-size slide

  39. 44
    At-most once
    delivery

    View full-size slide

  40. 45
    At-least once
    delivery

    View full-size slide

  41. 46
    Exactly once
    delivery

    View full-size slide

  42. 47
    Exactly once delivery
    =
    At-least-once Delivery
    +
    Exactly-once Processing

    View full-size slide

  43. Keys to
    Only-Once
    delivery
    48
    Atomic Window
    Idempotent

    View full-size slide

  44. Out-of-Order
    delivery 49

    View full-size slide

  45. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    51
    Sample
    Product

    View full-size slide

  46. 52
    Problems of
    State

    View full-size slide

  47. 53
    Problems of
    State

    View full-size slide

  48. 54
    Locked
    /Serialization

    View full-size slide

  49. 55
    Master/
    Master/
    Slave

    View full-size slide

  50. 56
    Clustering

    View full-size slide

  51. Scalability 57
    Data
    Replication
    Reduced
    Communication
    Logic/Data
    Decentralization

    View full-size slide

  52. CAP
    Theorem
    [Sab topi pehna rahe]
    58

    View full-size slide

  53. 59
    Trilemma
    Available Partition
    Consistent

    View full-size slide

  54. PACeLC
    Theorem 60

    View full-size slide

  55. Dilemma 61
    Consistency
    Latency

    View full-size slide

  56. 62
    Revised Flow

    View full-size slide

  57. 63
    What about Spanner?
    What about Calvin?

    View full-size slide

  58. Reliable System 64
    Scalable Correct
    Transparent

    View full-size slide

  59. Access Transparency
    Location Transparency
    Concurrency Transparency
    Failure Transparency
    01
    02
    03
    04
    65

    View full-size slide

  60. Size Scalability
    Geographical Scalability
    01
    02
    66

    View full-size slide

  61. Summary 67
    Consistent
    Available
    Economical
    Low Latency

    View full-size slide

  62. 68
    All Put
    Together

    View full-size slide

  63. Embrace your Bugs
    No Silver Bullet
    Cost to Everything
    Product First
    01
    02
    03
    04
    69

    View full-size slide

  64. Does anyone have any questions?
    [email protected]
    Thanks 70

    View full-size slide