Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability

Piyush Verma
September 21, 2019

 Reliability

Every product either dies a hero or lives long enough to hit Reliability issues.
Whether it’s your code or a service that you connect to, there will be a disk that will fail, a network that will experience partition, a CPU that will throttle, or a Memory that will fill up.
While you go about fixing this, What is the cost, both in terms of effort and business lost, of failure and how much does each nine of reliability cost?
The talk considers a sample and straightforward product and evaluates the depths of each failure point. We take one fault at a time and introduce incremental changes to the architecture, the product, and the support structure like monitoring and logging to detect and overcome those failures.

Piyush Verma

September 21, 2019
Tweet

More Decks by Piyush Verma

Other Decks in Technology

Transcript

  1. Reliability
    - Piyush Verma

    View Slide

  2. Nginx Fabio
    Traefik Kong
    Who is more
    Reliable? 2

    View Slide

  3. Cassandra InfluxDB
    TimeseriesDB Prometheus
    Who is more
    Reliable? 3

    View Slide

  4. Zookeeper Etcd
    Consul Filaan Dhimkaan
    Who is more
    Reliable? 4

    View Slide

  5. At-least one server is online
    All servers are below 100%
    All servers are responding within x
    ms.
    All of the above.
    01
    02
    03
    04
    5

    View Slide

  6. 6

    View Slide

  7. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    7
    Sample
    Product

    View Slide

  8. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    8
    Sample
    Product:
    Inbound

    View Slide

  9. Cron gets
    Activated when
    time is right.
    Call the User
    9
    Sample
    Product:
    Outbound
    Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308

    View Slide

  10. 10
    Inbound
    Connection

    View Slide

  11. — Leslie Lamport
    https://www.microsoft.com/en-us/research/uploads/prod/2016
    /12/Distribution.pdf
    “A distributed system is one in which the failure of
    a computer you didn’t even know existed can
    render your own computer unusable”
    11

    View Slide

  12. Four Flavors of
    Failure 12
    Disk Network
    CPU Memory

    View Slide

  13. 13
    Scope of
    Failures:
    Again

    View Slide

  14. AWS GCP
    Azure
    Who is more
    Reliable? 14

    View Slide

  15. 15
    #1 Server is
    Unavailable

    View Slide

  16. 16
    Failover
    Available

    View Slide

  17. 17
    Failover
    Available

    View Slide

  18. 18
    Failover !=
    Load Balanced

    View Slide

  19. 19
    Load Balanced

    View Slide

  20. 20
    Architecture of
    a Balancer

    View Slide

  21. GCP AWS
    On-Prem Azure
    Who uses What? 21

    View Slide

  22. 22
    Failover + Load
    Balanced

    View Slide

  23. 3-Rack 4-Rack
    5-Rack
    Who is more
    Reliable? 23

    View Slide

  24. Trilemma 24
    Available Economical
    Endurable

    View Slide

  25. Load Balancing 25

    View Slide

  26. Random Round-Robin
    Last Frequently
    Used Least Connections
    Who is more
    Reliable? 26

    View Slide

  27. 27
    Problems

    View Slide

  28. Client Side
    Load balancing
    Server Side
    Load Balancing
    Look Aside
    Load Balancing
    Who is more
    Reliable? 28

    View Slide

  29. 29
    Server-side
    Load Balancing
    Example: Fabio

    View Slide

  30. 30
    Look-aside
    Load Balancing
    Example:
    Consul/ DNS

    View Slide

  31. 31
    Client-side
    Load Balancing
    Example:
    Ribbon

    View Slide

  32. 32
    Client-side
    Load Balancing
    Example:
    Ribbon +
    Curator

    View Slide

  33. 33
    Load
    Shedding

    View Slide

  34. — Tyler McMullen
    https://www.infoq.com/presentations/load-balancing/
    “Load Balancing is almost Impossible”
    34

    View Slide

  35. Cron gets
    Activated when
    time is right.
    Call the User
    35
    Sample
    Product:
    Outbound
    Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Part 1

    View Slide

  36. 36
    Outbound

    View Slide

  37. 37
    Scope of
    Failure:
    Outbound

    View Slide

  38. 38
    Retries

    View Slide

  39. 39
    Retries:
    Transient
    Failures

    View Slide

  40. 40
    Exponential
    Backoff:
    Short term
    Transient
    Failures
    ✋ ✋ ✋ ✋

    View Slide

  41. 41
    Circuit
    Breaking:
    Long Term
    Transient
    Failures

    View Slide

  42. Retry-Once Keep-Retrying
    Circuit Breaking
    Who is more
    Reliable? 42

    View Slide

  43. Who is more
    Reliable? 43
    At-least Once
    Exactly-Once
    At-most Once

    View Slide

  44. 44
    At-most once
    delivery

    View Slide

  45. 45
    At-least once
    delivery

    View Slide

  46. 46
    Exactly once
    delivery

    View Slide

  47. 47
    Exactly once delivery
    =
    At-least-once Delivery
    +
    Exactly-once Processing

    View Slide

  48. Keys to
    Only-Once
    delivery
    48
    Atomic Window
    Idempotent

    View Slide

  49. Out-of-Order
    delivery 49

    View Slide

  50. Revisit 50

    View Slide

  51. Kafka Celery
    RabbitMQ
    Who is more
    Reliable? 51
    Sidekiq

    View Slide

  52. “Guaranteed delivery in multi-party system is
    almost Impossible”
    52

    View Slide

  53. Service receives
    SMS
    User sends SMS
    Remind me to buy milk at
    6:30 PM to 53308
    Cron gets
    Activated when
    time is right.
    Call the User
    53
    Sample
    Product

    View Slide

  54. 54
    Problems of
    State

    View Slide

  55. 55
    Locked
    /Serialization

    View Slide

  56. 56
    Master/
    Master/
    Slave

    View Slide

  57. 57
    Clustering

    View Slide

  58. Master-Master Master-Slave
    Clustering
    Who is more
    Reliable? 58
    Eventually
    Consistent

    View Slide

  59. CAP
    Theorem
    [Sab topi pehna rahe]
    59

    View Slide

  60. 60
    What is Better?
    Available Partition
    Consistent

    View Slide

  61. PACeLC
    Theorem 61

    View Slide

  62. What is Better? 62
    Consistency
    Latency

    View Slide

  63. 63
    Revised Flow

    View Slide

  64. 64

    View Slide

  65. “Absolute availability is almost Impossible” 65

    View Slide

  66. Reliable System 66
    Scalable Correct
    Transparent

    View Slide

  67. Summary 67
    Consistent
    Available
    Economical
    Low Latency

    View Slide

  68. Does anyone have any questions?
    [email protected]
    Thanks 68

    View Slide