Reliability

Ee5407f7a79eb620c4fd54c136847b33?s=47 Piyush Verma
September 21, 2019

 Reliability

Every product either dies a hero or lives long enough to hit Reliability issues.
Whether it’s your code or a service that you connect to, there will be a disk that will fail, a network that will experience partition, a CPU that will throttle, or a Memory that will fill up.
While you go about fixing this, What is the cost, both in terms of effort and business lost, of failure and how much does each nine of reliability cost?
The talk considers a sample and straightforward product and evaluates the depths of each failure point. We take one fault at a time and introduce incremental changes to the architecture, the product, and the support structure like monitoring and logging to detect and overcome those failures.

Ee5407f7a79eb620c4fd54c136847b33?s=128

Piyush Verma

September 21, 2019
Tweet

Transcript

  1. Reliability - Piyush Verma

  2. Nginx Fabio Traefik Kong Who is more Reliable? 2

  3. Cassandra InfluxDB TimeseriesDB Prometheus Who is more Reliable? 3

  4. Zookeeper Etcd Consul Filaan Dhimkaan Who is more Reliable? 4

  5. At-least one server is online All servers are below 100%

    All servers are responding within x ms. All of the above. 01 02 03 04 5
  6. 6

  7. Service receives SMS User sends SMS Remind me to buy

    milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 7 Sample Product
  8. Service receives SMS User sends SMS Remind me to buy

    milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 8 Sample Product: Inbound
  9. Cron gets Activated when time is right. Call the User

    9 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308
  10. 10 Inbound Connection

  11. — Leslie Lamport https://www.microsoft.com/en-us/research/uploads/prod/2016 /12/Distribution.pdf “A distributed system is one

    in which the failure of a computer you didn’t even know existed can render your own computer unusable” 11
  12. Four Flavors of Failure 12 Disk Network CPU Memory

  13. 13 Scope of Failures: Again

  14. AWS GCP Azure Who is more Reliable? 14

  15. 15 #1 Server is Unavailable

  16. 16 Failover Available

  17. 17 Failover Available

  18. 18 Failover != Load Balanced

  19. 19 Load Balanced

  20. 20 Architecture of a Balancer

  21. GCP AWS On-Prem Azure Who uses What? 21

  22. 22 Failover + Load Balanced

  23. 3-Rack 4-Rack 5-Rack Who is more Reliable? 23

  24. Trilemma 24 Available Economical Endurable

  25. Load Balancing 25

  26. Random Round-Robin Last Frequently Used Least Connections Who is more

    Reliable? 26
  27. 27 Problems

  28. Client Side Load balancing Server Side Load Balancing Look Aside

    Load Balancing Who is more Reliable? 28
  29. 29 Server-side Load Balancing Example: Fabio

  30. 30 Look-aside Load Balancing Example: Consul/ DNS

  31. 31 Client-side Load Balancing Example: Ribbon

  32. 32 Client-side Load Balancing Example: Ribbon + Curator

  33. 33 Load Shedding

  34. — Tyler McMullen https://www.infoq.com/presentations/load-balancing/ “Load Balancing is almost Impossible” 34

  35. Cron gets Activated when time is right. Call the User

    35 Sample Product: Outbound Service receives SMS User sends SMS Remind me to buy milk at 6:30 PM to 53308 Part 1
  36. 36 Outbound

  37. 37 Scope of Failure: Outbound

  38. 38 Retries

  39. 39 Retries: Transient Failures

  40. 40 Exponential Backoff: Short term Transient Failures ✋ ✋ ✋

    ✋ ✋
  41. 41 Circuit Breaking: Long Term Transient Failures

  42. Retry-Once Keep-Retrying Circuit Breaking Who is more Reliable? 42

  43. Who is more Reliable? 43 At-least Once Exactly-Once At-most Once

  44. 44 At-most once delivery

  45. 45 At-least once delivery

  46. 46 Exactly once delivery

  47. 47 Exactly once delivery = At-least-once Delivery + Exactly-once Processing

  48. Keys to Only-Once delivery 48 Atomic Window Idempotent

  49. Out-of-Order delivery 49

  50. Revisit 50

  51. Kafka Celery RabbitMQ Who is more Reliable? 51 Sidekiq

  52. “Guaranteed delivery in multi-party system is almost Impossible” 52

  53. Service receives SMS User sends SMS Remind me to buy

    milk at 6:30 PM to 53308 Cron gets Activated when time is right. Call the User 53 Sample Product
  54. 54 Problems of State

  55. 55 Locked /Serialization

  56. 56 Master/ Master/ Slave

  57. 57 Clustering

  58. Master-Master Master-Slave Clustering Who is more Reliable? 58 Eventually Consistent

  59. CAP Theorem [Sab topi pehna rahe] 59

  60. 60 What is Better? Available Partition Consistent

  61. PACeLC Theorem 61

  62. What is Better? 62 Consistency Latency

  63. 63 Revised Flow

  64. 64

  65. “Absolute availability is almost Impossible” 65

  66. Reliable System 66 Scalable Correct Transparent

  67. Summary 67 Consistent Available Economical Low Latency

  68. Does anyone have any questions? piyush@piyushverma.net Thanks 68