$30 off During Our Annual Pro Sale. View Details »

A Commerce-Centric Take on Queueing Fairly at High Throughput

Logan Martel
September 21, 2022

A Commerce-Centric Take on Queueing Fairly at High Throughput

When is a throttle more than a rate limiter? Historically, Shopify mitigated write-heavy traffic bursts up to 5x our baseline throughput via rate-limiting scripted in Nginx Lua modules at ingress on our load balancers. That solution served us well for years in scaling for some of the world's largest E-commerce flash sales. It also had drawbacks. Edge Tier overload protection divorced from Application Tier business logic meant inflexibility in testing, maintainability, & improving waiting room UX. High traffic on one shop could be throttled disproportionately from one load balancer to another. Users could wait 30 minutes, only to discover that their cart's inventory had gone out-of-stock 20 minutes prior. Lessons learned in moving from "off-the-shelf rate limiting" to "business-aware user queueing" broadly apply to any domain where traffic bursts could trigger a waiting room. This talk also covers our load testing & migration strategy in moving throttling away from the edge to our Rails monolith application tier.

Logan Martel

September 21, 2022
Tweet

Other Decks in Technology

Transcript

  1. Logan Martel | @martelogan A Commerce-Centric take on High Throughput

    Fair Queueing
  2. 👈 Me • works on scaling Checkout @ Shopify •

    advocating stateful throttles today • shipped a scalable stateful throttle with: • Scott Francis • Bassam Mansoob • Jay Lim • Osama Sidat • Jonathan Dupuis 🛒 Docs as legal-lang
  3. Simple Example Queue

  4. The Plan (Roughly) 01 02 03 04 “Flash Sale” Thundering

    Herds Prior Work & Drawbacks “Stateful Throttle” Solutions Test in prod!
  5. Flash Scale

  6. None
  7. Shopify handles some of the largest flash sales in the

    world
  8. 8 32M Requests per minute (peak) 11TB MySQL read I/O

    per second 24B Background jobs performed 42B API calls made to partner apps
  9. Shopify (Core) Tech Stack

  10. Shopify (Core) Tech Stack

  11. Shopify (Core) Tech Stack

  12. Browsing Storefront

  13. Add to Cart

  14. Writes during Checkout

  15. Payment Finalization

  16. Order Confirmation

  17. Shopify’s “Thundering Herd” → Write-heavy Bursts up to 5x our

    baseline traffic
  18. None
  19. Need Backpressure! What are some of our options?

  20. None
  21. None
  22. None
  23. • blocking λ dequeues • “stateful” memory requirement • nevertheless,

    we’ll circle back to this idea Why not simply queue users in order (FIFO)?
  24. None
  25. None
  26. • “leaky bucket as queue” → stateful FIFO equivalent →

    buffered in-order requests • “leaky bucket as metre” → stateless throttle → requests either dropped (z > β) or forwarded (z ≤ β)
  27. • token buckets are equivalent mirror images to “leaky bucket

    as metre” • both statelessly throttle at rate ρ → support “bursty traffic” up to burst size z ≤ β
  28. Stateless Throttles

  29. Common Throttle Challenges • Capacity problem → limiting service rate

    to sustainable throughput • Starvation problem → ensuring prompt service for all buyers → (fast sellout) • Fairness problem → limiting deviations from FIFO service order (e.g. don’t incentivize a “race to poll”!)
  30. Compromises? Let’s consider some semi-stateful windowed approaches

  31. Fixed Window

  32. Fixed Window

  33. Fixed Window

  34. Fixed Window

  35. Fixed Window (Redis Transaction)

  36. Fixed Window (Redis via Lua)

  37. Problem: Boundary Bursts

  38. Problem: Boundary Bursts

  39. Adjust dynamically → Sliding Window

  40. Adjust dynamically → Sliding Window

  41. Adjust dynamically → Sliding Window

  42. Adjust dynamically → Sliding Window

  43. Adjust dynamically → Sliding Window

  44. Variations on Windows • Sliding Window log→ track arrivals in-memory;

    pop outdated entries • Generic Cell Rate (GCRA) → metered leaky bucket with predicted arrivals • Concurrency & congestion controls → counting semaphores & TCP-style adaptive window sizes
  45. None
  46. “Standard” Window Approaches Retro • Police Capacity → (limit concurrent

    buyers) ✅ • Don’t Starve Throughput → (fast sellout) ✅ • Promote Fairness → avoid a “race to poll” ❌
  47. Let’s try going a step beyond *just* throttling.

  48. Our North Star Slides to follow along

  49. The Journey 1. Stateless V1 2. Stateful V2 3. Rollout

    Slides to follow along
  50. Step 1: Stateless V1 Our Edge-tier Legacy Throttle

  51. None
  52. OpenResty Lua Module: Enables scripting NGINX load balancers to manipulate

    request & response traffic.
  53. OpenResty Hello World Example (with custom headers)

  54. Legacy Throttle Architecture

  55. Legacy Throttle Architecture

  56. Legacy Throttle Architecture

  57. Legacy Throttle Architecture

  58. Servicing polls first-in-first-out is unfair. New users could simply poll

    first to “jump the line”
  59. Let’s issue (signed) tickets to each user for the timestamp

    when they arrived.
  60. None
  61. None
  62. None
  63. Control Theory Idea: Adjust our “accepted traffic window” on-the-fly (à

    la TCP) Seeking stable fair throughput just as Thermostat “PID controllers”1 seek stable temperatures 1 Proportional-Integral-Derivative Controllers
  64. Adaptive “lag” slider

  65. Adaptive “lag” slider

  66. Adaptive “lag” slider

  67. worked well at prioritizing “very lagged” user poll traffic difficult

    to stabilize → led to frequent window fluctuations never “quite” stateless → led to inconsistent behaviour across load balancers → complicated scaling across regional clusters Legacy Throttle “Adaptive Lag” Retro
  68. In Legacy throttle, users could also be queued for >30

    mins only to discover that their cart's inventory had already gone out-of-stock
  69. Step 2: Stateful V2 Our Application-tier Fair Waiting Room

  70. None
  71. Let’s issue (signed) tickets to each user for the timestamp

    when they arrived.
  72. Shopify’s “Thundering Herd” → Write-heavy Bursts up to 5x our

    baseline traffic
  73. None
  74. None
  75. None
  76. Consider a distribution of user arrival times

  77. Arrival tickets land in different buckets arrival tickets

  78. There’s more than one queue in this image arrival tickets

  79. Queue bins

  80. Intra-bin Queues y=10% into 1s . . . x-axis =

    arrival second x = 2s x = 1s . . . x = 3s y-axis = % into one-second bin y=25.2% into 1s y=33.33% into 1s (integer-valued) (decimal-valued)
  81. Idea: Limit unfairness between Queue Bins

  82. Tolerate unfairness within bins y=10% into 1s . . .

    x-axis = arrival second x = 2s x = 1s . . . x = 3s y-axis = % into one-second bin y=25.2% into 1s y=33.33% into 1s (integer-valued) (decimal-valued)
  83. None
  84. None
  85. None
  86. Queue Library (Ruby Gem) Interface

  87. Queue Library (Ruby Gem) Interface

  88. Simple Mock Service to test Queue gem

  89. Bin Scheduling • latest_bin - bin # currently assigned to

    arriving users • client_bin - bin # assigned to particular user (signed & encoded) • working_bin - max eligible bin to accept poll traffic from clients
  90. Lua Routine Stored in Redis

  91. When should we ask clients to poll?

  92. Inventory Awareness - highly cacheable reads, powered by application tier.

    Enriched by a stateful React + GraphQL client.
  93. State as an enabler for scalability • Multi-layered caching -

    most requests don’t even reach Redis • Adaptive working_bin - increments can react to signals such as: ◦ compliance - do clients poll at advised poll times (not too early or late)? ◦ system health - do we have capacity to allow more traffic? • Sellout as backoff signal - traffic backoff after sellout → shorter queue times! • Horizontal Scaling - if needed, could shard bins over multiple Redis instances
  94. Step 3: Rollout! Simulation-driven Migration

  95. None
  96. None
  97. Middleware Experiment in Prod

  98. Middleware Experiment in Prod

  99. None
  100. Similar Concept in Amusement Parks See Defunctland’s “Fastpass: A Complicated

    History” on YouTube
  101. Simulates Diverse Polling Behavior

  102. Simulates Diverse Polling Behavior

  103. Simulates Diverse Polling Behavior

  104. Simulates Diverse Polling Behavior

  105. Redis Queue Simulator: goqueuesim

  106. Redis Queue Simulator: goqueuesim

  107. Example Metrics

  108. Example Metrics

  109. Example Metrics

  110. Mock Services

  111. Genghis: Our Load Testing Tool Talk on Genghis

  112. Simple Mock Service to test Queue gem

  113. Mock API for Test Shops in Production

  114. Experiment Results: Success!

  115. Some Takeaways 01 02 03 04 Race to poll drawback

    in rate limiters Benefits of queue state to fairness & UX Horizontal & adaptive scaling options Simulation-driven migrations! Thoughts? Chat with me sometime @martelogan !