A Commerce-Centric Take on Queueing Fairly at High Throughput

Logan Martel | @martelogan A Commerce-Centric take on High Throughput
Fair Queueing

👈 Me • works on scaling Checkout @ Shopify •
advocating stateful throttles today • shipped a scalable stateful throttle with: • Scott Francis • Bassam Mansoob • Jay Lim • Osama Sidat • Jonathan Dupuis 🛒 Docs as legal-lang

Simple Example Queue

The Plan (Roughly) 01 02 03 04 “Flash Sale” Thundering
Herds Prior Work & Drawbacks “Stateful Throttle” Solutions Test in prod!

Flash Scale

Shopify handles some of the largest ﬂash sales in the
world

8 32M Requests per minute (peak) 11TB MySQL read I/O
per second 24B Background jobs performed 42B API calls made to partner apps

Shopify (Core) Tech Stack

Browsing Storefront

Add to Cart

Writes during Checkout

Payment Finalization

Order Conﬁrmation

Shopify’s “Thundering Herd” → Write-heavy Bursts up to 5x our
baseline traﬃc

Need Backpressure! What are some of our options?

• blocking λ dequeues • “stateful” memory requirement • nevertheless,
we’ll circle back to this idea Why not simply queue users in order (FIFO)?

• “leaky bucket as queue” → stateful FIFO equivalent →
buffered in-order requests • “leaky bucket as metre” → stateless throttle → requests either dropped (z > β) or forwarded (z ≤ β)

• token buckets are equivalent mirror images to “leaky bucket
as metre” • both statelessly throttle at rate ρ → support “bursty traﬃc” up to burst size z ≤ β

Stateless Throttles

Common Throttle Challenges • Capacity problem → limiting service rate
to sustainable throughput • Starvation problem → ensuring prompt service for all buyers → (fast sellout) • Fairness problem → limiting deviations from FIFO service order (e.g. don’t incentivize a “race to poll”!)

Compromises? Let’s consider some semi-stateful windowed approaches

Fixed Window

Fixed Window (Redis Transaction)

Fixed Window (Redis via Lua)

Problem: Boundary Bursts

Adjust dynamically → Sliding Window

Variations on Windows • Sliding Window log→ track arrivals in-memory;
pop outdated entries • Generic Cell Rate (GCRA) → metered leaky bucket with predicted arrivals • Concurrency & congestion controls → counting semaphores & TCP-style adaptive window sizes

“Standard” Window Approaches Retro • Police Capacity → (limit concurrent
buyers) ✅ • Don’t Starve Throughput → (fast sellout) ✅ • Promote Fairness → avoid a “race to poll” ❌

Let’s try going a step beyond *just* throttling.

Our North Star Slides to follow along

The Journey 1. Stateless V1 2. Stateful V2 3. Rollout
Slides to follow along

Step 1: Stateless V1 Our Edge-tier Legacy Throttle

OpenResty Lua Module: Enables scripting NGINX load balancers to manipulate
request & response traﬃc.

OpenResty Hello World Example (with custom headers)

Legacy Throttle Architecture

Servicing polls first-in-first-out is unfair. New users could simply poll
first to “jump the line”

Let’s issue (signed) tickets to each user for the timestamp
when they arrived.

Control Theory Idea: Adjust our “accepted traﬃc window” on-the-ﬂy (à
la TCP) Seeking stable fair throughput just as Thermostat “PID controllers”1 seek stable temperatures 1 Proportional-Integral-Derivative Controllers

Adaptive “lag” slider

worked well at prioritizing “very lagged” user poll traffic difficult
to stabilize → led to frequent window fluctuations never “quite” stateless → led to inconsistent behaviour across load balancers → complicated scaling across regional clusters Legacy Throttle “Adaptive Lag” Retro

In Legacy throttle, users could also be queued for >30
mins only to discover that their cart's inventory had already gone out-of-stock

Step 2: Stateful V2 Our Application-tier Fair Waiting Room

Let’s issue (signed) tickets to each user for the timestamp
when they arrived.

Shopify’s “Thundering Herd” → Write-heavy Bursts up to 5x our
baseline traﬃc

Consider a distribution of user arrival times

Arrival tickets land in different buckets arrival tickets

There’s more than one queue in this image arrival tickets

Queue bins

Intra-bin Queues y=10% into 1s . . . x-axis =
arrival second x = 2s x = 1s . . . x = 3s y-axis = % into one-second bin y=25.2% into 1s y=33.33% into 1s (integer-valued) (decimal-valued)

Idea: Limit unfairness between Queue Bins

Tolerate unfairness within bins y=10% into 1s . . .
x-axis = arrival second x = 2s x = 1s . . . x = 3s y-axis = % into one-second bin y=25.2% into 1s y=33.33% into 1s (integer-valued) (decimal-valued)

Queue Library (Ruby Gem) Interface

Simple Mock Service to test Queue gem

Bin Scheduling • latest_bin - bin # currently assigned to
arriving users • client_bin - bin # assigned to particular user (signed & encoded) • working_bin - max eligible bin to accept poll traﬃc from clients

Lua Routine Stored in Redis

When should we ask clients to poll?

Inventory Awareness - highly cacheable reads, powered by application tier.
Enriched by a stateful React + GraphQL client.

State as an enabler for scalability • Multi-layered caching -
most requests don’t even reach Redis • Adaptive working_bin - increments can react to signals such as: ◦ compliance - do clients poll at advised poll times (not too early or late)? ◦ system health - do we have capacity to allow more traﬃc? • Sellout as backoff signal - traﬃc backoff after sellout → shorter queue times! • Horizontal Scaling - if needed, could shard bins over multiple Redis instances

Step 3: Rollout! Simulation-driven Migration

Middleware Experiment in Prod

Similar Concept in Amusement Parks See Defunctland’s “Fastpass: A Complicated
History” on YouTube

Simulates Diverse Polling Behavior

Redis Queue Simulator: goqueuesim

Example Metrics

Mock Services

Genghis: Our Load Testing Tool Talk on Genghis

Simple Mock Service to test Queue gem

Mock API for Test Shops in Production

Experiment Results: Success!

Some Takeaways 01 02 03 04 Race to poll drawback
in rate limiters Beneﬁts of queue state to fairness & UX Horizontal & adaptive scaling options Simulation-driven migrations! Thoughts? Chat with me sometime @martelogan !

A Commerce-Centric Take on Queueing Fairly at H...

A Commerce-Centric Take on Queueing Fairly at High Throughput

Other Decks in Technology

Featured

Transcript