Prioritization justice: lessons from making background jobs fair at scale

Alexander Baygeldin, Evil Martians Prioritization justice: lessons from making background
jobs fair at scale 🍕🍕🍕

🧑🍳 🧑🍳 🧑🍳

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 Latency = the time between
placing the order and starting to prepare it

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🧑🍳 🧑🍳

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🧑🍳 🧑🍳 🧢 🧢 part-time
chefs

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🧑🍳 🧑🍳 🧢 🧢 🧑🍳
🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 part-time chefs

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🧑🍳 🧑🍳 🧢 🧢 🧑🍳
🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 part-time chefs At some point improving Quality of Service at more operational cost leads to diminishing returns

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕

🍕 🍕 🍕 🧑🍳 🧑🍳 🧑🍳

🧑🍳 🧑🍳 🧑🍳 The greater the latency, the greater the
need to treat customers fairly 🍕 🍕

G = n ∑ i=1 n ∑ j=1 xi −
xj 2n2 ¯ x = n ∑ i=1 n ∑ j=1 xi − xj 2n n ∑ i=1 xi 𝒥 (x1 , x2 , …, xn ) = (∑n i=1 xi )2 n ⋅ ∑n i=1 xi 2 = x2 x2 = 1 1 + ̂ cv 2 Jain's index? Gini index? Some other guy's index?

How do we know if we're not? ☑ FIFO queue
☑ High latency ☑ Greedy users Are we OK?

Customer feedback!

🍕 🍕 🍕 🍕 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🎲

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🎲 🍕 🍕 🍕
🍕 🍕

🧑🍳 🧑🍳 🧑🍳 BRPOP queue:tenant_1 queue:tenant_2 ... queue:tenant_N 2 timeout
per-tenant queues

🧑🍳 🧑🍳 🧑🍳 Solid Queue SELECT job_id FROM solid_queue_ready_executions WHERE
queue_name = 'tenant_N' ORDER BY priority ASC, job_id ASC LIMIT ? FOR UPDATE SKIP LOCKED;

🧑🍳 🧑🍳 🧑🍳 GoodJob 👍 WITH rows AS MATERIALIZED (
SELECT id, active_job_id FROM good_jobs WHERE queue_name = 'tenant_N' AND (<more filters>) ORDER BY priority DESC NULLS LAST, created_at ASC LIMIT ? ) SELECT id FROM rows WHERE pg_try_advisory_lock(<lock hash based on active_job_id>) LIMIT 1

🧑🍳 🧑🍳 🧑🍳

Thinking about building your own background job processor Actually building
one

(with what we have) 1. Shu ff l e-sharding 2.
Interruptible iteration 3. Throttling 4. Per-tenant queues Let's fix this!

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 Alex 🍕 A to I J to
R S to Z Amy 🍕 Joe 🍕 Sam 🍕 Sam 🍕 Amy 🍕

🧑🍳 🧑🍳 🧑🍳 A to I J to R S
to Z Amy 🍕 Amy 🍕 💤 Sam 🍕

🧑🍳 🐷 Pentagon 🍕 🧑🍳 🧑🍳 A to I J
to R S to Z Amy 🍕 Amy 🍕 Sam 🍕 🐷 Pentagon 🍕

🧑🍳 🐷 Pentagon 🍕 🧑🍳 🧑🍳 A to I J
to R S to Z Amy 🍕 🐷 Pentagon 🍕 💤 Joe 🍕

Shuffle-sharding (tl;dr: good when you have enough workload to fi
ll all shards) No hogging? No, but it's <number of shards> times less likely. Perfectly fair? No, but it a ff ects fewer people when it's not fair. Full resource utilization? No—underutilization is possible with too many shards. Does it scale? Yes, but it needs at least one worker per shard.

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 💤 💤 💧

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🍕 🍕 🍕
🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🍕 🍕 🍕
🍕 🍕 🍕 🍕

Interruptible iteration (tl;dr: good when the workload comes in large
batches) No hogging? No—it could happen if one tenant enqueues multiple batches. Perfectly fair? No—it stops being fair when someone hogs the queue. Full resource utilization? Yes, but <queue size> must be greater than <worker count>. Does it scale? Yes—using cursors to track progress is cheap.

🧑🍳 🧑🍳 🧑🍳 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕

🪣 🚿 🕳

🪣 🚿 🕳 💧 💧 💧 💧 💧 💧 💧
💧 💧 over time, the bucket leaks allowing further orders bucket has a limited capacity each new order fi lls the bucket

🪣 🚿 🕳 💧 💧 💧 💧 💧 💧 💧
💧 💦 💦 uh-oh, over fl ow! over time, the bucket leaks allowing further orders bucket has a limited capacity each new order fi lls the bucket

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 excess orders are
put here

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🎲

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🎲 default queue
(80% chance to be processed) throttled queue (20% chance to be processed)

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🎲 default queue
(80% chance to be processed) throttled queue (20% chance to be processed) 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🎲 default queue (80% chance
to be processed) throttled queue (20% chance to be processed) 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🎲 🍕 orders stuck in
the throttled queue :(

🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🎲 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 🎲 🍕 no longer throttled (100%
chance to be processed)

Throttling (tl;dr: good when the workload is well distributed over
time) No hogging? Yes—everyone will get at least a little work done. Perfectly fair? No—especially if the workload is bursty by nature. Full resource utilization? Yes, but it requires weighted queues support. Does it scale? Yes—especially with leaky buckets.

🧑🍳 🧑🍳 🧑🍳 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 👨💼 🍕 📒

🧑🍳 🧑🍳 🧑🍳 🍕 👨💼📒 🍕

🧑🍳 🧑🍳 🧑🍳 🍕 👨💼📒 🍕 💤

🧑🍳 🧑🍳 🧑🍳 👨💼📒 🍕 ‼

🍕 🍕 📒 🧑🍳 🧑🍳 🧑🍳 👨💼 🍕 🍕 🍕

🧑🍳 🧑🍳 🧑🍳 👨💼📒 🍕 🍕 🍕 💤 bottleneck!

🧑🍳 🧑🍳 🧑🍳 📒 🍕 🍕 🍕 👨💼 🧙

🧑🍳 🧑🍳 🧑🍳 📒 🍕 🍕 🍕 🧙

🍕 🍕 🍕 🎲 🍕 🍕 🍕 🍕 🍕 🧑🍳
🧑🍳 🧑🍳 📒 🍕 🍕 🍕 🧙

Per-tenant queues + custom scheduler (tl;dr: good when the workload
is heavy enough to forget about the bottleneck) No hogging? Yes—it's basically communism. Perfectly fair? Yes—as fair as you care to make it. Full resource utilization? Yes—with zero changes to the underlying infra. Does it scale? It depends... on how e ff i cient your scheduler is.

(where fairness matter) ✨ AI work fl ows * ⬇
Data imports 🖼 Media processing 📊 Reports generation Examples! ... and more! ` * less so with async-job

So, which one? Shu ff l e- sharding Interruptible iteration
Throttling Per-tenant queues No hogging? ❌ ❌ ✅ ✅ Perfectly fair? ❌ ❌ ❌ ✅ Full resource utilization? 🟧 🟧 ✅ ✅ Does it scale? ✅ ✅ ✅ 🟧

(part 1) ⚖ sidekiq-fairplay gem ⚖ 🧑⚖ sidekiq-fairplay gem 🧑⚖
🤝 sidekiq-fairplay gem 🤝 Resources

(part 2) • Workload Isolation with Queue Sharding (by Mike
Perham) • Workload Isolation Using Shu ff l e- Sharding (by Colm MacCárthaigh) Resources

(part 3) • job-iteration gem (from Shopify) • Sidekiq Iterable
Jobs: With Great Power.... (by Jon Sully) • Active Job Continuations Resources

(part 4) • “Fair” multi-tenant prioritization of Sidekiq jobs—and our
gem for it! (by Andrey Novikov) • The unreasonable e ff ectiveness of leaky buckets (and how to make one) (by Julik Tarkhanov) Resources

(part 5) • faqueue: researching background jobs fairness (by Vladimir
Dementyev) • fairway gem Resources

put this in your address bar

Prioritization justice: lessons from making bac...

Prioritization justice: lessons from making background jobs fair at scale

More Decks by Alexander Baygeldin

Other Decks in Programming

Featured

Transcript