Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prioritization justice: lessons from making bac...

Prioritization justice: lessons from making background jobs fair at scale

Are you treating your users fairly? They could be stuck in the queue while a greedy user monopolizes resources. And you might not even know it! In this talk, Alexander Baygeldin, backend engineer at Evil Martians, is going to show if it’s time for you to take background job prioritization seriously—and how to make it fair for all users.

Avatar for Alexander Baygeldin

Alexander Baygeldin

September 22, 2025
Tweet

More Decks by Alexander Baygeldin

Other Decks in Programming

Transcript

  1. 🧑🍳 🧑🍳 🧑🍳 🍕 🍕 Latency = the time between

    placing the order and starting to prepare it
  2. 🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🧑🍳 🧑🍳 🧢 🧢 🧑🍳

    🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 part-time chefs
  3. 🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🧑🍳 🧑🍳 🧢 🧢 🧑🍳

    🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 🧑🍳 🧢 part-time chefs At some point improving Quality of Service at more operational cost leads to diminishing returns
  4. G = n ∑ i=1 n ∑ j=1 xi −

    xj 2n2 ¯ x = n ∑ i=1 n ∑ j=1 xi − xj 2n n ∑ i=1 xi 𝒥 (x1 , x2 , …, xn ) = (∑n i=1 xi )2 n ⋅ ∑n i=1 xi 2 = x2 x2 = 1 1 + ̂ cv 2 Jain's index? Gini index? Some other guy's index?
  5. How do we know if we're not? ☑ FIFO queue

    ☑ High latency ☑ Greedy users Are we OK?
  6. 🧑🍳 🧑🍳 🧑🍳 LMOVE queue:tenant_1 queue:sq|<worker ID>|tenant_1 RIGHT LEFT LMOVE

    queue:tenant_2 queue:sq|<worker ID>|tenant_2 RIGHT LEFT ... LMOVE queue:tenant_N queue:sq|<worker ID>|tenant_N RIGHT LEFT worker's "in-progress" queues per-tenant queues
  7. 🧑🍳 🧑🍳 🧑🍳 Solid Queue SELECT job_id FROM solid_queue_ready_executions WHERE

    queue_name = 'tenant_N' ORDER BY priority ASC, job_id ASC LIMIT ? FOR UPDATE SKIP LOCKED;
  8. 🧑🍳 🧑🍳 🧑🍳 GoodJob 👍 WITH rows AS MATERIALIZED (

    SELECT id, active_job_id FROM good_jobs WHERE queue_name = 'tenant_N' AND (<more filters>) ORDER BY priority DESC NULLS LAST, created_at ASC LIMIT ? ) SELECT id FROM rows WHERE pg_try_advisory_lock(<lock hash based on active_job_id>) LIMIT 1
  9. (with what we have) 1. Shu ff l e-sharding 2.

    Interruptible iteration 3. Throttling 4. Per-tenant queues Let's fix this!
  10. 🧑🍳 🧑🍳 🧑🍳 Alex 🍕 A to I J to

    R S to Z Amy 🍕 Joe 🍕 Sam 🍕 Sam 🍕 Amy 🍕
  11. 🧑🍳 🧑🍳 🧑🍳 Alex 🍕 A to I J to

    R S to Z Amy 🍕 Joe 🍕 Sam 🍕 Sam 🍕 Amy 🍕
  12. 🧑🍳 🧑🍳 🧑🍳 A to I J to R S

    to Z Amy 🍕 Amy 🍕 💤 Sam 🍕
  13. 🧑🍳 🐷 Pentagon 🍕 🧑🍳 🧑🍳 A to I J

    to R S to Z Amy 🍕 Amy 🍕 Sam 🍕 🐷 Pentagon 🍕
  14. 🧑🍳 🐷 Pentagon 🍕 🧑🍳 🧑🍳 A to I J

    to R S to Z Amy 🍕 Amy 🍕 Sam 🍕 🐷 Pentagon 🍕
  15. 🧑🍳 🐷 Pentagon 🍕 🧑🍳 🧑🍳 A to I J

    to R S to Z Amy 🍕 🐷 Pentagon 🍕 💤 Joe 🍕
  16. Shuffle-sharding (tl;dr: good when you have enough workload to fi

    ll all shards) No hogging? No, but it's <number of shards> times less likely. Perfectly fair? No, but it a ff ects fewer people when it's not fair. Full resource utilization? No—underutilization is possible with too many shards. Does it scale? Yes, but it needs at least one worker per shard.
  17. Interruptible iteration (tl;dr: good when the workload comes in large

    batches) No hogging? No—it could happen if one tenant enqueues multiple batches. Perfectly fair? No—it stops being fair when someone hogs the queue. Full resource utilization? Yes, but <queue size> must be greater than <worker count>. Does it scale? Yes—using cursors to track progress is cheap.
  18. 🪣 🚿 🕳 💧 💧 💧 💧 💧 💧 💧

    💧 💧 over time, the bucket leaks allowing further orders bucket has a limited capacity each new order fi lls the bucket
  19. 🪣 🚿 🕳 💧 💧 💧 💧 💧 💧 💧

    💧 💦 💦 uh-oh, over fl ow! over time, the bucket leaks allowing further orders bucket has a limited capacity each new order fi lls the bucket
  20. 🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🎲 default queue

    (80% chance to be processed) throttled queue (20% chance to be processed)
  21. 🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🍕 🍕 🎲 default queue

    (80% chance to be processed) throttled queue (20% chance to be processed) 🍕
  22. 🧑🍳 🧑🍳 🧑🍳 🍕 🍕 🎲 default queue (80% chance

    to be processed) throttled queue (20% chance to be processed) 🍕
  23. Throttling (tl;dr: good when the workload is well distributed over

    time) No hogging? Yes—everyone will get at least a little work done. Perfectly fair? No—especially if the workload is bursty by nature. Full resource utilization? Yes, but it requires weighted queues support. Does it scale? Yes—especially with leaky buckets.
  24. 🍕 🍕 🍕 🎲 🍕 🍕 🍕 🍕 🍕 🧑🍳

    🧑🍳 🧑🍳 📒 🍕 🍕 🍕 🧙
  25. Per-tenant queues + custom scheduler (tl;dr: good when the workload

    is heavy enough to forget about the bottleneck) No hogging? Yes—it's basically communism. Perfectly fair? Yes—as fair as you care to make it. Full resource utilization? Yes—with zero changes to the underlying infra. Does it scale? It depends... on how e ff i cient your scheduler is.
  26. (where fairness matter) ✨ AI work fl ows * ⬇

    Data imports 🖼 Media processing 📊 Reports generation Examples! ... and more! ` * less so with async-job
  27. So, which one? Shu ff l e- sharding Interruptible iteration

    Throttling Per-tenant queues No hogging? ❌ ❌ ✅ ✅ Perfectly fair? ❌ ❌ ❌ ✅ Full resource utilization? 🟧 🟧 ✅ ✅ Does it scale? ✅ ✅ ✅ 🟧
  28. (part 2) • Workload Isolation with Queue Sharding (by Mike

    Perham) • Workload Isolation Using Shu ff l e- Sharding (by Colm MacCárthaigh) Resources
  29. (part 3) • job-iteration gem (from Shopify) • Sidekiq Iterable

    Jobs: With Great Power.... (by Jon Sully) • Active Job Continuations Resources
  30. (part 4) • “Fair” multi-tenant prioritization of Sidekiq jobs—and our

    gem for it! (by Andrey Novikov) • The unreasonable e ff ectiveness of leaky buckets (and how to make one) (by Julik Tarkhanov) Resources