Queue theory 101 (node.js edition)

Queue theory 101 @nukemberg (Avishai Ish-Shalom) Node.js edition @nukemberg

Why are we having this talk?

Attack of the killer queues They are everywhere! In your
drivers, your sockets, your event loop! No one is safe

• Distributions have width • Improbable results do happen •
Aggregate effects, particular effects • A single numeric aggregate cannot capture the behavior The world is made of distributions

Variability/Dispersion • How “wide” the distribution is • Various measures:
stddev, Variance, IQD, MAD... • Distributions are infinite, our systems are not ⇒ cutoffs, timeouts • Easy to raise variation, hard to reduce it

Variability effects on utilization Suppose you need to get from
Jerusalem to Tel-Aviv: • Train takes 40 minutes • Mean delay = 5 minutes • Delay P90 = 30 minutes • Delay P99 = 60 minutes How early should you leave to be in Tel-Aviv by noon? With which SLA? How much time are you wasting in total?

The curse of high variation • Utilization is limited by
high variation • Group work latency follows high percentiles (think Map/Reduce, Fork/Join) • Customer satisfaction follows high percentiles • Disasters follows tail behavior • Failure demand (e.g. retries)

Variance is the engineer’s enemy

Head of line blocking • When some task takes longer,
service center is “blocked” • Other tasks in the queue are blocked by the “head of line” • A single slow task will cause a bunch of other tasks to wait ◦ Bad news for latency high percentiles

Tasks should be independent, but... • Shared resources have queues
◦ Disks, CPUs, Thread pools, connection pools, DB locks, sockets, event loop… • Event loop phases share the same service center • Head-of-line blocking → cross task interaction ◦ Slow tasks raise latency of unrelated tasks ◦ Arrival spikes • High variance makes this worse

Capacity & latency • Queue length (and latency) rise to
infinity as utilization approaches 1 • Decent latency ⇒ over capacity • The slower the service, the higher the penalty ρ = arrival rate / service rate = utilization Q = Queue length http://queuemulator.gh.scylladb.com/

Implications Infinite queues: • Memory pressure / OOM • High
latency • Stale work Always limit queue size! Work item TTL*

• 10% fluctuation at 𝜌 = 0.5 will hardly affects
latency (~ 1.1x) • 10% fluctuation at 𝜌 = 0.9 will kill you (~ 10x latency) • Be careful when overloading resources • During peak load we must be extra careful • Highly varied load must be capped Utilization fluctuates

Kingman formula • The higher the variance, the worse the
latency/utilization curve gets • On both service rate and arrival rate • high variance ⇒ run at low utilization * Oh and btw your percentile curve is worse too Qeuemulator

• High utilization → high latency ◦ Non-linear! • High
variance → high latency • Never use unlimited queues • Interactive systems→ short queues; Batch systems → long queues • Maintain proper utilization Executive summary

Event loop phases

Event loop code execution

Node queueing summary • Event loop queues unlimited • Easy
to overload • Blocking ⇒ high latency • Large microtasks kill QoS • await/.then()/process.nextTick() can still hog the event loop

Coping strategies

Avoid blocking the event loop; specifically, CPU heavy tasks •
Immediate suspects: large JSONs, RegEx, SSR ◦ REDoS, JSON DoS ◦ Size limits ◦ Use async/stream friendly JSON parsers (bfj, JSONStream) ◦ Offload server side rendering (react/vue/angular) to workers • Offload heavy tasks to workers, remote processes (piscina) • Limit loops, recursion, etc. • Avoid sync functions Thou shalt not block!

• Split to small microtasks • Use setImmediate to unblock
the loop • Work will continue in check phase • Remember: Promise.then/await/process.nextTick() will requeue When in doubt, defer const yieldControl = new Promise((resolve) => setImmediate(resolve)) // Do something await yieldControl // Let other tasks run // Do more work after waking up

Apply some backpressure baby If the upstream applies pressure on
you, apply pressure backwards on the upstream! • Load needs to be controlled to avoid overload • How do we tell upstreams we’re overloaded? • Blocking semantics implicitly apply backpressure • Network protocols support this (TCP backpressure, HTTP 429, 503, etc)

Backpressure? But how? • For the lazy: Limit HTTP connections
(express, koa) ◦ TCP backpressure • Limit concurrency (promise-pool, token buckets) • Reject requests when event loop lag rises (node-toobusy) ◦ HTTP backpressure: 503, 429 • When in doubt, await

require(‘perf_hooks’) • performance.eventLoopUtilization() • monitorEventLoopDelay() • Event loop lag (“hiccup”)
• GC Mind the metrics

TLDR • Never block the event loop • Break to
small microtasks, defer • Event loop queueing will kill your latency • Monitor event loop lag • Do not overload. Use backpressure and load shedding • Maintain proper (low) utilization • Reduce variation wherever possible

Questions? @nukemberg

Queue theory 101 (node.js edition)

Queue theory 101 (node.js edition)

Avishai Ish-Shalom

More Decks by Avishai Ish-Shalom

Other Decks in Technology

Featured

Transcript

Queue theory 101 @nukemberg (Avishai Ish-Shalom) Node.js edition @nukemberg

Why are we having this talk?

Attack of the killer queues They are everywhere! In your

• Distributions have width • Improbable results do happen •

Variability/Dispersion • How “wide” the distribution is • Various measures:

Variability effects on utilization Suppose you need to get from

The curse of high variation • Utilization is limited by

Variance is the engineer’s enemy

Head of line blocking • When some task takes longer,

Tasks should be independent, but... • Shared resources have queues

Capacity & latency • Queue length (and latency) rise to

Implications Infinite queues: • Memory pressure / OOM • High

• 10% fluctuation at 𝜌 = 0.5 will hardly affects

Kingman formula • The higher the variance, the worse the

• High utilization → high latency ◦ Non-linear! • High

Event loop phases

Event loop code execution

Node queueing summary • Event loop queues unlimited • Easy

Coping strategies

Avoid blocking the event loop; specifically, CPU heavy tasks •

• Split to small microtasks • Use setImmediate to unblock

Apply some backpressure baby If the upstream applies pressure on

Backpressure? But how? • For the lazy: Limit HTTP connections

require(‘perf_hooks’) • performance.eventLoopUtilization() • monitorEventLoopDelay() • Event loop lag (“hiccup”)

TLDR • Never block the event loop • Break to

Questions? @nukemberg