How continuous batching enables 23x throughput in LLM inference

How continuous batching enables 23x throughput in LLM inference Cade
Daniel

• Anyscale last 1.5 years, working on Ray and LLMs
• Previously worked on communication engine for LLM training at AWS • Outside of work, I enjoy a good latte while liking hot takes on ML/AI twitter 𝕏 About me

Motivation: Serving dominates cost for applications

Goal: Show how continuous batching significantly reduces LLM serving costs
• LLM inference background • Systems challenges that increase cost • How continuous batching makes such an improvement (23x!) • Benchmark results Note: most of this talk provided in our blog post “How continuous batching enables 23x throughput in LLM inference while reducing p50 latency” Reduce serving costs → enable more LLM applications

Layers in an LLM serving stack

LLM inference background Legend: • Yellow: prompt token • Blue:
generated token • Red: end-of-sequence token Iterative: each forward pass generates a single token Autoregressive: generation consumes prompt tokens + previously generated tokens Completion potentially decided by model: A generated token can be the end-of-sequence token How does text generation work?

Systems challenges that increase cost • Size of LLM parameters
>> size of LLM data ◦ Llama2 70B ~ 130GB to store float16 parameters ◦ 2x A100-80GB to store, 4x+ A100-80GB to maximize throughput • Memory IO huge factor in latency ◦ For a single token, have to load 130 GB to compute cores ◦ CPU memory IO ~= 10-50 GB/s ◦ GPU memory IO ~= 2000 GB/s (A100 80GB) • High throughput requires many FLOPS ◦ CPU can do real-time generation of a single sequence ◦ GPU can do real-time generation for many sequences From the FlashAttention paper https://arxiv.org/pdf/2205.14135.pdf

KV cache: transformer-specific optimization • Autoregressive generation recomputes constants K
and V • Cache K,V to reduce recomputations • K,V are ~1MB each per token for 13B model

Other optimizations • Quantization – compress parameters but reduce model
quality ◦ Treats model like black-box • Custom CUDA kernels – e.g. FlashAttention, reduces memory IO needed ◦ Low-level, complicated • Grouped Query Attention (GQA) – modify model architecture for optimized inference ◦ Requires changes to training • Continuous Batching – modify how sequences are batched ◦ Works with any LLM!

Static batching • Batching multiple sequences on GPU, aka “static
batching” • Problem: GPU utilization drops as sequences complete Legend: • Yellow: prompt token • Blue: generated token • Red: end-of-sequence token

Continuous batching: low-hanging fruit • Continuous batching dynamically recreates batches
• Fills GPU capacity after each token generation

Continuous batching Top: static batching Bottom: continuous batching Legend: •
Yellow: prompt token • Blue: generated token • Red: end-of-sequence token

Continuous batching • Continuous batching dynamically recreates batches • Fills
GPU capacity after each token generation • As variance in sequence length increases, continuous batching increases GPU utilization

Throughput experiments • Hypothesis ◦ Continuous batching performs better the
more variance there is in sequence lengths • Frameworks • Setup – hardware/model • Setup – data • Results

Throughput experiments: Frameworks Static batching • HuggingFace Pipelines (link) •
NVIDIA FasterTransformer (link) Continuous batching • HuggingFace text-generation-inference (TGI) (link) • Ray Serve • vLLM (link)

Throughput experiments: Hardware/model • 1x NVIDIA A100-40GB SXM GPU •
Provided by Anyscale • Meta’s OPT-13B ◦ dtype=float16 → 26GB for parameters • No tensor parallelism

Throughput experiments: Data • Hypothesis ◦ Continuous batching performs better
the more variance there is in sequence lengths • How to test? ◦ Generate 1000 prompts each with 512 input tokens ◦ Generate predetermined output length for each prompt, following an exponential distribution ◦ Configure model to ignore EOS token • How to control variance in sequence lengths? ◦ Limit the random sequence lengths artificially ◦ E.g. to 32, 128, 512, and 1536 output tokens ◦ 4 experiments

Throughput experiments: Results

How does vLLM beat TGI? • Note – we ran
experiments in June, TGI is now much closer to vLLM • TGI and vLLM both use continuous batching • vLLM uses PagedAttention – extra batch size space

How does vLLM beat TGI?

How continuous batching enables 23x throughput ...

How continuous batching enables 23x throughput in LLM inference

Anyscale

More Decks by Anyscale

Other Decks in Programming

Featured

Transcript