How continuous batching enables 23x throughput in LLM inference

Slide 1

Slide 1 text

How continuous batching enables 23x throughput in LLM inference Cade Daniel

Slide 2

Slide 2 text

● Anyscale last 1.5 years, working on Ray and LLMs ● Previously worked on communication engine for LLM training at AWS ● Outside of work, I enjoy a good latte while liking hot takes on ML/AI twitter 𝕏 About me

Slide 3

Slide 3 text

Motivation: Serving dominates cost for applications

Slide 4

Slide 4 text

Motivation: Serving dominates cost for applications

Slide 5

Slide 5 text

Goal: Show how continuous batching significantly reduces LLM serving costs ● LLM inference background ● Systems challenges that increase cost ● How continuous batching makes such an improvement (23x!) ● Benchmark results Note: most of this talk provided in our blog post “How continuous batching enables 23x throughput in LLM inference while reducing p50 latency” Reduce serving costs → enable more LLM applications

Slide 6

Slide 6 text

Layers in an LLM serving stack

Slide 7

Slide 7 text

LLM inference background Legend: ● Yellow: prompt token ● Blue: generated token ● Red: end-of-sequence token Iterative: each forward pass generates a single token Autoregressive: generation consumes prompt tokens + previously generated tokens Completion potentially decided by model: A generated token can be the end-of-sequence token How does text generation work?

Slide 8

Slide 8 text

Systems challenges that increase cost ● Size of LLM parameters >> size of LLM data ○ Llama2 70B ~ 130GB to store float16 parameters ○ 2x A100-80GB to store, 4x+ A100-80GB to maximize throughput ● Memory IO huge factor in latency ○ For a single token, have to load 130 GB to compute cores ○ CPU memory IO ~= 10-50 GB/s ○ GPU memory IO ~= 2000 GB/s (A100 80GB) ● High throughput requires many FLOPS ○ CPU can do real-time generation of a single sequence ○ GPU can do real-time generation for many sequences From the FlashAttention paper https://arxiv.org/pdf/2205.14135.pdf

Slide 9

Slide 9 text

KV cache: transformer-specific optimization ● Autoregressive generation recomputes constants K and V ● Cache K,V to reduce recomputations ● K,V are ~1MB each per token for 13B model

Slide 10

Slide 10 text

Other optimizations ● Quantization – compress parameters but reduce model quality ○ Treats model like black-box ● Custom CUDA kernels – e.g. FlashAttention, reduces memory IO needed ○ Low-level, complicated ● Grouped Query Attention (GQA) – modify model architecture for optimized inference ○ Requires changes to training ● Continuous Batching – modify how sequences are batched ○ Works with any LLM!

Slide 11

Slide 11 text

Static batching ● Batching multiple sequences on GPU, aka “static batching” ● Problem: GPU utilization drops as sequences complete Legend: ● Yellow: prompt token ● Blue: generated token ● Red: end-of-sequence token

Slide 12

Slide 12 text

Continuous batching: low-hanging fruit ● Continuous batching dynamically recreates batches ● Fills GPU capacity after each token generation

Slide 13

Slide 13 text

Continuous batching Top: static batching Bottom: continuous batching Legend: ● Yellow: prompt token ● Blue: generated token ● Red: end-of-sequence token

Slide 14

Slide 14 text

Continuous batching ● Continuous batching dynamically recreates batches ● Fills GPU capacity after each token generation ● As variance in sequence length increases, continuous batching increases GPU utilization

Slide 15

Slide 15 text

Throughput experiments ● Hypothesis ○ Continuous batching performs better the more variance there is in sequence lengths ● Frameworks ● Setup – hardware/model ● Setup – data ● Results

Slide 16

Slide 16 text

Throughput experiments: Frameworks Static batching ● HuggingFace Pipelines (link) ● NVIDIA FasterTransformer (link) Continuous batching ● HuggingFace text-generation-inference (TGI) (link) ● Ray Serve ● vLLM (link)

Slide 17

Slide 17 text

Throughput experiments: Hardware/model ● 1x NVIDIA A100-40GB SXM GPU ● Provided by Anyscale ● Meta’s OPT-13B ○ dtype=float16 → 26GB for parameters ● No tensor parallelism

Slide 18

Slide 18 text

Throughput experiments: Data ● Hypothesis ○ Continuous batching performs better the more variance there is in sequence lengths ● How to test? ○ Generate 1000 prompts each with 512 input tokens ○ Generate predetermined output length for each prompt, following an exponential distribution ○ Configure model to ignore EOS token ● How to control variance in sequence lengths? ○ Limit the random sequence lengths artificially ○ E.g. to 32, 128, 512, and 1536 output tokens ○ 4 experiments

Slide 19

Slide 19 text

Throughput experiments: Results

Slide 20

Slide 20 text

Throughput experiments: Results

Slide 21

Slide 21 text

How does vLLM beat TGI? ● Note – we ran experiments in June, TGI is now much closer to vLLM ● TGI and vLLM both use continuous batching ● vLLM uses PagedAttention – extra batch size space

Slide 22

Slide 22 text

How does vLLM beat TGI?