Slide 1

Slide 1 text

How continuous batching enables 23x throughput in LLM inference Cade Daniel

Slide 2

Slide 2 text

● Anyscale last 1.5 years, working on Ray and LLMs ● Previously worked on communication engine for LLM training at AWS ● Outside of work, I enjoy a good latte while liking hot takes on ML/AI twitter 𝕏 About me

Slide 3

Slide 3 text

Motivation: Serving dominates cost for applications

Slide 4

Slide 4 text

Motivation: Serving dominates cost for applications

Slide 5

Slide 5 text

Goal: Show how continuous batching significantly reduces LLM serving costs ● LLM inference background ● Systems challenges that increase cost ● How continuous batching makes such an improvement (23x!) ● Benchmark results Note: most of this talk provided in our blog post β€œHow continuous batching enables 23x throughput in LLM inference while reducing p50 latency” Reduce serving costs β†’ enable more LLM applications

Slide 6

Slide 6 text

Layers in an LLM serving stack

Slide 7

Slide 7 text

LLM inference background Legend: ● Yellow: prompt token ● Blue: generated token ● Red: end-of-sequence token Iterative: each forward pass generates a single token Autoregressive: generation consumes prompt tokens + previously generated tokens Completion potentially decided by model: A generated token can be the end-of-sequence token How does text generation work?

Slide 8

Slide 8 text

Systems challenges that increase cost ● Size of LLM parameters >> size of LLM data β—‹ Llama2 70B ~ 130GB to store float16 parameters β—‹ 2x A100-80GB to store, 4x+ A100-80GB to maximize throughput ● Memory IO huge factor in latency β—‹ For a single token, have to load 130 GB to compute cores β—‹ CPU memory IO ~= 10-50 GB/s β—‹ GPU memory IO ~= 2000 GB/s (A100 80GB) ● High throughput requires many FLOPS β—‹ CPU can do real-time generation of a single sequence β—‹ GPU can do real-time generation for many sequences From the FlashAttention paper https://arxiv.org/pdf/2205.14135.pdf

Slide 9

Slide 9 text

KV cache: transformer-specific optimization ● Autoregressive generation recomputes constants K and V ● Cache K,V to reduce recomputations ● K,V are ~1MB each per token for 13B model

Slide 10

Slide 10 text

Other optimizations ● Quantization – compress parameters but reduce model quality β—‹ Treats model like black-box ● Custom CUDA kernels – e.g. FlashAttention, reduces memory IO needed β—‹ Low-level, complicated ● Grouped Query Attention (GQA) – modify model architecture for optimized inference β—‹ Requires changes to training ● Continuous Batching – modify how sequences are batched β—‹ Works with any LLM!

Slide 11

Slide 11 text

Static batching ● Batching multiple sequences on GPU, aka β€œstatic batching” ● Problem: GPU utilization drops as sequences complete Legend: ● Yellow: prompt token ● Blue: generated token ● Red: end-of-sequence token

Slide 12

Slide 12 text

Continuous batching: low-hanging fruit ● Continuous batching dynamically recreates batches ● Fills GPU capacity after each token generation

Slide 13

Slide 13 text

Continuous batching Top: static batching Bottom: continuous batching Legend: ● Yellow: prompt token ● Blue: generated token ● Red: end-of-sequence token

Slide 14

Slide 14 text

Continuous batching ● Continuous batching dynamically recreates batches ● Fills GPU capacity after each token generation ● As variance in sequence length increases, continuous batching increases GPU utilization

Slide 15

Slide 15 text

Throughput experiments ● Hypothesis β—‹ Continuous batching performs better the more variance there is in sequence lengths ● Frameworks ● Setup – hardware/model ● Setup – data ● Results

Slide 16

Slide 16 text

Throughput experiments: Frameworks Static batching ● HuggingFace Pipelines (link) ● NVIDIA FasterTransformer (link) Continuous batching ● HuggingFace text-generation-inference (TGI) (link) ● Ray Serve ● vLLM (link)

Slide 17

Slide 17 text

Throughput experiments: Hardware/model ● 1x NVIDIA A100-40GB SXM GPU ● Provided by Anyscale ● Meta’s OPT-13B β—‹ dtype=float16 β†’ 26GB for parameters ● No tensor parallelism

Slide 18

Slide 18 text

Throughput experiments: Data ● Hypothesis β—‹ Continuous batching performs better the more variance there is in sequence lengths ● How to test? β—‹ Generate 1000 prompts each with 512 input tokens β—‹ Generate predetermined output length for each prompt, following an exponential distribution β—‹ Configure model to ignore EOS token ● How to control variance in sequence lengths? β—‹ Limit the random sequence lengths artificially β—‹ E.g. to 32, 128, 512, and 1536 output tokens β—‹ 4 experiments

Slide 19

Slide 19 text

Throughput experiments: Results

Slide 20

Slide 20 text

Throughput experiments: Results

Slide 21

Slide 21 text

How does vLLM beat TGI? ● Note – we ran experiments in June, TGI is now much closer to vLLM ● TGI and vLLM both use continuous batching ● vLLM uses PagedAttention – extra batch size space

Slide 22

Slide 22 text

How does vLLM beat TGI?