$30 off During Our Annual Pro Sale. View Details »

How continuous batching enables 23x throughput in LLM inference

How continuous batching enables 23x throughput in LLM inference

Due to the large GPU memory footprint and compute cost of LLMs, serving dominates the compute cost for most real world applications. ML engineers often treat LLMs like "black boxes" that can only be optimized with internal changes such as quantization and custom CUDA kernels. However, this is not entirely the case. Because LLMs iteratively generate their output, and because LLM inference is often memory and not compute bound, there are surprising system-level batching optimizations that make 10x or more differences in real-world workloads.

One recent such proposed optimization is continuous batching. In this talk we’ll discuss what it is, how it works, and how it enables a 23x improvement in throughput over naive HuggingFace transformers on a production workload (3x over previous SOTA).

Anyscale
PRO

August 31, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. How continuous batching enables 23x
    throughput in LLM inference
    Cade Daniel

    View Slide

  2. ● Anyscale last 1.5 years, working on Ray and LLMs
    ● Previously worked on communication engine for LLM training at AWS
    ● Outside of work, I enjoy a good latte while liking hot takes on ML/AI twitter 𝕏
    About me

    View Slide

  3. Motivation: Serving dominates cost for applications

    View Slide

  4. Motivation: Serving dominates cost for applications

    View Slide

  5. Goal: Show how continuous batching significantly reduces LLM serving costs
    ● LLM inference background
    ● Systems challenges that increase cost
    ● How continuous batching makes such an improvement (23x!)
    ● Benchmark results
    Note: most of this talk provided in our blog post “How continuous batching enables
    23x throughput in LLM inference while reducing p50 latency”
    Reduce serving costs → enable more LLM applications

    View Slide

  6. Layers in an LLM serving stack

    View Slide

  7. LLM inference background
    Legend:
    ● Yellow: prompt token
    ● Blue: generated token
    ● Red: end-of-sequence token
    Iterative: each forward pass generates
    a single token
    Autoregressive: generation consumes
    prompt tokens + previously generated
    tokens
    Completion potentially decided by
    model: A generated token can be the
    end-of-sequence token
    How does text generation
    work?

    View Slide

  8. Systems challenges that increase cost
    ● Size of LLM parameters >> size of LLM data
    ○ Llama2 70B ~ 130GB to store float16 parameters
    ○ 2x A100-80GB to store, 4x+ A100-80GB to maximize throughput
    ● Memory IO huge factor in latency
    ○ For a single token, have to load 130 GB to compute cores
    ○ CPU memory IO ~= 10-50 GB/s
    ○ GPU memory IO ~= 2000 GB/s (A100 80GB)
    ● High throughput requires many FLOPS
    ○ CPU can do real-time generation of a single sequence
    ○ GPU can do real-time generation for many sequences
    From the FlashAttention paper
    https://arxiv.org/pdf/2205.14135.pdf

    View Slide

  9. KV cache: transformer-specific optimization
    ● Autoregressive generation recomputes constants K and V
    ● Cache K,V to reduce recomputations
    ● K,V are ~1MB each per token for 13B model

    View Slide

  10. Other optimizations
    ● Quantization – compress parameters but reduce model quality
    ○ Treats model like black-box
    ● Custom CUDA kernels – e.g. FlashAttention, reduces memory IO needed
    ○ Low-level, complicated
    ● Grouped Query Attention (GQA) – modify model architecture for optimized
    inference
    ○ Requires changes to training
    ● Continuous Batching – modify how sequences are batched
    ○ Works with any LLM!

    View Slide

  11. Static batching
    ● Batching multiple sequences on GPU, aka “static batching”
    ● Problem: GPU utilization drops as sequences complete
    Legend:
    ● Yellow: prompt token
    ● Blue: generated token
    ● Red: end-of-sequence token

    View Slide

  12. Continuous batching: low-hanging fruit
    ● Continuous batching dynamically recreates batches
    ● Fills GPU capacity after each token generation

    View Slide

  13. Continuous batching
    Top: static batching
    Bottom: continuous batching
    Legend:
    ● Yellow: prompt token
    ● Blue: generated token
    ● Red: end-of-sequence token

    View Slide

  14. Continuous batching
    ● Continuous batching dynamically recreates batches
    ● Fills GPU capacity after each token generation
    ● As variance in sequence length increases, continuous batching increases
    GPU utilization

    View Slide

  15. Throughput experiments
    ● Hypothesis
    ○ Continuous batching performs better the more variance there is in sequence lengths
    ● Frameworks
    ● Setup – hardware/model
    ● Setup – data
    ● Results

    View Slide

  16. Throughput experiments: Frameworks
    Static batching
    ● HuggingFace Pipelines (link)
    ● NVIDIA FasterTransformer (link)
    Continuous batching
    ● HuggingFace text-generation-inference (TGI) (link)
    ● Ray Serve
    ● vLLM (link)

    View Slide

  17. Throughput experiments: Hardware/model
    ● 1x NVIDIA A100-40GB SXM GPU
    ● Provided by Anyscale
    ● Meta’s OPT-13B
    ○ dtype=float16 → 26GB for parameters
    ● No tensor parallelism

    View Slide

  18. Throughput experiments: Data
    ● Hypothesis
    ○ Continuous batching performs better the more variance there is in sequence lengths
    ● How to test?
    ○ Generate 1000 prompts each with 512 input tokens
    ○ Generate predetermined output length for each prompt, following an exponential distribution
    ○ Configure model to ignore EOS token
    ● How to control variance in sequence lengths?
    ○ Limit the random sequence lengths artificially
    ○ E.g. to 32, 128, 512, and 1536 output tokens
    ○ 4 experiments

    View Slide

  19. Throughput experiments: Results

    View Slide

  20. Throughput experiments: Results

    View Slide

  21. How does vLLM beat TGI?
    ● Note – we ran experiments in June, TGI is now much closer to vLLM
    ● TGI and vLLM both use continuous batching
    ● vLLM uses PagedAttention – extra batch size space

    View Slide

  22. How does vLLM beat TGI?

    View Slide