Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLM Inference Arithmetics: the Theory behind Mo...

Avatar for Luca Baggi Luca Baggi
June 07, 2025
9

LLM Inference Arithmetics: the Theory behind ModelΒ Serving

Have you ever asked yourself how parameters for an LLM are counted, or wondered why Gemma 2B is actually closer to a 3B model? You have no clue about what a KV-Cache is? (And, before you ask: no, it's not a Redis fork.) Do you want to find out how much GPU VRAM you need to run your model smoothly?

If your answer to any of these questions was "yes", or you have another doubt about inference with LLMs - such as batching, or time-to-first-token - this talk is for you. Well, except for the Redis part.

Avatar for Luca Baggi

Luca Baggi

June 07, 2025
Tweet

Transcript

  1. LLM Inference Arithmetics All the maths to understand model serving

    and when you should choose self-hosting πŸ‘€ Luca Baggi πŸ’Ό AI Engineer @ xtream πŸ‡¬πŸ‡§ PyData London (2025/06/07)
  2. πŸ“Outline πŸ™Œ Disclaimers 🎯 Takeaways πŸ“ Three quantities along two

    dimensions 🧠 A simple neural network πŸ€– What about transformers? βš™ Implications when serving an LLM
  3. πŸ™Œ Disclaimers 1. There will be some simpli f ications,

    and omissions due to time. Let’s chat about them later! 2. Lots of praise to prof. Sasha Rush, who originally published Street Fighting Transformers. Check the video for more in-depth explanations!
  4. 🎯 Takeaways β€’ Model training and inference di ff er

    substantially in the amount of compute and memory required. This mostly boils down to: transformers training is parallel, inference is serial β€’ We need to accept trade-o ff s between latency and optimal hardware usage when serving model. Increasing batch size is one answer, but it’s not a silver bullet and has its downsides in terms of latency β€’ Self-hosting can be viable with async batch processing and prompt- heavy tasks (long prompts, short completions).
  5. πŸ“ Three quantities along two dimensions β€’ Three quantities: β€’

    Number of parameters β€’ Memory (GB) β€’ Compute (FLOPs) β€’ Along two dimensions: β€’ Training, but mostly β€’ Inference β€’ (There’s also f ine tuning, but we pretend that does not exist)
  6. 🧠 A simple neural net Compute: di ff erences between

    training and inference β€’ Batch size B: neural networks can process multiple inputs in parallel, so we introduce B to denote the number of samples we pass. β€’ Why does training require three times as much compute? β€’ One for the forward pass; β€’ Two for the backward pass: β€’ Once to compute the derivative of the loss with respect to the weights. β€’ Once to compute the derivative of the loss with respect to the (intermediate) inputs (to propagate to other layers via chain rule).
  7. 🧠 A simple neural net Memory: di ff erences between

    training and inference β€’ Activations: β€’ At inference time, we only need to store the largest activation for each sample in the batch size (at least). β€’ At training time, we need all activations to back propagate β€’ Optimiser state: β€’ If using Adam, it’s equivalent to two additional copies of the weights
  8. πŸ€– What about transformers? A bit of notation β€’ A

    transformer is made of three building blocks: β€’ An embedding layer of size V (the vocabulary size) β€’ N layers made up of: β€’ Feed forward β€’ Attention
  9. πŸ€– What about transformers? Attention block: compute (training) β€’ During

    training, the transformer block is easy to parallelise: β€’ Since we know the whole sequence beforehand, we can run each step (i.e., prediction of tokens in the sequence) as part of the same batch.
  10. πŸ€– What about transformers? Attention block: compute (inference) β€’ At

    inference time, we don’t know what tokens we are generating next by de f inition. β€’ This implies that we need to wait for token t to be generated, before generating t + 1. β€’ Is there a way we can mimic this behaviour when doing inference to reduce compute? Yes: by using more memory.
  11. πŸ€– What about transformers? Attention block: KV-Cache β€’ Attention β€œlooks

    back”, requiring: β€’ Q from the current step β€’ K, V (size 2D) from the past T-1 steps β€’ Instead of re-computing them, we store all K, V for all previous steps in memory
  12. πŸ€– What about transformers? Attention block: memory (inference) Inference engines

    (vLLM, SGLang, NVIDIA Triton/Dynamo…) always implement a KV-cache algorithm, more sophisticated than this one
  13. βš™ Implications when serving an LLM A recap β€’ The

    attention mechanism behaves di ff erently during training than during inference β€’ During training, self-attention can be computed e ff iciently since the whole sequence is known in advance and we generate only one token β€’ During inference, we are generating new tokens. To reduce the amount of compute (and latency), we invest more memory to store in a cache the key and value of past tokens in the sequence β€’ Hardware usage is still suboptimal, since values need to be read from the cache! (I/O bottleneck)
  14. βš™ Implications when serving an LLM Filling the KV-cache creates

    two stages in LLM inference β€’ Using a KV-cache splits generation in two steps: β€’ The pre f ill of the KV-cache, that can happen in parallel β€’ In this step we try to maximise GPU usage (matrix-matrix multiplication) β€’ The decoding phase, that happens auto-regressively, where one token is generated at a time (matrix-vector multiplication) β€’ In this phase, the hardware is still underutilised since we need to read data from the cache into the compute units!
  15. βš™ Implications when serving an LLM Better hardware utilisation with

    increased batch size β€’ To improve hardware utilisation during the decoding phase, we can increase the batch size: β€’ Batching increases the model’s arithmetic intensity by doing more computation for the same number of loads and stores from memory β€’ In other words, this increases the number of tokens/second (but there are diminishing returns) β€’ On the other hand, this decreases the time to f irst token, since we need to f ill the KV cache for more sequences!
  16. βš™ Implications when serving an LLM Other limits of increased

    batch size β€’ To avoid extreme latency or memory over f low, we can’t use static batching. Inference servers implement clever batching algorithms. β€’ Throughput doesn't increase linearly with batch size inde f initely β€’ You might still hit bottlenecks in memory bandwidth (moving data to/ from VRAM) and computation limits of the GPU.
  17. βš™ Implications when serving an LLM When should I use

    self-hosted models? β€’ In general, you might save money with self-hosting if: β€’ You perform batch processing jobs (asynchronously) β€’ Your completions are β€œprompt heavy” tasks, i.e. pre- f ill phase is dominant β€’ Run your own experiments πŸ€“
  18. πŸ“š References Used to prepare this talk, and more β€’

    Street Fighting Transformers β€’ Transformers Optimization: Part 1 - KV Cache β€’ Mastering LLM Techniques: Inference Optimization β€’ A guide to LLM inference and performance β€’ Inference Characteristics of Llama