LLM Inference Arithmetics: the Theory behind Model Serving

LLM Inference Arithmetics All the maths to understand model serving
and when you should choose self-hosting 👤 Luca Baggi 💼 AI Engineer @ xtream 🇬🇧 PyData London (2025/06/07)

https://www.cursor.com/en/blog/llama-inference

📍Outline 🙌 Disclaimers 🎯 Takeaways 📏 Three quantities along two
dimensions 🧠 A simple neural network 🤖 What about transformers? ⚙ Implications when serving an LLM

🙌 Disclaimers 1. There will be some simpli f ications,
and omissions due to time. Let’s chat about them later! 2. Lots of praise to prof. Sasha Rush, who originally published Street Fighting Transformers. Check the video for more in-depth explanations!

🎯 Takeaways • Model training and inference di ff er
substantially in the amount of compute and memory required. This mostly boils down to: transformers training is parallel, inference is serial • We need to accept trade-o ff s between latency and optimal hardware usage when serving model. Increasing batch size is one answer, but it’s not a silver bullet and has its downsides in terms of latency • Self-hosting can be viable with async batch processing and prompt- heavy tasks (long prompts, short completions).

📏 Three quantities along two dimensions • Three quantities: •
Number of parameters • Memory (GB) • Compute (FLOPs) • Along two dimensions: • Training, but mostly • Inference • (There’s also f ine tuning, but we pretend that does not exist)

🧠 A simple neural net Number of parameters

🧠 A simple neural net Compute

🧠 A simple neural net Compute: di ff erences between
training and inference • Batch size B: neural networks can process multiple inputs in parallel, so we introduce B to denote the number of samples we pass. • Why does training require three times as much compute? • One for the forward pass; • Two for the backward pass: • Once to compute the derivative of the loss with respect to the weights. • Once to compute the derivative of the loss with respect to the (intermediate) inputs (to propagate to other layers via chain rule).

🧠 A simple neural net Memory at inference time

🧠 A simple neural net Memory at training time

🧠 A simple neural net Memory: di ff erences between
training and inference • Activations: • At inference time, we only need to store the largest activation for each sample in the batch size (at least). • At training time, we need all activations to back propagate • Optimiser state: • If using Adam, it’s equivalent to two additional copies of the weights

🤖 What about transformers?

🤖 What about transformers? A bit of notation • A
transformer is made of three building blocks: • An embedding layer of size V (the vocabulary size) • N layers made up of: • Feed forward • Attention

🤖 What about transformers? The embedding layer

🤖 What about transformers? The feed forward layer

🤖 What about transformers? The attention layer

🤖 What about transformers? Simple NNs hidden in the attention
layer

🤖 What about transformers? Self-attention

🤖 What about transformers? Attention block: compute (training)

🤖 What about transformers? Attention block: compute (training) • During
training, the transformer block is easy to parallelise: • Since we know the whole sequence beforehand, we can run each step (i.e., prediction of tokens in the sequence) as part of the same batch.

🤖 What about transformers? Attention block: memory (training)

🤖 What about transformers? Attention block: compute (inference)

🤖 What about transformers? Attention block: compute (inference) • At
inference time, we don’t know what tokens we are generating next by de f inition. • This implies that we need to wait for token t to be generated, before generating t + 1. • Is there a way we can mimic this behaviour when doing inference to reduce compute? Yes: by using more memory.

🤖 What about transformers? Attention block: KV-Cache • Attention “looks
back”, requiring: • Q from the current step • K, V (size 2D) from the past T-1 steps • Instead of re-computing them, we store all K, V for all previous steps in memory

🤖 What about transformers? Attention block: memory (inference) Inference engines
(vLLM, SGLang, NVIDIA Triton/Dynamo…) always implement a KV-cache algorithm, more sophisticated than this one

⚙ Implications when serving an LLM A recap • The
attention mechanism behaves di ff erently during training than during inference • During training, self-attention can be computed e ff iciently since the whole sequence is known in advance and we generate only one token • During inference, we are generating new tokens. To reduce the amount of compute (and latency), we invest more memory to store in a cache the key and value of past tokens in the sequence • Hardware usage is still suboptimal, since values need to be read from the cache! (I/O bottleneck)

⚙ Implications when serving an LLM Filling the KV-cache creates
two stages in LLM inference • Using a KV-cache splits generation in two steps: • The pre f ill of the KV-cache, that can happen in parallel • In this step we try to maximise GPU usage (matrix-matrix multiplication) • The decoding phase, that happens auto-regressively, where one token is generated at a time (matrix-vector multiplication) • In this phase, the hardware is still underutilised since we need to read data from the cache into the compute units!

⚙ Implications when serving an LLM Better hardware utilisation with
increased batch size • To improve hardware utilisation during the decoding phase, we can increase the batch size: • Batching increases the model’s arithmetic intensity by doing more computation for the same number of loads and stores from memory • In other words, this increases the number of tokens/second (but there are diminishing returns) • On the other hand, this decreases the time to f irst token, since we need to f ill the KV cache for more sequences!

⚙ Implications when serving an LLM Other limits of increased
batch size • To avoid extreme latency or memory over f low, we can’t use static batching. Inference servers implement clever batching algorithms. • Throughput doesn't increase linearly with batch size inde f initely • You might still hit bottlenecks in memory bandwidth (moving data to/ from VRAM) and computation limits of the GPU.

⚙ Implications when serving an LLM When should I use
self-hosted models? • In general, you might save money with self-hosting if: • You perform batch processing jobs (asynchronously) • Your completions are “prompt heavy” tasks, i.e. pre- f ill phase is dominant • Run your own experiments 🤓

👤 Luca Baggi 💼 AI Engineer @ xtream

📚 References Used to prepare this talk, and more •
Street Fighting Transformers • Transformers Optimization: Part 1 - KV Cache • Mastering LLM Techniques: Inference Optimization • A guide to LLM inference and performance • Inference Characteristics of Llama

LLM Inference Arithmetics: the Theory behind Mo...

LLM Inference Arithmetics: the Theory behind Model Serving

Luca Baggi

More Decks by Luca Baggi

Featured

Transcript

LLM Inference Arithmetics All the maths to understand model serving

https://www.cursor.com/en/blog/llama-inference

📍Outline 🙌 Disclaimers 🎯 Takeaways 📏 Three quantities along two

🙌 Disclaimers 1. There will be some simpli f ications,

🎯 Takeaways • Model training and inference di ff er

📏 Three quantities along two dimensions • Three quantities: •

🧠 A simple neural net Number of parameters

🧠 A simple neural net Number of parameters

🧠 A simple neural net Compute

🧠 A simple neural net Compute: di ff erences between

🧠 A simple neural net Memory at inference time

🧠 A simple neural net Memory at training time

🧠 A simple neural net Memory: di ff erences between

🤖 What about transformers?

🤖 What about transformers? A bit of notation • A

🤖 What about transformers? The embedding layer

🤖 What about transformers? The feed forward layer

🤖 What about transformers? The attention layer

🤖 What about transformers? Simple NNs hidden in the attention

🤖 What about transformers? Self-attention

🤖 What about transformers? Attention block: compute (training)

🤖 What about transformers? Attention block: compute (training) • During

🤖 What about transformers? Attention block: memory (training)

🤖 What about transformers? Attention block: compute (inference)

🤖 What about transformers? Attention block: compute (inference) • At

🤖 What about transformers? Attention block: KV-Cache • Attention “looks

🤖 What about transformers? Attention block: memory (inference) Inference engines

⚙ Implications when serving an LLM A recap • The

⚙ Implications when serving an LLM Filling the KV-cache creates

⚙ Implications when serving an LLM Better hardware utilisation with

⚙ Implications when serving an LLM Other limits of increased

⚙ Implications when serving an LLM When should I use

👤 Luca Baggi 💼 AI Engineer @ xtream

📚 References Used to prepare this talk, and more •