CafeGPT: Serving LLMs Like Coffee With Kubernetes

CafeGPT: Serving LLMs Like Coffee With Kubernetes Madhav Jivrajani &
Kartik Ramesh Gopher credits: https://github.com/ashleymcnamara/gophers

Before we get into it, we’d like to start with
what this talk isn’t! 2

This isn’t a deep dive into LLM inference and the
internals of how it is enabled by Kubernetes. There’s much more suited talks at this conference and others on the topic! 3

We’ve been trying to educate ourselves on how many of
these different pieces of the ecosystem fit together, and this is us hoping to share that picture with you. 4

Welcome to CafeGPT! ☕ 5

Welcome to CafeGPT! • CafeGPT is the best Cafe at
KubeCon 2025. 6

KubeCon 2025. • We have a new customer with a drink request. 7

KubeCon 2025. • We have a new customer with a drink request. • There’s a manager and a barista to help fulfil the request! 8

Welcome to CafeGPT! • The manager directs the request to
the barista. 9

the barista. • The barista has access to coffee machines to help fulfill the request. 10

the barista. • The barista has access to coffee machines to help fulfill the request. • Each coffee machine has many moving parts: the grinder, the steamer and the pressure valve. 11

Welcome to CafeGPT! • The barista uses a coffee machine
to make the drink… 12

Welcome to CafeGPT! • The barista uses a coffee machine
to make the drink… • … and finally serves the drink back to the customer via the manager 13

Welcome to CafeGPT! • Our customer loves the coffee! •
They loved it because it was served in a timely manner and it’s not super expensive. 14

Its analogy time! 15

Serving In CafeGPT 16

• The drink request is the… request. Serving In CafeGPT
17

Serving In CafeGPT 18 • The drink request is the…
request. • The manager is the router responsible for delegating the request to a barista.

request. • The manager is the router responsible for delegating the request to a barista. • The barista is the inference engine - using resources available to it and converting requests to coffee.

request. • The manager is the router responsible for delegating the request to a barista. • The barista is the inference engine - using resources available to it and converting requests to coffee. • The coffee machines are our GPUs that actually brew the responses!

Let’s continue makings things a little more concrete… what does
the lifecycle of an LLM request look like? 21

LLM inference engines convert the user input into an output
using a trained LLM on specialized hardware. Lifecycle of an LLM request

Lifecycle of an LLM request

KV caching 1. Process the entire input sequence and store
the KV Caches.

KV caching 2. Generate next token using the KV Cache

KV caching 2. Generate next token using the KV Cache
until you hit <EOS>

KV caching speeds up LLM inference by 5x1, but it
comes at an overhead of higher memory use, which becomes a bottleneck for concurrent processing of requests. KV caching

comes at an overhead of higher memory use, which becomes a bottleneck for concurrent processing of requests. For Llama 3.1 8B, with 8K context length on an NVIDIA A100 GPU, you can only serve about 24 requests per second.2 KV caching Memory available 40GB Model Weights (FP16) 16GB KV Cache per request 1GB

comes at an overhead of higher memory use, which becomes a bottleneck for concurrent processing of requests. For Llama 3.1 8B, with 8K context length on an NVIDIA A100 GPU, you can only serve about 24 requests per second.2 As a result, a lot of research has emerged to make better use of this bottleneck.3 KV caching Memory available 40GB Model Weights (FP16) 16GB KV Cache per request 1GB 1. https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms 2. https://lmcache.ai/kv_cache_calculator.html 3. https://arxiv.org/abs/2309.06180, https://lmcache.ai/tech_report.pdf

Prefill vs Decode Serving at CafeGPT has two phases: 1.
Prefill - Do the prep work. 2. Decode - Multiple iterations of small units of work. Important workload characteristics: 1. Prefill is computationally intensive, but lasts a short duration. 2. Decode is memory intensive, and lasts a long duration. 3. You don’t know when the Decode will end.

Workload SLAs From a customer point of view, we want:
1. Low Time to Coffee <-> Request latency 2. Low Time between drops <-> Inter Token Latency 3. Low Time to first drop <-> Time to first token From a provider point of view we want: 1. More Coffee per second <-> Throughput 2. High machine utilization Different customers might have different SLAs.

Meet our novice barista

Batch Brewing

Batching

Batching Throughput increases by over 10x due to batching.1 Request
latency improves due to lower queue latencies.1 1. https://www.anyscale.com/blog/continuous-batching-llm-inference

Batching Interference between Prefill and Decode. Batch size controls iteration
latency. Conflict between time to first drop and time between drops!

Disaggregation splits up prefill and decode processing across different instances.
Prefill / Decode Disaggregation

That’s pretty cool, but how do I deploy and run
this? 46

Same Infra, New Workload?

Same Infra, New Workload? • Kubernetes has become the defacto
choice for a large percentage of companies to build their platforms on.

choice for a large percentage of companies to build their platforms on. • Inference is an interesting new workload with unique characteristics that can be served very well* without having to reinvent the wheel.

choice for a large percentage of companies to build their platforms on. • Inference is an interesting new workload with unique characteristics that can be served very well* without having to reinvent the wheel. • In fact, the community has been relentlessly evolving the core and ecosystem projects to better support this workload.

What do these efforts look like? Let’s build bottom up.

Let’s say you want to host a model and serve
requests.

Espresso (Smol) Models Maybe your model is small enough to
need just one GPU. ... resources: limits: nvidia.com/gpu: 1

Espresso (Smol) Models Maybe… you just need a fraction of
a GPU. ... resources: limits: nvidia.com/mig-1g.5gb.shared: 1

Maximising Hardware Efficiency As a vendor, you may statically pre-partition
your devices and expose those partitions.

Maximising Hardware Efficiency However, this may not maximise hardware efficiency.
For example, if I need 1/8th of an H100, I might be forced to run on a slice larger than what I need - leading to wastage.

Maximising Hardware Efficiency I also may not need a particular
type of GPU - for example, all I may care about is 20Gi of GPU memory, I don’t care where it comes from. ... resources: limits: nvidia.com/gpu: 1

Maximising Hardware Efficiency DRA (Dynamic Resource Allocation) can help with
that!

that! For example, vendors may be able to partition devices on the fly (KEP-4815): ... requests: ... "device.attributes['gpu.nvidia.com'].profile == '1g.5gb'”

that! ... requests: ... "device.capacity['nvidia.com']. memory.compareTo(quantity('10Gi')) >= 0"”

Now that we’ve made sure we’re getting the most bang
for our buck…

Seems Familiar? We want our cafe to be able to
handle many different kinds of scenarios. 1. Surges in the morning and lunch, slows at night. 2. Surges of specific drinks on occasions 3. Different workloads For LLM serving, these changes manifest as conversation heavy vs coding heavy tasks, or increased requests for certain models.

Let’s use Kubernetes HPA to scale our serving system! CPU
/ memory utilization ❌ Horizontal Pod Autoscaler

Concurrency: ❌ Summarization: 1000 input, 100 output Code generation: 100
input, 1000 output Traditional autoscaling metrics are not a great fit for LLM workloads. Horizontal Pod Autoscaler

An increasing queue length indicates increasing that our barista can’t
keep up with their orders. New requests will face blocking, until pending requests are completed (unknown) Queue Lengths

An increasing queue length indicates increasing that our barista can’t
keep up with their orders. New requests will face blocking, until all pending requests are completed (unknown) Autoscaling on queue lengths boosts your processing throughput and reduce queuing delays. However, it does not lower latency of your processing requests. Queue Lengths

Batch Sizes If latency is a concern, keep batch sizes
low and monitor the number of active batched tokens. Scaling on batch size, can help you scale earlier than queue buildup and still meet SLAs, but you might over-react to temporary spikes.

For improving latency, scale up your Disaggregated Prefill deployments. Two
options for scaling: 1. Scale up P:D instances keeping the ratio constant a. Good for increased requests for constant workload 2. Scale up individual P or D instances a. Good if you don’t have a lot of GPUs, or if you workload has shifted. Prefill Decode Disaggregation

KV cache utilization Often, KV cache space can be the
main bottleneck. 1. Reactive: Monitor Scale if KV cache utilization exceeds a threshold. 2. Proactive: Use your workload to estimate how much KV cache you need to serve requests while meeting SLAs Memory available 40GB Model Weights (FP16) 16GB KV Cache per request 1GB

Seems Familiar? With autoscaling in place: • How do we
load balance across these replicas?

load balance across these replicas? • Can we do better than round-robin?

load balance across these replicas? • Can we do better than round-robin? ◦ Yes! Sending requests round robin might actually result in degraded performance. ◦ You may end up with hotspots because each workload is not the same.

Good Routing and Better SLAs • Your inference engine is
caching processed tokens (KV Cache).

Good Routing and Better SLAs • Your inference engine is
caching processed tokens (KV Cache). • The router can take this information to better balance requests. https://github.com/kubernetes-sigs/gateway-api-inference-extension

Good Routing and Better SLAs • You don’t want to
route to the same instance all the time because you might overload it. • You might also want to route to another instance as an escape hatch. But which one?

Good Routing and Better SLAs • You don’t want to
route to the same instance all the time because that might become the bottleneck then. • You might also want to route to another instance as an escape hatch. • We can route to the least loaded replica! ◦ KV cache utilization ◦ Queue lengths ◦ Some custom metric that matters to you

Good Routing and Better SLAs llm-d and AIBrix are two
such ecosystem projects that help with better routing and load balancing. https://llm-d.ai/, https://aibrix.readthedocs.io/latest/

Phew! That was a lot of info. Let’s zoom out
and look at the picture we’ve been constructing. 78

In Conclusion

So, what’s the answer?

We hope we’ve convinced you that you don’t need to
be proficient in language modelling and that when you try to work with these systems you can ground yourself in the fact that serving LLMs is as tractable as serving coffee!

Thank you for visiting CafeGPT! 🥰 86

87 Feedback & Tips Welcome!

CafeGPT: Serving LLMs Like Coffee With Kubernetes

CafeGPT: Serving LLMs Like Coffee With Kubernetes

More Decks by Madhav Jivrajani

Other Decks in Programming

Featured

Transcript