PremDay #3 - Production AI at Scale: How Criteo is managing GPU Infrastructure from Hardware to Service

ANISSE ASTIER & GEOFFREY BEAUSIRE Production AI at Scale: Managing
GPU Infrastructure from Hardware to Service PREMDAY 2026

Who we are Anisse Astier Hardware SRE Geoffrey Beausire Platform
SRE

Intro: AI use cases @ Criteo 3 Adtech business, present
in 10 Datacenters WW Training: concentrated, needs throughput Inference: distributed, latency sensitive

Training: DLC 5 Datacenter DLC room A few racks, 60kW
/ rack 4 PDUs per rack 1 CDU per rack

Training: Servers & DLC 6 Why DLC: power & cooling
is substantially cheaper DLC challenges overview (see Alexis & Vincent's talk) Still have fans (why?)

Training: Server specs & needs 7 High performance No high
speed network… for now (can be retrofitted, see Mathieu's talk) Not many servers per rack, limit is power & air ratio

Training: Server delivery 8 All fans at 100%

Training: room fans 9

Training: HMC & BMC 10 All fans at 100% all
the time o quickly fixed to 40% Then after the HMC & BMC speak, all goes well

Inference: Air 11 Criteo still needs air cooling No DLC
option at our server vendors o cold plates exist, but out of warranty, etc. o not Criteo policy 4/8 PCIe accelerators in 4U

Inference: PSUs 12 PSUs & dual feed challenge

Inference: PSUs 13 PSUs & dual feed challenge Vendors: no
OpenRack, but 8 PSUs / server Design limits reached

OS & drivers 14 Custom Kernel & building drivers o
open vs closed drivers Chef deployment & testing HGX stack is more complex than discrete GPUs

Benchmarking 15 Hardware specific tool is gpu-fryer o can max
compute, memory, not nvswitch o Criteo contributed o hwbench integration planned Real Source of truth is user workloads https://github.com/huggingface/gpu-fryer

The main interface: Kubernetes Standard way to deploy workloads at
Criteo Use Nvidia operator in production ▪ Almost invisible to users ▪ Just have to request a simple resources

Nvidia Operator: Pain points 18 End to End testing ▪
Catching problem before they break end users is hard ▪ Not good solution at the moment, we are interested by feedback! Operator Vs Chef ▪ Nvidia Operator might modify OS level configuration (i.e. containerd) ▪ Can lead to fighting with Chef Driver bump is risky

The AI Pipeline 19

Criteo’s use cases 20 10s of millions of inference per
second per datacenter Focused on prediction and recommendation (deep tabular) Small models = less compute = more scale and better latency Moving from traditional on-CPU ML to Deep Learning/AI LLMs have a minor footprint (at the moment)

PROPRIETARY & CONFIDENTIAL. COPYRIGHT © CRITEO 2026. ALL RIGHTS RESERVED.
21

22

Offline/Training: Ray 23 • Distributed platform in Python • Jobs
are written in Python • Executed on a Ray cluster • Great integration with • Highly flexible core with dedicated libs: • Training • LLM fine tuning • Data/Batch Inference • Hyperparameter Tuning • Model serving (not used at Criteo) https://github.com/ray-project/ray

Training challenges 24 Challenges ▪ Fragmentation ▪ Efficient sharing/Scheduling Multi-tenant
Kubernetes cluster ▪ Ray workers ▪ LLM ▪ Inference Jobs ▪ Custom GPU workloads

Avoiding Fragmentation 25 Resources fragmentation ▪ CPU ▪ Memory ▪
GPU ▪ SSD Allocate static resources per GPU Example for 1 GPU: ▪ 30vCPU ▪ 200 GB of memory ▪ 1TiB of SSD Bursting to avoid waste

Avoiding Fragmentation 26 GPU fragmentation: ▪ 1 worker with 4
GPUs vs 4 workers with 1 GPU ▪ GPUs on a single nodes are more efficient ▪ More GPUs per workers = harder to schedule ▪ Scheduler can be configured for density/bin-packing

Isolation and orchestration 27 Yunikorn as a scheduler ▪ Quotas
per team ▪ Gang scheduling ▪ Preemptable workers

29

Packaging models 30 • Models needs to be flexible: •
Can be called from different languages • And on different hardware • Some ways to package models: • Pure python package • ONNX • Executorch • LLM: Safetensors/GGUF

Packaging models 31 • ONNX: • Graph of operators with
associated weights/parameters • Portable: • Easy to run from any language/stack • Can use different hardware: • Either as direct providers • Or ”transpilation” to another format • JIT optimizations of the graph • Multiple compute providers within a single execution • e.g CPU/Cuda/TensorRT

Organizational aspect 32 ML Engineers and Researchers can focus on
architecture and business performance SE/SRE Engineers more focused on low level performance and serving Testing new hardware (and new vendor) in days instead of weeks/months

33

Online Inference 34 • Using the GPU for one inference
is not very efficient • GPU saturation requires batches of inferences • How long to queues before executing batch • Batch size • Model parallelization

Online Inference: Nvidia Triton 35 • GRPC • A bunch
of backends including Python, ONNX, TensorRT • Provide a server side batching scheduler • We also batch on client side • Triton tends to be the bottleneck https://github.com/triton-inference-server/server

Online Inference 36 • GPU are growing in compute capabilities
• It is becoming harder to saturate the GPU with a single models • Mixing models with different owners and different usage is not easy (isolation, lifecycle, etc) • With Nvidia: Multi-Instance GPU (MIG) • Allow splitting a GPU in slices (RTX6000 up to 4 quarters, B200 up to 7)

Optimizations 37

Thank you! QUESTIONS?

PremDay #3 - Production AI at Scale: How Criteo...

PremDay #3 - Production AI at Scale: How Criteo is managing GPU Infrastructure from Hardware to Service

More Decks by Premday

Other Decks in Technology

Featured

Transcript