Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PremDay #3 - Production AI at Scale: How Criteo...

PremDay #3 - Production AI at Scale: How Criteo is managing GPU Infrastructure from Hardware to Service

Criteo presents how AI was integrated into its infrastructure, including hardware and software considerations.

Avatar for Premday

Premday

June 12, 2026

More Decks by Premday

Other Decks in Technology

Transcript

  1. ANISSE ASTIER & GEOFFREY BEAUSIRE Production AI at Scale: Managing

    GPU Infrastructure from Hardware to Service PREMDAY 2026
  2. Intro: AI use cases @ Criteo 3 Adtech business, present

    in 10 Datacenters WW Training: concentrated, needs throughput Inference: distributed, latency sensitive
  3. Training: DLC 5 Datacenter DLC room A few racks, 60kW

    / rack 4 PDUs per rack 1 CDU per rack
  4. Training: Servers & DLC 6 Why DLC: power & cooling

    is substantially cheaper DLC challenges overview (see Alexis & Vincent's talk) Still have fans (why?)
  5. Training: Server specs & needs 7 High performance No high

    speed network… for now (can be retrofitted, see Mathieu's talk) Not many servers per rack, limit is power & air ratio
  6. Training: HMC & BMC 10 All fans at 100% all

    the time o quickly fixed to 40% Then after the HMC & BMC speak, all goes well
  7. Inference: Air 11 Criteo still needs air cooling No DLC

    option at our server vendors o cold plates exist, but out of warranty, etc. o not Criteo policy 4/8 PCIe accelerators in 4U
  8. Inference: PSUs 13 PSUs & dual feed challenge Vendors: no

    OpenRack, but 8 PSUs / server Design limits reached
  9. OS & drivers 14 Custom Kernel & building drivers o

    open vs closed drivers Chef deployment & testing HGX stack is more complex than discrete GPUs
  10. Benchmarking 15 Hardware specific tool is gpu-fryer o can max

    compute, memory, not nvswitch o Criteo contributed o hwbench integration planned Real Source of truth is user workloads https://github.com/huggingface/gpu-fryer
  11. 16

  12. The main interface: Kubernetes Standard way to deploy workloads at

    Criteo Use Nvidia operator in production ▪ Almost invisible to users ▪ Just have to request a simple resources
  13. Nvidia Operator: Pain points 18 End to End testing ▪

    Catching problem before they break end users is hard ▪ Not good solution at the moment, we are interested by feedback! Operator Vs Chef ▪ Nvidia Operator might modify OS level configuration (i.e. containerd) ▪ Can lead to fighting with Chef Driver bump is risky
  14. Criteo’s use cases 20 10s of millions of inference per

    second per datacenter Focused on prediction and recommendation (deep tabular) Small models = less compute = more scale and better latency Moving from traditional on-CPU ML to Deep Learning/AI LLMs have a minor footprint (at the moment)
  15. Offline/Training: Ray 23 • Distributed platform in Python • Jobs

    are written in Python • Executed on a Ray cluster • Great integration with • Highly flexible core with dedicated libs: • Training • LLM fine tuning • Data/Batch Inference • Hyperparameter Tuning • Model serving (not used at Criteo) https://github.com/ray-project/ray
  16. Training challenges 24 Challenges ▪ Fragmentation ▪ Efficient sharing/Scheduling Multi-tenant

    Kubernetes cluster ▪ Ray workers ▪ LLM ▪ Inference Jobs ▪ Custom GPU workloads
  17. Avoiding Fragmentation 25 Resources fragmentation ▪ CPU ▪ Memory ▪

    GPU ▪ SSD Allocate static resources per GPU Example for 1 GPU: ▪ 30vCPU ▪ 200 GB of memory ▪ 1TiB of SSD Bursting to avoid waste
  18. Avoiding Fragmentation 26 GPU fragmentation: ▪ 1 worker with 4

    GPUs vs 4 workers with 1 GPU ▪ GPUs on a single nodes are more efficient ▪ More GPUs per workers = harder to schedule ▪ Scheduler can be configured for density/bin-packing
  19. Isolation and orchestration 27 Yunikorn as a scheduler ▪ Quotas

    per team ▪ Gang scheduling ▪ Preemptable workers
  20. Packaging models 30 • Models needs to be flexible: •

    Can be called from different languages • And on different hardware • Some ways to package models: • Pure python package • ONNX • Executorch • LLM: Safetensors/GGUF
  21. Packaging models 31 • ONNX: • Graph of operators with

    associated weights/parameters • Portable: • Easy to run from any language/stack • Can use different hardware: • Either as direct providers • Or ”transpilation” to another format • JIT optimizations of the graph • Multiple compute providers within a single execution • e.g CPU/Cuda/TensorRT
  22. Organizational aspect 32 ML Engineers and Researchers can focus on

    architecture and business performance SE/SRE Engineers more focused on low level performance and serving Testing new hardware (and new vendor) in days instead of weeks/months
  23. Online Inference 34 • Using the GPU for one inference

    is not very efficient • GPU saturation requires batches of inferences • How long to queues before executing batch • Batch size • Model parallelization
  24. Online Inference: Nvidia Triton 35 • GRPC • A bunch

    of backends including Python, ONNX, TensorRT • Provide a server side batching scheduler • We also batch on client side • Triton tends to be the bottleneck https://github.com/triton-inference-server/server
  25. Online Inference 36 • GPU are growing in compute capabilities

    • It is becoming harder to saturate the GPU with a single models • Mixing models with different owners and different usage is not easy (isolation, lifecycle, etc) • With Nvidia: Multi-Instance GPU (MIG) • Allow splitting a GPU in slices (RTX6000 up to 4 quarters, B200 up to 7)