PremDay #2 - Building Infrastructure for Inference at Scale

Inference at the Edge Building a Global, Scalable AI Inference
Network Syona Sarma Head of Hardware Engineering Cloudflare

2 1 2 3 4 5 Introduction to Cloudflare Cloudflare
Infrastructure Cloudflareʼs AI Developer Platform Infrastructure Buildout - Hardware Challenges & Solutions Building the software stack

Serverless Architecture - Flexible, Scalable and Eﬃcient 4 • Distributed
Network, Abstracts hardware • Every server is “capableˮ of running every service ◦ Compute, Storage, Networking and Now Inference workloads • Leveraging existing edge network for inference - latency • Dynamically optimizes and leverages capacity across network • Single unified code base

Cloudflare AI Developer Platform

GPUs for training and inference AI initiatives face challenges 6
Reduce cyber risk Control operational costs Multi-cloud architecture High demand has led to scarcity of resources Cost of moving data across clouds Initiative CHALLENGES Lack of visibility into usage Data leaks, privacy, shadow IT Challenges

The AI Lifecycle Product Overview

Cloudflare offers products at each step of the AI lifecycle
Training Storing training data, without egress fees with R2 Store your training data on R2 to avoid paying egress fees and retain flexibility for multi-cloud architectures. Inference Powering inference with Workers AI and Vectorize Launched Sept 2023 Run and deploy models through Workers AI, powered by our GPUs. Store and query your embeddings for faster (and cheaper! vector lookups with Vectorize. Optimization Security Observe & save costs with AI Gateway Launched Sept 2023 Connect your AI app to AI Gateway for better observability with analytics and controlled scalability via caching and rate limiting. Secure AI deployments with Firewall for AI AI Firewall protects your AI services from bad actors with easy to configure without performance impact. Product Overview

Unified platform to develop GenAI applications 9 Cloudflare AI Products
Vectorize Simple vector storage to power RAG applications Workers AI Serverless inference at a global scale AI Gateway ModelOps to control and observe your AI applications

The next focus of AI is going to be on
automating end-to-end tasks. 10 Train Infer Automate

How is AI evolving? 11 Predictive AI Machine Learning) Generative
AI Agentic AI Definition AI systems that recognize patterns from data and make predictions or classifications. AI systems that create new content (e.g., text, images, or data.) AI systems that can act autonomously, make decisions, and adapt to changing environments based on goals. Goal To find patterns in data and make predictions or decisions based on historical data. To generate new, realistic outputs based on learned patterns. To independently solve problems and take actions based on goals. Example Lead scoring (i.e., “Based on previous examples, which customer is most likely to buy?ˮ) Text generation (i.e., “Help me draft an email to a customer.ˮ) Autonomous campaign management (i.e., “Run a campaign targeting XYZ customers.ˮ)

Augmentation → Automation Augmentation Generative AI Help me draft an
email to a customer I met at a conference, reminding them of the benefits of Cloudflare I'll help you draft a professional follow-up email. Hereʼs a professional follow-up email that highlights Cloudflare's key benefits while maintaining a friendly tone. Would you like me to customize any specific parts based on your conversation at the conference? Just let me know what you'd like to adjust! Reply…. Cloudflare Email Click to open Automation Agentic AI Good Morning, Rita Run a campaign following up with everyone I met at the conference last week: get a list of customers draft up an email send to me for sign off fire off ping me when the customer responds 12

Understanding Hardware Challenges

Run inference tasks on Workers AI, the first globally- distributed
serverless AI inference platform 14 Deploy from region: Earth Code executes within 50ms of 95% of the Internet- connected global population 335 cities in 125 countries, including mainland China Growing constellation of cities for AI inference powered by GPUs 190 cities with GPUs Note: Number of cities with GPUs as of March 3, 2025. All other figures as of December 31, 2024.

Hardware vs Software Product Life Cycles Pace of Innovation in
hardware is years, High barrier to entry Why Cloudflare went down the Inference path? • Optimized for reduced latency due to inherently efficient architecture • Consistent performance & Increased utilization “Every service runs everywhere modelˮ • AI Platform Solution for Inference 15 Hardware Infrastructure

• Choosing the right inference accelerator • Designing within system
constraints • Meeting future capacity demand • Deployment complexity and operational challenges • Architecting at scale Infrastructure Buildout - Understanding Hardware Challenges 16 Hardware Infrastructure 1 2 3 4 5

Choosing the right inference accelerator • Right sizing the hardware
to workload needs • Target Workloads - Inference/FineTuning/BYOM • Adjacent workloads - stream, video decode/encode • NVIDIA vs Alternatives/ Custom Silicon • Open source vs Custom SDK • Multi-Pronged approach - Multi-GPU Tier/Distributed Tier • Solving for Utilization 17 Hardware Infrastructure 1 Mapping workloads/KPIs to Hardware Resources

Mapping Workloads/KPIs to Hardware Resources • Cost and power efficiency
- NVIDIA Alternatives • Time to First Token for customer experience - Real time Latency • Throughput as Performance Tokens/Sec) - GPU Compute/Mem BW • Accuracy determined by precision/quantization • Faster Model switching time, size of models - memory capacity 18 Hardware Infrastructure Making the right KPI tradeoffs

Designing within system constraints • Power limitations - future proofing
thermal solutions • Design Choices to meet mechanical constraints • Hardware Qual with adequate stress testing - error rates • Performance tuning • C-group resourcing • Isolation • Latency vs Throughput • Benchmarking and Telemetry 19 Hardware Infrastructure AVAILABILITY PERFORM ANCE RELIABILITY 2 Modular & Disaggregated Architecture

Modular, Composable & Disaggregated Hardware • Composability for the ultimate
flexibility • Optimize GPU usage by dynamically compose GPUs & Storage • Reuse and enablement of any compute node to be accelerated • Different layers of disaggregation ◦ Prefill vs Decode ◦ Rack level ◦ Colo/DC level 20 Hardware Infrastructure

• Tradeoffs between perf and availability • Utilization through orchestration,
scheduling and management • Identifying model specific metrics - batch vs real time • Hardware to product metrics translation • Forecasting considerations 21 Hardware Infrastructure 3 Meeting future capacity demand When to move to On-Prem? Hybrid Cloud Approach?

Deployment complexity and operational challenges • Standardizing racks - Density/Diversity
considerations • Heterogeneous architecture SW enabling effort • Installation and Costs of Retrofitting • Regional differences - compliance • TCO and Supply Chain reliability, negotiating space and power by colo • Continuous monitoring of usage 22 Hardware Infrastructure 4 Multi-Pronged Approach : Points of Presence + Cluster Strategy

Architecting at scale • Data locality considerations • “One size
fits allˮ vs custom built accelerators • Understanding bottlenecks - non negotiables • Balancing server lifecycles with evolving needs • Model optimization techniques to reduce hardware footprint • Batching/Parallelism techniques • Deep Seek/MoonCake 23 Hardware Infrastructure 5 Inference in the context of the larger system

Building the software stack

Three challenges we sought to address The state-of-the-art is moving
fast 25 Building the software stack Software configurations can be complex We want to get the most out of our hardware 1 2 3

Just in the last two years we have seen the
following developments… • Continuous batching • Flash attention • Paged attention • Sliding-window attention • Expert parallelism • MOE, Multi-modal The state-of-the-art is moving fast 26 Building the software stack Goal: Provide the latest capabilities to developers and data scientists 1

Deploying an AI model that uses an accelerator requires getting
many interconnected hardware and software components aligned • Hardware • Kernel • Device drivers • System libraries • Application libraries Software configurations can be complex 27 Building the software stack 2 Goal: Contain the complexity

We want to get the most out of our hardware
Create measure-build-test loops in the system • Measure a metric we care about (latency, utilization, etc) • Implement features we think will improve our target metric • Measure the result, and iterate 28 3 Goal: Get the most value out of the hardware weʼve invested in

Major Components 29 Scheduling Making sure the right number of
models are running in the right places on Cloudflareʼs network Routing Getting requests from the users to the models as fast as possible Models Selecting the models that are available, designing the APIs, optimizing Enablement Giving developers the tools they need to use access and use the models Building the software stack

Deployment • CI/CD • Packaging & containerization • Gradual deployments
A few final considerations 30 Observability • Global metrics • Log capture • Alerting • On-call rotation Compliance & Security • Data encryption • Authentication & authorization • Dependency management • Vulnerability management Building the software stack

Thank you

PremDay #2 - Building Infrastructure for Infere...

PremDay #2 - Building Infrastructure for Inference at Scale

PremDay

More Decks by PremDay

Other Decks in Technology

Featured

Transcript

Inference at the Edge Building a Global, Scalable AI Inference

2 1 2 3 4 5 Introduction to Cloudflare Cloudflare

Conﬁdential. Copyright © Cloudﬂare, Inc.

Serverless Architecture - Flexible, Scalable and Eﬃcient 4 • Distributed

Cloudflare AI Developer Platform

GPUs for training and inference AI initiatives face challenges 6

The AI Lifecycle Product Overview

Cloudflare offers products at each step of the AI lifecycle

Unified platform to develop GenAI applications 9 Cloudflare AI Products

The next focus of AI is going to be on

How is AI evolving? 11 Predictive AI Machine Learning) Generative

Augmentation → Automation Augmentation Generative AI Help me draft an

Understanding Hardware Challenges

Run inference tasks on Workers AI, the first globally- distributed

Hardware vs Software Product Life Cycles Pace of Innovation in

• Choosing the right inference accelerator • Designing within system

Choosing the right inference accelerator • Right sizing the hardware

Mapping Workloads/KPIs to Hardware Resources • Cost and power efficiency

Designing within system constraints • Power limitations - future proofing

Modular, Composable & Disaggregated Hardware • Composability for the ultimate

• Tradeoffs between perf and availability • Utilization through orchestration,

Deployment complexity and operational challenges • Standardizing racks - Density/Diversity

Architecting at scale • Data locality considerations • “One size

Building the software stack

Three challenges we sought to address The state-of-the-art is moving

Just in the last two years we have seen the

Deploying an AI model that uses an accelerator requires getting

We want to get the most out of our hardware

Major Components 29 Scheduling Making sure the right number of

Deployment • CI/CD • Packaging & containerization • Gradual deployments

Thank you