PremDay #2 - Building Infrastructure for Inference at Scale

Slide 1

Slide 1 text

Inference at the Edge Building a Global, Scalable AI Inference Network Syona Sarma Head of Hardware Engineering Cloudflare

Slide 2

Slide 2 text

2 1 2 3 4 5 Introduction to Cloudflare Cloudflare Infrastructure Cloudflareʼs AI Developer Platform Infrastructure Buildout - Hardware Challenges & Solutions Building the software stack

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Serverless Architecture - Flexible, Scalable and Eﬃcient 4 ● Distributed Network, Abstracts hardware ● Every server is “capableˮ of running every service ○ Compute, Storage, Networking and Now Inference workloads ● Leveraging existing edge network for inference - latency ● Dynamically optimizes and leverages capacity across network ● Single unified code base

Slide 5

Slide 5 text

Cloudflare AI Developer Platform

Slide 6

Slide 6 text

GPUs for training and inference AI initiatives face challenges 6 Reduce cyber risk Control operational costs Multi-cloud architecture High demand has led to scarcity of resources Cost of moving data across clouds Initiative CHALLENGES Lack of visibility into usage Data leaks, privacy, shadow IT Challenges

Slide 7

Slide 7 text

The AI Lifecycle Product Overview

Slide 8

Slide 8 text

Cloudflare offers products at each step of the AI lifecycle Training Storing training data, without egress fees with R2 Store your training data on R2 to avoid paying egress fees and retain flexibility for multi-cloud architectures. Inference Powering inference with Workers AI and Vectorize Launched Sept 2023 Run and deploy models through Workers AI, powered by our GPUs. Store and query your embeddings for faster (and cheaper! vector lookups with Vectorize. Optimization Security Observe & save costs with AI Gateway Launched Sept 2023 Connect your AI app to AI Gateway for better observability with analytics and controlled scalability via caching and rate limiting. Secure AI deployments with Firewall for AI AI Firewall protects your AI services from bad actors with easy to configure without performance impact. Product Overview

Slide 9

Slide 9 text

Unified platform to develop GenAI applications 9 Cloudflare AI Products Vectorize Simple vector storage to power RAG applications Workers AI Serverless inference at a global scale AI Gateway ModelOps to control and observe your AI applications

Slide 10

Slide 10 text

The next focus of AI is going to be on automating end-to-end tasks. 10 Train Infer Automate

Slide 11

Slide 11 text

How is AI evolving? 11 Predictive AI Machine Learning) Generative AI Agentic AI Definition AI systems that recognize patterns from data and make predictions or classifications. AI systems that create new content (e.g., text, images, or data.) AI systems that can act autonomously, make decisions, and adapt to changing environments based on goals. Goal To find patterns in data and make predictions or decisions based on historical data. To generate new, realistic outputs based on learned patterns. To independently solve problems and take actions based on goals. Example Lead scoring (i.e., “Based on previous examples, which customer is most likely to buy?ˮ) Text generation (i.e., “Help me draft an email to a customer.ˮ) Autonomous campaign management (i.e., “Run a campaign targeting XYZ customers.ˮ)

Slide 12

Slide 12 text

Augmentation → Automation Augmentation Generative AI Help me draft an email to a customer I met at a conference, reminding them of the benefits of Cloudflare I'll help you draft a professional follow-up email. Hereʼs a professional follow-up email that highlights Cloudflare's key benefits while maintaining a friendly tone. Would you like me to customize any specific parts based on your conversation at the conference? Just let me know what you'd like to adjust! Reply…. Cloudflare Email Click to open Automation Agentic AI Good Morning, Rita Run a campaign following up with everyone I met at the conference last week: get a list of customers draft up an email send to me for sign off fire off ping me when the customer responds 12

Slide 13

Slide 13 text

Understanding Hardware Challenges

Slide 14

Slide 14 text

Run inference tasks on Workers AI, the first globally- distributed serverless AI inference platform 14 Deploy from region: Earth Code executes within 50ms of 95% of the Internet- connected global population 335 cities in 125 countries, including mainland China Growing constellation of cities for AI inference powered by GPUs 190 cities with GPUs Note: Number of cities with GPUs as of March 3, 2025. All other figures as of December 31, 2024.

Slide 15

Slide 15 text

Hardware vs Software Product Life Cycles Pace of Innovation in hardware is years, High barrier to entry Why Cloudflare went down the Inference path? ● Optimized for reduced latency due to inherently efficient architecture ● Consistent performance & Increased utilization “Every service runs everywhere modelˮ ● AI Platform Solution for Inference 15 Hardware Infrastructure

Slide 16

Slide 16 text

● Choosing the right inference accelerator ● Designing within system constraints ● Meeting future capacity demand ● Deployment complexity and operational challenges ● Architecting at scale Infrastructure Buildout - Understanding Hardware Challenges 16 Hardware Infrastructure 1 2 3 4 5

Slide 17

Slide 17 text

Choosing the right inference accelerator ● Right sizing the hardware to workload needs ● Target Workloads - Inference/FineTuning/BYOM ● Adjacent workloads - stream, video decode/encode ● NVIDIA vs Alternatives/ Custom Silicon ● Open source vs Custom SDK ● Multi-Pronged approach - Multi-GPU Tier/Distributed Tier ● Solving for Utilization 17 Hardware Infrastructure 1 Mapping workloads/KPIs to Hardware Resources

Slide 18

Slide 18 text

Mapping Workloads/KPIs to Hardware Resources ● Cost and power efficiency - NVIDIA Alternatives ● Time to First Token for customer experience - Real time Latency ● Throughput as Performance Tokens/Sec) - GPU Compute/Mem BW ● Accuracy determined by precision/quantization ● Faster Model switching time, size of models - memory capacity 18 Hardware Infrastructure Making the right KPI tradeoffs

Slide 19

Slide 19 text

Designing within system constraints ● Power limitations - future proofing thermal solutions ● Design Choices to meet mechanical constraints ● Hardware Qual with adequate stress testing - error rates ● Performance tuning ● C-group resourcing ● Isolation ● Latency vs Throughput ● Benchmarking and Telemetry 19 Hardware Infrastructure AVAILABILITY PERFORM ANCE RELIABILITY 2 Modular & Disaggregated Architecture

Slide 20

Slide 20 text

Modular, Composable & Disaggregated Hardware ● Composability for the ultimate flexibility ● Optimize GPU usage by dynamically compose GPUs & Storage ● Reuse and enablement of any compute node to be accelerated ● Different layers of disaggregation ○ Prefill vs Decode ○ Rack level ○ Colo/DC level 20 Hardware Infrastructure

Slide 21

Slide 21 text

● Tradeoffs between perf and availability ● Utilization through orchestration, scheduling and management ● Identifying model specific metrics - batch vs real time ● Hardware to product metrics translation ● Forecasting considerations 21 Hardware Infrastructure 3 Meeting future capacity demand When to move to On-Prem? Hybrid Cloud Approach?

Slide 22

Slide 22 text

Deployment complexity and operational challenges ● Standardizing racks - Density/Diversity considerations ● Heterogeneous architecture SW enabling effort ● Installation and Costs of Retrofitting ● Regional differences - compliance ● TCO and Supply Chain reliability, negotiating space and power by colo ● Continuous monitoring of usage 22 Hardware Infrastructure 4 Multi-Pronged Approach : Points of Presence + Cluster Strategy

Slide 23

Slide 23 text

Architecting at scale ● Data locality considerations ● “One size fits allˮ vs custom built accelerators ● Understanding bottlenecks - non negotiables ● Balancing server lifecycles with evolving needs ● Model optimization techniques to reduce hardware footprint ● Batching/Parallelism techniques ● Deep Seek/MoonCake 23 Hardware Infrastructure 5 Inference in the context of the larger system

Slide 24

Slide 24 text

Building the software stack

Slide 25

Slide 25 text

Three challenges we sought to address The state-of-the-art is moving fast 25 Building the software stack Software configurations can be complex We want to get the most out of our hardware 1 2 3

Slide 26

Slide 26 text

Just in the last two years we have seen the following developments… ● Continuous batching ● Flash attention ● Paged attention ● Sliding-window attention ● Expert parallelism ● MOE, Multi-modal The state-of-the-art is moving fast 26 Building the software stack Goal: Provide the latest capabilities to developers and data scientists 1

Slide 27

Slide 27 text

Deploying an AI model that uses an accelerator requires getting many interconnected hardware and software components aligned ● Hardware ● Kernel ● Device drivers ● System libraries ● Application libraries Software configurations can be complex 27 Building the software stack 2 Goal: Contain the complexity

Slide 28

Slide 28 text

We want to get the most out of our hardware Create measure-build-test loops in the system ● Measure a metric we care about (latency, utilization, etc) ● Implement features we think will improve our target metric ● Measure the result, and iterate 28 3 Goal: Get the most value out of the hardware weʼve invested in

Slide 29

Slide 29 text

Major Components 29 Scheduling Making sure the right number of models are running in the right places on Cloudflareʼs network Routing Getting requests from the users to the models as fast as possible Models Selecting the models that are available, designing the APIs, optimizing Enablement Giving developers the tools they need to use access and use the models Building the software stack

Slide 30

Slide 30 text

Deployment ● CI/CD ● Packaging & containerization ● Gradual deployments A few final considerations 30 Observability ● Global metrics ● Log capture ● Alerting ● On-call rotation Compliance & Security ● Data encryption ● Authentication & authorization ● Dependency management ● Vulnerability management Building the software stack

Slide 31

Slide 31 text

Thank you