Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PremDay #2 - Building Infrastructure for Infere...

Avatar for PremDay PremDay
April 18, 2025

PremDay #2 - Building Infrastructure for Inference at Scale

As AI inference workloads grow more complex, designing flexible and scalable hardware infrastructure is more critical than ever. In this session, we’ll explore how Cloudflare builds hardware for its global network, balancing cost, performance, and adaptability. From selecting the right inference accelerators to designing for flexibility in evolving workloads, we’ll discuss real-world trade-offs, system design constraints, and operational challenges. Join us for an inside look at how real deployment metrics shape our hardware roadmap and the lessons learned from scaling inference at the edge amidst shifting performance bottlenecks.

Avatar for PremDay

PremDay

April 18, 2025
Tweet

More Decks by PremDay

Other Decks in Technology

Transcript

  1. Inference at the Edge Building a Global, Scalable AI Inference

    Network Syona Sarma Head of Hardware Engineering Cloudflare
  2. 2 1 2 3 4 5 Introduction to Cloudflare Cloudflare

    Infrastructure Cloudflareʼs AI Developer Platform Infrastructure Buildout - Hardware Challenges & Solutions Building the software stack
  3. Serverless Architecture - Flexible, Scalable and Efficient 4 • Distributed

    Network, Abstracts hardware • Every server is “capableˮ of running every service ◦ Compute, Storage, Networking and Now Inference workloads • Leveraging existing edge network for inference - latency • Dynamically optimizes and leverages capacity across network • Single unified code base
  4. GPUs for training and inference AI initiatives face challenges 6

    Reduce cyber risk Control operational costs Multi-cloud architecture High demand has led to scarcity of resources Cost of moving data across clouds Initiative CHALLENGES Lack of visibility into usage Data leaks, privacy, shadow IT Challenges
  5. Cloudflare offers products at each step of the AI lifecycle

    Training Storing training data, without egress fees with R2 Store your training data on R2 to avoid paying egress fees and retain flexibility for multi-cloud architectures. Inference Powering inference with Workers AI and Vectorize Launched Sept 2023 Run and deploy models through Workers AI, powered by our GPUs. Store and query your embeddings for faster (and cheaper! vector lookups with Vectorize. Optimization Security Observe & save costs with AI Gateway Launched Sept 2023 Connect your AI app to AI Gateway for better observability with analytics and controlled scalability via caching and rate limiting. Secure AI deployments with Firewall for AI AI Firewall protects your AI services from bad actors with easy to configure without performance impact. Product Overview
  6. Unified platform to develop GenAI applications 9 Cloudflare AI Products

    Vectorize Simple vector storage to power RAG applications Workers AI Serverless inference at a global scale AI Gateway ModelOps to control and observe your AI applications
  7. The next focus of AI is going to be on

    automating end-to-end tasks. 10 Train Infer Automate
  8. How is AI evolving? 11 Predictive AI Machine Learning) Generative

    AI Agentic AI Definition AI systems that recognize patterns from data and make predictions or classifications. AI systems that create new content (e.g., text, images, or data.) AI systems that can act autonomously, make decisions, and adapt to changing environments based on goals. Goal To find patterns in data and make predictions or decisions based on historical data. To generate new, realistic outputs based on learned patterns. To independently solve problems and take actions based on goals. Example Lead scoring (i.e., “Based on previous examples, which customer is most likely to buy?ˮ) Text generation (i.e., “Help me draft an email to a customer.ˮ) Autonomous campaign management (i.e., “Run a campaign targeting XYZ customers.ˮ)
  9. Augmentation → Automation Augmentation Generative AI Help me draft an

    email to a customer I met at a conference, reminding them of the benefits of Cloudflare I'll help you draft a professional follow-up email. Hereʼs a professional follow-up email that highlights Cloudflare's key benefits while maintaining a friendly tone. Would you like me to customize any specific parts based on your conversation at the conference? Just let me know what you'd like to adjust! Reply…. Cloudflare Email Click to open Automation Agentic AI Good Morning, Rita Run a campaign following up with everyone I met at the conference last week: get a list of customers draft up an email send to me for sign off fire off ping me when the customer responds 12
  10. Run inference tasks on Workers AI, the first globally- distributed

    serverless AI inference platform 14 Deploy from region: Earth Code executes within 50ms of 95% of the Internet- connected global population 335 cities in 125 countries, including mainland China Growing constellation of cities for AI inference powered by GPUs 190 cities with GPUs Note: Number of cities with GPUs as of March 3, 2025. All other figures as of December 31, 2024.
  11. Hardware vs Software Product Life Cycles Pace of Innovation in

    hardware is years, High barrier to entry Why Cloudflare went down the Inference path? • Optimized for reduced latency due to inherently efficient architecture • Consistent performance & Increased utilization “Every service runs everywhere modelˮ • AI Platform Solution for Inference 15 Hardware Infrastructure
  12. • Choosing the right inference accelerator • Designing within system

    constraints • Meeting future capacity demand • Deployment complexity and operational challenges • Architecting at scale Infrastructure Buildout - Understanding Hardware Challenges 16 Hardware Infrastructure 1 2 3 4 5
  13. Choosing the right inference accelerator • Right sizing the hardware

    to workload needs • Target Workloads - Inference/FineTuning/BYOM • Adjacent workloads - stream, video decode/encode • NVIDIA vs Alternatives/ Custom Silicon • Open source vs Custom SDK • Multi-Pronged approach - Multi-GPU Tier/Distributed Tier • Solving for Utilization 17 Hardware Infrastructure 1 Mapping workloads/KPIs to Hardware Resources
  14. Mapping Workloads/KPIs to Hardware Resources • Cost and power efficiency

    - NVIDIA Alternatives • Time to First Token for customer experience - Real time Latency • Throughput as Performance Tokens/Sec) - GPU Compute/Mem BW • Accuracy determined by precision/quantization • Faster Model switching time, size of models - memory capacity 18 Hardware Infrastructure Making the right KPI tradeoffs
  15. Designing within system constraints • Power limitations - future proofing

    thermal solutions • Design Choices to meet mechanical constraints • Hardware Qual with adequate stress testing - error rates • Performance tuning • C-group resourcing • Isolation • Latency vs Throughput • Benchmarking and Telemetry 19 Hardware Infrastructure AVAILABILITY PERFORM ANCE RELIABILITY 2 Modular & Disaggregated Architecture
  16. Modular, Composable & Disaggregated Hardware • Composability for the ultimate

    flexibility • Optimize GPU usage by dynamically compose GPUs & Storage • Reuse and enablement of any compute node to be accelerated • Different layers of disaggregation ◦ Prefill vs Decode ◦ Rack level ◦ Colo/DC level 20 Hardware Infrastructure
  17. • Tradeoffs between perf and availability • Utilization through orchestration,

    scheduling and management • Identifying model specific metrics - batch vs real time • Hardware to product metrics translation • Forecasting considerations 21 Hardware Infrastructure 3 Meeting future capacity demand When to move to On-Prem? Hybrid Cloud Approach?
  18. Deployment complexity and operational challenges • Standardizing racks - Density/Diversity

    considerations • Heterogeneous architecture SW enabling effort • Installation and Costs of Retrofitting • Regional differences - compliance • TCO and Supply Chain reliability, negotiating space and power by colo • Continuous monitoring of usage 22 Hardware Infrastructure 4 Multi-Pronged Approach : Points of Presence + Cluster Strategy
  19. Architecting at scale • Data locality considerations • “One size

    fits allˮ vs custom built accelerators • Understanding bottlenecks - non negotiables • Balancing server lifecycles with evolving needs • Model optimization techniques to reduce hardware footprint • Batching/Parallelism techniques • Deep Seek/MoonCake 23 Hardware Infrastructure 5 Inference in the context of the larger system
  20. Three challenges we sought to address The state-of-the-art is moving

    fast 25 Building the software stack Software configurations can be complex We want to get the most out of our hardware 1 2 3
  21. Just in the last two years we have seen the

    following developments… • Continuous batching • Flash attention • Paged attention • Sliding-window attention • Expert parallelism • MOE, Multi-modal The state-of-the-art is moving fast 26 Building the software stack Goal: Provide the latest capabilities to developers and data scientists 1
  22. Deploying an AI model that uses an accelerator requires getting

    many interconnected hardware and software components aligned • Hardware • Kernel • Device drivers • System libraries • Application libraries Software configurations can be complex 27 Building the software stack 2 Goal: Contain the complexity
  23. We want to get the most out of our hardware

    Create measure-build-test loops in the system • Measure a metric we care about (latency, utilization, etc) • Implement features we think will improve our target metric • Measure the result, and iterate 28 3 Goal: Get the most value out of the hardware weʼve invested in
  24. Major Components 29 Scheduling Making sure the right number of

    models are running in the right places on Cloudflareʼs network Routing Getting requests from the users to the models as fast as possible Models Selecting the models that are available, designing the APIs, optimizing Enablement Giving developers the tools they need to use access and use the models Building the software stack
  25. Deployment • CI/CD • Packaging & containerization • Gradual deployments

    A few final considerations 30 Observability • Global metrics • Log capture • Alerting • On-call rotation Compliance & Security • Data encryption • Authentication & authorization • Dependency management • Vulnerability management Building the software stack