Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling AI on a Budget: A Startup's GPU Optimiz...

Scaling AI on a Budget: A Startup's GPU Optimization Journey by Shannon Lal

Designstripe, led by serial entrepreneur François Arbour, revolutionizes AI-powered design. Designstripe’s CTO Shannon Lal reveals how they slashed cloud costs by 30% and boosted scaling speed by 25% for GPU-intensive features.

Learn GCP optimizations applicable to startups and developers alike.

https://youtu.be/4q_fiJ3e9js

DevFest Montreal 2024

GDG Montreal

November 15, 2024
Tweet

More Decks by GDG Montreal

Other Decks in Programming

Transcript

  1. Bio - Role: CTO of Designstripe - Previous Experience: (RBC,

    InVue (acquired)) - Tech Stack: NestJS/NextJS, Langchain, and Mongo - Other: Married, Father and unpaid Uber driver
  2. Designstripe - Designstripe is a smart design platform, like a

    combination of Canva and ChatGPT, that creates ready-to-publish social media content, customizable and pre-branded for your business - Key Features: - Social Post - 3D Mockups
  3. Poll 1 Got too excited doing the Macarena and elbowed

    myself Mistook the closet door for the bathroom door at 3 am and got a black eye
  4. Optimization Steps - HPA with PubSub - Docker Optimization -

    Image Streaming - Separate Nodes for Environments - Multiple Workloads
  5. HPA with PubSub - Horizontal Pod Autoscaler (HPA) - Configure

    a custom metric on your Google Workload (deployment) - External Metric we use: - Subscription num_undelivered_messages - Target Total Value = 1
  6. Docker Optimization Docker Optimizations - Remove unnecessary build files -

    Multi-state build Container Registry - Migrated to Artifact Registry - Single Region Artifact Artifact Registry
  7. Poll 2 Son’s aim with his XBox controller has gotten

    a lot better Tried to prove to my spouse that I could still do a cartwheel
  8. Artifact Registry - Image Streaming - Mounts the container data

    layer - Starts container without full image - Streams the image on demand - Leverage Multi-level caching
  9. Nodes Pools - Single GPU per Node - Nvidia CUDA

    Image is 5GB - Model weights can get huge - Disk Pressure will be a factor - Need to separate your Node Pools by workload
  10. Multiple Workloads - Reduced our memory consumption to have 2

    renders on 1 GPU - LLMs can run multiple workloads on a single GPU (nvidia-smi) - Multi-Instance GPU on GKE
  11. Performance Results Test Scenario Results HPA Optimization 6 minutes Image

    Optimization 4 minutes Pre-Loaded Nodes 2 minutes Manual Autoscale with Nodes 10 seconds Costs reduction of 40% Speed Improvement: 30 %
  12. Next Steps - CloudRun with GPU - Baking Nvidia base

    image into node - Optimistic Loading of Nodes