Scaling AI on a Budget: A Startup's GPU Optimization Journey by Shannon Lal

Scaling AI on a Budget: A Startup’s GPU Optimization Journey

Agenda - Quick Intro - Steps to reduce costs and
speed - Next Steps

Elephant in the room

Bio - Role: CTO of Designstripe - Previous Experience: (RBC,
InVue (acquired)) - Tech Stack: NestJS/NextJS, Langchain, and Mongo - Other: Married, Father and unpaid Uber driver

Designstripe - Designstripe is a smart design platform, like a
combination of Canva and ChatGPT, that creates ready-to-publish social media content, customizable and pre-branded for your business - Key Features: - Social Post - 3D Mockups

AI Models and GPU

Architecture Overview

Poll 1 Got too excited doing the Macarena and elbowed
myself Mistook the closet door for the bathroom door at 3 am and got a black eye

Optimization Steps - HPA with PubSub - Docker Optimization -
Image Streaming - Separate Nodes for Environments - Multiple Workloads

HPA with PubSub - Horizontal Pod Autoscaler (HPA) - Configure
a custom metric on your Google Workload (deployment) - External Metric we use: - Subscription num_undelivered_messages - Target Total Value = 1

Docker Optimization Docker Optimizations - Remove unnecessary build files -
Multi-state build Container Registry - Migrated to Artifact Registry - Single Region Artifact Artifact Registry

Poll 2 Son’s aim with his XBox controller has gotten
a lot better Tried to prove to my spouse that I could still do a cartwheel

Artifact Registry - Image Streaming - Mounts the container data
layer - Starts container without full image - Streams the image on demand - Leverage Multi-level caching

Nodes Pools - Single GPU per Node - Nvidia CUDA
Image is 5GB - Model weights can get huge - Disk Pressure will be a factor - Need to separate your Node Pools by workload

Multiple Workloads - Reduced our memory consumption to have 2
renders on 1 GPU - LLMs can run multiple workloads on a single GPU (nvidia-smi) - Multi-Instance GPU on GKE

Performance Results Test Scenario Results HPA Optimization 6 minutes Image
Optimization 4 minutes Pre-Loaded Nodes 2 minutes Manual Autoscale with Nodes 10 seconds Costs reduction of 40% Speed Improvement: 30 %

Next Steps - CloudRun with GPU - Baking Nvidia base
image into node - Optimistic Loading of Nodes

Big Reveal

Thank you Contact Information: - https://www.linkedin.com/in/shannonlal/ - https://calendly.com/shannonlal/30min -

References HPA Autoscaler: https://cloud.google.com/kubernetes-engine/docs/samples/container-pubsub-horizontal-pod-autoscaler Docker Optimization: https://cloud.google.com/blog/products/application-development/understanding-artifact-registry-vs-container-registry https://cloud.google.com/artifact-registry/docs/repositories/repo-locations#:~:text=You%20can%20create%20repositories%20in,two%2 0or%20more%20geographic%20places Image
Streaming: https://cloud.google.com/blog/products/containers-kubernetes/introducing-container-image-streaming-in-gke Multiple LLMs on single GPU: https://dev.to/shannonlal/running-multiple-llms-on-a-single-gpu-255o

Scaling AI on a Budget: A Startup's GPU Optimiz...

Scaling AI on a Budget: A Startup's GPU Optimization Journey by Shannon Lal

GDG Montreal

More Decks by GDG Montreal

Other Decks in Programming

Featured

Transcript