Scaling AI on a Budget:
A Startup’s GPU Optimization Journey
Slide 2
Slide 2 text
Agenda
- Quick Intro
- Steps to reduce costs and speed
- Next Steps
Slide 3
Slide 3 text
Elephant in the room
Slide 4
Slide 4 text
Bio
- Role: CTO of Designstripe
- Previous Experience: (RBC, InVue
(acquired))
- Tech Stack: NestJS/NextJS, Langchain,
and Mongo
- Other: Married, Father and unpaid Uber
driver
Slide 5
Slide 5 text
Designstripe
- Designstripe is a smart design platform, like a combination of Canva and
ChatGPT, that creates ready-to-publish social media content, customizable
and pre-branded for your business
- Key Features:
- Social Post
- 3D Mockups
Slide 6
Slide 6 text
AI Models and GPU
Slide 7
Slide 7 text
Architecture Overview
Slide 8
Slide 8 text
Poll 1
Got too excited doing the Macarena
and elbowed myself
Mistook the closet door for the
bathroom door at 3 am and got a black
eye
Slide 9
Slide 9 text
Optimization Steps
- HPA with PubSub
- Docker Optimization
- Image Streaming
- Separate Nodes for Environments
- Multiple Workloads
Slide 10
Slide 10 text
HPA with PubSub
- Horizontal Pod Autoscaler (HPA)
- Configure a custom metric on your
Google Workload (deployment)
- External Metric we use:
- Subscription num_undelivered_messages
- Target Total Value = 1
Slide 11
Slide 11 text
Docker Optimization
Docker Optimizations
- Remove unnecessary build files
- Multi-state build
Container Registry
- Migrated to Artifact Registry
- Single Region Artifact
Artifact Registry
Slide 12
Slide 12 text
Poll 2
Son’s aim with his XBox controller has
gotten a lot better
Tried to prove to my spouse that I could
still do a cartwheel
Slide 13
Slide 13 text
Artifact Registry - Image Streaming
- Mounts the container data layer
- Starts container without full image
- Streams the image on demand
- Leverage Multi-level caching
Slide 14
Slide 14 text
Nodes Pools
- Single GPU per Node
- Nvidia CUDA Image is 5GB
- Model weights can get huge
- Disk Pressure will be a factor
- Need to separate your Node Pools
by workload
Slide 15
Slide 15 text
Multiple Workloads
- Reduced our memory consumption to have
2 renders on 1 GPU
- LLMs can run multiple workloads on a
single GPU (nvidia-smi)
- Multi-Instance GPU on GKE