Scaling AI on a Budget: A Startup's GPU Optimization Journey by Shannon Lal

Slide 1

Slide 1 text

Scaling AI on a Budget: A Startup’s GPU Optimization Journey

Slide 2

Slide 2 text

Agenda - Quick Intro - Steps to reduce costs and speed - Next Steps

Slide 3

Slide 3 text

Elephant in the room

Slide 4

Slide 4 text

Bio - Role: CTO of Designstripe - Previous Experience: (RBC, InVue (acquired)) - Tech Stack: NestJS/NextJS, Langchain, and Mongo - Other: Married, Father and unpaid Uber driver

Slide 5

Slide 5 text

Designstripe - Designstripe is a smart design platform, like a combination of Canva and ChatGPT, that creates ready-to-publish social media content, customizable and pre-branded for your business - Key Features: - Social Post - 3D Mockups

Slide 6

Slide 6 text

AI Models and GPU

Slide 7

Slide 7 text

Architecture Overview

Slide 8

Slide 8 text

Poll 1 Got too excited doing the Macarena and elbowed myself Mistook the closet door for the bathroom door at 3 am and got a black eye

Slide 9

Slide 9 text

Optimization Steps - HPA with PubSub - Docker Optimization - Image Streaming - Separate Nodes for Environments - Multiple Workloads

Slide 10

Slide 10 text

HPA with PubSub - Horizontal Pod Autoscaler (HPA) - Configure a custom metric on your Google Workload (deployment) - External Metric we use: - Subscription num_undelivered_messages - Target Total Value = 1

Slide 11

Slide 11 text

Docker Optimization Docker Optimizations - Remove unnecessary build files - Multi-state build Container Registry - Migrated to Artifact Registry - Single Region Artifact Artifact Registry

Slide 12

Slide 12 text

Poll 2 Son’s aim with his XBox controller has gotten a lot better Tried to prove to my spouse that I could still do a cartwheel

Slide 13

Slide 13 text

Artifact Registry - Image Streaming - Mounts the container data layer - Starts container without full image - Streams the image on demand - Leverage Multi-level caching

Slide 14

Slide 14 text

Nodes Pools - Single GPU per Node - Nvidia CUDA Image is 5GB - Model weights can get huge - Disk Pressure will be a factor - Need to separate your Node Pools by workload

Slide 15

Slide 15 text

Multiple Workloads - Reduced our memory consumption to have 2 renders on 1 GPU - LLMs can run multiple workloads on a single GPU (nvidia-smi) - Multi-Instance GPU on GKE

Slide 16

Slide 16 text

Performance Results Test Scenario Results HPA Optimization 6 minutes Image Optimization 4 minutes Pre-Loaded Nodes 2 minutes Manual Autoscale with Nodes 10 seconds Costs reduction of 40% Speed Improvement: 30 %

Slide 17

Slide 17 text

Next Steps - CloudRun with GPU - Baking Nvidia base image into node - Optimistic Loading of Nodes

Slide 18

Slide 18 text

Big Reveal

Slide 19

Slide 19 text

Thank you Contact Information: - https://www.linkedin.com/in/shannonlal/ - https://calendly.com/shannonlal/30min -

Slide 20

Slide 20 text

References HPA Autoscaler: https://cloud.google.com/kubernetes-engine/docs/samples/container-pubsub-horizontal-pod-autoscaler Docker Optimization: https://cloud.google.com/blog/products/application-development/understanding-artifact-registry-vs-container-registry https://cloud.google.com/artifact-registry/docs/repositories/repo-locations#:~:text=You%20can%20create%20repositories%20in,two%2 0or%20more%20geographic%20places Image Streaming: https://cloud.google.com/blog/products/containers-kubernetes/introducing-container-image-streaming-in-gke Multiple LLMs on single GPU: https://dev.to/shannonlal/running-multiple-llms-on-a-single-gpu-255o