BUILDING A MULTI-TENANT MACHINE LEARNING PLATFORM ON AWS EKS WITH RAY AND JUPYTERHUB

Jakarta, 25 October 2025 BUILDING A MULTI-TENANT MACHINE LEARNING PLATFORM
ON AWS EKS WITH RAY AND JUPYTERHUB

About Me Focus area: - Distributed Systems - Cloud-Native -
Spends free time contributing on open-source projects Connect: Daniel Pepuho danielcristho

DISCLAIMER In this session, the discussion will focus on the
infrastructure and architectural aspects of the Machine Learning platform, rather than on the development or fine-tuning of AI or LLM models.

Background on Multi-Tenant ML Infrastructure Challenge: - ML and AI
are growing rapidly - High cost of GPU investment and maintenance - Many GPUs remain idle or underutilized Focus: - Implementing this concept on AWS EKS - Integrating it with JupyterHub to provide a seamless user experience - Leveraging Ray for efficient distributed workload orchestration

Multi-Tenant ML Platform Concept 1. Single infrastructure serving multiple users
(researchers, teams, or students). 2. Each user has their own isolated workspace 3. Resource pooling (CPU/GPU) is managed dynamically and fairly.

Reference: https://github.com/aws-samples/gen-ai-on-eks

Komponen Peran Teknologi Infrastructure Running containers workload Amazon EKS Add-ons
Functionality(GPU, Ingress) Ray, NVIDIA Operator, Nginx Ingress User Interface Interactive workspace JupyterHub Scheduling Workload Distribution Ray Storage Saving dataset & artifact Amazon S3 Infrastructure as Code Infrastructure Provisioning Terraform

Project Repo: https://github.com/danielcristho/aws-eks-d emo

JupyterHub - Web-based platform providing interactive notebooks for multiple users
(multi-tenant). - Integrates with Ray, Dask, Spark, or Airflow. - Auto-scales through the Kubernetes scheduler (CPU/GPU-aware).

JupyterHub Core Components - Hub -> Central component for authentication
and user session management. - Proxy -> Routes user traffic to the corresponding notebook server. - Single-User Notebook Server -> A dedicated container per user, deployed as a Kubernetes Pod. - Spawner (KubeSpawner) -> Manages the lifecycle of user pods and integrates with the EKS scheduler.

Components

JupyterHub Using Native Authenticator

Profile List

Spawn JupyterLab Pulling image Jupyter Notebook Jupyter Notebooks are scheduled
to the appropriate nodes, each user has a persistent volume.

Spawn JupyterLab on GPUs Node Jupyter Notebook with PyTorch and
CUDA 12

Administrator Page

Ray - Ray is an open-source framework for distributed computing
and machine learning workloads. - It enables ML workloads to run in parallel across multiple nodes (CPU/GPU) within a “RayCluster”. - ML Libraries: Ray Train, Ray Tune, Ray Serve for training, tuning, serving

Ray Core Architectures Ray Head The main node that acts
as the scheduler and control plane. It coordinates tasks across worker nodes and stores cluster metadata. Ray Worker Execution nodes that run tasks, actors, or ML jobs in parallel. They can be either CPU or GPU nodes. Ray Client Interface that allows external connections (e.g., from a JupyterHub notebook) to the Ray Head via the ray:// protocol.

Connect to Ray Cluster from Ray Client

There are version differences between the Ray Server and the
Ray Client that may affect compatibility.

Ray ML Libraries Ray Core Run parallel tasks and actors
across multiple nodes Ray Train Distributed deep learning for PyTorch & TensorFlow. (Distributed training) Ray Tune Distributed hyperparameter optimization. Ray Serve Model deployment and serve.

Karpenter - Karpenter Dynamic node autoscaler for EKS. - Analyzes
pod specs (CPU, GPU, memory, taints/tolerations) and optimizes cost by choosing the best instance type (Spot/On-Demand).

Terraform - Terraform is an IaC tool used to automate
the provisioning and management of the entire platform infrastructure including: - EKS Cluster & Node Groups (CPU/GPU) - Networking (VPC, Subnets, Security Groups) - IAM & IRSA Integration for service account permissions - Helm Add-ons: JupyterHub, Ray, NVIDIA Plugin, Karpenter - Automated setup of IAM roles, IRSA, and resource quotas

Nodes Group

Addons Release

JupyterHub Chart

Infrastructure Provisioning $ terraform apply $ terraform destroy

New cluster on EKS

Conclusion This platform brings together multiple open-source and AWS-native components:
- EKS as the foundational layer, - Karpenter for auto-scaling, - JupyterHub for multi-tenant access, - Ray for distributed machine learning workloads, - and Terraform for automated infrastructure provisioning and management.

References - Fine-tuning Foundation Models on Amazon EKS for AI/ML
Workloads: https://github.com/aws-samples/gen-ai-on-eks - Ray cluster with examples running on Kubernetes (k3d) : https://github.com/tekumara/ray-demo - Ray on Kubernetes (Kuberay): https://docs.ray.io/en/latest/cluster/kubernetes/index.html - Start Amazon EKS Cluster with GPUs for KubeRay: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/aws-ek s-gpu-cluster.html - Launching Ray Clusters on AWS: https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clu sters/aws.html

Connect with Me: Daniel Pepuho danielcristho https://bit.ly/dc_devto https://bit.ly/dc_medium

Thank you!

BUILDING A MULTI-TENANT MACHINE LEARNING PLATFO...

BUILDING A MULTI-TENANT MACHINE LEARNING PLATFORM ON AWS EKS WITH RAY AND JUPYTERHUB

Daniel Pepuho

Featured

Transcript

Jakarta, 25 October 2025 BUILDING A MULTI-TENANT MACHINE LEARNING PLATFORM

About Me Focus area: - Distributed Systems - Cloud-Native -

Docs

DISCLAIMER In this session, the discussion will focus on the

Background on Multi-Tenant ML Infrastructure Challenge: - ML and AI

Multi-Tenant ML Platform Concept 1. Single infrastructure serving multiple users

Reference: https://github.com/aws-samples/gen-ai-on-eks

Komponen Peran Teknologi Infrastructure Running containers workload Amazon EKS Add-ons

Project Repo: https://github.com/danielcristho/aws-eks-d emo

JupyterHub - Web-based platform providing interactive notebooks for multiple users

JupyterHub Core Components - Hub -> Central component for authentication

Components

JupyterHub Using Native Authenticator

Profile List

Spawn JupyterLab Pulling image Jupyter Notebook Jupyter Notebooks are scheduled

Spawn JupyterLab on GPUs Node Jupyter Notebook with PyTorch and

Administrator Page

Ray - Ray is an open-source framework for distributed computing

Ray Core Architectures Ray Head The main node that acts

Connect to Ray Cluster from Ray Client

There are version differences between the Ray Server and the

Ray ML Libraries Ray Core Run parallel tasks and actors

Karpenter - Karpenter Dynamic node autoscaler for EKS. - Analyzes

Terraform - Terraform is an IaC tool used to automate