Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BUILDING A MULTI-TENANT MACHINE LEARNING PLATFO...

Avatar for Daniel Pepuho Daniel Pepuho
November 02, 2025
16

BUILDING A MULTI-TENANT MACHINE LEARNING PLATFORM ON AWS EKS WITH RAY AND JUPYTERHUB

My presentation @ AWS Community Day Indonesia 2025

Avatar for Daniel Pepuho

Daniel Pepuho

November 02, 2025
Tweet

Transcript

  1. About Me Focus area: - Distributed Systems - Cloud-Native -

    Spends free time contributing on open-source projects Connect: Daniel Pepuho danielcristho
  2. DISCLAIMER In this session, the discussion will focus on the

    infrastructure and architectural aspects of the Machine Learning platform, rather than on the development or fine-tuning of AI or LLM models.
  3. Background on Multi-Tenant ML Infrastructure Challenge: - ML and AI

    are growing rapidly - High cost of GPU investment and maintenance - Many GPUs remain idle or underutilized Focus: - Implementing this concept on AWS EKS - Integrating it with JupyterHub to provide a seamless user experience - Leveraging Ray for efficient distributed workload orchestration
  4. Multi-Tenant ML Platform Concept 1. Single infrastructure serving multiple users

    (researchers, teams, or students). 2. Each user has their own isolated workspace 3. Resource pooling (CPU/GPU) is managed dynamically and fairly.
  5. Komponen Peran Teknologi Infrastructure Running containers workload Amazon EKS Add-ons

    Functionality(GPU, Ingress) Ray, NVIDIA Operator, Nginx Ingress User Interface Interactive workspace JupyterHub Scheduling Workload Distribution Ray Storage Saving dataset & artifact Amazon S3 Infrastructure as Code Infrastructure Provisioning Terraform
  6. JupyterHub - Web-based platform providing interactive notebooks for multiple users

    (multi-tenant). - Integrates with Ray, Dask, Spark, or Airflow. - Auto-scales through the Kubernetes scheduler (CPU/GPU-aware).
  7. JupyterHub Core Components - Hub -> Central component for authentication

    and user session management. - Proxy -> Routes user traffic to the corresponding notebook server. - Single-User Notebook Server -> A dedicated container per user, deployed as a Kubernetes Pod. - Spawner (KubeSpawner) -> Manages the lifecycle of user pods and integrates with the EKS scheduler.
  8. Spawn JupyterLab Pulling image Jupyter Notebook Jupyter Notebooks are scheduled

    to the appropriate nodes, each user has a persistent volume.
  9. Ray - Ray is an open-source framework for distributed computing

    and machine learning workloads. - It enables ML workloads to run in parallel across multiple nodes (CPU/GPU) within a “RayCluster”. - ML Libraries: Ray Train, Ray Tune, Ray Serve for training, tuning, serving
  10. Ray Core Architectures Ray Head The main node that acts

    as the scheduler and control plane. It coordinates tasks across worker nodes and stores cluster metadata. Ray Worker Execution nodes that run tasks, actors, or ML jobs in parallel. They can be either CPU or GPU nodes. Ray Client Interface that allows external connections (e.g., from a JupyterHub notebook) to the Ray Head via the ray:// protocol.
  11. There are version differences between the Ray Server and the

    Ray Client that may affect compatibility.
  12. Ray ML Libraries Ray Core Run parallel tasks and actors

    across multiple nodes Ray Train Distributed deep learning for PyTorch & TensorFlow. (Distributed training) Ray Tune Distributed hyperparameter optimization. Ray Serve Model deployment and serve.
  13. Karpenter - Karpenter Dynamic node autoscaler for EKS. - Analyzes

    pod specs (CPU, GPU, memory, taints/tolerations) and optimizes cost by choosing the best instance type (Spot/On-Demand).
  14. Terraform - Terraform is an IaC tool used to automate

    the provisioning and management of the entire platform infrastructure including: - EKS Cluster & Node Groups (CPU/GPU) - Networking (VPC, Subnets, Security Groups) - IAM & IRSA Integration for service account permissions - Helm Add-ons: JupyterHub, Ray, NVIDIA Plugin, Karpenter - Automated setup of IAM roles, IRSA, and resource quotas
  15. Conclusion This platform brings together multiple open-source and AWS-native components:

    - EKS as the foundational layer, - Karpenter for auto-scaling, - JupyterHub for multi-tenant access, - Ray for distributed machine learning workloads, - and Terraform for automated infrastructure provisioning and management.
  16. References - Fine-tuning Foundation Models on Amazon EKS for AI/ML

    Workloads: https://github.com/aws-samples/gen-ai-on-eks - Ray cluster with examples running on Kubernetes (k3d) : https://github.com/tekumara/ray-demo - Ray on Kubernetes (Kuberay): https://docs.ray.io/en/latest/cluster/kubernetes/index.html - Start Amazon EKS Cluster with GPUs for KubeRay: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/aws-ek s-gpu-cluster.html - Launching Ray Clusters on AWS: https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clu sters/aws.html