are growing rapidly - High cost of GPU investment and maintenance - Many GPUs remain idle or underutilized Focus: - Implementing this concept on AWS EKS - Integrating it with JupyterHub to provide a seamless user experience - Leveraging Ray for efficient distributed workload orchestration
and user session management. - Proxy -> Routes user traffic to the corresponding notebook server. - Single-User Notebook Server -> A dedicated container per user, deployed as a Kubernetes Pod. - Spawner (KubeSpawner) -> Manages the lifecycle of user pods and integrates with the EKS scheduler.
and machine learning workloads. - It enables ML workloads to run in parallel across multiple nodes (CPU/GPU) within a “RayCluster”. - ML Libraries: Ray Train, Ray Tune, Ray Serve for training, tuning, serving
as the scheduler and control plane. It coordinates tasks across worker nodes and stores cluster metadata. Ray Worker Execution nodes that run tasks, actors, or ML jobs in parallel. They can be either CPU or GPU nodes. Ray Client Interface that allows external connections (e.g., from a JupyterHub notebook) to the Ray Head via the ray:// protocol.
across multiple nodes Ray Train Distributed deep learning for PyTorch & TensorFlow. (Distributed training) Ray Tune Distributed hyperparameter optimization. Ray Serve Model deployment and serve.
the provisioning and management of the entire platform infrastructure including: - EKS Cluster & Node Groups (CPU/GPU) - Networking (VPC, Subnets, Security Groups) - IAM & IRSA Integration for service account permissions - Helm Add-ons: JupyterHub, Ray, NVIDIA Plugin, Karpenter - Automated setup of IAM roles, IRSA, and resource quotas
- EKS as the foundational layer, - Karpenter for auto-scaling, - JupyterHub for multi-tenant access, - Ray for distributed machine learning workloads, - and Terraform for automated infrastructure provisioning and management.
Workloads: https://github.com/aws-samples/gen-ai-on-eks - Ray cluster with examples running on Kubernetes (k3d) : https://github.com/tekumara/ray-demo - Ray on Kubernetes (Kuberay): https://docs.ray.io/en/latest/cluster/kubernetes/index.html - Start Amazon EKS Cluster with GPUs for KubeRay: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/aws-ek s-gpu-cluster.html - Launching Ray Clusters on AWS: https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clu sters/aws.html