Multi-Region/Cloud Ray Pipeline with Distributed Caching

Multi-Region/Cloud Ray Pipeline with Distributed Caching Beinan Wang, Architect @
Alluxio

Challenges In Data Loading

Challenges of Data Loading in AI/ML Pipeline Large Data Volume
Remote Data Loading Various Data Access Patterns

100,000,000,000,000,000,000,000 bytes of data will be stored in the cloud
by 2025 Source: Cybersecurity Ventures

Remote Data Loading Online ML platform Alluxio Inference cluster Models
Models Training Data Models 1 2 3 4 5 Offline training platform Alluxio Training cluster Training Data 2

What are Data Access Patterns?

Various Data Access Patterns in the ML Pipeline

Data Access Patterns in Model Training

Data Loading in Ray

Ray is Designed for Distributed Training • Ray uses a
distributed scheduler to dispatch training jobs to available workers (CPUs/GPUs) • Enables seamless horizontal scaling of training jobs across multiple nodes • Provides streaming data abstraction for ML training for parallel and distributed preprocessing.

Why Streaming *“Downloading the entire training dataset to local disk
may not make sense • If you donʼt have enough disk space on each node to store the entire dataset • If you want to overlap downloading the data with data preprocessing and training • If you want each worker node to read a diﬀerent and random subset of the data on each epoch” *Source: https://www.anyscale.com/blog/fast-flexible-scalable-data-loading-for-ml-training-with-ray-data

Performance & Cost Implication of Streaming • You might load
the entire dataset again and again for each epoch • You cannot cache the hottest data among multiple training jobs automatically • You might be suﬀering from a cold start every time.

Caching Solution

Alluxio’s Position In the Ray Ecosystem Storage - Data storage
Alluxio - High performance data access layer ML Framework - Model training/inference Unified Compute - ML pipeline orchestration

AI Training with Alluxio POSIX API (FUSE) REST API S3
API Alluxio GPU (CV, NLP) Dataset Storage Kubernetes Interactive Notebook Alluxio Operator Visualization Dashboard Alluxio Dashboard

AI Inference with Alluxio POSIX API (FUSE) REST API S3
API Alluxio GPU/CPU Model Storage Kubernetes Interactive Notebook Alluxio Operator Alluxio Dashboard

REST API Client Worker docs NOTICE overview.md security.md HTTP GET
list files RESTful API http://<worker_ip>:<port>/v1/files?path=<path> JSON [ {"mType": "directory","mName": "docs","mLength": 0}, {"mType": "file", "mName": "NOTICE", "mLength": 4352} ] http://worker01:28080/v1/files?path=/ 1 Send HTTP request to exec ls 2 Response with the JSON result

REST API Client Worker 1 0 HTTP GET get page
RESTful API http://<worker_ip>:<port>/v1/file/<file_id>/pag e/<page_index> page’s bytes http://127.0.0.1:28080/v1/file/5f2829f08879b0e8 9d07174cffa8d891bdf08ba9e91218e30fe39503dd42e32 c/page/0 1 Send HTTP request to get page’s bytes 2 Response with the bytes of the page 5f2829f08879b0e89d0717 4cffa8d891bdf08ba9e912 18e30fe39503dd42e32c 2 Pages from #0 to #2 File ID

Benchmarks

Training Benchmark for Each API

21 Training Directly from Storage - > 80% of total
time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) Visualization Dashboard Results (w/o Alluxio)

22 Visualization Dashboard Results (with Alluxio) Training with Alluxio -
Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X)

Alluxio+Ray Benchmark Setup • Instance Type ◦ m5.4xlarge 16vCPU 64GB
memory • Ray head resources ◦ nohup ray start --head --memory=$((16 * 2**30)) --object-store-memory=$((4 * 2**30)) --dashboard-host=0.0.0.0 --metrics-export-port=8080 --block --num-cpus=14 --system-config='{"automatic_object_spilling_enabled": false}' & • Ray actual task resources ◦ python release/nightly_tests/dataset/multi_node_train_benchmark.py --num-workers 12 --file-type image --data-root s3://ai-ref-arch/imagenet-mini/train --object-store-memory $((4 * 2**30))

Alluxio+Ray Benchmark – I/O Throughput Without Alluxio With Alluxio Mbps

Cost Saving – Egress/Data Transfer Fees

Cost Saving – API Calls/S3 Operations (List, Get)

Case Study

Increase GPU utilization 50% 93% HDFS Training Data Training Data
Models Training Data Models Model Training Model Training Model Deployment Model Inference Downstream Applications Model Update Training Clouds Oﬀline Cloud Online Cloud Zhihu: High Performance AI Platform for LLM 2 - 4X faster model training

Thank you! Welcome any questions Welcome to engage with me
on Slack! Scan QR code for data access patterns white paper:

Multi-Region/Cloud Ray Pipeline with Distribute...

Multi-Region/Cloud Ray Pipeline with Distributed Caching

Anyscale

More Decks by Anyscale

Other Decks in Programming

Featured

Transcript