Multi-Region/Cloud
Ray Pipeline with
Distributed Caching
Beinan Wang, Architect @ Alluxio
Slide 2
Slide 2 text
Challenges
In Data Loading
Slide 3
Slide 3 text
Challenges of Data Loading in AI/ML Pipeline
Large Data Volume Remote
Data Loading
Various
Data Access Patterns
Slide 4
Slide 4 text
100,000,000,000,000,000,000,000
bytes of data will be stored in the cloud by 2025
Source: Cybersecurity Ventures
Slide 5
Slide 5 text
Remote Data Loading
Online ML platform
Alluxio
Inference cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
Slide 6
Slide 6 text
What are Data Access Patterns?
Slide 7
Slide 7 text
Various Data Access Patterns in the ML Pipeline
Slide 8
Slide 8 text
Data Access Patterns in Model Training
Slide 9
Slide 9 text
Data Loading
in Ray
Slide 10
Slide 10 text
Ray is Designed for Distributed Training
● Ray uses a distributed scheduler to dispatch training jobs to available
workers (CPUs/GPUs)
● Enables seamless horizontal scaling of training jobs across multiple nodes
● Provides streaming data abstraction for ML training for parallel and
distributed preprocessing.
Slide 11
Slide 11 text
Why Streaming
*“Downloading the entire training dataset to local disk may not make sense
● If you donʼt have enough disk space on each node to store the entire dataset
● If you want to overlap downloading the data with data preprocessing and training
● If you want each worker node to read a different and random subset of the data on each epoch”
*Source: https://www.anyscale.com/blog/fast-flexible-scalable-data-loading-for-ml-training-with-ray-data
Slide 12
Slide 12 text
Performance & Cost Implication of Streaming
● You might load the entire dataset again and again for each epoch
● You cannot cache the hottest data among multiple training jobs automatically
● You might be suffering from a cold start every time.
Slide 13
Slide 13 text
Caching
Solution
Slide 14
Slide 14 text
Alluxio’s Position In the Ray Ecosystem
Storage - Data storage
Alluxio - High performance data access layer
ML Framework - Model training/inference
Unified Compute - ML pipeline orchestration
Slide 15
Slide 15 text
AI Training with Alluxio
POSIX API (FUSE)
REST API
S3 API
Alluxio
GPU (CV, NLP)
Dataset Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard
Alluxio
Dashboard
Slide 16
Slide 16 text
AI Inference with Alluxio
POSIX API (FUSE)
REST API
S3 API
Alluxio
GPU/CPU
Model Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Alluxio
Dashboard
Slide 17
Slide 17 text
REST API
Client Worker
docs
NOTICE
overview.md
security.md
HTTP GET list files RESTful API
http://:/v1/files?path=
JSON
[
{"mType": "directory","mName": "docs","mLength": 0},
{"mType": "file", "mName": "NOTICE", "mLength": 4352}
]
http://worker01:28080/v1/files?path=/
1 Send HTTP request to exec ls
2 Response with the JSON result
Slide 18
Slide 18 text
REST API
Client Worker
1
0
HTTP GET get page RESTful API
http://:/v1/file//pag
e/
page’s bytes
http://127.0.0.1:28080/v1/file/5f2829f08879b0e8
9d07174cffa8d891bdf08ba9e91218e30fe39503dd42e32
c/page/0
1 Send HTTP request to get page’s bytes
2 Response with the bytes of the page
5f2829f08879b0e89d0717
4cffa8d891bdf08ba9e912
18e30fe39503dd42e32c
2
Pages from #0 to #2
File ID
Slide 19
Slide 19 text
Benchmarks
Slide 20
Slide 20 text
Training Benchmark for Each API
Slide 21
Slide 21 text
21
Training Directly from Storage
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
Visualization Dashboard Results (w/o Alluxio)
Slide 22
Slide 22 text
22
Visualization Dashboard Results (with Alluxio)
Training with Alluxio
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
Slide 23
Slide 23 text
Alluxio+Ray Benchmark Setup
● Instance Type
○ m5.4xlarge 16vCPU 64GB memory
● Ray head resources
○ nohup ray start --head --memory=$((16 * 2**30)) --object-store-memory=$((4 * 2**30)) --dashboard-host=0.0.0.0
--metrics-export-port=8080 --block --num-cpus=14 --system-config='{"automatic_object_spilling_enabled": false}' &
● Ray actual task resources
○ python release/nightly_tests/dataset/multi_node_train_benchmark.py --num-workers 12 --file-type image --data-root
s3://ai-ref-arch/imagenet-mini/train --object-store-memory $((4 * 2**30))
Slide 24
Slide 24 text
Alluxio+Ray Benchmark – I/O Throughput
Without Alluxio With Alluxio
Mbps
Slide 25
Slide 25 text
Cost Saving – Egress/Data Transfer Fees
Slide 26
Slide 26 text
Cost Saving – API Calls/S3 Operations (List, Get)
Slide 27
Slide 27 text
Case Study
Slide 28
Slide 28 text
Increase GPU
utilization
50%
93%
HDFS
Training
Data
Training
Data
Models
Training
Data
Models
Model
Training
Model
Training
Model
Deployment
Model
Inference
Downstream
Applications
Model
Update
Training Clouds Offline Cloud Online Cloud
Zhihu: High Performance AI Platform
for LLM
2 - 4X faster
model training
Slide 29
Slide 29 text
Thank you! Welcome any questions
Welcome to engage with me on
Slack!
Scan QR code for data access
patterns white paper: