Multi-Region/Cloud Ray Pipeline with Distributed Caching

Slide 1

Slide 1 text

Multi-Region/Cloud Ray Pipeline with Distributed Caching Beinan Wang, Architect @ Alluxio

Slide 2

Slide 2 text

Challenges In Data Loading

Slide 3

Slide 3 text

Challenges of Data Loading in AI/ML Pipeline Large Data Volume Remote Data Loading Various Data Access Patterns

Slide 4

Slide 4 text

100,000,000,000,000,000,000,000 bytes of data will be stored in the cloud by 2025 Source: Cybersecurity Ventures

Slide 5

Slide 5 text

Remote Data Loading Online ML platform Alluxio Inference cluster Models Models Training Data Models 1 2 3 4 5 Offline training platform Alluxio Training cluster Training Data 2

Slide 6

Slide 6 text

What are Data Access Patterns?

Slide 7

Slide 7 text

Various Data Access Patterns in the ML Pipeline

Slide 8

Slide 8 text

Data Access Patterns in Model Training

Slide 9

Slide 9 text

Data Loading in Ray

Slide 10

Slide 10 text

Ray is Designed for Distributed Training ● Ray uses a distributed scheduler to dispatch training jobs to available workers (CPUs/GPUs) ● Enables seamless horizontal scaling of training jobs across multiple nodes ● Provides streaming data abstraction for ML training for parallel and distributed preprocessing.

Slide 11

Slide 11 text

Why Streaming *“Downloading the entire training dataset to local disk may not make sense ● If you donʼt have enough disk space on each node to store the entire dataset ● If you want to overlap downloading the data with data preprocessing and training ● If you want each worker node to read a diﬀerent and random subset of the data on each epoch” *Source: https://www.anyscale.com/blog/fast-flexible-scalable-data-loading-for-ml-training-with-ray-data

Slide 12

Slide 12 text

Performance & Cost Implication of Streaming ● You might load the entire dataset again and again for each epoch ● You cannot cache the hottest data among multiple training jobs automatically ● You might be suﬀering from a cold start every time.

Slide 13

Slide 13 text

Caching Solution

Slide 14

Slide 14 text

Alluxio’s Position In the Ray Ecosystem Storage - Data storage Alluxio - High performance data access layer ML Framework - Model training/inference Unified Compute - ML pipeline orchestration

Slide 15

Slide 15 text

AI Training with Alluxio POSIX API (FUSE) REST API S3 API Alluxio GPU (CV, NLP) Dataset Storage Kubernetes Interactive Notebook Alluxio Operator Visualization Dashboard Alluxio Dashboard

Slide 16

Slide 16 text

AI Inference with Alluxio POSIX API (FUSE) REST API S3 API Alluxio GPU/CPU Model Storage Kubernetes Interactive Notebook Alluxio Operator Alluxio Dashboard

Slide 17

Slide 17 text

REST API Client Worker docs NOTICE overview.md security.md HTTP GET list files RESTful API http://:/v1/files?path= JSON [ {"mType": "directory","mName": "docs","mLength": 0}, {"mType": "file", "mName": "NOTICE", "mLength": 4352} ] http://worker01:28080/v1/files?path=/ 1 Send HTTP request to exec ls 2 Response with the JSON result

Slide 18

Slide 18 text

REST API Client Worker 1 0 HTTP GET get page RESTful API http://:/v1/file//pag e/ page’s bytes http://127.0.0.1:28080/v1/file/5f2829f08879b0e8 9d07174cffa8d891bdf08ba9e91218e30fe39503dd42e32 c/page/0 1 Send HTTP request to get page’s bytes 2 Response with the bytes of the page 5f2829f08879b0e89d0717 4cffa8d891bdf08ba9e912 18e30fe39503dd42e32c 2 Pages from #0 to #2 File ID

Slide 19

Slide 19 text

Benchmarks

Slide 20

Slide 20 text

Training Benchmark for Each API

Slide 21

Slide 21 text

21 Training Directly from Storage - > 80% of total time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) Visualization Dashboard Results (w/o Alluxio)

Slide 22

Slide 22 text

22 Visualization Dashboard Results (with Alluxio) Training with Alluxio - Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X)

Slide 23

Slide 23 text

Alluxio+Ray Benchmark Setup ● Instance Type ○ m5.4xlarge 16vCPU 64GB memory ● Ray head resources ○ nohup ray start --head --memory=$((16 * 2**30)) --object-store-memory=$((4 * 2**30)) --dashboard-host=0.0.0.0 --metrics-export-port=8080 --block --num-cpus=14 --system-config='{"automatic_object_spilling_enabled": false}' & ● Ray actual task resources ○ python release/nightly_tests/dataset/multi_node_train_benchmark.py --num-workers 12 --file-type image --data-root s3://ai-ref-arch/imagenet-mini/train --object-store-memory $((4 * 2**30))

Slide 24

Slide 24 text

Alluxio+Ray Benchmark – I/O Throughput Without Alluxio With Alluxio Mbps

Slide 25

Slide 25 text

Cost Saving – Egress/Data Transfer Fees

Slide 26

Slide 26 text

Cost Saving – API Calls/S3 Operations (List, Get)

Slide 27

Slide 27 text

Case Study

Slide 28

Slide 28 text

Increase GPU utilization 50% 93% HDFS Training Data Training Data Models Training Data Models Model Training Model Training Model Deployment Model Inference Downstream Applications Model Update Training Clouds Oﬀline Cloud Online Cloud Zhihu: High Performance AI Platform for LLM 2 - 4X faster model training

Slide 29

Slide 29 text

Thank you! Welcome any questions Welcome to engage with me on Slack! Scan QR code for data access patterns white paper: