Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multi-Region/Cloud Ray Pipeline with Distributed Caching ​

Anyscale
November 02, 2023

Multi-Region/Cloud Ray Pipeline with Distributed Caching ​

In some cases, the machine learning pipeline stages may be distributed across regions or clouds. Data preprocessing, model training, and inferencing are in different regions/clouds to leverage special resource types or services that exist in a particular cloud, and to reduce latency by placing inference near user-facing applications. Additionally, as GPUs remain scarce resources, it is getting more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces challenges of losing data locality, resulting in latency and expensive data egress costs.

​In this talk, Beinan Wang, Senior Staff Software Engineer from Alluxio, will discuss how Alluxio’s open-source distributed caching system integrates with Ray in the multi-region/cloud scenario:

​ * The data locality challenges in the multi-region/cloud ML pipeline
* ​The stack of Ray+PyTorch+Alluxio to overcome these challenges, optimize model training performance, save on costs, and improve reliability
* The architecture and integration of Ray+PyTorch+Alluxio using POSIX or RESTful APIs
​* ResNet and BERT benchmark results showing performance gains and cost savings analysis
* Real-world examples of how Zhihu, a top Q&A platform, leveraged Alluxio’s distributed caching and data management with Ray’s scalable distributed computing to optimize their multi-cloud model training performance

Anyscale

November 02, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Challenges of Data Loading in AI/ML Pipeline Large Data Volume

    Remote Data Loading Various Data Access Patterns
  2. Remote Data Loading Online ML platform Alluxio Inference cluster Models

    Models Training Data Models 1 2 3 4 5 Offline training platform Alluxio Training cluster Training Data 2
  3. Ray is Designed for Distributed Training • Ray uses a

    distributed scheduler to dispatch training jobs to available workers (CPUs/GPUs) • Enables seamless horizontal scaling of training jobs across multiple nodes • Provides streaming data abstraction for ML training for parallel and distributed preprocessing.
  4. Why Streaming *“Downloading the entire training dataset to local disk

    may not make sense • If you donʼt have enough disk space on each node to store the entire dataset • If you want to overlap downloading the data with data preprocessing and training • If you want each worker node to read a different and random subset of the data on each epoch” *Source: https://www.anyscale.com/blog/fast-flexible-scalable-data-loading-for-ml-training-with-ray-data
  5. Performance & Cost Implication of Streaming • You might load

    the entire dataset again and again for each epoch • You cannot cache the hottest data among multiple training jobs automatically • You might be suffering from a cold start every time.
  6. Alluxio’s Position In the Ray Ecosystem Storage - Data storage

    Alluxio - High performance data access layer ML Framework - Model training/inference Unified Compute - ML pipeline orchestration
  7. AI Training with Alluxio POSIX API (FUSE) REST API S3

    API Alluxio GPU (CV, NLP) Dataset Storage Kubernetes Interactive Notebook Alluxio Operator Visualization Dashboard Alluxio Dashboard
  8. AI Inference with Alluxio POSIX API (FUSE) REST API S3

    API Alluxio GPU/CPU Model Storage Kubernetes Interactive Notebook Alluxio Operator Alluxio Dashboard
  9. REST API Client Worker docs NOTICE overview.md security.md HTTP GET

    list files RESTful API http://<worker_ip>:<port>/v1/files?path=<path> JSON [ {"mType": "directory","mName": "docs","mLength": 0}, {"mType": "file", "mName": "NOTICE", "mLength": 4352} ] http://worker01:28080/v1/files?path=/ 1 Send HTTP request to exec ls 2 Response with the JSON result
  10. REST API Client Worker 1 0 HTTP GET get page

    RESTful API http://<worker_ip>:<port>/v1/file/<file_id>/pag e/<page_index> page’s bytes http://127.0.0.1:28080/v1/file/5f2829f08879b0e8 9d07174cffa8d891bdf08ba9e91218e30fe39503dd42e32 c/page/0 1 Send HTTP request to get page’s bytes 2 Response with the bytes of the page 5f2829f08879b0e89d0717 4cffa8d891bdf08ba9e912 18e30fe39503dd42e32c 2 Pages from #0 to #2 File ID
  11. 21 Training Directly from Storage - > 80% of total

    time is spent in DataLoader - Result in Low GPU Utilization Rate (<20%) Visualization Dashboard Results (w/o Alluxio)
  12. 22 Visualization Dashboard Results (with Alluxio) Training with Alluxio -

    Reduced DataLoader Rate from 82% to 1% (82X) - Increase GPU Utilization Rate from 17% to 93% (5X)
  13. Alluxio+Ray Benchmark Setup • Instance Type ◦ m5.4xlarge 16vCPU 64GB

    memory • Ray head resources ◦ nohup ray start --head --memory=$((16 * 2**30)) --object-store-memory=$((4 * 2**30)) --dashboard-host=0.0.0.0 --metrics-export-port=8080 --block --num-cpus=14 --system-config='{"automatic_object_spilling_enabled": false}' & • Ray actual task resources ◦ python release/nightly_tests/dataset/multi_node_train_benchmark.py --num-workers 12 --file-type image --data-root s3://ai-ref-arch/imagenet-mini/train --object-store-memory $((4 * 2**30))
  14. Increase GPU utilization 50% 93% HDFS Training Data Training Data

    Models Training Data Models Model Training Model Training Model Deployment Model Inference Downstream Applications Model Update Training Clouds Offline Cloud Online Cloud Zhihu: High Performance AI Platform for LLM 2 - 4X faster model training
  15. Thank you! Welcome any questions Welcome to engage with me

    on Slack! Scan QR code for data access patterns white paper: