Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multi-Region/Cloud Ray Pipeline with Distributed Caching ​

Anyscale
November 02, 2023

Multi-Region/Cloud Ray Pipeline with Distributed Caching ​

In some cases, the machine learning pipeline stages may be distributed across regions or clouds. Data preprocessing, model training, and inferencing are in different regions/clouds to leverage special resource types or services that exist in a particular cloud, and to reduce latency by placing inference near user-facing applications. Additionally, as GPUs remain scarce resources, it is getting more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces challenges of losing data locality, resulting in latency and expensive data egress costs.

​In this talk, Beinan Wang, Senior Staff Software Engineer from Alluxio, will discuss how Alluxio’s open-source distributed caching system integrates with Ray in the multi-region/cloud scenario:

​ * The data locality challenges in the multi-region/cloud ML pipeline
* ​The stack of Ray+PyTorch+Alluxio to overcome these challenges, optimize model training performance, save on costs, and improve reliability
* The architecture and integration of Ray+PyTorch+Alluxio using POSIX or RESTful APIs
​* ResNet and BERT benchmark results showing performance gains and cost savings analysis
* Real-world examples of how Zhihu, a top Q&A platform, leveraged Alluxio’s distributed caching and data management with Ray’s scalable distributed computing to optimize their multi-cloud model training performance

Anyscale

November 02, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Multi-Region/Cloud
    Ray Pipeline with
    Distributed Caching
    Beinan Wang, Architect @ Alluxio

    View full-size slide

  2. Challenges
    In Data Loading

    View full-size slide

  3. Challenges of Data Loading in AI/ML Pipeline
    Large Data Volume Remote
    Data Loading
    Various
    Data Access Patterns

    View full-size slide

  4. 100,000,000,000,000,000,000,000
    bytes of data will be stored in the cloud by 2025
    Source: Cybersecurity Ventures

    View full-size slide

  5. Remote Data Loading
    Online ML platform
    Alluxio
    Inference cluster
    Models
    Models
    Training Data
    Models
    1
    2
    3
    4
    5
    Offline training platform
    Alluxio
    Training cluster
    Training Data
    2

    View full-size slide

  6. What are Data Access Patterns?

    View full-size slide

  7. Various Data Access Patterns in the ML Pipeline

    View full-size slide

  8. Data Access Patterns in Model Training

    View full-size slide

  9. Data Loading
    in Ray

    View full-size slide

  10. Ray is Designed for Distributed Training
    ● Ray uses a distributed scheduler to dispatch training jobs to available
    workers (CPUs/GPUs)
    ● Enables seamless horizontal scaling of training jobs across multiple nodes
    ● Provides streaming data abstraction for ML training for parallel and
    distributed preprocessing.

    View full-size slide

  11. Why Streaming
    *“Downloading the entire training dataset to local disk may not make sense
    ● If you donʼt have enough disk space on each node to store the entire dataset
    ● If you want to overlap downloading the data with data preprocessing and training
    ● If you want each worker node to read a different and random subset of the data on each epoch”
    *Source: https://www.anyscale.com/blog/fast-flexible-scalable-data-loading-for-ml-training-with-ray-data

    View full-size slide

  12. Performance & Cost Implication of Streaming
    ● You might load the entire dataset again and again for each epoch
    ● You cannot cache the hottest data among multiple training jobs automatically
    ● You might be suffering from a cold start every time.

    View full-size slide

  13. Caching
    Solution

    View full-size slide

  14. Alluxio’s Position In the Ray Ecosystem
    Storage - Data storage
    Alluxio - High performance data access layer
    ML Framework - Model training/inference
    Unified Compute - ML pipeline orchestration

    View full-size slide

  15. AI Training with Alluxio
    POSIX API (FUSE)
    REST API
    S3 API
    Alluxio
    GPU (CV, NLP)
    Dataset Storage
    Kubernetes
    Interactive
    Notebook
    Alluxio
    Operator
    Visualization
    Dashboard
    Alluxio
    Dashboard

    View full-size slide

  16. AI Inference with Alluxio
    POSIX API (FUSE)
    REST API
    S3 API
    Alluxio
    GPU/CPU
    Model Storage
    Kubernetes
    Interactive
    Notebook
    Alluxio
    Operator
    Alluxio
    Dashboard

    View full-size slide

  17. REST API
    Client Worker
    docs
    NOTICE
    overview.md
    security.md
    HTTP GET list files RESTful API
    http://:/v1/files?path=
    JSON
    [
    {"mType": "directory","mName": "docs","mLength": 0},
    {"mType": "file", "mName": "NOTICE", "mLength": 4352}
    ]
    http://worker01:28080/v1/files?path=/
    1 Send HTTP request to exec ls
    2 Response with the JSON result

    View full-size slide

  18. REST API
    Client Worker
    1
    0
    HTTP GET get page RESTful API
    http://:/v1/file//pag
    e/
    page’s bytes
    http://127.0.0.1:28080/v1/file/5f2829f08879b0e8
    9d07174cffa8d891bdf08ba9e91218e30fe39503dd42e32
    c/page/0
    1 Send HTTP request to get page’s bytes
    2 Response with the bytes of the page
    5f2829f08879b0e89d0717
    4cffa8d891bdf08ba9e912
    18e30fe39503dd42e32c
    2
    Pages from #0 to #2
    File ID

    View full-size slide

  19. Training Benchmark for Each API

    View full-size slide

  20. 21
    Training Directly from Storage
    - > 80% of total time is spent in DataLoader
    - Result in Low GPU Utilization Rate (<20%)
    Visualization Dashboard Results (w/o Alluxio)

    View full-size slide

  21. 22
    Visualization Dashboard Results (with Alluxio)
    Training with Alluxio
    - Reduced DataLoader Rate from 82% to 1% (82X)
    - Increase GPU Utilization Rate from 17% to 93% (5X)

    View full-size slide

  22. Alluxio+Ray Benchmark Setup
    ● Instance Type
    ○ m5.4xlarge 16vCPU 64GB memory
    ● Ray head resources
    ○ nohup ray start --head --memory=$((16 * 2**30)) --object-store-memory=$((4 * 2**30)) --dashboard-host=0.0.0.0
    --metrics-export-port=8080 --block --num-cpus=14 --system-config='{"automatic_object_spilling_enabled": false}' &
    ● Ray actual task resources
    ○ python release/nightly_tests/dataset/multi_node_train_benchmark.py --num-workers 12 --file-type image --data-root
    s3://ai-ref-arch/imagenet-mini/train --object-store-memory $((4 * 2**30))

    View full-size slide

  23. Alluxio+Ray Benchmark – I/O Throughput
    Without Alluxio With Alluxio
    Mbps

    View full-size slide

  24. Cost Saving – Egress/Data Transfer Fees

    View full-size slide

  25. Cost Saving – API Calls/S3 Operations (List, Get)

    View full-size slide

  26. Increase GPU
    utilization
    50%
    93%
    HDFS
    Training
    Data
    Training
    Data
    Models
    Training
    Data
    Models
    Model
    Training
    Model
    Training
    Model
    Deployment
    Model
    Inference
    Downstream
    Applications
    Model
    Update
    Training Clouds Offline Cloud Online Cloud
    Zhihu: High Performance AI Platform
    for LLM
    2 - 4X faster
    model training

    View full-size slide

  27. Thank you! Welcome any questions
    Welcome to engage with me on
    Slack!
    Scan QR code for data access
    patterns white paper:

    View full-size slide