$30 off During Our Annual Pro Sale. View Details »

Ray Train: Production-ready distributed deep learning

Anyscale
PRO
February 10, 2022

Ray Train: Production-ready distributed deep learning

Today, most frameworks for deep learning prototyping, training, and distributing to a cluster are either powerful and inflexible, or nimble and toy-like. Data scientists are forced to choose between a great developer experience and a production-ready framework.

To fix this gap, the Ray ML team has developed Ray Train.

Ray Train is a library built on top of the Ray ecosystem that simplifies distributed deep learning. Currently in stable beta in Ray 1.9, Ray Train offers the following features:

- Scales to multi-GPU and multi-node training with zero code changes
- Runs seamlessly on any cloud (AWS, GCP, Azure, Kubernetes, or on-prem)
- Supports PyTorch, TensorFlow, and Horovod
- Distributed data shuffling and loading with Ray Datasets
- Distributed hyperparameter tuning with Ray Tune
- Built-in loggers for TensorBoard and MLflow

In this webinar, we'll talk through some of the challenges in large-scale computer vision ML training, and show a demo of Ray Train in action.

Anyscale
PRO

February 10, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Anyscale Overview
    Ray Train: Distributed
    Deep learning
    Amog Kamsetty
    Matthew Deng
    Will Drevo

    View Slide

  2. Anyscale Overview
    Overview
    I. Problems with model training
    II. Why Ray Train?
    III. Ray Train: development goals
    A. Developer velocity
    B. Production-ready
    C. Batteries included
    IV. Roadmap for H1 2022
    V. CIFAR-10 deep learning demo

    View Slide

  3. Anyscale Overview
    Problems in model training
    The two reasons ML training is hard

    View Slide

  4. Data science is iterative
    What people think data science projects should look like
    (% to goal)
    Value!

    View Slide

  5. Data science is iterative
    What people think data science projects should look like
    What data science projects actually look like
    (% to goal)
    Value!
    Value!

    View Slide

  6. Today, iteration is expensive
    Let’s break it down:
    (1) Compute time (and therefore also cloud $$$)
    We try to fix this by distributing, but in doing so we often introduce...
    (2) Developer cognitive overhead
    Which reduces development & experimentation velocity

    View Slide

  7. Goal: reduce the length of the line (time to value)
    Size of loop ≈
    How long training takes
    Distance between loops ≈
    Developer velocity
    Value!

    View Slide

  8. Problem #1 ...

    View Slide

  9. It gets worse…
    Models are Increasing in Compute Demand 2018 Study by Open AI
    Compute
    Requirement doubling
    every 3.4 months
    since 2012
    300,000x increase in
    compute from AlexNet
    to AlphaGo Zero
    35x every 18 months

    View Slide

  10. Problem #2: Developer cognitive overhead
    Configuring
    GPUs
    Converting
    code
    Starting a job
    Spinning up
    machines

    View Slide

  11. What is Ray Train?
    A library built on Ray that simplifies distributed deep learning training
    • Scale to multi-GPU and multi-node training with 0 code changes
    • Runs seamlessly on any cloud (AWS, GCP, Azure, Kubernetes, or on-prem)
    • Supports PyTorch, TensorFlow, and Horovod
    • Distributed data loading (Datasets) and hyperparameter tuning (Tune)
    • Built-in loggers for TensorBoard and MLflow

    View Slide

  12. Anyscale Overview
    What is Ray?
    Ray Train

    View Slide

  13. Anyscale Overview
    Ray Users
    13000+
    Repositories
    Depend on Ray
    1600+
    Open Source
    Contributors
    449+
    Growth of Ray open-source 18.7K stars as of 1/1/22

    View Slide

  14. Anyscale Overview
    Where does Ray Train fit in?
    Motivations & context from the deep
    learning frameworks space

    View Slide

  15. Anyscale Overview
    Deep learning framework tradeoffs
    Production-readiness
    Ease of development
    Heavyweight
    Inflexible, hard to customize
    Lightweight
    Nimble, but unscalable
    Ray Train

    View Slide

  16. Anyscale Overview
    Supported libraries Cloud type
    Name
    Open
    source
    Torch TF Horovod AWS GCP K8s On
    prem
    Scalable
    Local →
    Cluster
    Tuning
    Pythonic / no
    containerization
    required ⭐
    Ray
    Train
    ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
    Kubeflow
    (requires
    Kubernetes)
    ✅ ✅ ✅ ✅ ❌ * ❌ * ✅ ✅ ✅ ❌ ✅ ❌
    AWS
    Sagemaker
    ❌ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ✅ ❌ ✅ ❌
    Google AI
    Platform
    ❌ ✅ ✅ ❌ ❌ ✅ ❌ ❌ ✅ ❌ ✅ ❌
    Azure ML ❌ ✅ ✅ ❌ ❌ ❌ ❌ ❌ ✅ ❌ ✅ ❌
    Comparing scalable solutions
    * Kubeflow does run on AWS/GCP, but requires K8s so it isn’t as flexible (can’t run on bare VMs)

    View Slide

  17. Anyscale Overview
    Goals of Ray Train
    1) Developer velocity
    2) Production-ready
    3) Batteries included

    View Slide

  18. Anyscale Overview
    GOAL 1: Developer velocity
    Quickly go from ideas to scaling
    horizontally

    View Slide

  19. Ray Train startup time is very fast

    View Slide

  20. Coding is fun, infra ... not always
    Every data scientist or MLE wants to do modeling, but infrastructure and
    operations work is an impediment to showing value from ML!

    View Slide

  21. k8s-based
    solutions often
    have high
    overhead

    View Slide

  22. Trying to
    understand all the
    k8s network
    policies in a cluster

    View Slide

  23. Even non-k8s
    based hosted
    platforms require
    containerization
    Docker images aren’t fun to
    build and constantly revise

    View Slide

  24. Ray Train: easy as 1, 2, 3
    Scale up in your code, not in your infrastructure
    from ray import train
    def train_func():
    … # existing model and data loader setup
    model = train.torch.prepare_model(model)
    dataloader = train.torch.prepare_data_loader(dataloader)
    for _ in range(num_epochs):
    … # training loop
    trainer = Trainer(backend="torch", num_workers=4)
    trainer.start()
    results = trainer.run(train_func)
    trainer.shutdown()
    Step 1:
    Put training
    code in one
    function
    Step 2:
    Wrap you
    model and
    dataset
    Step 3:
    Create your
    Trainer and
    run!

    View Slide

  25. Ray Train: easy to scale
    ● Scale to multiple machines – use Ray Cluster Launcher
    ● Just need a one line change in the code
    ...
    num_workers = 100
    trainer = Trainer(backend="torch", num_workers=num_workers)
    my_script.py (on my laptop)
    $ ray up my_cluster.yaml
    $ ray submit my_script.py
    >_

    View Slide

  26. Ray Train: use GPUs!
    ● Train with Multiple GPUs – also one line change
    ...
    num_workers = 100
    trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=True)
    $ ray up my_cluster.yaml
    $ ray submit my_script.py
    >_
    my_script.py (on my laptop)

    View Slide

  27. Ray Train: run with laptop or a cluster
    Multiple ways to deploy
    1. Use Ray Client for interactive runs directly on your laptop
    2. Use Ray Jobs for production ready runs
    ray.init("ray://:")
    # or use on managed service, like Anyscale
    #ray.init("anyscale://")
    ...
    trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=True)

    View Slide

  28. Anyscale Overview
    GOAL 2: Production-ready
    Future-proof your ML training system

    View Slide

  29. What does “production-ready” mean?
    In a word: Future-proof.
    More concretely:
    ● Cost effective
    ● Stability & reliability
    ● Plays nicely with other “production-ready” tools
    It means you’re never scared you’ll have to scrap all your work once you get
    to “scale” or data size increases

    View Slide

  30. Future-proofing #1: Cost effectiveness
    How can we achieve this?
    1. Maximizing performance, minimizing compute usage
    2. Cloud agnostic (avoid lock-in)
    Ray Train supports spot instances (ie: CPU, GPU)
    Ray Train works on any cloud Ray runs on: AWS, GCP, Azure, K8s, on-prem...
    So as your usage grows, you can be confident you can migrate to wherever
    cloud costs are lowest.

    View Slide

  31. Anyscale Overview
    Ray Users
    13000+
    Repositories
    Depend on Ray
    1600+
    Open Source
    Contributors
    449+
    Future proof #2: trusted by the best

    View Slide

  32. Future-proofing #3: Plays nicely with other production-ready tools
    How can we achieve this?
    1. Monitoring (Grafana, etc)
    2. Metadata tracking
    a. MLflow
    b. Tensorboard
    c. Weights & Biases (integration coming soon!)

    View Slide

  33. Anyscale Overview
    GOAL 3: Batteries included
    Compatible with entire Python and Ray
    ecosystem

    View Slide

  34. Why “batteries included”?
    Training is never done in a vacuum.
    1. Upstream: feature processing, data ingest
    2. Downstream: re-training, tuning, serving, monitoring
    Ray Train’s batteries included are:
    ● Dask on Ray or Spark on Ray for feature engineering or ETLs
    ● Ray Datasets for parallel training data ingest
    ● Ray Tune for hyperparameter tuning
    ● Ray Serve for model serving and composition
    ● Integrations with MLflow, W&B, Tensorboard
    ... and anything in the Python ecosystem! Ray Train can parallelize anything in
    Python.

    View Slide

  35. Distributed Data Loading with Ray Datasets
    1. Sharded datasets
    2. Windowed datasets: train on data larger than RAM
    3. Pipelined Execution: keep your GPUs fully saturated
    4. Global shuffling: improve performance over no shuffling or local shuffling

    View Slide

  36. Hyperparameter Optimization with Ray Tune
    Perform distributed hyperparameter tuning / training in just a few lines of code:
    trainable = trainer.to_tune_trainable(train_func)
    analysis = tune.run(trainable, config=...)
    1. Minimal Code
    Changes
    2. Automatic resource
    management

    View Slide

  37. Anyscale Overview
    Review so far
    I. Problems with model training
    II. Why Ray Train?
    III. Ray Train: development goals
    A. Developer velocity
    B. Production-ready
    C. Batteries included
    IV. Roadmap for H1 2022
    V. CIFAR-10 deep learning demo

    View Slide

  38. Anyscale Overview
    Ray Train: Roadmap
    What to look forward to this year

    View Slide

  39. Ray Train H1 2022 Roadmap
    Q1, 2022:
    ● Elastic training, better metrics/results handling, model parallelism
    ● DeepSpeed, fp16 support
    ● Integrations: W&B, PyTorch profiler
    ● Unified ML API alpha
    Q2, 2022:
    ● Better checkpointing, stability
    ● Advanced operations: GNNs, parameter serving, benchmarking
    ● Unified ML API beta

    View Slide

  40. Anyscale Overview
    Demo:
    Training a deep learning
    classification model with
    Pytorch
    Training a Pytorch image model at scale

    View Slide