Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Train: Production-ready distributed deep learning

Anyscale
February 10, 2022

Ray Train: Production-ready distributed deep learning

Today, most frameworks for deep learning prototyping, training, and distributing to a cluster are either powerful and inflexible, or nimble and toy-like. Data scientists are forced to choose between a great developer experience and a production-ready framework.

To fix this gap, the Ray ML team has developed Ray Train.

Ray Train is a library built on top of the Ray ecosystem that simplifies distributed deep learning. Currently in stable beta in Ray 1.9, Ray Train offers the following features:

- Scales to multi-GPU and multi-node training with zero code changes
- Runs seamlessly on any cloud (AWS, GCP, Azure, Kubernetes, or on-prem)
- Supports PyTorch, TensorFlow, and Horovod
- Distributed data shuffling and loading with Ray Datasets
- Distributed hyperparameter tuning with Ray Tune
- Built-in loggers for TensorBoard and MLflow

In this webinar, we'll talk through some of the challenges in large-scale computer vision ML training, and show a demo of Ray Train in action.

Anyscale

February 10, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Anyscale Overview Overview I. Problems with model training II. Why

    Ray Train? III. Ray Train: development goals A. Developer velocity B. Production-ready C. Batteries included IV. Roadmap for H1 2022 V. CIFAR-10 deep learning demo
  2. Data science is iterative What people think data science projects

    should look like What data science projects actually look like (% to goal) Value! Value!
  3. Today, iteration is expensive Let’s break it down: (1) Compute

    time (and therefore also cloud $$$) We try to fix this by distributing, but in doing so we often introduce... (2) Developer cognitive overhead Which reduces development & experimentation velocity
  4. Goal: reduce the length of the line (time to value)

    Size of loop ≈ How long training takes Distance between loops ≈ Developer velocity Value!
  5. It gets worse… Models are Increasing in Compute Demand 2018

    Study by Open AI Compute Requirement doubling every 3.4 months since 2012 300,000x increase in compute from AlexNet to AlphaGo Zero 35x every 18 months
  6. What is Ray Train? A library built on Ray that

    simplifies distributed deep learning training • Scale to multi-GPU and multi-node training with 0 code changes • Runs seamlessly on any cloud (AWS, GCP, Azure, Kubernetes, or on-prem) • Supports PyTorch, TensorFlow, and Horovod • Distributed data loading (Datasets) and hyperparameter tuning (Tune) • Built-in loggers for TensorBoard and MLflow
  7. Anyscale Overview Ray Users 13000+ Repositories Depend on Ray 1600+

    Open Source Contributors 449+ Growth of Ray open-source 18.7K stars as of 1/1/22
  8. Anyscale Overview Where does Ray Train fit in? Motivations &

    context from the deep learning frameworks space
  9. Anyscale Overview Deep learning framework tradeoffs Production-readiness Ease of development

    Heavyweight Inflexible, hard to customize Lightweight Nimble, but unscalable Ray Train
  10. Anyscale Overview Supported libraries Cloud type Name Open source Torch

    TF Horovod AWS GCP K8s On prem Scalable Local → Cluster Tuning Pythonic / no containerization required ⭐ Ray Train ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ Kubeflow (requires Kubernetes) ✅ ✅ ✅ ✅ ❌ * ❌ * ✅ ✅ ✅ ❌ ✅ ❌ AWS Sagemaker ❌ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ✅ ❌ ✅ ❌ Google AI Platform ❌ ✅ ✅ ❌ ❌ ✅ ❌ ❌ ✅ ❌ ✅ ❌ Azure ML ❌ ✅ ✅ ❌ ❌ ❌ ❌ ❌ ✅ ❌ ✅ ❌ Comparing scalable solutions * Kubeflow does run on AWS/GCP, but requires K8s so it isn’t as flexible (can’t run on bare VMs)
  11. Anyscale Overview Goals of Ray Train 1) Developer velocity 2)

    Production-ready 3) Batteries included
  12. Coding is fun, infra ... not always Every data scientist

    or MLE wants to do modeling, but infrastructure and operations work is an impediment to showing value from ML!
  13. Ray Train: easy as 1, 2, 3 Scale up in

    your code, not in your infrastructure from ray import train def train_func(): … # existing model and data loader setup model = train.torch.prepare_model(model) dataloader = train.torch.prepare_data_loader(dataloader) for _ in range(num_epochs): … # training loop trainer = Trainer(backend="torch", num_workers=4) trainer.start() results = trainer.run(train_func) trainer.shutdown() Step 1: Put training code in one function Step 2: Wrap you model and dataset Step 3: Create your Trainer and run!
  14. Ray Train: easy to scale • Scale to multiple machines

    – use Ray Cluster Launcher • Just need a one line change in the code ... num_workers = 100 trainer = Trainer(backend="torch", num_workers=num_workers) my_script.py (on my laptop) $ ray up my_cluster.yaml $ ray submit my_script.py >_
  15. Ray Train: use GPUs! • Train with Multiple GPUs –

    also one line change ... num_workers = 100 trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=True) $ ray up my_cluster.yaml $ ray submit my_script.py >_ my_script.py (on my laptop)
  16. Ray Train: run with laptop or a cluster Multiple ways

    to deploy 1. Use Ray Client for interactive runs directly on your laptop 2. Use Ray Jobs for production ready runs ray.init("ray://<head_node_host>:<port>") # or use on managed service, like Anyscale #ray.init("anyscale://<cluster_name>") ... trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=True)
  17. What does “production-ready” mean? In a word: Future-proof. More concretely:

    • Cost effective • Stability & reliability • Plays nicely with other “production-ready” tools It means you’re never scared you’ll have to scrap all your work once you get to “scale” or data size increases
  18. Future-proofing #1: Cost effectiveness How can we achieve this? 1.

    Maximizing performance, minimizing compute usage 2. Cloud agnostic (avoid lock-in) Ray Train supports spot instances (ie: CPU, GPU) Ray Train works on any cloud Ray runs on: AWS, GCP, Azure, K8s, on-prem... So as your usage grows, you can be confident you can migrate to wherever cloud costs are lowest.
  19. Anyscale Overview Ray Users 13000+ Repositories Depend on Ray 1600+

    Open Source Contributors 449+ Future proof #2: trusted by the best
  20. Future-proofing #3: Plays nicely with other production-ready tools How can

    we achieve this? 1. Monitoring (Grafana, etc) 2. Metadata tracking a. MLflow b. Tensorboard c. Weights & Biases (integration coming soon!)
  21. Why “batteries included”? Training is never done in a vacuum.

    1. Upstream: feature processing, data ingest 2. Downstream: re-training, tuning, serving, monitoring Ray Train’s batteries included are: • Dask on Ray or Spark on Ray for feature engineering or ETLs • Ray Datasets for parallel training data ingest • Ray Tune for hyperparameter tuning • Ray Serve for model serving and composition • Integrations with MLflow, W&B, Tensorboard ... and anything in the Python ecosystem! Ray Train can parallelize anything in Python.
  22. Distributed Data Loading with Ray Datasets 1. Sharded datasets 2.

    Windowed datasets: train on data larger than RAM 3. Pipelined Execution: keep your GPUs fully saturated 4. Global shuffling: improve performance over no shuffling or local shuffling
  23. Hyperparameter Optimization with Ray Tune Perform distributed hyperparameter tuning /

    training in just a few lines of code: trainable = trainer.to_tune_trainable(train_func) analysis = tune.run(trainable, config=...) 1. Minimal Code Changes 2. Automatic resource management
  24. Anyscale Overview Review so far I. Problems with model training

    II. Why Ray Train? III. Ray Train: development goals A. Developer velocity B. Production-ready C. Batteries included IV. Roadmap for H1 2022 V. CIFAR-10 deep learning demo
  25. Ray Train H1 2022 Roadmap Q1, 2022: • Elastic training,

    better metrics/results handling, model parallelism • DeepSpeed, fp16 support • Integrations: W&B, PyTorch profiler • Unified ML API alpha Q2, 2022: • Better checkpointing, stability • Advanced operations: GNNs, parameter serving, benchmarking • Unified ML API beta
  26. Anyscale Overview Demo: Training a deep learning classification model with

    Pytorch Training a Pytorch image model at scale