Ray Train: Production-ready distributed deep learning

Anyscale Overview Ray Train: Distributed Deep learning Amog Kamsetty Matthew
Deng Will Drevo

Anyscale Overview Overview I. Problems with model training II. Why
Ray Train? III. Ray Train: development goals A. Developer velocity B. Production-ready C. Batteries included IV. Roadmap for H1 2022 V. CIFAR-10 deep learning demo

Anyscale Overview Problems in model training The two reasons ML
training is hard

Data science is iterative What people think data science projects
should look like (% to goal) Value!

Data science is iterative What people think data science projects
should look like What data science projects actually look like (% to goal) Value! Value!

Today, iteration is expensive Let’s break it down: (1) Compute
time (and therefore also cloud $$$) We try to fix this by distributing, but in doing so we often introduce... (2) Developer cognitive overhead Which reduces development & experimentation velocity

Goal: reduce the length of the line (time to value)
Size of loop ≈ How long training takes Distance between loops ≈ Developer velocity Value!

Problem #1 ...

It gets worse… Models are Increasing in Compute Demand 2018
Study by Open AI Compute Requirement doubling every 3.4 months since 2012 300,000x increase in compute from AlexNet to AlphaGo Zero 35x every 18 months

Problem #2: Developer cognitive overhead Configuring GPUs Converting code Starting
a job Spinning up machines

What is Ray Train? A library built on Ray that
simplifies distributed deep learning training • Scale to multi-GPU and multi-node training with 0 code changes • Runs seamlessly on any cloud (AWS, GCP, Azure, Kubernetes, or on-prem) • Supports PyTorch, TensorFlow, and Horovod • Distributed data loading (Datasets) and hyperparameter tuning (Tune) • Built-in loggers for TensorBoard and MLflow

Anyscale Overview What is Ray? Ray Train

Anyscale Overview Ray Users 13000+ Repositories Depend on Ray 1600+
Open Source Contributors 449+ Growth of Ray open-source 18.7K stars as of 1/1/22

Anyscale Overview Where does Ray Train fit in? Motivations &
context from the deep learning frameworks space

Anyscale Overview Deep learning framework tradeoffs Production-readiness Ease of development
Heavyweight Inflexible, hard to customize Lightweight Nimble, but unscalable Ray Train

Anyscale Overview Supported libraries Cloud type Name Open source Torch
TF Horovod AWS GCP K8s On prem Scalable Local → Cluster Tuning Pythonic / no containerization required ⭐ Ray Train ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ Kubeflow (requires Kubernetes) ✅ ✅ ✅ ✅ ❌ * ❌ * ✅ ✅ ✅ ❌ ✅ ❌ AWS Sagemaker ❌ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ✅ ❌ ✅ ❌ Google AI Platform ❌ ✅ ✅ ❌ ❌ ✅ ❌ ❌ ✅ ❌ ✅ ❌ Azure ML ❌ ✅ ✅ ❌ ❌ ❌ ❌ ❌ ✅ ❌ ✅ ❌ Comparing scalable solutions * Kubeflow does run on AWS/GCP, but requires K8s so it isn’t as flexible (can’t run on bare VMs)

Anyscale Overview Goals of Ray Train 1) Developer velocity 2)
Production-ready 3) Batteries included

Anyscale Overview GOAL 1: Developer velocity Quickly go from ideas
to scaling horizontally

Ray Train startup time is very fast

Coding is fun, infra ... not always Every data scientist
or MLE wants to do modeling, but infrastructure and operations work is an impediment to showing value from ML!

k8s-based solutions often have high overhead

Trying to understand all the k8s network policies in a
cluster

Even non-k8s based hosted platforms require containerization Docker images aren’t
fun to build and constantly revise

Ray Train: easy as 1, 2, 3 Scale up in
your code, not in your infrastructure from ray import train def train_func(): … # existing model and data loader setup model = train.torch.prepare_model(model) dataloader = train.torch.prepare_data_loader(dataloader) for _ in range(num_epochs): … # training loop trainer = Trainer(backend="torch", num_workers=4) trainer.start() results = trainer.run(train_func) trainer.shutdown() Step 1: Put training code in one function Step 2: Wrap you model and dataset Step 3: Create your Trainer and run!

Ray Train: easy to scale • Scale to multiple machines
– use Ray Cluster Launcher • Just need a one line change in the code ... num_workers = 100 trainer = Trainer(backend="torch", num_workers=num_workers) my_script.py (on my laptop) $ ray up my_cluster.yaml $ ray submit my_script.py >_

Ray Train: use GPUs! • Train with Multiple GPUs –
also one line change ... num_workers = 100 trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=True) $ ray up my_cluster.yaml $ ray submit my_script.py >_ my_script.py (on my laptop)

Ray Train: run with laptop or a cluster Multiple ways
to deploy 1. Use Ray Client for interactive runs directly on your laptop 2. Use Ray Jobs for production ready runs ray.init("ray://<head_node_host>:<port>") # or use on managed service, like Anyscale #ray.init("anyscale://<cluster_name>") ... trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=True)

Anyscale Overview GOAL 2: Production-ready Future-proof your ML training system

What does “production-ready” mean? In a word: Future-proof. More concretely:
• Cost effective • Stability & reliability • Plays nicely with other “production-ready” tools It means you’re never scared you’ll have to scrap all your work once you get to “scale” or data size increases

Future-proofing #1: Cost effectiveness How can we achieve this? 1.
Maximizing performance, minimizing compute usage 2. Cloud agnostic (avoid lock-in) Ray Train supports spot instances (ie: CPU, GPU) Ray Train works on any cloud Ray runs on: AWS, GCP, Azure, K8s, on-prem... So as your usage grows, you can be confident you can migrate to wherever cloud costs are lowest.

Anyscale Overview Ray Users 13000+ Repositories Depend on Ray 1600+
Open Source Contributors 449+ Future proof #2: trusted by the best

Future-proofing #3: Plays nicely with other production-ready tools How can
we achieve this? 1. Monitoring (Grafana, etc) 2. Metadata tracking a. MLflow b. Tensorboard c. Weights & Biases (integration coming soon!)

Anyscale Overview GOAL 3: Batteries included Compatible with entire Python
and Ray ecosystem

Why “batteries included”? Training is never done in a vacuum.
1. Upstream: feature processing, data ingest 2. Downstream: re-training, tuning, serving, monitoring Ray Train’s batteries included are: • Dask on Ray or Spark on Ray for feature engineering or ETLs • Ray Datasets for parallel training data ingest • Ray Tune for hyperparameter tuning • Ray Serve for model serving and composition • Integrations with MLflow, W&B, Tensorboard ... and anything in the Python ecosystem! Ray Train can parallelize anything in Python.

Distributed Data Loading with Ray Datasets 1. Sharded datasets 2.
Windowed datasets: train on data larger than RAM 3. Pipelined Execution: keep your GPUs fully saturated 4. Global shuffling: improve performance over no shuffling or local shuffling

Hyperparameter Optimization with Ray Tune Perform distributed hyperparameter tuning /
training in just a few lines of code: trainable = trainer.to_tune_trainable(train_func) analysis = tune.run(trainable, config=...) 1. Minimal Code Changes 2. Automatic resource management

Anyscale Overview Review so far I. Problems with model training
II. Why Ray Train? III. Ray Train: development goals A. Developer velocity B. Production-ready C. Batteries included IV. Roadmap for H1 2022 V. CIFAR-10 deep learning demo

Anyscale Overview Ray Train: Roadmap What to look forward to
this year

Ray Train H1 2022 Roadmap Q1, 2022: • Elastic training,
better metrics/results handling, model parallelism • DeepSpeed, fp16 support • Integrations: W&B, PyTorch profiler • Unified ML API alpha Q2, 2022: • Better checkpointing, stability • Advanced operations: GNNs, parameter serving, benchmarking • Unified ML API beta

Anyscale Overview Demo: Training a deep learning classification model with
Pytorch Training a Pytorch image model at scale

Ray Train: Production-ready distributed deep le...

Ray Train: Production-ready distributed deep learning

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript