Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Ray Meetup] Ray Train, PyTorch, TorchX, and di...

Anyscale
March 03, 2022

[Ray Meetup] Ray Train, PyTorch, TorchX, and distributed deep learning

Welcome to our second Ray meetup, where we focus on Ray’s native libraries for scaling machine learning workloads.

We'll discuss Ray Train, a production-ready distributed training library for deep learning workloads. And will present TorchX and Ray Integration. Through this integration, PyTorch developers can submit PyTorch-based scripts and workloads to a Ray Cluster using TorchX’s SDK and CLI via its new Ray Scheduler.

Anyscale

March 03, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Anyscale Overview Welcome to the Bay Area Ray Meetup! March

    2, 2022 Ray Train, TorchX, and Distributed Deep Learning with Ray
  2. Agenda: Virtual Meetup Welcome Remarks, Introduction, Announcements: Jules S Damji,

    Anyscale Talk 1: Ray Train: Production-ready Distributed Deep Learning, Will Drevo, Amog Kamsetty, & Matthew Deng, Anyscale Inc Talk 2: Large Scale Distributed Training with TorchX and Ray, Mark Saroufim, Meta AI & PyTorch Engineering
  3. Production RL Summit MARCH 29 - VIRTUAL - FREE A

    reinforcement learning event for practitioners Ben Kasper Sumitra Ganesh Sergey Levine Marc Weber Volkmar Sterzing Adam Kelloway ORGANIZED BY Register: https://tinyurl.com/mr9rd32h
  4. Instructor: Sven Mika, Lead maintainer, RLlib HANDS-ON TUTORIAL Contextual Bandits

    & RL with RLlib Learn how to apply cutting edge RL in production with RLlib. Tutorial covers: • Brief overview of RL concepts. • Train and tune contextual bandits and SlateQ algorithm • Offline RL using cutting-edge algos • Deploy RL models into a live service $75 $50 (use code MEETUP50) Register: https://tinyurl.com/mr9rd32h $75 $50 Use code MEETUP50 Production RL Summit MARCH 29 - VIRTUAL A reinforcement learning event for practitioners ORGANIZED BY
  5. Anyscale Overview Ray Train: A high-level library for deep learning

    training Amog Kamsetty Matthew Deng Will Drevo
  6. Anyscale Overview Overview I. Problems in DL training II. Ray

    Train A. Simple scaling B. Flexible C. High level API, low level optimizations III. Roadmap for H1 2022 IV. CIFAR-10 deep learning demo
  7. Ray Train: https://tinyurl.com/ray-train; reach out at [email protected] 1. Training takes

    too long 2. Too much data to fit in one node 3. Large models that do not fit on one device Problems seen today in Deep Learning
  8. Ray Train: https://tinyurl.com/ray-train; reach out at [email protected] OK, so we

    go distributed. Now what? We see a bunch of new problems! • Managing a new infra stack • Rewriting all your training code • Dealing with increased cost • Setting up new optimizations • Tuning hyperparameters Also, it is horribly painful from a developer’s perspective! Machine 1 Machine 2 Machine 4 Machine 3
  9. • is easy to onboard onto. • abstracts away the

    infrastructure. • gives me extremely fast iteration speed. • allows me to use affordable GPUs from any cloud provider. • integrates well in an end-to-end machine learning pipeline. How do we fix the problems? My ideal solution is a tool that…
  10. Anyscale Overview Deep learning framework tradeoffs Production-readiness Ease of development

    Heavyweight Inflexible, hard to customize Lightweight Nimble, but unscalable Ray Train
  11. Ray Train: https://tinyurl.com/ray-train; reach out at [email protected] What is Ray

    Train? Ray Train Distributed Training Deep Learning Frameworks Compute Data Processing Model Tuning / Serving
  12. DL: division of labor DL frameworks like PyTorch, Horovod, Tensorflow

    do a great job at: • NN modules, components, & patterns • Writing training loops • Gradient communication protocols • Having great developer community Ray’s strengths are: • Managing compute • Anticipating and scheduling around data locality and constraints • Seamless way to distribute Python • Distributed systems Ray Train is the union of these different competencies!
  13. Ray Train: easy as 1, 2, 3 Scale up in

    your code, not in your infrastructure from ray import train def train_func(): … # existing model and data loader setup model = train.torch.prepare_model(model) dataloader = train.torch.prepare_data_loader(dataloader) for _ in range(num_epochs): … # training loop trainer = Trainer(backend="torch", num_workers=4) trainer.start() results = trainer.run(train_func) trainer.shutdown() Step 1: Put training code in one function Step 2: Wrap you model and dataset Step 3: Create your Trainer and run!
  14. Ray Train: https://tinyurl.com/ray-train; reach out at [email protected] With Ray, moving

    from local to cluster is as easy as: And scaling up your workload is even easier! Move between laptop & cluster, CPU & GPU, (un)distributed Multi-Node and Multi-GPU trainer = Trainer(backend="torch", num_workers=1) trainer = Trainer(backend="torch", num_workers=100, use_gpu=True) $ ray up cluster-config.yaml $ ray job submit cluster-config.yaml -- python my_script.py
  15. Train iteratively, or in production Method 1: Ray Client (docs)

    Method 2: Ray Jobs (docs) script.py Good for longer running jobs (ie: “close the laptop”), or production jobs Good for interactive runs, can use ipdb, or Ray debugger Ray cluster $ ray up my_cluster.yaml $ ray job submit -- “python script.py” ray.init("ray://<head_node_host>:<port>") # or use on managed service, like Anyscale #ray.init("anyscale://<cluster_name>")
  16. ML training: upstream and downstream ETL, feature preprocessing (“SQL land”)

    ML/DL training “Last mile” data ingest (“Dataloaders”) Hyperparameter Tuning Model Serving, A/B testing Monitoring Ray Ray Tune Ray Train Ray Serve Ray Datasets
  17. The Ray Train Ecosystem! Datasets Workflows Compute Cluster User ML

    apps Ray Core Ray Train Ray Distributed Libraries Your app/library here! 3rd party training libraries
  18. Distributed Data Loading with Ray Datasets 1. Sharded datasets: easily

    split data across workers 2. Windowed datasets: train on data larger than RAM 3. Pipelined Execution: keep your GPUs fully saturated 4. Global shuffling: improve model accuracy
  19. Hyperparameter Optimization with Ray Tune Perform distributed hyperparameter tuning /

    training in 2 lines of code! trainable = trainer.to_tune_trainable(train_func) analysis = tune.run(trainable, config=...)
  20. Ray Tune: code example Choose from state of the art

    searchers, and search over anything you can parameterize... Easily load your model from a checkpoint in cloud storage later!
  21. High level API! Multi-GPU Spot instances DeepSpeed fp16 Zero copy

    reads ... future optimizations here Tensor communication handled by DL frameworks
  22. DL: division of labor DL frameworks like PyTorch, Horovod, Tensorflow

    do a great job at: • NN modules, components, & patterns • Writing training loops • Gradient communication protocols • Having great developer community Ray’s strengths are: • Managing compute • Anticipating and scheduling around data locality and constraints • Seamless way to distribute Python • Distributed systems Ray Train is the union of these different competencies!
  23. Ray Train H1 2022 Roadmap Q1, 2022: • Elastic training,

    better metrics/results handling, model parallelism • DeepSpeed, fp16 support • Integrations: W&B, PyTorch profiler • Unified ML API alpha Q2, 2022: • Better checkpointing • Advanced operations: GNNs, parameter servers, benchmarking • Unified ML API beta
  24. Anyscale Overview Demo: Training a deep learning classification model with

    Pytorch Training a Pytorch image model at scale
  25. Start learning Ray and contributing … Getting Started: pip install

    ray Documentation (docs.ray.io) Quick start example, reference guides, etc Join Ray Meetup Revived in Jan 2022. Next meetup March 2nd. Meetup each month and publish recording to the members https://www.meetup.com/Bay-Area-Ray-Meetup/ Forums (discuss.ray.io) Learn / share with broader Ray community, including core team Ray Slack Connect with the Ray team and community Social Media (@raydistrtibuted, @anyscalecompute) Follow us on Twitter and linkedIn