Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Distributed Training with TorchX and Ray

Large Scale Distributed Training with TorchX and Ray

Large-scale model training has generally been out of reach for people in open source because it requires an engineer to learn how to set up an infrastructure, how to build composable software systems, and how to set up robust machine learning scripts.

To that end, we’ve built the TorchX Ray scheduler which leverages the newly created Ray Job API to allow scientists to focus on writing their scripts and making infrastructure and systems setup relatively easy.

1. Setting up a multi GPU setup on any cloud provider is as easy as calling ray up the cluster.yaml
2. TorchX embraces a component-based approach to designing systems that makes your ops workflows composable
3. Running a distributed PyTorch script is then as simple as calling torchx run

In this session, we’ll go through a practical live demo of how to train multi GPU models, set up the infrastructure live, and provide some tips and best practices to productionize such workflows

Anyscale

March 07, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. About me • Maintain pytorch/serve • Contribute to pytorch/torchx, pytorch/pytorch

    • Charter is open source production story • twitter.com/marksaroufim
  2. Key problems 1. Setting up an infrastructure 2. Submitting jobs

    against that infrastructure 3. Getting logs and job status 4. Deploying an end to end system on that infrastructure Data scientist != Infra engineer
  3. Acknowledgements Meta • Can Balioglu for developing the original interface

    for the Ray scheduler • Tristian Rice for unblocking CI issues • Aliaksandr Ivanou for rigorous code reviews and fixing last few bugs • Geeta Chauhan, Kiuk Chung & Diamond Bishop for leadership guidance Anyscale • Amog Kamsetty for many Ray coaching sessions • Jiao Dong for building a great Ray Job API • Jules Damji for writing most of the blog post • Richard Liaw for leadership guidance
  4. Setting up the infrastructure aws configure ray up -y ray_cluster.yaml

    Typical YAML file to • Determine cloud provider • Machine types • Docker images
  5. What’s a component? • Python Application Definition ◦ Entrypoint ◦

    Resources associated with a job ◦ Environment variables ◦ Docker image • What about a trainer.py?
  6. Torchx & Ray Run torchX components on Ray • Model

    serving (torch serve) • Elastic job launcher script • Hyperparameter optimization • Setting env variables • Metric logging • Configuration management • Pytorch Lightning Training loop with callbacks