Large Scale Distributed Training with TorchX and Ray

Large scale training with torchX and Ray Mark Saroufim

About me • Maintain pytorch/serve • Contribute to pytorch/torchx, pytorch/pytorch
• Charter is open source production story • twitter.com/marksaroufim

Key problems 1. Setting up an infrastructure 2. Submitting jobs
against that infrastructure 3. Getting logs and job status 4. Deploying an end to end system on that infrastructure Data scientist != Infra engineer

Acknowledgements Meta • Can Balioglu for developing the original interface
for the Ray scheduler • Tristian Rice for unblocking CI issues • Aliaksandr Ivanou for rigorous code reviews and fixing last few bugs • Geeta Chauhan, Kiuk Chung & Diamond Bishop for leadership guidance Anyscale • Amog Kamsetty for many Ray coaching sessions • Jiao Dong for building a great Ray Job API • Jules Damji for writing most of the blog post • Richard Liaw for leadership guidance

Setting up the infrastructure aws configure ray up -y ray_cluster.yaml
Typical YAML file to • Determine cloud provider • Machine types • Docker images

Follow along in Google Colab shorturl.at/wJS19

Submitting a job

TorchX

Install torchx pip install torchx-nightly

Submitting a job torchx run -s ray -cfg dashboard_address=$public_ip:20002, working_dir=.
./component.py:trainer

What’s a component? • Python Application Definition ◦ Entrypoint ◦
Resources associated with a job ◦ Environment variables ◦ Docker image • What about a trainer.py?

Job Status or Logs Launched app: ray://torchx/54.214.124.247:20002-raysubmit_ntquG1dDV6CtFUC5 torchx describe ray://torchx/54.214.124.247:20002-raysubmit_ntquG1dDV6CtFUC5
torchx log ray://torchx/54.214.124.247:20002-raysubmit_ntquG1dDV6CtFUC5

Torchx describe

Torchx log

Under the hood - Ray Job SDK

Ray Job status

ray_driver.py

Torchx & Ray Run torchX components on Ray • Model
serving (torch serve) • Elastic job launcher script • Hyperparameter optimization • Setting env variables • Metric logging • Configuration management • Pytorch Lightning Training loop with callbacks

Torchx apps/pipelines store running on Ray

Give us a star ⭐ https://github.com/pytorch/torchx https://github.com/ray-project/ray

Large Scale Distributed Training with TorchX an...

Large Scale Distributed Training with TorchX and Ray

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Large scale training with torchX and Ray Mark Saroufim

About me • Maintain pytorch/serve • Contribute to pytorch/torchx, pytorch/pytorch

Key problems 1. Setting up an infrastructure 2. Submitting jobs

Acknowledgements Meta • Can Balioglu for developing the original interface

Setting up the infrastructure aws configure ray up -y ray_cluster.yaml

Follow along in Google Colab shorturl.at/wJS19

Submitting a job

TorchX

Install torchx pip install torchx-nightly

Submitting a job torchx run -s ray -cfg dashboard_address=$public_ip:20002, working_dir=.

What’s a component? • Python Application Definition ◦ Entrypoint ◦

Job Status or Logs Launched app: ray://torchx/54.214.124.247:20002-raysubmit_ntquG1dDV6CtFUC5 torchx describe ray://torchx/54.214.124.247:20002-raysubmit_ntquG1dDV6CtFUC5

Torchx describe

Torchx log

Under the hood - Ray Job SDK

Ray Job status

ray_driver.py

ray_driver.py

ray_driver.py

Torchx & Ray Run torchX components on Ray • Model

Torchx apps/pipelines store running on Ray

Give us a star ⭐ https://github.com/pytorch/torchx https://github.com/ray-project/ray