Slide 1

Slide 1 text

Large scale training with torchX and Ray Mark Saroufim

Slide 2

Slide 2 text

About me ● Maintain pytorch/serve ● Contribute to pytorch/torchx, pytorch/pytorch ● Charter is open source production story ● twitter.com/marksaroufim

Slide 3

Slide 3 text

Key problems 1. Setting up an infrastructure 2. Submitting jobs against that infrastructure 3. Getting logs and job status 4. Deploying an end to end system on that infrastructure Data scientist != Infra engineer

Slide 4

Slide 4 text

Acknowledgements Meta ● Can Balioglu for developing the original interface for the Ray scheduler ● Tristian Rice for unblocking CI issues ● Aliaksandr Ivanou for rigorous code reviews and fixing last few bugs ● Geeta Chauhan, Kiuk Chung & Diamond Bishop for leadership guidance Anyscale ● Amog Kamsetty for many Ray coaching sessions ● Jiao Dong for building a great Ray Job API ● Jules Damji for writing most of the blog post ● Richard Liaw for leadership guidance

Slide 5

Slide 5 text

Setting up the infrastructure aws configure ray up -y ray_cluster.yaml Typical YAML file to ● Determine cloud provider ● Machine types ● Docker images

Slide 6

Slide 6 text

Follow along in Google Colab shorturl.at/wJS19

Slide 7

Slide 7 text

Submitting a job

Slide 8

Slide 8 text

TorchX

Slide 9

Slide 9 text

Install torchx pip install torchx-nightly

Slide 10

Slide 10 text

Submitting a job torchx run -s ray -cfg dashboard_address=$public_ip:20002, working_dir=. ./component.py:trainer

Slide 11

Slide 11 text

What’s a component? ● Python Application Definition ○ Entrypoint ○ Resources associated with a job ○ Environment variables ○ Docker image ● What about a trainer.py?

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Job Status or Logs Launched app: ray://torchx/54.214.124.247:20002-raysubmit_ntquG1dDV6CtFUC5 torchx describe ray://torchx/54.214.124.247:20002-raysubmit_ntquG1dDV6CtFUC5 torchx log ray://torchx/54.214.124.247:20002-raysubmit_ntquG1dDV6CtFUC5

Slide 14

Slide 14 text

Torchx describe

Slide 15

Slide 15 text

Torchx log

Slide 16

Slide 16 text

Under the hood - Ray Job SDK

Slide 17

Slide 17 text

Ray Job status

Slide 18

Slide 18 text

ray_driver.py

Slide 19

Slide 19 text

ray_driver.py

Slide 20

Slide 20 text

ray_driver.py

Slide 21

Slide 21 text

Torchx & Ray Run torchX components on Ray ● Model serving (torch serve) ● Elastic job launcher script ● Hyperparameter optimization ● Setting env variables ● Metric logging ● Configuration management ● Pytorch Lightning Training loop with callbacks

Slide 22

Slide 22 text

Torchx apps/pipelines store running on Ray

Slide 23

Slide 23 text

Give us a star ⭐ https://github.com/pytorch/torchx https://github.com/ray-project/ray