Slide 1

Slide 1 text

Large-scale dataloading/dataprepcessing with Ray Wei Chen Nvidia

Slide 2

Slide 2 text

About Me ● My name is Wei Chen. ● I am a DL Software Engineer from Nvidia. ● I am working on building AI infrasturcure for NVidia’s autonomous vehicles.

Slide 3

Slide 3 text

Outline 1. Dataloading in DL pipeline. 2. Dataloading in Pytorch. 3. Re-implement dataloading using Ray engine. 4. Lessons learned.

Slide 4

Slide 4 text

Dataloading in DL Pipeline

Slide 5

Slide 5 text

Dataloading is one of the most complex components in our DL pipeline. Loading Features Loading Labels Features Processing Label processing Making a Batch Model

Slide 6

Slide 6 text

GPU CPU Why datloading in DL is challening ? - It needs to be fast - It needs to be stable. Dataloader Model Storage Filesystem

Slide 7

Slide 7 text

Dataloading in Pytorch

Slide 8

Slide 8 text

Sampler Dataset Batcher Dataloader Trainer Batched Samples __getitem__ __iter__ index List[features] List[labels] Batched Samples Feature processor Label processor The logic view

Slide 9

Slide 9 text

Main Processer Worker Worker The Multiprocess Engine Dataset Batcher Sampler Fork/Spwan - Main process forks/spawns workers. - Data copy through shared memory.

Slide 10

Slide 10 text

But!!! We have issues with Python multiprocess Engine - Uses COW. - This approach introduces race conditions with any other code running multiple threads. Dataloader deadlock issue: https://github.com/pytorch/pytorch/issues/1355 https://github.com/pytorch/pytorch/issues/1595

Slide 11

Slide 11 text

Re-implementing dataloading with Ray

Slide 12

Slide 12 text

Why Ray ? - it is fast ● Parallel actors ● Shared memory ● Achieve the same speed as Pytorch dataloader - Actor mode makes program easiser: ● Launching actor = Call a function. - Transparent to upper layer.: ● Easy to swap the multiprocess engine with Ray engien.

Slide 13

Slide 13 text

Dataset Ray actor - Dataset Actor - Passing the idx - Return any serializable python object

Slide 14

Slide 14 text

- We keep the sampler and dataset unchanged when switching to Ray. - We can control the granularity of parallism through actor. - We can further make this asynchronized by calling ray.wait() API. Dataset Ray actor

Slide 15

Slide 15 text

Lessions Learned

Slide 16

Slide 16 text

Handle Multi-GPU training in Ray - We instantiate Ray in each of the training job. - We run our multi-gpu training using MPI. - When starting up: ● Rank-0 calls ray.init() ● Rank-x > 0 calls ray.init(address=’auto’) - When shutting down: ● Rank-x > 0 exits. ● Rank-0 waits to exit until all the other ranks exit.

Slide 17

Slide 17 text

Customized config to adapt to our cluter We identify a few configs that need to adapt to our environment: - worker_register_timeout_seconds (default 30). - num_heartbeats_timeout (default 30).

Slide 18

Slide 18 text

Thanks!!!