Large Scale Data Loading and Data Preprocessing with Ray (Wei Chen, NVIDIA)

Large-scale dataloading/dataprepcessing with Ray Wei Chen Nvidia

About Me • My name is Wei Chen. • I
am a DL Software Engineer from Nvidia. • I am working on building AI infrasturcure for NVidia’s autonomous vehicles.

Outline 1. Dataloading in DL pipeline. 2. Dataloading in Pytorch.
3. Re-implement dataloading using Ray engine. 4. Lessons learned.

Dataloading in DL Pipeline

Dataloading is one of the most complex components in our
DL pipeline. Loading Features Loading Labels Features Processing Label processing Making a Batch Model

GPU CPU Why datloading in DL is challening ? -
It needs to be fast - It needs to be stable. Dataloader Model Storage Filesystem

Dataloading in Pytorch

Sampler Dataset Batcher Dataloader Trainer Batched Samples __getitem__ __iter__ index
List[features] List[labels] Batched Samples Feature processor Label processor The logic view

Main Processer Worker Worker The Multiprocess Engine Dataset Batcher Sampler
Fork/Spwan - Main process forks/spawns workers. - Data copy through shared memory.

But!!! We have issues with Python multiprocess Engine - Uses
COW. - This approach introduces race conditions with any other code running multiple threads. Dataloader deadlock issue: https://github.com/pytorch/pytorch/issues/1355 https://github.com/pytorch/pytorch/issues/1595

Re-implementing dataloading with Ray

Why Ray ? - it is fast • Parallel actors
• Shared memory • Achieve the same speed as Pytorch dataloader - Actor mode makes program easiser: • Launching actor = Call a function. - Transparent to upper layer.: • Easy to swap the multiprocess engine with Ray engien.

Dataset Ray actor - Dataset Actor - Passing the idx
- Return any serializable python object

- We keep the sampler and dataset unchanged when switching
to Ray. - We can control the granularity of parallism through actor. - We can further make this asynchronized by calling ray.wait() API. Dataset Ray actor

Lessions Learned

Handle Multi-GPU training in Ray - We instantiate Ray in
each of the training job. - We run our multi-gpu training using MPI. - When starting up: • Rank-0 calls ray.init() • Rank-x > 0 calls ray.init(address=’auto’) - When shutting down: • Rank-x > 0 exits. • Rank-0 waits to exit until all the other ranks exit.

Customized config to adapt to our cluter We identify a
few conﬁgs that need to adapt to our environment: - worker_register_timeout_seconds (default 30). - num_heartbeats_timeout (default 30).

Thanks!!!

Large Scale Data Loading and Data Preprocessing...

Large Scale Data Loading and Data Preprocessing with Ray (Wei Chen, NVIDIA)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Large-scale dataloading/dataprepcessing with Ray Wei Chen Nvidia

About Me • My name is Wei Chen. • I

Outline 1. Dataloading in DL pipeline. 2. Dataloading in Pytorch.

Dataloading in DL Pipeline

Dataloading is one of the most complex components in our

GPU CPU Why datloading in DL is challening ? -

Dataloading in Pytorch

Sampler Dataset Batcher Dataloader Trainer Batched Samples getitem iter index

Main Processer Worker Worker The Multiprocess Engine Dataset Batcher Sampler

But!!! We have issues with Python multiprocess Engine - Uses

Re-implementing dataloading with Ray

Why Ray ? - it is fast • Parallel actors

Dataset Ray actor - Dataset Actor - Passing the idx

- We keep the sampler and dataset unchanged when switching

Lessions Learned

Handle Multi-GPU training in Ray - We instantiate Ray in

Customized config to adapt to our cluter We identify a

Thanks!!!