Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Data Loading and Data Preprocessing with Ray (Wei Chen, NVIDIA)

Large Scale Data Loading and Data Preprocessing with Ray (Wei Chen, NVIDIA)

Data loading is one of the most crucial steps in the DL pipeline. It needs to be designed and implemented in both a flexible and performant manner so that (1) it can be reused to support different DNN models, (2) it can match the speed of GPU compute, and (3) it can scale to multi-cores and even multi-nodes. However, achieving these design goals is not trivial, especially given that the most commonly used language in DL is python in which there is no good support for parallel programming.

In this talk, we will show that how we can use Ray to implement our data loading pipeline. Powered by the Ray actor, we are able to reuse most of our python modules and run our data loading pipeline in parallel without worrying about the overhead of managing it at scale. We will also talk about the experience and lessons we learned during our implementation and production depoyment.



July 21, 2021


  1. Large-scale dataloading/dataprepcessing with Ray Wei Chen Nvidia

  2. About Me • My name is Wei Chen. • I

    am a DL Software Engineer from Nvidia. • I am working on building AI infrasturcure for NVidia’s autonomous vehicles.
  3. Outline 1. Dataloading in DL pipeline. 2. Dataloading in Pytorch.

    3. Re-implement dataloading using Ray engine. 4. Lessons learned.
  4. Dataloading in DL Pipeline

  5. Dataloading is one of the most complex components in our

    DL pipeline. Loading Features Loading Labels Features Processing Label processing Making a Batch Model
  6. GPU CPU Why datloading in DL is challening ? -

    It needs to be fast - It needs to be stable. Dataloader Model Storage Filesystem
  7. Dataloading in Pytorch

  8. Sampler Dataset Batcher Dataloader Trainer Batched Samples __getitem__ __iter__ index

    List[features] List[labels] Batched Samples Feature processor Label processor The logic view
  9. Main Processer Worker Worker The Multiprocess Engine Dataset Batcher Sampler

    Fork/Spwan - Main process forks/spawns workers. - Data copy through shared memory.
  10. But!!! We have issues with Python multiprocess Engine - Uses

    COW. - This approach introduces race conditions with any other code running multiple threads. Dataloader deadlock issue: https://github.com/pytorch/pytorch/issues/1355 https://github.com/pytorch/pytorch/issues/1595
  11. Re-implementing dataloading with Ray

  12. Why Ray ? - it is fast • Parallel actors

    • Shared memory • Achieve the same speed as Pytorch dataloader - Actor mode makes program easiser: • Launching actor = Call a function. - Transparent to upper layer.: • Easy to swap the multiprocess engine with Ray engien.
  13. Dataset Ray actor - Dataset Actor - Passing the idx

    - Return any serializable python object
  14. - We keep the sampler and dataset unchanged when switching

    to Ray. - We can control the granularity of parallism through actor. - We can further make this asynchronized by calling ray.wait() API. Dataset Ray actor
  15. Lessions Learned

  16. Handle Multi-GPU training in Ray - We instantiate Ray in

    each of the training job. - We run our multi-gpu training using MPI. - When starting up: • Rank-0 calls ray.init() • Rank-x > 0 calls ray.init(address=’auto’) - When shutting down: • Rank-x > 0 exits. • Rank-0 waits to exit until all the other ranks exit.
  17. Customized config to adapt to our cluter We identify a

    few configs that need to adapt to our environment: - worker_register_timeout_seconds (default 30). - num_heartbeats_timeout (default 30).
  18. Thanks!!!