Save 37% off PRO during our Black Friday Sale! »

Collective-on-Ray: High-performance Collective Communication for Distributed Machine Learning on Ray (Hao Zhang, UC Berkeley)

Collective-on-Ray: High-performance Collective Communication for Distributed Machine Learning on Ray (Hao Zhang, UC Berkeley)

Collective communication is the cornerstone of many distributed machine learning systems. While existing ML systems such as distributed TensorFlow have built-in collective communication infrastructure, these functionalities are often bundled with the framework, invisible to users, and cannot extrapolate to ML programs that are not written following the framework-specific languages.

In this talk, we introduce a set of Ray-native Python-based collective communication primitives for Ray clusters with distributed CPUs or GPUs. They can be used in Ray task or actor code to speed up distributed communications, such as those introduced in distributed ML training.

Built on top of these communication primitives, we bring in a Ray-native distributed ML training library, offering Python-based implementations and interfaces to a variety of data- and model-parallel training strategies, such as parameter server, pipeline parallelism, to enable training various DL or non-DL models, on a Ray cluster.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

July 21, 2021
Tweet

Transcript

  1. Collective On Ray: High-performance Collective Communication for Distributed Machine Learning

    on Ray Hao Zhang
  2. Distributed Communication 2 Hao Zhang Point-to-point communication between two processes

  3. Ray Simplifies P2P Communication A Lot! 3 Hao Zhang

  4. Collective Communication Patterns 4 Hao Zhang GPU Worker 0 𝛻𝜃0

    𝑡 GPU Worker 1 GPU Worker 2 GPU Worker 3 ∇𝜃0 𝑡 = ∇𝐿 (𝜃𝑡, 𝐷0 𝑡) 𝜃𝑡+1 ← 𝜃𝑡 + ෍ 𝑖=0 3 ∇𝜃𝑖 𝑡 𝛻𝜃0 𝑡 𝛻𝜃0 𝑡 Derive gradients Coordinated updates
  5. Assembling Collective Using P2P? 5 Hao Zhang GPU Worker 0

    𝛻𝜃0 𝑡 GPU Worker 1 GPU Worker 2 GPU Worker 3 𝛻𝜃0 𝑡 𝛻𝜃0 𝑡 Assemble a collective pattern using multiple P2Ps
  6. A More Complex Case: Ring AllReduce 6 Hao Zhang Figure

    from Sergeev et al. 2018
  7. Introducing Collective Primitives to Ray 7 Hao Zhang GPU Worker

    0 𝛻𝜃0 𝑡 GPU Worker 1 GPU Worker 2 GPU Worker 3 𝛻𝜃0 𝑡 𝛻𝜃0 𝑡
  8. Programmatically simpler • Specialized APIs for each collective communication pattern

    • Less error-prone than assembling them from scratch Better Performance • They are optimized against different computing devices, e.g., GPUs or CPUs. • They can make the best use of the network hardware (Ethernet, InfiniBand, etc.), to provide the greatest communication performance. Pre-cap: Advantages 8 Hao Zhang
  9. Walk-through Example: Data-parallel Training 9 Hao Zhang Right Figure from

    NCCL documentation
  10. Training Workers as GPU Actors 10 Hao Zhang import ray

    import cupy @ray.remote(num_gpus=1) class GPUWorker: def __init__(self): self.gradients = cupy.ones((10,), dtype=cupy.float32) num_worker = 16 workers = [GPUWorker.remote() for i in range(num_workers)]
  11. Assembling AllReduce Using ray.get() 11 Hao Zhang # Allreduce the

    gradients using Ray APIs # Let all workers to put their gradients into the Ray object store. gradient_ids = [worker.put_gradients.remote() for worker in workers] ray.wait(object_ids, num_returns=len(object_ids, timeout=None)) # Let worker 0 reduce the gradients reduced_id_ref = workers[0].reduce_gradients.remote(gradient_ids) # All others workers get the reduced gradients results = [] for i, worker in enumerate(workers): results.append(worker.get_reduced_gradient.remote([reduced_id_ref]) ray.get(results)
  12. Assembling AllReduce Using ray.get() 12 Hao Zhang @ray.remote(num_gpus=1) class GPUWorker:

    def __init__(self): self.gradients = cupy.ones((10,), dtype=cupy.float32) def put_gradients(self): return ray.put(self.gradients) def reduce_gradients(self, grad_id_refs): grad_ids = ray.get(grad_id_refs) reduced_result = cupy.ones((10,), dtype=float32) for grad_id in grad_ids: array = ray.get(grad_id) reduced_result += array result_id = ray.put(reduced_result) return result_id def get_reduced_gradient(self, reduced_gradient_id_ref): reduced_gradient_id = ray.get(reduced_gradient_id_ref) reduced_gradient = ray.get(reduced_gradient_id) # do whatever with the reduced gradients return True
  13. Assembling AllReduce Using ray.get() 13 Hao Zhang # Allreduce the

    gradients using Ray APIs # Let all workers to put their gradients into the Ray object store. gradient_ids = [worker.put_gradients.remote() for worker in workers] ray.wait(object_ids, num_returns=len(object_ids, timeout=None)) # Let worker 0 reduce the gradients reduced_id_ref = workers[0].reduce_gradients.remote(gradient_ids) # All others workers get the reduced gradients results = [] for i, worker in enumerate(workers): results.append(worker.get_reduced_gradient.remote([reduced_id_ref]) ray.get(results)
  14. Overheads 14 Hao Zhang GPU array GPU array GPU array

    GPU array Ray Object Store Process 1 GPU array GPU array GPU array GPU array Process 2
  15. Using Ray Collective Primitive APIs 15 Hao Zhang import ray.util.collective

    as col @ray.remote(num_gpus=1) class GPUWorker: def __init__(self): self.gradients = cupy.ones((10,), dtype=cupy.float32) def setup(self, world_size, rank): col.init_collective_group( world_size=world_size, rank=rank, backend="nccl") def allreduce(self): col.allreduce(self.gradients) return self.gradients setup_rets = ray.get([w.setup(16, i) for i, w in enumerate(workers)]) results = ray.get([w.allreduce.remote() for w in workers])
  16. Behind the Scene: CCL Backends 16 Hao Zhang Ray Collective

    APIs CollectiveGroup PyGloo Group and APIs NCCL Group and APIs TensorPipe APIs Python C++ backend Gloo NCCL TensorPipe pybind pybind pybind
  17. Microbenchmarks #1 17 Hao Zhang A node with 2 GPUs,

    each worker is spawned on 1 GPU. NVLink is enabled (Y-axis is log-scale)
  18. Microbenchmark #2 18 Hao Zhang A cluster with 7 nodes,

    each node with 2 GPUs; each worker is spawned on a 1 GPU (hence 14 workers in total, Y-axis is in log-scale).
  19. Case: Spacy NER + Parameter Server 19 Hao Zhang #

    workers Spacy-ray spacy-ray w/ collective backend speedup 1 worker 137.5 ± 2.1 116.7 ± 2.51 1.18x 2 workers 354.1 ± 16.8 171.1 ± 1.11 2.07x 4 workers 523.9 ± 10.4 179.6 ± 2.91 2.92x 8 workers 710.1 ± 3.0 205.8 ± 1.20 3.45x 16 workers 1296.1 ± 42.1 248.3 ± 3.63 5.22x Check out at : https://github.com/explosion/spacy-ray
  20. Supporting Matrix 20 Hao Zhang Backend GLOO NCCL Device CPU

    GPU CPU GPU send ✓ ✘ ✘ ✓ recv ✓ ✘ ✘ ✓ broadcast ✓ ✘ ✘ ✓ all_reduce ✓ ✘ ✘ ✓ reduce ✓ ✘ ✘ ✓ all_gather ✓ ✘ ✘ ✓ gather ✓ ✘ ✘ ✘ scatter WIP ✘ ✘ ✘ reduce_sca tter ✓ ✘ ✘ ✓ all_to_all ✘ ✘ ✘ ✓ barrier ✓ ✘ ✘ ✓
  21. More APIs: Faster P2P 21 Hao Zhang @ray.remote(num_gpus=1) class Worker:

    def __init__(self): self.buffer = cupy.ones((10,), dtype=cupy.float32) def get_buffer(self) return self.buffer def do_send(self, target_rank=0): # this call is blocking col.send(target_rank) def do_recv(self, src_rank=0): # this call is blocking col.recv(src_rank) # Create two actors A = Worker.remote() B = Worker.remote() # Point-to-point communication with NCCL backend. ray.get([a.do_send.remote(target_rank=1), b.do_recv.remote(src_rank=0)])
  22. Limitations 22 Hao Zhang • Out-of-band Communication • No object

    store involves – hence less guarantees • Manually manage the process groups • Risk of deadlocks
  23. More APIs: Declarative Collectives 23 Hao Zhang @ray.remote class Worker:

    def __init__(self): self.buffer = cupy.ones((10,), dtype=cupy.float32) def get_buffer(self) Return self.buffer # Create two actors and create a collective group A = Worker.remote() B = Worker.remote() col.declare_collective_group([A, B], options={rank=[0, 1], ...}) # Specify a collective allreduce "completely" instead of "partially" on each actor col.allreduce_refs([A.get_buffer.remote(), B.get_buffer.remote()])
  24. Future Steps: Distributed ML on Ray 24 Hao Zhang Ray

    Collectives/Ray Distributed ML Training Strategies User ML Code Auto-parallel compilation/optimization
  25. Take-aways 25 Hao Zhang • Collective communication are corner stones

    in distributed ML • We provide a set of CCL library backed collective primitives APIs on Ray • In pure Python • Support both CPU and GPU arrays, and transfers between CPU<->CPU, GPU<->GPU, CPU<->GPU • Support many primitives, as well as p2p communications • Imperative and Declarative versions • 10-1000x faster than RPC-based collectives specific use cases • We are developing more generic distributed ML systems based on them. • Check them out : https://github.com/rayproject/ray/tree/master/python/ray/util/collective
  26. Thank you! 26 Hao Zhang