Collective-on-Ray: High-performance Collective Communication for Distributed Machine Learning on Ray (Hao Zhang, UC Berkeley)

Collective On Ray: High-performance Collective Communication for Distributed Machine Learning
on Ray Hao Zhang

Distributed Communication 2 Hao Zhang Point-to-point communication between two processes

Ray Simplifies P2P Communication A Lot! 3 Hao Zhang

Collective Communication Patterns 4 Hao Zhang GPU Worker 0 𝛻𝜃0
𝑡 GPU Worker 1 GPU Worker 2 GPU Worker 3 ∇𝜃0 𝑡 = ∇𝐿 (𝜃𝑡, 𝐷0 𝑡) 𝜃𝑡+1 ← 𝜃𝑡 + ෍ 𝑖=0 3 ∇𝜃𝑖 𝑡 𝛻𝜃0 𝑡 𝛻𝜃0 𝑡 Derive gradients Coordinated updates

Assembling Collective Using P2P? 5 Hao Zhang GPU Worker 0
𝛻𝜃0 𝑡 GPU Worker 1 GPU Worker 2 GPU Worker 3 𝛻𝜃0 𝑡 𝛻𝜃0 𝑡 Assemble a collective pattern using multiple P2Ps

A More Complex Case: Ring AllReduce 6 Hao Zhang Figure
from Sergeev et al. 2018

Introducing Collective Primitives to Ray 7 Hao Zhang GPU Worker
0 𝛻𝜃0 𝑡 GPU Worker 1 GPU Worker 2 GPU Worker 3 𝛻𝜃0 𝑡 𝛻𝜃0 𝑡

Programmatically simpler • Specialized APIs for each collective communication pattern
• Less error-prone than assembling them from scratch Better Performance • They are optimized against different computing devices, e.g., GPUs or CPUs. • They can make the best use of the network hardware (Ethernet, InfiniBand, etc.), to provide the greatest communication performance. Pre-cap: Advantages 8 Hao Zhang

Walk-through Example: Data-parallel Training 9 Hao Zhang Right Figure from
NCCL documentation

Training Workers as GPU Actors 10 Hao Zhang import ray
import cupy @ray.remote(num_gpus=1) class GPUWorker: def __init__(self): self.gradients = cupy.ones((10,), dtype=cupy.float32) num_worker = 16 workers = [GPUWorker.remote() for i in range(num_workers)]

Assembling AllReduce Using ray.get() 11 Hao Zhang # Allreduce the
gradients using Ray APIs # Let all workers to put their gradients into the Ray object store. gradient_ids = [worker.put_gradients.remote() for worker in workers] ray.wait(object_ids, num_returns=len(object_ids, timeout=None)) # Let worker 0 reduce the gradients reduced_id_ref = workers[0].reduce_gradients.remote(gradient_ids) # All others workers get the reduced gradients results = [] for i, worker in enumerate(workers): results.append(worker.get_reduced_gradient.remote([reduced_id_ref]) ray.get(results)

Assembling AllReduce Using ray.get() 12 Hao Zhang @ray.remote(num_gpus=1) class GPUWorker:
def __init__(self): self.gradients = cupy.ones((10,), dtype=cupy.float32) def put_gradients(self): return ray.put(self.gradients) def reduce_gradients(self, grad_id_refs): grad_ids = ray.get(grad_id_refs) reduced_result = cupy.ones((10,), dtype=float32) for grad_id in grad_ids: array = ray.get(grad_id) reduced_result += array result_id = ray.put(reduced_result) return result_id def get_reduced_gradient(self, reduced_gradient_id_ref): reduced_gradient_id = ray.get(reduced_gradient_id_ref) reduced_gradient = ray.get(reduced_gradient_id) # do whatever with the reduced gradients return True

Assembling AllReduce Using ray.get() 13 Hao Zhang # Allreduce the
gradients using Ray APIs # Let all workers to put their gradients into the Ray object store. gradient_ids = [worker.put_gradients.remote() for worker in workers] ray.wait(object_ids, num_returns=len(object_ids, timeout=None)) # Let worker 0 reduce the gradients reduced_id_ref = workers[0].reduce_gradients.remote(gradient_ids) # All others workers get the reduced gradients results = [] for i, worker in enumerate(workers): results.append(worker.get_reduced_gradient.remote([reduced_id_ref]) ray.get(results)

Overheads 14 Hao Zhang GPU array GPU array GPU array
GPU array Ray Object Store Process 1 GPU array GPU array GPU array GPU array Process 2

Using Ray Collective Primitive APIs 15 Hao Zhang import ray.util.collective
as col @ray.remote(num_gpus=1) class GPUWorker: def __init__(self): self.gradients = cupy.ones((10,), dtype=cupy.float32) def setup(self, world_size, rank): col.init_collective_group( world_size=world_size, rank=rank, backend="nccl") def allreduce(self): col.allreduce(self.gradients) return self.gradients setup_rets = ray.get([w.setup(16, i) for i, w in enumerate(workers)]) results = ray.get([w.allreduce.remote() for w in workers])

Behind the Scene: CCL Backends 16 Hao Zhang Ray Collective
APIs CollectiveGroup PyGloo Group and APIs NCCL Group and APIs TensorPipe APIs Python C++ backend Gloo NCCL TensorPipe pybind pybind pybind

Microbenchmarks #1 17 Hao Zhang A node with 2 GPUs,
each worker is spawned on 1 GPU. NVLink is enabled (Y-axis is log-scale)

Microbenchmark #2 18 Hao Zhang A cluster with 7 nodes,
each node with 2 GPUs; each worker is spawned on a 1 GPU (hence 14 workers in total, Y-axis is in log-scale).

Case: Spacy NER + Parameter Server 19 Hao Zhang #
workers Spacy-ray spacy-ray w/ collective backend speedup 1 worker 137.5 ± 2.1 116.7 ± 2.51 1.18x 2 workers 354.1 ± 16.8 171.1 ± 1.11 2.07x 4 workers 523.9 ± 10.4 179.6 ± 2.91 2.92x 8 workers 710.1 ± 3.0 205.8 ± 1.20 3.45x 16 workers 1296.1 ± 42.1 248.3 ± 3.63 5.22x Check out at : https://github.com/explosion/spacy-ray

Supporting Matrix 20 Hao Zhang Backend GLOO NCCL Device CPU
GPU CPU GPU send ✓ ✘ ✘ ✓ recv ✓ ✘ ✘ ✓ broadcast ✓ ✘ ✘ ✓ all_reduce ✓ ✘ ✘ ✓ reduce ✓ ✘ ✘ ✓ all_gather ✓ ✘ ✘ ✓ gather ✓ ✘ ✘ ✘ scatter WIP ✘ ✘ ✘ reduce_sca tter ✓ ✘ ✘ ✓ all_to_all ✘ ✘ ✘ ✓ barrier ✓ ✘ ✘ ✓

More APIs: Faster P2P 21 Hao Zhang @ray.remote(num_gpus=1) class Worker:
def __init__(self): self.buffer = cupy.ones((10,), dtype=cupy.float32) def get_buffer(self) return self.buffer def do_send(self, target_rank=0): # this call is blocking col.send(target_rank) def do_recv(self, src_rank=0): # this call is blocking col.recv(src_rank) # Create two actors A = Worker.remote() B = Worker.remote() # Point-to-point communication with NCCL backend. ray.get([a.do_send.remote(target_rank=1), b.do_recv.remote(src_rank=0)])

Limitations 22 Hao Zhang • Out-of-band Communication • No object
store involves – hence less guarantees • Manually manage the process groups • Risk of deadlocks

More APIs: Declarative Collectives 23 Hao Zhang @ray.remote class Worker:
def __init__(self): self.buffer = cupy.ones((10,), dtype=cupy.float32) def get_buffer(self) Return self.buffer # Create two actors and create a collective group A = Worker.remote() B = Worker.remote() col.declare_collective_group([A, B], options={rank=[0, 1], ...}) # Specify a collective allreduce "completely" instead of "partially" on each actor col.allreduce_refs([A.get_buffer.remote(), B.get_buffer.remote()])

Future Steps: Distributed ML on Ray 24 Hao Zhang Ray
Collectives/Ray Distributed ML Training Strategies User ML Code Auto-parallel compilation/optimization

Take-aways 25 Hao Zhang • Collective communication are corner stones
in distributed ML • We provide a set of CCL library backed collective primitives APIs on Ray • In pure Python • Support both CPU and GPU arrays, and transfers between CPU<->CPU, GPU<->GPU, CPU<->GPU • Support many primitives, as well as p2p communications • Imperative and Declarative versions • 10-1000x faster than RPC-based collectives specific use cases • We are developing more generic distributed ML systems based on them. • Check them out : https://github.com/rayproject/ray/tree/master/python/ray/util/collective

Thank you! 26 Hao Zhang

Collective-on-Ray: High-performance Collective ...

Collective-on-Ray: High-performance Collective Communication for Distributed Machine Learning on Ray (Hao Zhang, UC Berkeley)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript