A Growing Ecosystem of Scalable ML Libraries on Ray (Amog Kamsetty, Anyscale)

The Growing Ecosystem of ML Libraries on Ray Amog Kamsetty
Anyscale

• Why distributed machine learning? Overview of Talk • Distributed
ML Architectures & Challenges • Ray Walkthrough • Ray and the ML Ecosystem

ML needs to go distributed.

1. Increasing compute requirements to train state-of-the-art ML models 2.
End of Moore’s Law. Have to scale out, not scale up. Necessitated by 2 Trends

Models are Increasing in Compute Demand 2018 Study by Open
AI Compute Requirement doubling every 3.4 months since 2012 300k increase in compute from AlexNet to AlphaGo Zero https://openai.com/blog/ai-and-compute/ 35x every 18 months

Many more Hyperparameters to Tune https://openai.com/blog/ai-and-compute/ https://towardsdatascience.com/gpt-3-the-new-mighty-l anguage-model-from-openai-a74ff35346fc https://arxiv.org/abs/1907.11692

1. Increasing compute requirements to train state-of-the-art ML models 2.
End of Moore’s Law. Have to scale out, not scale up. Necessitated by 2 Trends

End of Moore’s Law From 2x every 18 months to
1.05x every 18 months.

Hardware Cannot Keep Up (https://openai.com/blog/ai-and-compute//) 35x every 18 months CPU

Hardware Cannot Keep Up 35x every 18 months Moore’s Law
(2x every 18 months) CPU

Specialized Hardware is not enough 35x every 18 months Moore’s
Law (2x every 18 months) CPU GPU TPU

Specialized Hardware is not enough 35x every 18 months Moore’s
Law (2x every 18 months) CPU GPU TPU No way out but distributed!

• Main challenges in distributed ML • Why Ray solves
a big part of the problem • How the ecosystem is adopting Ray. It’s not all about training.

Challenges with Distributed ML

Cutting edge approaches require ad-hoc distributed computation Retrieval Augmented Generation
(RAG) Model

• Support training with cheaper spot instances • Handle worker
failures at any point during training process Elastic Training and failure handling can be complex Worker 1 Worker 2 Worker 3 Worker 4

• Colocation • Specify certain processes that need to be
on the same node • Homogenous setup • Ensure each node has the same number of workers ML workloads are locality / placement sensitive.

How does Ray simplify distributed ML?

What is Ray? 1. A simple and powerful distributed computing
toolkit 2. An ecosystem of libraries for everything from web applications to data processing to ML/RL at Anyscal e Native Libraries 3rd Party Libraries Ecosystem Universal framework for Distributed computing

• Simple API • Autoscaling/elastic workload support • Ability to
handle complex worker/task placement Three Key benefits

Execute remote functions as tasks, and instantiate remote classes as
actors • Support both stateful and stateless computations Asynchronous execution using futures • Enable parallelism Ray API

API Functions -> Tasks def read_array(file): # read array “a”
from “file” return a def add(a, b): return np.add(a, b)

API Functions -> Tasks @ray.remote def read_array(file): # read array
“a” from “file” return a @ray.remote def add(a, b): return np.add(a, b)

“a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors

“a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value

“a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() ray.get([id4, id5])

Interface for custom placement of tasks and actors Create bundles
of resources and schedule workers on each bundle Use strategies for placement of bundles PACK=Place on same node SPREAD=Place on different nodes Ray Placement Groups # Initialize Ray. import ray ray.init(num_gpus=2, resources={"extra_resource": 2}) bundle1 = {"GPU": 2} bundle2 = {"extra_resource": 2} pg = placement_group([bundle1, bundle2], strategy="PACK")

Ray Autoscaler Ray Task, Actor, and Object APIs Ray Autoscaler
Ray

Unifying the ML ecosystem with Ray

Ray unifies the distributed ML Ecosystem Lower level Communicators Model
/ Algorithms Higher level trainers

Open Source library for fast and easy distributed training on
any deep learning framework (TF, Torch, Keras, MXNet) All-reduce communication protocol, excellent scaling efficiency ElasticHorovod was released in 2020 to allow for dynamic scaling during training Did not implement actual operation of adding/removing nodes, making resource requests… this is where Ray comes in Horovod

Autoscaling on any cloud provider/orchestrator Custom placement strategies, object store,
resource management Leverage Ray ecosystem (Data Processing, Tuning) Support Jupyter Notebook Horovod on Ray import ray ray.init(address="auto") # attach to the Ray cluster # Use standard RayExecutor executor = RayExecutor(settings, use_gpu=True, num_workers=2) # Or use elastic training executor = ElasticRayExecutor(settings, use_gpu=True) executor.start() executor.run(training_fn)

Horovod on Ray Architecture Worker 3 Worker 2 Worker 1
Worker 4

Horovod on Ray Adoption • Integrated as a backend in
the Horovod repo • Users in open source community, dozens of issues • Uber moving their deep learning workloads to Horovod on Ray

New NLP architecture by Facebook AI Implemented in Huggingface suite
of NLP models Leverages external documents for state-of-the-art results in knowledge-intensive tasks like Q&A Retrieval Augmented Generation (RAG)

Document Retrieval with Torch Distributed

Document Retrieval with Ray

Trainer interface for Huggingface transformer models Integrates with Ray Tun
for Hyperparameter Optimization Huggingface Transformers trainer = Trainer( model_init=get_model, train_dataset=train_dataset, eval_dataset=eval_dataset) trainer.hyperparameter_search( hp_space=..., backend="ray")

Open source library that provides a high-level interface on Pytorch
Allows developers to focus on research code and not boilerplate Distributed Pytorch Lightning is not easy to deploy • Have to write bash scripts & ssh into each node • No cluster launching or autoscaling capabilities Why Ray: • Single Python script to launch job • Integrates with Ray Tune for HPO Pytorch Lightning

https://github.com/ray-project/ray_lightning Ray Lightning Library import pytorch_lightning as pl from ray_lightning
import RayPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = RayPlugin(num_workers=4,cpus_per_worker=1,use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model) Pytorch Distributed Data Parallel

import HorovodRayPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = HorovodRayPlugin(num_hosts=2, num_slots=4, use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model)

import RayShardedPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = RayShardedPlugin(num_workers=4,cpus_per_worker=1,use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model) Fairscale Sharded Distributed Data Parallel

Ray Lightning Architecture Worker 3 Worker 2 Worker 1 Worker
4

What’s Next? • More support for model parallel training •
Integrations with DeepSpeed • Tying Ray Data Processing efforts with training • Providing a serverless experience for distributed training on Ray • Research projects at UC Berkeley • Ray Collective Communications Library

Github Ray: https://github.com/ray-project/ray Horovod: https://github.com/horovod/horovod Ray Lightning: https://github.com/ray-project/ray_lightning Huggingface transformers:
https://github.com/huggingface/transformers Join the Ray Discussion Forum: https://discuss.ray.io/ Connect with us

Thank You Thank You

A Growing Ecosystem of Scalable ML Libraries on...

A Growing Ecosystem of Scalable ML Libraries on Ray (Amog Kamsetty, Anyscale)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript