Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Growing Ecosystem of Scalable ML Libraries on Ray (Amog Kamsetty, Anyscale)

A Growing Ecosystem of Scalable ML Libraries on Ray (Amog Kamsetty, Anyscale)

The open-source Python ML ecosystem has seen rapid growth over the recent years. As these libraries mature, there is an increased demand for distributed execution frameworks that allow programmers to handle large amounts of data and coordinate computational resources. In this talk, we discuss our experiences collaborating with the open source Python ML ecosystem as maintainers of Ray, a popular distributed execution framework. We will cover how distributed computing has shaped the way machine learning is done, and go through case studies on how three popular open source ML libraries (Horovod, HuggingFace transformers, and spaCy) benefit from Ray for distributed training.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

July 21, 2021
Tweet

Transcript

  1. The Growing Ecosystem of ML Libraries on Ray Amog Kamsetty

    Anyscale
  2. • Why distributed machine learning? Overview of Talk • Distributed

    ML Architectures & Challenges • Ray Walkthrough • Ray and the ML Ecosystem
  3. ML needs to go distributed.

  4. 1. Increasing compute requirements to train state-of-the-art ML models 2.

    End of Moore’s Law. Have to scale out, not scale up. Necessitated by 2 Trends
  5. Models are Increasing in Compute Demand 2018 Study by Open

    AI Compute Requirement doubling every 3.4 months since 2012 300k increase in compute from AlexNet to AlphaGo Zero https://openai.com/blog/ai-and-compute/ 35x every 18 months
  6. Many more Hyperparameters to Tune https://openai.com/blog/ai-and-compute/ https://towardsdatascience.com/gpt-3-the-new-mighty-l anguage-model-from-openai-a74ff35346fc https://arxiv.org/abs/1907.11692

  7. 1. Increasing compute requirements to train state-of-the-art ML models 2.

    End of Moore’s Law. Have to scale out, not scale up. Necessitated by 2 Trends
  8. End of Moore’s Law From 2x every 18 months to

    1.05x every 18 months.
  9. Hardware Cannot Keep Up (https://openai.com/blog/ai-and-compute//) 35x every 18 months CPU

  10. Hardware Cannot Keep Up 35x every 18 months Moore’s Law

    (2x every 18 months) CPU
  11. Specialized Hardware is not enough 35x every 18 months Moore’s

    Law (2x every 18 months) CPU GPU TPU
  12. Specialized Hardware is not enough 35x every 18 months Moore’s

    Law (2x every 18 months) CPU GPU TPU No way out but distributed!
  13. • Main challenges in distributed ML • Why Ray solves

    a big part of the problem • How the ecosystem is adopting Ray. It’s not all about training.
  14. Challenges with Distributed ML

  15. Cutting edge approaches require ad-hoc distributed computation Retrieval Augmented Generation

    (RAG) Model
  16. • Support training with cheaper spot instances • Handle worker

    failures at any point during training process Elastic Training and failure handling can be complex Worker 1 Worker 2 Worker 3 Worker 4
  17. • Colocation • Specify certain processes that need to be

    on the same node • Homogenous setup • Ensure each node has the same number of workers ML workloads are locality / placement sensitive.
  18. How does Ray simplify distributed ML?

  19. What is Ray? 1. A simple and powerful distributed computing

    toolkit 2. An ecosystem of libraries for everything from web applications to data processing to ML/RL at Anyscal e Native Libraries 3rd Party Libraries Ecosystem Universal framework for Distributed computing
  20. • Simple API • Autoscaling/elastic workload support • Ability to

    handle complex worker/task placement Three Key benefits
  21. Execute remote functions as tasks, and instantiate remote classes as

    actors • Support both stateful and stateless computations Asynchronous execution using futures • Enable parallelism Ray API
  22. API Functions -> Tasks def read_array(file): # read array “a”

    from “file” return a def add(a, b): return np.add(a, b)
  23. API Functions -> Tasks @ray.remote def read_array(file): # read array

    “a” from “file” return a @ray.remote def add(a, b): return np.add(a, b)
  24. API Functions -> Tasks @ray.remote def read_array(file): # read array

    “a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors
  25. API Functions -> Tasks @ray.remote def read_array(file): # read array

    “a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value
  26. API Functions -> Tasks @ray.remote def read_array(file): # read array

    “a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() ray.get([id4, id5])
  27. Interface for custom placement of tasks and actors Create bundles

    of resources and schedule workers on each bundle Use strategies for placement of bundles PACK=Place on same node SPREAD=Place on different nodes Ray Placement Groups # Initialize Ray. import ray ray.init(num_gpus=2, resources={"extra_resource": 2}) bundle1 = {"GPU": 2} bundle2 = {"extra_resource": 2} pg = placement_group([bundle1, bundle2], strategy="PACK")
  28. Ray Autoscaler Ray Task, Actor, and Object APIs Ray Autoscaler

    Ray
  29. Unifying the ML ecosystem with Ray

  30. Ray unifies the distributed ML Ecosystem Lower level Communicators Model

    / Algorithms Higher level trainers
  31. Ray unifies the distributed ML Ecosystem Lower level Communicators Model

    / Algorithms Higher level trainers
  32. Open Source library for fast and easy distributed training on

    any deep learning framework (TF, Torch, Keras, MXNet) All-reduce communication protocol, excellent scaling efficiency ElasticHorovod was released in 2020 to allow for dynamic scaling during training Did not implement actual operation of adding/removing nodes, making resource requests… this is where Ray comes in Horovod
  33. Autoscaling on any cloud provider/orchestrator Custom placement strategies, object store,

    resource management Leverage Ray ecosystem (Data Processing, Tuning) Support Jupyter Notebook Horovod on Ray import ray ray.init(address="auto") # attach to the Ray cluster # Use standard RayExecutor executor = RayExecutor(settings, use_gpu=True, num_workers=2) # Or use elastic training executor = ElasticRayExecutor(settings, use_gpu=True) executor.start() executor.run(training_fn)
  34. Horovod on Ray Architecture Worker 3 Worker 2 Worker 1

    Worker 4
  35. Horovod on Ray Adoption • Integrated as a backend in

    the Horovod repo • Users in open source community, dozens of issues • Uber moving their deep learning workloads to Horovod on Ray
  36. Ray unifies the distributed ML Ecosystem Lower level Communicators Model

    / Algorithms Higher level trainers
  37. New NLP architecture by Facebook AI Implemented in Huggingface suite

    of NLP models Leverages external documents for state-of-the-art results in knowledge-intensive tasks like Q&A Retrieval Augmented Generation (RAG)
  38. Document Retrieval with Torch Distributed

  39. Document Retrieval with Ray

  40. Ray unifies the distributed ML Ecosystem Lower level Communicators Model

    / Algorithms Higher level trainers
  41. Trainer interface for Huggingface transformer models Integrates with Ray Tun

    for Hyperparameter Optimization Huggingface Transformers trainer = Trainer( model_init=get_model, train_dataset=train_dataset, eval_dataset=eval_dataset) trainer.hyperparameter_search( hp_space=..., backend="ray")
  42. Open source library that provides a high-level interface on Pytorch

    Allows developers to focus on research code and not boilerplate Distributed Pytorch Lightning is not easy to deploy • Have to write bash scripts & ssh into each node • No cluster launching or autoscaling capabilities Why Ray: • Single Python script to launch job • Integrates with Ray Tune for HPO Pytorch Lightning
  43. https://github.com/ray-project/ray_lightning Ray Lightning Library import pytorch_lightning as pl from ray_lightning

    import RayPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = RayPlugin(num_workers=4,cpus_per_worker=1,use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model) Pytorch Distributed Data Parallel
  44. https://github.com/ray-project/ray_lightning Ray Lightning Library import pytorch_lightning as pl from ray_lightning

    import HorovodRayPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = HorovodRayPlugin(num_hosts=2, num_slots=4, use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model)
  45. https://github.com/ray-project/ray_lightning Ray Lightning Library import pytorch_lightning as pl from ray_lightning

    import RayShardedPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = RayShardedPlugin(num_workers=4,cpus_per_worker=1,use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model) Fairscale Sharded Distributed Data Parallel
  46. Ray Lightning Architecture Worker 3 Worker 2 Worker 1 Worker

    4
  47. What’s Next? • More support for model parallel training •

    Integrations with DeepSpeed • Tying Ray Data Processing efforts with training • Providing a serverless experience for distributed training on Ray • Research projects at UC Berkeley • Ray Collective Communications Library
  48. Github Ray: https://github.com/ray-project/ray Horovod: https://github.com/horovod/horovod Ray Lightning: https://github.com/ray-project/ray_lightning Huggingface transformers:

    https://github.com/huggingface/transformers Join the Ray Discussion Forum: https://discuss.ray.io/ Connect with us
  49. Thank You Thank You