Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Growing Ecosystem of Scalable ML Libraries on Ray (Amog Kamsetty, Anyscale)

A Growing Ecosystem of Scalable ML Libraries on Ray (Amog Kamsetty, Anyscale)

The open-source Python ML ecosystem has seen rapid growth over the recent years. As these libraries mature, there is an increased demand for distributed execution frameworks that allow programmers to handle large amounts of data and coordinate computational resources. In this talk, we discuss our experiences collaborating with the open source Python ML ecosystem as maintainers of Ray, a popular distributed execution framework. We will cover how distributed computing has shaped the way machine learning is done, and go through case studies on how three popular open source ML libraries (Horovod, HuggingFace transformers, and spaCy) benefit from Ray for distributed training.

Anyscale

July 21, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. • Why distributed machine learning? Overview of Talk • Distributed

    ML Architectures & Challenges • Ray Walkthrough • Ray and the ML Ecosystem
  2. 1. Increasing compute requirements to train state-of-the-art ML models 2.

    End of Moore’s Law. Have to scale out, not scale up. Necessitated by 2 Trends
  3. Models are Increasing in Compute Demand 2018 Study by Open

    AI Compute Requirement doubling every 3.4 months since 2012 300k increase in compute from AlexNet to AlphaGo Zero https://openai.com/blog/ai-and-compute/ 35x every 18 months
  4. 1. Increasing compute requirements to train state-of-the-art ML models 2.

    End of Moore’s Law. Have to scale out, not scale up. Necessitated by 2 Trends
  5. Specialized Hardware is not enough 35x every 18 months Moore’s

    Law (2x every 18 months) CPU GPU TPU No way out but distributed!
  6. • Main challenges in distributed ML • Why Ray solves

    a big part of the problem • How the ecosystem is adopting Ray. It’s not all about training.
  7. • Support training with cheaper spot instances • Handle worker

    failures at any point during training process Elastic Training and failure handling can be complex Worker 1 Worker 2 Worker 3 Worker 4
  8. • Colocation • Specify certain processes that need to be

    on the same node • Homogenous setup • Ensure each node has the same number of workers ML workloads are locality / placement sensitive.
  9. What is Ray? 1. A simple and powerful distributed computing

    toolkit 2. An ecosystem of libraries for everything from web applications to data processing to ML/RL at Anyscal e Native Libraries 3rd Party Libraries Ecosystem Universal framework for Distributed computing
  10. • Simple API • Autoscaling/elastic workload support • Ability to

    handle complex worker/task placement Three Key benefits
  11. Execute remote functions as tasks, and instantiate remote classes as

    actors • Support both stateful and stateless computations Asynchronous execution using futures • Enable parallelism Ray API
  12. API Functions -> Tasks def read_array(file): # read array “a”

    from “file” return a def add(a, b): return np.add(a, b)
  13. API Functions -> Tasks @ray.remote def read_array(file): # read array

    “a” from “file” return a @ray.remote def add(a, b): return np.add(a, b)
  14. API Functions -> Tasks @ray.remote def read_array(file): # read array

    “a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors
  15. API Functions -> Tasks @ray.remote def read_array(file): # read array

    “a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value
  16. API Functions -> Tasks @ray.remote def read_array(file): # read array

    “a” from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(“/input1”) id2 = read_array.remote(“/input2”) id3 = add.remote(id1, id2) Classes -> Actors @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() ray.get([id4, id5])
  17. Interface for custom placement of tasks and actors Create bundles

    of resources and schedule workers on each bundle Use strategies for placement of bundles PACK=Place on same node SPREAD=Place on different nodes Ray Placement Groups # Initialize Ray. import ray ray.init(num_gpus=2, resources={"extra_resource": 2}) bundle1 = {"GPU": 2} bundle2 = {"extra_resource": 2} pg = placement_group([bundle1, bundle2], strategy="PACK")
  18. Open Source library for fast and easy distributed training on

    any deep learning framework (TF, Torch, Keras, MXNet) All-reduce communication protocol, excellent scaling efficiency ElasticHorovod was released in 2020 to allow for dynamic scaling during training Did not implement actual operation of adding/removing nodes, making resource requests… this is where Ray comes in Horovod
  19. Autoscaling on any cloud provider/orchestrator Custom placement strategies, object store,

    resource management Leverage Ray ecosystem (Data Processing, Tuning) Support Jupyter Notebook Horovod on Ray import ray ray.init(address="auto") # attach to the Ray cluster # Use standard RayExecutor executor = RayExecutor(settings, use_gpu=True, num_workers=2) # Or use elastic training executor = ElasticRayExecutor(settings, use_gpu=True) executor.start() executor.run(training_fn)
  20. Horovod on Ray Adoption • Integrated as a backend in

    the Horovod repo • Users in open source community, dozens of issues • Uber moving their deep learning workloads to Horovod on Ray
  21. New NLP architecture by Facebook AI Implemented in Huggingface suite

    of NLP models Leverages external documents for state-of-the-art results in knowledge-intensive tasks like Q&A Retrieval Augmented Generation (RAG)
  22. Trainer interface for Huggingface transformer models Integrates with Ray Tun

    for Hyperparameter Optimization Huggingface Transformers trainer = Trainer( model_init=get_model, train_dataset=train_dataset, eval_dataset=eval_dataset) trainer.hyperparameter_search( hp_space=..., backend="ray")
  23. Open source library that provides a high-level interface on Pytorch

    Allows developers to focus on research code and not boilerplate Distributed Pytorch Lightning is not easy to deploy • Have to write bash scripts & ssh into each node • No cluster launching or autoscaling capabilities Why Ray: • Single Python script to launch job • Integrates with Ray Tune for HPO Pytorch Lightning
  24. https://github.com/ray-project/ray_lightning Ray Lightning Library import pytorch_lightning as pl from ray_lightning

    import RayPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = RayPlugin(num_workers=4,cpus_per_worker=1,use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model) Pytorch Distributed Data Parallel
  25. https://github.com/ray-project/ray_lightning Ray Lightning Library import pytorch_lightning as pl from ray_lightning

    import HorovodRayPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = HorovodRayPlugin(num_hosts=2, num_slots=4, use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model)
  26. https://github.com/ray-project/ray_lightning Ray Lightning Library import pytorch_lightning as pl from ray_lightning

    import RayShardedPlugin # Create your PyTorch Lightning model here. ptl_model = MNISTClassifier(...) plugin = RayShardedPlugin(num_workers=4,cpus_per_worker=1,use_gpu=True) trainer = pl.Trainer(..., plugins=[plugin]) trainer.fit(ptl_model) Fairscale Sharded Distributed Data Parallel
  27. What’s Next? • More support for model parallel training •

    Integrations with DeepSpeed • Tying Ray Data Processing efforts with training • Providing a serverless experience for distributed training on Ray • Research projects at UC Berkeley • Ray Collective Communications Library