Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Growing Ecosystem of Scalable ML Libraries on Ray (Amog Kamsetty, Anyscale)

A Growing Ecosystem of Scalable ML Libraries on Ray (Amog Kamsetty, Anyscale)

The open-source Python ML ecosystem has seen rapid growth over the recent years. As these libraries mature, there is an increased demand for distributed execution frameworks that allow programmers to handle large amounts of data and coordinate computational resources. In this talk, we discuss our experiences collaborating with the open source Python ML ecosystem as maintainers of Ray, a popular distributed execution framework. We will cover how distributed computing has shaped the way machine learning is done, and go through case studies on how three popular open source ML libraries (Horovod, HuggingFace transformers, and spaCy) benefit from Ray for distributed training.

Anyscale
PRO

July 21, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. The Growing Ecosystem of ML Libraries on Ray
    Amog
    Kamsetty
    Anyscale

    View Slide

  2. ● Why distributed machine learning?
    Overview of Talk
    ● Distributed ML Architectures & Challenges
    ● Ray Walkthrough
    ● Ray and the ML Ecosystem

    View Slide

  3. ML needs to go distributed.

    View Slide

  4. 1. Increasing compute requirements to
    train state-of-the-art ML models
    2. End of Moore’s Law. Have to scale out,
    not scale up.
    Necessitated by 2 Trends

    View Slide

  5. Models are Increasing in Compute Demand
    2018 Study by Open AI
    Compute
    Requirement doubling
    every 3.4 months
    since 2012
    300k increase in
    compute from AlexNet
    to AlphaGo Zero
    https://openai.com/blog/ai-and-compute/
    35x every 18 months

    View Slide

  6. Many more Hyperparameters to Tune
    https://openai.com/blog/ai-and-compute/
    https://towardsdatascience.com/gpt-3-the-new-mighty-l
    anguage-model-from-openai-a74ff35346fc
    https://arxiv.org/abs/1907.11692

    View Slide

  7. 1. Increasing compute requirements to
    train state-of-the-art ML models
    2. End of Moore’s Law. Have to scale out,
    not scale up.
    Necessitated by 2 Trends

    View Slide

  8. End of Moore’s Law
    From 2x every 18 months
    to 1.05x every 18 months.

    View Slide

  9. Hardware Cannot Keep Up
    (https://openai.com/blog/ai-and-compute//)
    35x every 18 months
    CPU

    View Slide

  10. Hardware Cannot Keep Up
    35x every 18 months
    Moore’s Law (2x every 18 months)
    CPU

    View Slide

  11. Specialized Hardware is not enough
    35x every 18 months
    Moore’s Law (2x every 18 months)
    CPU
    GPU
    TPU

    View Slide

  12. Specialized Hardware is not enough
    35x every 18 months
    Moore’s Law (2x every 18 months)
    CPU
    GPU
    TPU
    No way out but distributed!

    View Slide

  13. ● Main challenges in distributed ML
    ● Why Ray solves a big part of the problem
    ● How the ecosystem is adopting Ray.
    It’s not all about
    training.

    View Slide

  14. Challenges with Distributed
    ML

    View Slide

  15. Cutting edge approaches require
    ad-hoc distributed computation
    Retrieval Augmented Generation (RAG) Model

    View Slide

  16. • Support training with cheaper spot instances
    • Handle worker failures at any point during training process
    Elastic Training and failure handling can
    be complex
    Worker 1 Worker 2 Worker 3 Worker 4

    View Slide

  17. • Colocation
    • Specify certain processes that need
    to be on the same node
    • Homogenous setup
    • Ensure each node has the same
    number of workers
    ML workloads are locality / placement
    sensitive.

    View Slide

  18. How does Ray simplify
    distributed ML?

    View Slide

  19. What is Ray?
    1. A simple and powerful
    distributed computing
    toolkit
    2. An ecosystem of libraries
    for everything from web
    applications to data
    processing to ML/RL
    at
    Anyscal
    e
    Native Libraries 3rd Party Libraries
    Ecosystem
    Universal framework for
    Distributed computing

    View Slide

  20. • Simple API
    • Autoscaling/elastic workload support
    • Ability to handle complex worker/task placement
    Three Key benefits

    View Slide

  21. Execute remote functions as tasks, and
    instantiate remote classes as actors
    • Support both stateful and stateless computations
    Asynchronous execution using futures
    • Enable parallelism
    Ray API

    View Slide

  22. API
    Functions -> Tasks
    def read_array(file):
    # read array “a” from “file”
    return a
    def add(a, b):
    return np.add(a, b)

    View Slide

  23. API
    Functions -> Tasks
    @ray.remote
    def read_array(file):
    # read array “a” from “file”
    return a
    @ray.remote
    def add(a, b):
    return np.add(a, b)

    View Slide

  24. API
    Functions -> Tasks
    @ray.remote
    def read_array(file):
    # read array “a” from “file”
    return a
    @ray.remote
    def add(a, b):
    return np.add(a, b)
    id1 = read_array.remote(“/input1”)
    id2 = read_array.remote(“/input2”)
    id3 = add.remote(id1, id2)
    Classes -> Actors

    View Slide

  25. API
    Functions -> Tasks
    @ray.remote
    def read_array(file):
    # read array “a” from “file”
    return a
    @ray.remote
    def add(a, b):
    return np.add(a, b)
    id1 = read_array.remote(“/input1”)
    id2 = read_array.remote(“/input2”)
    id3 = add.remote(id1, id2)
    Classes -> Actors
    @ray.remote
    class Counter(object):
    def __init__(self):
    self.value = 0
    def inc(self):
    self.value += 1
    return self.value

    View Slide

  26. API
    Functions -> Tasks
    @ray.remote
    def read_array(file):
    # read array “a” from “file”
    return a
    @ray.remote
    def add(a, b):
    return np.add(a, b)
    id1 = read_array.remote(“/input1”)
    id2 = read_array.remote(“/input2”)
    id3 = add.remote(id1, id2)
    Classes -> Actors
    @ray.remote
    class Counter(object):
    def __init__(self):
    self.value = 0
    def inc(self):
    self.value += 1
    return self.value
    c = Counter.remote()
    id4 = c.inc.remote()
    id5 = c.inc.remote()
    ray.get([id4, id5])

    View Slide

  27. Interface for custom
    placement of tasks and
    actors
    Create bundles of
    resources and schedule
    workers on each bundle
    Use strategies for
    placement of bundles
    PACK=Place on same node
    SPREAD=Place on different
    nodes
    Ray Placement Groups
    # Initialize Ray.
    import ray
    ray.init(num_gpus=2,
    resources={"extra_resource": 2})
    bundle1 = {"GPU": 2}
    bundle2 = {"extra_resource": 2}
    pg = placement_group([bundle1,
    bundle2], strategy="PACK")

    View Slide

  28. Ray Autoscaler
    Ray Task, Actor, and Object APIs
    Ray Autoscaler
    Ray

    View Slide

  29. Unifying the ML ecosystem
    with Ray

    View Slide

  30. Ray unifies the distributed ML Ecosystem
    Lower level
    Communicators
    Model / Algorithms
    Higher level
    trainers

    View Slide

  31. Ray unifies the distributed ML Ecosystem
    Lower level
    Communicators
    Model / Algorithms
    Higher level
    trainers

    View Slide

  32. Open Source library for fast and easy distributed
    training on any deep learning framework (TF, Torch,
    Keras, MXNet)
    All-reduce communication protocol, excellent scaling
    efficiency
    ElasticHorovod was released in 2020 to allow for
    dynamic scaling during training
    Did not implement actual operation of
    adding/removing nodes, making resource requests…
    this is where Ray comes in
    Horovod

    View Slide

  33. Autoscaling on any cloud
    provider/orchestrator
    Custom placement strategies, object
    store, resource management
    Leverage Ray ecosystem (Data
    Processing, Tuning)
    Support Jupyter Notebook
    Horovod on Ray
    import ray
    ray.init(address="auto") #
    attach to the Ray cluster
    # Use standard RayExecutor
    executor = RayExecutor(settings,
    use_gpu=True, num_workers=2)
    # Or use elastic training
    executor =
    ElasticRayExecutor(settings,
    use_gpu=True)
    executor.start()
    executor.run(training_fn)

    View Slide

  34. Horovod on Ray Architecture
    Worker 3
    Worker 2
    Worker 1
    Worker 4

    View Slide

  35. Horovod on Ray Adoption
    • Integrated as a backend in the Horovod repo
    • Users in open source community, dozens of
    issues
    • Uber moving their deep learning workloads to
    Horovod on Ray

    View Slide

  36. Ray unifies the distributed ML Ecosystem
    Lower level
    Communicators
    Model / Algorithms
    Higher level
    trainers

    View Slide

  37. New NLP architecture by Facebook AI
    Implemented in Huggingface suite of NLP models
    Leverages external documents for state-of-the-art results in
    knowledge-intensive tasks like Q&A
    Retrieval Augmented Generation (RAG)

    View Slide

  38. Document Retrieval with Torch Distributed

    View Slide

  39. Document Retrieval with Ray

    View Slide

  40. Ray unifies the distributed ML Ecosystem
    Lower level
    Communicators
    Model / Algorithms
    Higher level
    trainers

    View Slide

  41. Trainer interface for Huggingface transformer models
    Integrates with Ray Tun for Hyperparameter Optimization
    Huggingface Transformers
    trainer = Trainer(
    model_init=get_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset)
    trainer.hyperparameter_search(
    hp_space=...,
    backend="ray")

    View Slide

  42. Open source library that provides a high-level interface on Pytorch
    Allows developers to focus on research code and not boilerplate
    Distributed Pytorch Lightning is not easy to deploy
    • Have to write bash scripts & ssh into each node
    • No cluster launching or autoscaling capabilities
    Why Ray:
    • Single Python script to launch job
    • Integrates with Ray Tune for HPO
    Pytorch Lightning

    View Slide

  43. https://github.com/ray-project/ray_lightning
    Ray Lightning Library
    import pytorch_lightning as pl
    from ray_lightning import RayPlugin
    # Create your PyTorch Lightning model here.
    ptl_model = MNISTClassifier(...)
    plugin = RayPlugin(num_workers=4,cpus_per_worker=1,use_gpu=True)
    trainer = pl.Trainer(..., plugins=[plugin])
    trainer.fit(ptl_model)
    Pytorch Distributed
    Data Parallel

    View Slide

  44. https://github.com/ray-project/ray_lightning
    Ray Lightning Library
    import pytorch_lightning as pl
    from ray_lightning import HorovodRayPlugin
    # Create your PyTorch Lightning model here.
    ptl_model = MNISTClassifier(...)
    plugin = HorovodRayPlugin(num_hosts=2, num_slots=4, use_gpu=True)
    trainer = pl.Trainer(..., plugins=[plugin])
    trainer.fit(ptl_model)

    View Slide

  45. https://github.com/ray-project/ray_lightning
    Ray Lightning Library
    import pytorch_lightning as pl
    from ray_lightning import RayShardedPlugin
    # Create your PyTorch Lightning model here.
    ptl_model = MNISTClassifier(...)
    plugin = RayShardedPlugin(num_workers=4,cpus_per_worker=1,use_gpu=True)
    trainer = pl.Trainer(..., plugins=[plugin])
    trainer.fit(ptl_model)
    Fairscale Sharded
    Distributed Data
    Parallel

    View Slide

  46. Ray Lightning Architecture
    Worker 3
    Worker 2
    Worker 1
    Worker 4

    View Slide

  47. What’s Next?
    • More support for model parallel training
    • Integrations with DeepSpeed
    • Tying Ray Data Processing efforts with training
    • Providing a serverless experience for distributed
    training on Ray
    • Research projects at UC Berkeley
    • Ray Collective Communications Library

    View Slide

  48. Github
    Ray: https://github.com/ray-project/ray
    Horovod: https://github.com/horovod/horovod
    Ray Lightning: https://github.com/ray-project/ray_lightning
    Huggingface transformers: https://github.com/huggingface/transformers
    Join the Ray Discussion Forum: https://discuss.ray.io/
    Connect with us

    View Slide

  49. Thank You
    Thank You

    View Slide