Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Quick Journey to Using Ray: How We Implement Ray and Anyscale to Speed up our ML Processes (Juan Roberto Honorato & Domingo Ortuzar, Anastasia)

The Quick Journey to Using Ray: How We Implement Ray and Anyscale to Speed up our ML Processes (Juan Roberto Honorato & Domingo Ortuzar, Anastasia)

When in need of scaling Ray applications, the go-to resource for developers has been Ray’s own Cluster Launcher. Recently, Anyscale was launched in private beta as a fully-managed alternative. In this talk, we will have a look into a real-life workflow running in both scenarios. We will also highlight the key differences between them.

Anyscale

July 14, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. The quick journey to using Ray: how we implement Ray

    and Anyscale to speed up our ML processes Ray Summit 2021 Juan Roberto Honorato [email protected] Domingo Ortuzar [email protected]
  2. About Anastasia.ai We provide a powerful platform with business solutions

    that enables organizations to operate AI capacities at scale, with a fraction of the resources and effort traditionally required.
  3. What is this talk going to be about • Initially

    pure Python. • The need to scale horizontally. • Our journey towards Ray and Anyscale. • Astonishing results.
  4. Initially pure Python Demand Prediction: problem description • Demand prediction

    reduces Operational Expenses(OpEx). • This business problem brings lots of challenges and nuances. • We can boil it down to a time series forecasting problem. • Our approach: multimodel ensemble for every item.
  5. Pants raw predictions 2021/01/02, 231 … 2021/01/05, 244 2021/01/02, 239

    … 2021/01/05, 222 ... Initially pure Python Multimodel ensemble for every item Items sold dataset 2010/01/01, shirt, 100 … 2021/01/01, shirt, 500 2010/01/01, pants, 221 … 2021/01/01, pants, 132 2010/01/01, shoes, 13 … 2021/01/01, shoes, 15 ... Prediction models pool Model 1 Model 2 Model 3 ... Model N Pants predictions 2021/01/02, 232 … … … … … … 2021/01/05, 230
  6. Initially pure Python Baseline description: where we were. • One

    EC2 instance for each type of model. • Parallelization within each instance. • Instances reported full cpu usage. We were happy with the results as we considered the code very optimized. Little did we know...
  7. The need to scale horizontally Baseline issues: scaling and cost

    • We hit a ceiling with vertical scaling. • Big machines mean big money. • Big data would take forever to process. • Vendor lock-in.
  8. The need to scale horizontally Why it's hard Scaling •

    Horizontal scaling in AWS Batch. • Big code changes required. • Can’t scale automatically. Costs • No spot instances.
  9. The need to scale horizontally Solution alternatives • AWS Batch’s

    Multi-node Parallel Jobs: Fault tolerance is difficult. • AWS EMR: Designed for Spark workloads. • AWS Sagemaker suite: Mostly the same as Batch’s alternative. • Ray library + Autoscaler: Very small code changes and solves scaling and cost issues.
  10. Our journey towards Ray and Anyscale What is going inside

    the nodes: Actors • Number of Actors as a parameter. • Each Consumer Actor runs an end-to-end ML Pipeline. • Data Transfer Actors write to S3. • This implementation lowers the development time for new models. • Ray Queues feed data to Consumer Actors.
  11. Our journey towards Ray and Anyscale Ray usage results: Cheaper

    and Faster • We tested our current implementation with our Ray implementation in Anyscale for the same amount of items, data and models. • We managed to reduce cost by lowering the amount of CPU cores used. • We got faster results because we are not using a Batch instance per model, but all instances for all models. This means that all the machines are used all the time. • There was a noticeable improvement on per-core speed when compared to Python’s built-in multiprocess module.
  12. Astonishing results Comparison: Ray v/s AWS Batch Test job: 100,000

    time series, 120 data points per item and 384 CPU cores. Pure Python Ray Instances Isolated per model type True cluster setup Pricing On-demand Spot
  13. Astonishing results Ray Autoscaler brings more sophisticated issues More sophisticated

    issues: • Production Ray autoscaler requires careful DevOps processes. • Bottlenecks are hard to identify from your laptop. • Sharing a cluster is not easy.
  14. Astonishing results Anyscale usage results Anyscale brings for us: •

    DevOps much easier with Python SDK and API. • Versioning and governance. • Seamlessly manage, share and run clusters across multiple teams. • Optimal orchestration of Ray Clusters gets taken care of by Ray’s own creators.
  15. Conclusions • Implementing end-to-end AI workloads with Ray makes it

    really simple for developers to scale. • Improves performance and lowers infrastructure costs. • Anyscale helps you manage and automate Ray clusters.