The Quick Journey to Using Ray: How We Implement Ray and Anyscale to Speed up our ML Processes (Juan Roberto Honorato & Domingo Ortuzar, Anastasia)

The quick journey to using Ray: how we implement Ray
and Anyscale to speed up our ML processes Ray Summit 2021 Juan Roberto Honorato [email protected] Domingo Ortuzar [email protected]

About Anastasia.ai We provide a powerful platform with business solutions
that enables organizations to operate AI capacities at scale, with a fraction of the resources and effort traditionally required.

What is this talk going to be about • Initially
pure Python. • The need to scale horizontally. • Our journey towards Ray and Anyscale. • Astonishing results.

Initially pure Python Demand Prediction: problem description • Demand prediction
reduces Operational Expenses(OpEx). • This business problem brings lots of challenges and nuances. • We can boil it down to a time series forecasting problem. • Our approach: multimodel ensemble for every item.

Pants raw predictions 2021/01/02, 231 … 2021/01/05, 244 2021/01/02, 239
… 2021/01/05, 222 ... Initially pure Python Multimodel ensemble for every item Items sold dataset 2010/01/01, shirt, 100 … 2021/01/01, shirt, 500 2010/01/01, pants, 221 … 2021/01/01, pants, 132 2010/01/01, shoes, 13 … 2021/01/01, shoes, 15 ... Prediction models pool Model 1 Model 2 Model 3 ... Model N Pants predictions 2021/01/02, 232 … … … … … … 2021/01/05, 230

Initially pure Python Current Implementation: Batch

Initially pure Python Baseline description: where we were. • One
EC2 instance for each type of model. • Parallelization within each instance. • Instances reported full cpu usage. We were happy with the results as we considered the code very optimized. Little did we know...

The need to scale horizontally Baseline issues: scaling and cost
• We hit a ceiling with vertical scaling. • Big machines mean big money. • Big data would take forever to process. • Vendor lock-in.

The need to scale horizontally Why it's hard Scaling •
Horizontal scaling in AWS Batch. • Big code changes required. • Can’t scale automatically. Costs • No spot instances.

The need to scale horizontally Solution alternatives • AWS Batch’s
Multi-node Parallel Jobs: Fault tolerance is difﬁcult. • AWS EMR: Designed for Spark workloads. • AWS Sagemaker suite: Mostly the same as Batch’s alternative. • Ray library + Autoscaler: Very small code changes and solves scaling and cost issues.

Our journey towards Ray and Anyscale Ray Implementation: Scaling made
easy

Our journey towards Ray and Anyscale What is going inside
the nodes: Actors • Number of Actors as a parameter. • Each Consumer Actor runs an end-to-end ML Pipeline. • Data Transfer Actors write to S3. • This implementation lowers the development time for new models. • Ray Queues feed data to Consumer Actors.

Our journey towards Ray and Anyscale Ray usage results: Cheaper
and Faster • We tested our current implementation with our Ray implementation in Anyscale for the same amount of items, data and models. • We managed to reduce cost by lowering the amount of CPU cores used. • We got faster results because we are not using a Batch instance per model, but all instances for all models. This means that all the machines are used all the time. • There was a noticeable improvement on per-core speed when compared to Python’s built-in multiprocess module.

Astonishing results Comparison: Ray v/s AWS Batch Test job: 100,000
time series, 120 data points per item and 384 CPU cores. Pure Python Ray Instances Isolated per model type True cluster setup Pricing On-demand Spot

Ray implementation is 9x faster and 87% cheaper. Astonishing results
Comparison: Ray v/s AWS Batch

Astonishing results Ray Autoscaler brings more sophisticated issues More sophisticated
issues: • Production Ray autoscaler requires careful DevOps processes. • Bottlenecks are hard to identify from your laptop. • Sharing a cluster is not easy.

Astonishing results Anyscale usage results Anyscale brings for us: •
DevOps much easier with Python SDK and API. • Versioning and governance. • Seamlessly manage, share and run clusters across multiple teams. • Optimal orchestration of Ray Clusters gets taken care of by Ray’s own creators.

Conclusions • Implementing end-to-end AI workloads with Ray makes it
really simple for developers to scale. • Improves performance and lowers infrastructure costs. • Anyscale helps you manage and automate Ray clusters.

Thank you! Juan Roberto Honorato [email protected] Domingo Ortuzar [email protected]

The Quick Journey to Using Ray: How We Implemen...

The Quick Journey to Using Ray: How We Implement Ray and Anyscale to Speed up our ML Processes (Juan Roberto Honorato & Domingo Ortuzar, Anastasia)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

The quick journey to using Ray: how we implement Ray

About Anastasia.ai We provide a powerful platform with business solutions

What is this talk going to be about • Initially

Initially pure Python Demand Prediction: problem description • Demand prediction

Pants raw predictions 2021/01/02, 231 … 2021/01/05, 244 2021/01/02, 239

Initially pure Python Current Implementation: Batch

Initially pure Python Baseline description: where we were. • One

The need to scale horizontally Baseline issues: scaling and cost

The need to scale horizontally Why it's hard Scaling •

The need to scale horizontally Solution alternatives • AWS Batch’s

Our journey towards Ray and Anyscale Ray Implementation: Scaling made

Our journey towards Ray and Anyscale What is going inside

Our journey towards Ray and Anyscale Ray usage results: Cheaper

Astonishing results Comparison: Ray v/s AWS Batch Test job: 100,000

Ray implementation is 9x faster and 87% cheaper. Astonishing results

Astonishing results Ray Autoscaler brings more sophisticated issues More sophisticated

Astonishing results Anyscale usage results Anyscale brings for us: •

Conclusions • Implementing end-to-end AI workloads with Ray makes it

Thank you! Juan Roberto Honorato [email protected] Domingo Ortuzar [email protected]