Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling ML + Python Workloads On-Demand 10.12

Anyscale
October 12, 2022

Scaling ML + Python Workloads On-Demand 10.12

Anyscale

October 12, 2022
Tweet

More Decks by Anyscale

Other Decks in Education

Transcript

  1. Simplify Scaling ML and Any Python Workload Phi Nguyen, GTM

    Tech Lead Antoni Baum, ML Software Engineer
  2. 2 Agenda • Scaling Machine Learning and Python Workloads •

    Instant Scaling From a Laptop to the Cloud • Use Cases • Demo • Questions
  3. AI compute demands have increased exponentially AI scaling has lacked

    a universal framework to ease scaling and enable distributed computing for machine learning & Python workloads MLOps is complex, making developer iterations and scaling too complex, leading to hampered developer productivity, consumption of excessive compute, and a slow-time-market. Fundamental Challenges
  4. Python is the most popular programming language with a solid

    footprint across: Python is Growing Everywhere 4 Data Science Analytics HPC/ Scientific IoT Web Development General Applications
  5. • Deep Learning breakthrough innovation is based on unstructured data

    • Data science libraries are primarily based in Python but do not scale well • How do you scale text, images, logs, geospatial, video and sensor data? 5 Exponential Growth of Unstructured Data
  6. “Applying AI-driven forecasting to supply chain management, for example, can

    reduce errors by between 20 and 50 percent” Demand forecasting • 100k SKUS - 50 categories • 210 DMA (Designated Market Area) • Time (Monthly, Weekly, Daily etc… 6 The Rise of Operational ML
  7. Ray | The Fastest Growing Framework to Scale AI apps

    1,000+ Organizations Using Ray 21,000+ GitHub Stars 4,000+ Repositories Depend on Ray 700+ Community Contributors 600+
  8. CPG / Retail • Supply chain • Anomaly detection •

    Forecasting factory • Demand forecasting • Recsys / Personalization Scaling ML & Python – Proven Use Cases Financial Services • Algorithm trading • Financial modelling • Market simulation • Backtesting Pharma / Biotech • Drug screening • Sequence DNA processing • Protein folding Digital Native Businesses • Recsys / Personalization • ETAs • Dynamic pricing • Product classification • NLP / CV Gaming • Game testing • In-game personalization AI Core Businesses • RL / DL / ML at scale • Complex data processing • Foundational Model
  9. 10 Scale Any Python or AI Workload Reinforcement Learning Computer

    Vision / NLP Recommendation systems Complex / Unstructured data processing Large Scale Inference Simulation Embarrassingly parallel Batch training Tuning AutoML Time Series Forecast Factory Backtesting
  10. 11 Batch Training & Tuning Before Ray, we used 10

    containers with celery running on AWS Batch and it used to take 2-3 days to train ~8000 models weekly for our marketplace use case. After doing a quick POC with Ray, we are now able to train 1000 models in 20 min. – ML Engineer, Instacart We did an internal benchmark for our forecast factory use cases and we found a 10x better performance compared to SageMaker. – Product Manager, Manufacturing Conglomerate With Ray we were able to create a self-service marketing attribution model. The service would train and tune many models & combinations and provide the best model based on user inputs. This has allowed to scale and provide an on-demand service for our customers. – Chief Technology Officer, AI Powered Marketing Co. We used Ray to solve our demand prediction use case and we obtained some astonishing results. Specifically, compared to our AWS Batch implementation, our Ray implementation is 9x faster and has reduced the cost by 87%. – Chief Technology Officer, anastasia.ai
  11. 12 Batch Inference & Data processing In order to generate

    MSA data from sequence data for ~100,000+ proteins, one would need 10+ years on a laptop. However, with Ray and a simple ~100-line python script, I was able to perform this task on 8,192 cores in one day. – Laksh Aithani, CEO at CHARM Therapeutics Using Ray and spacy.io, we can now process 15M documents in hours instead of weeks. This has allowed us to accelerate our NLP pipelines provide faster value to our clients. – Chief Technology Officer, AI Startup With Ray, we can now scale our time series backtesting part of our our algorithm trading workbench. This has allowed us to test and provide more robust models in a shorter period of time. – Thomas Kutschera, CEO @ Axovision Ray is 11x faster and 3x cheaper than traditional methods for our ultra-high resolution drone imagery — allowing us to significantly accelerate our restoration efforts in a cost-effective way. – Richard Decal, SWE at Dendra Systems
  12. 13 Ray | A Unified Framework for Scaling AI &

    Python Ray AI Runtime Unified Framework for Scalable Computing Data Loading Training Reinforcement Learning Model Serving Ray Core Tuning One framework to scale all workloads!
  13. 14 Anyscale | A Unified Compute Platform for Scaling and

    Fast Time-to-Market Managed Ray Platform Eases development, deployment, scaling and management Managed Service Observability Access Control Workspaces Jobs Services Fully-Managed Scalable Compute Platform Ray AI Runtime Unified Framework for Scalable Computing Data Loading Training Reinforcement Learning Model Serving Ray Core Tuning Data / Features Orchestration Explainability / Observability Experiment Management Hyperparameter Tuning Serving / Applications Training
  14. 15 Instant Scaling to the Cloud Develop on your laptop,

    seamlessly scale to the cloud with no code changes! Develop Debug Develop on Laptop No code changes Develop Test Debug Develop on Cluster No code changes Run Monitor Production Deployment
  15. • Effortlessly scale your Python and AI Workloads No code

    change to go from laptop to the cloud • Speed time-to-market Train in hours or days rather than weeks or months. • Unified distributed framework Parallelize any ML and Python code • Simplify your MLOps Single script for data preprocessing, training, tuning and serving
  16. Batch Tuning Batch Training / Inference AutoML Different data /

    Same function / Different hyperparam per job Different data / Same function 18 Scaling Design Patterns Same data / Different function Compute Data ... ...
  17. Decentralized Scheduler • Cluster auto-scales based on Python calls! •

    Low task / actor overhead (ms) Node #N Head of a Node Node #1 19 Anatomy of a Ray Cluster Driver Worker Global Control Store (GCS) Scheduler Object Store RAYLET Worker Worker Scheduler Object Store RAYLET Worker Worker Scheduler Object Store RAYLET
  18. 21 Python → Ray: Basic Patterns Function Class Object Task

    (Stateless) Actor (Stateful Process) Distributed Object (Immutable)
  19. 22 Function → Task Class → Actor class Counter(object): def

    __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() def read_array(path): # ... read ndarray “a” from path return a def add(a, b): return np.add(a, b) a = read_array(path1) b = read_array(path2) sum = add(a, b)
  20. 23 Function → Task Class → Actor @ray.remote class Counter(object):

    def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() @ray.remote def read_array(path): # ... read ndarray “a” from path return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(path1) id2 = read_array.remote(path2) id = add.remote(id1, id2) sum = ray.get(id)
  21. 24 Function → Task Class → Actor @ray.remote class Counter(object):

    def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() @ray.remote def read_array(path): # ... read ndarray “a” from path return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(path1) id2 = read_array.remote(path2) id = add.remote(id1, id2) sum = ray.get(id)
  22. 25 Function → Task Class → Actor @ray.remote(num_gpus=1, num_cpus=4) class

    Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() @ray.remote def read_array(path): # ... read ndarray “a” from path return a @ray.remote(num_gpus=1, accelerator_type=TESLA_V100) def add(a, b): return np.add(a, b) id1 = read_array.remote(path1) id2 = read_array.remote(path2) id = add.remote(id1, id2) sum = ray.get(id)
  23. 26 Ray Demos Simple Composable AutoML for TS Complex Data

    Processing Batch Forecasting M5 Dataset ... ... ETS AutoArima Best Model Compute NYC Taxi Dataset PU Loc 1 PU Loc 2 ... ... PU Loc N LightShot Img 1 Img 2 ... ... Img N OCR Language? ... ... ...
  24. 27 Questions? Join the Community • discuss.ray.io • github.com/ray-project/ray •

    @raydistributed • @anyscalecompute Fill out our survey for: • Feedback https://bit.ly/3CoqLX3 Request a Demo of the Anyscale Platform – Go to www.anyscale.com and Select ‘Try It Now’