$30 off During Our Annual Pro Sale. View Details »

Scaling ML + Python Workloads On-Demand 10.12

Anyscale
PRO
October 12, 2022

Scaling ML + Python Workloads On-Demand 10.12

Anyscale
PRO

October 12, 2022
Tweet

More Decks by Anyscale

Other Decks in Education

Transcript

  1. Simplify Scaling ML and Any Python Workload
    Phi Nguyen, GTM Tech Lead
    Antoni Baum, ML Software Engineer

    View Slide

  2. 2
    Agenda
    • Scaling Machine Learning and Python
    Workloads
    • Instant Scaling From a Laptop to the Cloud
    • Use Cases
    • Demo
    • Questions

    View Slide

  3. AI compute demands have increased exponentially
    AI scaling has lacked a universal framework to
    ease scaling and enable distributed computing
    for machine learning & Python workloads
    MLOps is complex, making developer iterations
    and scaling too complex, leading to hampered
    developer productivity, consumption of
    excessive compute, and a slow-time-market.
    Fundamental Challenges

    View Slide

  4. Python is the most popular programming
    language with a solid footprint across:
    Python is Growing Everywhere
    4
    Data
    Science
    Analytics HPC/
    Scientific
    IoT Web
    Development
    General
    Applications

    View Slide

  5. • Deep Learning breakthrough innovation
    is based on unstructured data
    • Data science libraries are primarily
    based in Python but do not scale well
    • How do you scale text, images, logs,
    geospatial, video and sensor data?
    5
    Exponential Growth of Unstructured Data

    View Slide

  6. “Applying AI-driven forecasting
    to supply chain management,
    for example, can reduce errors
    by between 20 and 50 percent”
    Demand forecasting
    • 100k SKUS - 50 categories
    • 210 DMA (Designated Market
    Area)
    • Time (Monthly, Weekly, Daily
    etc…
    6
    The Rise of Operational ML

    View Slide

  7. Ray & Anyscale

    View Slide

  8. Ray | The Fastest Growing Framework to Scale AI apps
    1,000+
    Organizations
    Using Ray
    21,000+
    GitHub
    Stars
    4,000+
    Repositories
    Depend on Ray
    700+
    Community
    Contributors
    600+

    View Slide

  9. CPG / Retail
    • Supply chain
    • Anomaly detection
    • Forecasting factory
    • Demand forecasting
    • Recsys / Personalization
    Scaling ML & Python – Proven Use Cases
    Financial Services
    • Algorithm trading
    • Financial modelling
    • Market simulation
    • Backtesting
    Pharma / Biotech
    • Drug screening
    • Sequence DNA
    processing
    • Protein folding
    Digital Native Businesses
    • Recsys / Personalization
    • ETAs
    • Dynamic pricing
    • Product classification
    • NLP / CV
    Gaming
    • Game testing
    • In-game
    personalization
    AI Core Businesses
    • RL / DL / ML at scale
    • Complex data
    processing
    • Foundational Model

    View Slide

  10. 10
    Scale Any Python or AI Workload
    Reinforcement
    Learning
    Computer Vision /
    NLP
    Recommendation
    systems
    Complex /
    Unstructured
    data processing
    Large Scale
    Inference
    Simulation
    Embarrassingly
    parallel
    Batch training
    Tuning
    AutoML
    Time Series
    Forecast Factory
    Backtesting

    View Slide

  11. 11
    Batch Training & Tuning
    Before Ray, we used 10 containers with celery
    running on AWS Batch and it used to take 2-3 days
    to train ~8000 models weekly for our marketplace
    use case. After doing a quick POC with Ray, we are
    now able to train 1000 models in 20 min.
    – ML Engineer, Instacart
    We did an internal benchmark for our
    forecast factory use cases and we found a
    10x better performance compared to
    SageMaker.
    – Product Manager, Manufacturing
    Conglomerate
    With Ray we were able to create a self-service
    marketing attribution model. The service would
    train and tune many models & combinations and
    provide the best model based on user inputs.
    This has allowed to scale and provide an
    on-demand service for our customers.
    – Chief Technology Officer, AI Powered Marketing
    Co.
    We used Ray to solve our demand prediction use
    case and we obtained some astonishing results.
    Specifically, compared to our AWS Batch
    implementation, our Ray implementation is 9x
    faster and has reduced the cost by 87%.
    – Chief Technology Officer, anastasia.ai

    View Slide

  12. 12
    Batch Inference & Data processing
    In order to generate MSA data from sequence data
    for ~100,000+ proteins, one would need 10+ years
    on a laptop. However, with Ray and a simple
    ~100-line python script, I was able to perform this
    task on 8,192 cores in one day.
    – Laksh Aithani, CEO at CHARM Therapeutics
    Using Ray and spacy.io, we can now process 15M
    documents in hours instead of weeks. This has
    allowed us to accelerate our NLP pipelines provide
    faster value to our clients.
    – Chief Technology Officer, AI Startup
    With Ray, we can now scale our time series
    backtesting part of our our algorithm trading
    workbench. This has allowed us to test and provide
    more robust models in a shorter period of time.
    – Thomas Kutschera, CEO @ Axovision
    Ray is 11x faster and 3x cheaper than
    traditional methods for our ultra-high
    resolution drone imagery — allowing us to
    significantly accelerate our restoration
    efforts in a cost-effective way.
    – Richard Decal, SWE at Dendra Systems

    View Slide

  13. 13
    Ray | A Unified Framework for Scaling AI & Python
    Ray AI Runtime
    Unified Framework for Scalable Computing
    Data Loading Training
    Reinforcement
    Learning
    Model Serving
    Ray Core
    Tuning
    One framework to scale all workloads!

    View Slide

  14. 14
    Anyscale | A Unified Compute Platform
    for Scaling and Fast Time-to-Market
    Managed Ray Platform
    Eases development, deployment, scaling and management
    Managed Service Observability Access Control
    Workspaces Jobs Services
    Fully-Managed Scalable Compute Platform
    Ray AI Runtime
    Unified Framework for Scalable Computing
    Data
    Loading
    Training
    Reinforcement
    Learning
    Model
    Serving
    Ray Core
    Tuning
    Data / Features
    Orchestration
    Explainability /
    Observability
    Experiment Management Hyperparameter Tuning
    Serving / Applications
    Training

    View Slide

  15. 15
    Instant Scaling to the Cloud
    Develop on your laptop,
    seamlessly scale to the cloud with no code changes!
    Develop
    Debug
    Develop on Laptop
    No code
    changes
    Develop
    Test
    Debug
    Develop on Cluster
    No code
    changes
    Run
    Monitor
    Production Deployment

    View Slide

  16. • Effortlessly scale your Python and AI
    Workloads
    No code change to go from laptop to the cloud
    • Speed time-to-market
    Train in hours or days rather than weeks or months.
    • Unified distributed framework
    Parallelize any ML and Python code
    • Simplify your MLOps
    Single script for data preprocessing, training,
    tuning and serving

    View Slide

  17. Ray Fundamentals

    View Slide

  18. Batch Tuning
    Batch Training / Inference AutoML
    Different data / Same function /
    Different hyperparam per job
    Different data / Same function
    18
    Scaling Design Patterns
    Same data / Different function
    Compute Data
    ... ...

    View Slide

  19. Decentralized Scheduler • Cluster auto-scales based on Python calls! • Low task / actor overhead (ms)
    Node #N
    Head of a Node Node #1
    19
    Anatomy of a Ray Cluster
    Driver Worker
    Global Control Store (GCS)
    Scheduler
    Object Store
    RAYLET
    Worker Worker
    Scheduler
    Object Store
    RAYLET
    Worker Worker
    Scheduler
    Object Store
    RAYLET

    View Slide

  20. Ray Core API

    View Slide

  21. 21
    Python → Ray: Basic Patterns
    Function
    Class
    Object
    Task (Stateless)
    Actor (Stateful Process)
    Distributed Object
    (Immutable)

    View Slide

  22. 22
    Function → Task Class → Actor
    class Counter(object):
    def __init__(self):
    self.value = 0
    def inc(self):
    self.value += 1
    return self.value
    c = Counter()
    c.inc()
    c.inc()
    def read_array(path):
    # ... read ndarray “a” from path
    return a
    def add(a, b):
    return np.add(a, b)
    a = read_array(path1)
    b = read_array(path2)
    sum = add(a, b)

    View Slide

  23. 23
    Function → Task Class → Actor
    @ray.remote
    class Counter(object):
    def __init__(self):
    self.value = 0
    def inc(self):
    self.value += 1
    return self.value
    c = Counter.remote()
    id4 = c.inc.remote()
    id5 = c.inc.remote()
    @ray.remote
    def read_array(path):
    # ... read ndarray “a” from path
    return a
    @ray.remote
    def add(a, b):
    return np.add(a, b)
    id1 = read_array.remote(path1)
    id2 = read_array.remote(path2)
    id = add.remote(id1, id2)
    sum = ray.get(id)

    View Slide

  24. 24
    Function → Task Class → Actor
    @ray.remote
    class Counter(object):
    def __init__(self):
    self.value = 0
    def inc(self):
    self.value += 1
    return self.value
    c = Counter.remote()
    id4 = c.inc.remote()
    id5 = c.inc.remote()
    @ray.remote
    def read_array(path):
    # ... read ndarray “a” from path
    return a
    @ray.remote
    def add(a, b):
    return np.add(a, b)
    id1 = read_array.remote(path1)
    id2 = read_array.remote(path2)
    id = add.remote(id1, id2)
    sum = ray.get(id)

    View Slide

  25. 25
    Function → Task Class → Actor
    @ray.remote(num_gpus=1, num_cpus=4)
    class Counter(object):
    def __init__(self):
    self.value = 0
    def inc(self):
    self.value += 1
    return self.value
    c = Counter.remote()
    id4 = c.inc.remote()
    id5 = c.inc.remote()
    @ray.remote
    def read_array(path):
    # ... read ndarray “a” from path
    return a
    @ray.remote(num_gpus=1, accelerator_type=TESLA_V100)
    def add(a, b):
    return np.add(a, b)
    id1 = read_array.remote(path1)
    id2 = read_array.remote(path2)
    id = add.remote(id1, id2)
    sum = ray.get(id)

    View Slide

  26. 26
    Ray Demos
    Simple Composable AutoML for TS
    Complex Data Processing
    Batch Forecasting
    M5 Dataset
    ...
    ...
    ETS
    AutoArima
    Best Model
    Compute
    NYC Taxi Dataset
    PU Loc 1
    PU Loc 2
    ...
    ...
    PU Loc N
    LightShot
    Img 1
    Img 2
    ...
    ...
    Img N
    OCR Language?
    ...
    ...
    ...

    View Slide

  27. 27
    Questions?
    Join the Community
    • discuss.ray.io
    • github.com/ray-project/ray
    • @raydistributed
    • @anyscalecompute
    Fill out our survey for:
    • Feedback https://bit.ly/3CoqLX3
    Request a Demo of the Anyscale
    Platform –
    Go to www.anyscale.com and
    Select ‘Try It Now’

    View Slide

  28. Thank You
    October 2022

    View Slide