Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray AIR: A. Scalable Toolkit for End-to-end ML ...

Anyscale
October 25, 2022

Ray AIR: A. Scalable Toolkit for End-to-end ML Applications

Existing production machine learning systems often suffer from various problems that make them hard to use. For example, data scientists and ML practitioners often spend most of their time-fighting YAMLs and refactoring code to push models to production.

To address this, the Ray community has built Ray AI Runtime (AIR), an open-source toolkit for building large-scale end-to-end ML applications. By leveraging Ray’s distributed compute strata and library ecosystem, the AIR Runtime brings scalability and programmability to ML platforms.

The main focus of the Ray AI Runtime is on providing the compute layer for Python-based ML workloads and is designed to interoperate with other systems for storage and metadata needs.

In this session, we’ll explore and discuss the following:

* How AIR is different from existing ML platform tools like TFX, Sagemaker, and Kubeflow
* How AIR allows you to program and scale your machine learning workloads easily
* Interoperability and easy integration points with other systems for storage and metadata needs
* AIR’s cutting-edge features for accelerating the machine learning lifecycle such as data preprocessing, last-mile data ingestion, tuning and training, and serving at scale
Key takeaways for attendees are:

* Understand how Ray AI Runtime can be used to implement scalable, programmable machine learning workflows.
* Learn how to pass and share data across distributed trainers and Ray native libraries: Tune, Serve, Train, RLlib, etc.
* How to scale python-based workloads across supported public clouds

Anyscale

October 25, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Ray AIR: A Scalable Toolkit for End-to-end ML Applications Richard

    Liaw, Anyscale Xiaowei Jiang, Anyscale SF Bay ACM Meetup 10/24/2022
  2. 2 Who we are::Original creators of Ray What we do:

    Unified compute platform to develop, deploy, and manage scalable AI & Python applications with Ray Why do it: Scaling is a necessity, scaling is hard; make distributed computing easy and simple for everyone
  3. What is Ray → A simple/general-purpose library for distributed computing

    → A unified Python toolkit Ray AI Runtime (for scaling ML and more) → Runs on laptop, public cloud, K8s, on-premise 3 A layered cake of functionality and capabilities for scaling ML workloads
  4. A Layered Cake and Ecosystem 4 Run anywhere general-purpose framework

    for distributed computing Library + app ecosystem Ray core
  5. 5 An anatomy of a Ray cluster Driver Worker Global

    Control Store (GCS) Scheduler Object Store Raylet Worker Worker Scheduler Object Store Raylet Worker Worker Scheduler Object Store Raylet … … Head Node Worker Node #1 Worker Node #N . . . Unique to Ray
  6. 6 Python → Ray APIs def f(x): # do something

    with x: y = … return y @ray.remote def f(x): # do something with x: Y = … return y f.remote() for i in range(10000) f() Node … Task Distributed f() Node class Cls(): def __init__(self, x): def f(self, a): … def g(self, a): … Actor @ray.remote(num_cpus=2, num_gpus=4) class Cls(): def __init__(self, x): def f(self, a): … def g(self, a): … cls = Cls.remote() cls.f.remote(a) del cls Cls Node … Distributed Cls Node import numpy as np a= np.arange(1, 10e6) b = a * 2 Distributed immutable object import numpy as np a = np.arange(1, 10e6) obj_a = ray.put(a) b = ray.get(obj_a) * 2 Node … Distributed Node a a
  7. Project Overview 8 Ray team has worked with ML users

    and infra groups at e.g., Uber, Ant, Shopify, Cruise, OpenAI, etc. for several years. AIR is an effort to synthesize lessons learned into a simple toolkit for the community. - Built on Ray's existing scalable libraries - Unified APIs for e2e ML - Simplify ML Infra
  8. 10 Still not easy to go from dev to prod

    at scale. preprocess.py train.py eval.py run_workflow.py
  9. 11 What happens when your ML infra gets out of

    date? preprocess.py train.py eval.py run_workflow.py
  10. 12 Scaling is hard, especially for data scientists. Key Problems

    of existing ML infrastructure Platforms solutions can limit flexibility. But custom distributed apps are too hard.
  11. Ray AI Runtime (AIR) is a scalable toolkit for end-to-end

    ML applications 17 Built on Ray Core for open and flexible ML compute end-to-end.
  12. Ray AI Runtime (AIR) is a scalable toolkit for end-to-end

    ML applications 18 Built on Ray Core for open and flexible ML compute end-to-end. Since Ray focuses on compute, AIR leverages integrations for storage and tracking.
  13. Ray AI Runtime (AIR) is a scalable runtime for end-to-end

    ML applications 19 High-level libraries that make scaling easy for both data scientists and ML engineers.
  14. • Non-distributed systems / libraries • Opinionated distributed systems /

    libraries Data science team High Friction Eng team Easy to scale with Ray AIR and libraries of their choice Data science team Eng team More performant, robust and scalable. Seamless handoff Development environment Production environment Development environment Production environment With non-scalable libraries With scalable libraries Importance of scalable library layer
  15. Ray AI Runtime (AIR) is a scalable toolkit for end-to-end

    ML applications 21 Scalable integrations with best-of-breed libraries/MLOps tools
  16. 22 • Built-in integrations Built-in integrations Integrations API Custom scalable

    components AIR Integrations • Integrations API to easily add integrations • Custom scalable components can be built on Ray Core
  17. 23 AIR simplifies scalable ML infrastructure Integrations with best-of-breed libraries/

    MLOps tools A unified end-to-end ML runtime Make scaling easy for both data scientists and ML engineers
  18. When would you use Ray AIR? 24 Scale a single

    type of workload Scale end-to-end ML applications Run ecosystem libraries using a unified API Build a custom ML platform
  19. AIR is for the entire ML org 25 Scale a

    single type of workload Scale end-to-end ML applications Run ecosystem libraries using a unified API Build a custom ML platform A scalable, unified toolkit for both data scientists and software engineers.
  20. Ray AIR vs Ray Core 26 Ray AIR Ray Core

    Who should use... Data Scientists & ML Engineers Advanced Infra & ML Groups If you want... Easy to get started and Ecosystem integrations Customizability and Control
  21. What comes out of the box with AIR? 27 Training

    Tuning Batch Prediction Data Preprocessing Serving
  22. Scalable Data Prep and Loading with Ray Data • Dataset

    library built for ML tasks • Seamlessly load distributed data from MB to TB scale • Preprocessors for unified training<>inference Trainer Worker Worker Worker Worker Dataset Trainer.fit dataset = ray.data.read_csv(“...”) preprocessor = ray.data.preprocessors.MinMaxScaler( ["value"]) trainer = ray.train.TorchTrainer( ..., preprocessor=preprocessor, dataset=dataset)
  23. • Single API to run the most popular ML training

    frameworks • Seamless integration with other AIR libraries Scalable Model Training with Ray Train Trainer Checkpoint Datasets Tuner trainer = ray.train.TorchTrainer( train_loop, scaling_config=ScalingConfig( num_workers=100, use_gpu=True) preprocessor=preprocessor, dataset=dataset) result = trainer.fit()
  24. Scalable Hyperparameter Tuning with Ray Tune • Run multiple concurrent

    Training jobs • Cutting edge optimization algorithms • Fault tolerance at scale trainer = TorchTrainer(...) tuner = Tuner( trainer, param_space={ “batch_size”: tune.grid_search( [1, 2, 3])}) results = tuner.fit() Trainer Tuner Trial Trainer Worker Worker Worker Tuner.fit
  25. Scalable Batch Prediction with AIR's BatchPredictor • Execute inference on

    distributed data using CPUs and GPUs • Bring your own model or load existing checkpoints from Train predictor = BatchPredictor.from_checkpoint( checkpoint, XGBoostPredictor)(...) results = predictor.predict(dataset) results.write_parquet("s3://...") Model Batch Predictor Worker Worker GPU Worker predict Ray Dataset Shard Shard Shard
  26. Scalable Online Inference with Ray Serve • Deploy single models

    as HA inference services in Ray • Build multi-model pipelines with custom business logic deployment = PredictorDeployment.options( name="XGBoostService") deployment.deploy(XGBoostPredictor, checkpoint, ...) print(deployment.url) Model Predictor Deployment Prediction requests
  27. When would you use Ray AIR? 34 Scale a single

    type of workload Scale end-to-end ML applications Run ecosystem libraries using a unified API Build a custom ML platform
  28. 35 Using Ray AIR for a single workload Preprocess Training

    Batch Prediction on Ray Data loading Orchestrator Kubernetes/Cloud Prediction Results
  29. Batch Prediction Prediction results Here’s an example of using Ray

    for one part of your ML pipeline. Scalable Batch Prediction on Ray AIR 36 Ray AIR data_url = "s3://YOUR_IMAGE_DATA” model = models.resnet18(pretrained=True)
  30. Batch Prediction Prediction results Here’s an example of using Ray

    for one part of your ML pipeline. Scalable Batch Prediction on Ray AIR 37 Ray AIR data_url = "s3://YOUR_IMAGE_DATA” model = models.resnet18(pretrained=True) dataset = ray.data.read_datasource( ImageFolderDatasource(), paths=[data_url])
  31. Batch Prediction Prediction results Here’s an example of using Ray

    for one part of your ML pipeline. Scalable Batch Prediction on Ray AIR 38 Ray AIR data_url = "s3://YOUR_IMAGE_DATA” model = models.resnet18(pretrained=True) dataset = ray.data.read_datasource( ImageFolderDatasource(), paths=[data_url]) ckpt = TorchCheckpoint.from_model(model) predictor = BatchPredictor.from_checkpoint( ckpt, TorchPredictor) outputs = predictor.predict( dataset, column=["image"])
  32. Batch Prediction Prediction results Here’s an example of using Ray

    for one part of your ML pipeline. Scalable Batch Prediction on Ray AIR 39 Ray AIR data_url = "s3://YOUR_IMAGE_DATA” model = models.resnet18(pretrained=True) dataset = ray.data.read_datasource( ImageFolderDatasource(), paths=[data_url]) ckpt = TorchCheckpoint.from_model(model) predictor = BatchPredictor.from_checkpoint( ckpt, TorchPredictor) outputs = predictor.predict( dataset, column=["image"]) outputs.write_s3(...)
  33. Compared to SageMaker Batch Inference… → Create Airflow DAG →

    Create Docker Image → Test Locally → Push docker image to ECR → Decide how many machines → Partition work across all machines → Copy files from S3 to local → Read all results from machines → Collate results → Tear all down Ray AIR vs Sagemaker Batch Inference Ray AIR in 3 steps → Start a Ray cluster → Submit your Python script → [Maybe] Shut down your Ray cluster
  34. When would you use Ray AIR? 41 Scale a single

    type of workload Scale end-to-end ML applications Run ecosystem libraries using a unified API Build a custom ML platform
  35. 42 Using Ray AIR for E2E ML Workflows Hyperparameter Tuning

    on Ray Batch Prediction on Ray Distributed Training on Ray Data Processing on Ray
  36. Scalable Data Processing (Ray Data) Using Ray AIR to scale

    E2E ML Workflows 43 dataset = ray.data.read_csv(...) train_ds, valid_ds = train_test_split( dataset, test_size=0.3) test_ds = valid_ds.drop_columns(["target"]) preprocessor = StandardScaler(columns=["mean radius"])
  37. Scalable Data Processing (Ray Data) Using Ray AIR to scale

    E2E ML Workflows 44 dataset = ray.data.read_csv(...) train_ds, valid_ds = train_test_split( dataset, test_size=0.3) test_ds = valid_ds.drop_columns(["target"]) preprocessor = StandardScaler(columns=["mean radius"]) Scalable Model Training (Ray Train) trainer = ray.train.xgboost.XGBoostTrainer( scaling_config=ScalingConfig(num_workers=128), label_column="target", datasets=dict(train=train_ds, valid=valid_ds}, preprocessor=preprocessor) result = trainer.fit()
  38. Scalable Data Processing (Ray Data) Using Ray AIR to scale

    E2E ML Workflows 45 dataset = ray.data.read_csv(...) train_ds, valid_ds = train_test_split( dataset, test_size=0.3) test_ds = valid_ds.drop_columns(["target"]) preprocessor = StandardScaler(columns=["mean radius"]) Scalable Model Training (Ray Train) trainer = ray.train.xgboost.XGBoostTrainer( scaling_config=ScalingConfig(num_workers=128), label_column="target", datasets=dict(train=train_ds, valid=valid_ds}, preprocessor=preprocessor) result = trainer.fit() Scalable Model Tuning (Ray Tune) tuner = ray.tune.Tuner( trainer, param_space={"params": {"max_depth": tune.randint(1, 9)}}, tune_config=TuneConfig( num_samples=5, metric="logloss", mode="min"), ) checkpoint = tuner.fit().get_best_result().checkpoint
  39. Scalable Data Processing (Ray Data) Using Ray AIR to scale

    E2E ML Workflows 46 dataset = ray.data.read_csv(...) train_ds, valid_ds = train_test_split( dataset, test_size=0.3) test_ds = valid_ds.drop_columns(["target"]) preprocessor = StandardScaler(columns=["mean radius"]) Scalable Model Training (Ray Train) trainer = ray.train.xgboost.XGBoostTrainer( scaling_config=ScalingConfig(num_workers=128), label_column="target", datasets=dict(train=train_ds, valid=valid_ds}, preprocessor=preprocessor) result = trainer.fit() Scalable Model Tuning (Ray Tune) tuner = ray.tune.Tuner( trainer, param_space={"params": {"max_depth": tune.randint(1, 9)}}, tune_config=TuneConfig( num_samples=5, metric="logloss", mode="min"), ) checkpoint = tuner.fit().get_best_result().checkpoint Scalable Batch Prediction (Predictors) batch_predictor = BatchPredictor.from_checkpoint( checkpoint, XGBoostPredictor) predicted_probabilities = batch_predictor.predict(test_ds) predicted_probabilities.show()
  40. Scalable Data Processing (Ray Data) Using Ray AIR to scale

    E2E ML Workflows 47 Scalable Model Training (Ray Train) Scalable Model Tuning (Ray Tune) Scalable Batch Prediction (Predictors) dataset = ray.data.read_csv(...) train_ds, valid_ds = train_test_split( dataset, test_size=0.3) test_ds = valid_dataset.drop_columns(["target"]) preprocessor = StandardScaler(columns=["mean radius"]) trainer = ray.train.xgboost.XGBoostTrainer( scaling_config=ScalingConfig(num_workers=128), label_column="target", datasets=dict(train=train_ds, valid=valid_ds}, preprocessor=preprocessor) result = trainer.fit() tuner = ray.tune.Tuner( trainer, param_space={"params": {"max_depth": tune.randint(1, 9)}}, tune_config=TuneConfig( num_samples=5, metric="logloss", mode="min"), ) checkpoint = tuner.fit().get_best_result().checkpoint batch_predictor = BatchPredictor.from_checkpoint( checkpoint, XGBoostPredictor) predicted_probabilities = batch_predictor.predict(test_ds) predicted_probabilities.show() Scale out to a cluster with 1 line change.
  41. When would you use Ray AIR? 48 Scale a single

    type of workload Scale end-to-end ML applications Run ecosystem libraries using a unified API Build a custom ML platform
  42. 49 Integrations with: • Data Ecosystem • ML frameworks •

    Optimization Libraries • Model Monitoring • Model Serving
  43. Custom Integrations with Ray AIR 50 Scalable Data Preprocessing (Ray

    Data) Scalable Model Training (Ray Train) Scalable Model Tuning (Ray Tune) Scalable Batch Prediction (Predictors) from ray.train import DataParallelTrainer class JaxTrainer(DataParallelTrainer): # define custom training logic trainer = JaxTrainer(dataset={..}) trainer.fit() from ray.air.callbacks import Callback class CustomMLflowTracker(Callback): # define custom training logic tuner = Tuner(trainer, run_config=RunConfig( callback=CustomMLflowTracker()) tuner.fit() from ray.data.datasource import Datasource class DeltaLakeDatasource(Datasource): # define custom data source ds.read_datasource(DeltaLakeDatasource(...)) ds.write_datasource(DeltaLakeDatasource(...))
  44. When would you use Ray AIR? 51 Scale a single

    type of workload Scale end-to-end ML applications Run ecosystem libraries using a unified API Build a custom ML platform
  45. 53 Using Ray AIR as the compute core for your

    ML platform Ray AIR program
  46. Ray AIR program 54 Using Ray AIR as the compute

    core for your ML platform KubeRay / Anyscale Ray AIR program Ray AIR program
  47. Ray AIR program 55 Using Ray AIR as the compute

    core for your ML platform KubeRay / Anyscale Ray AIR program Ray AIR program Model Registry Monitoring Experiment Tracking Feature Store Lakehouse Notebook Service Job Scheduler
  48. Ray AIR Roadmap In Ray 2.0, Ray AIR is released

    as beta In future releases, we plan to add: • Improved integrations with data sources, feature stores, and model monitoring services • Improved scalability and performance benchmarks for data-intensive use cases 57
  49. “ “ 58 Users are excited about Ray AIR! I’d

    say the productivity at least doubled if not more… I can’t wait until AIR is released. I’m using the nightly builds and it’s already a massive productivity boost. - Data Scientist from large music streaming company Ray AIR has greatly improved my developer experience … through intuitive abstractions like Ray Datasets and Train. In two weeks I was able to recreate and outperform a data ingest pipeline I built by hand over the course of 6 months. - ML Engineer at telematics startup
  50. - Chat with the developers on the Ray Slack (#air-dogfooding

    channel!) - Come talk afterwards -- maybe we can form a recurring meetup in Seattle! How to get involved? 59