Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simplify and scale your XGBoost model using Ray on Anyscale

Simplify and scale your XGBoost model using Ray on Anyscale

In this webinar, we will cover how Ray, a universal distributed computing framework running on Anyscale, simplifies the end-to-end machine learning lifecycle and provides serverless compute without limits. We will go through an example from beginning to end using XGBoost.

See first hand how to:
- Load data with Ray Datasets
- Train an XGBoost model on Ray
- Perform hyperparameter tuning with Ray Tune
- Scale from your laptop to Anyscale with zero code changes
- Experiment tracking with Weight and Biases

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

March 16, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Scaling XGBoost with Ray Phi Nguyen ML Solution Architect, Anyscale

    Antoni Baum ML Software Engineer, Anyscale
  2. 1. State of Machine Learning Platform 2. Next Generation ML

    Platform with Ray & Anyscale 3. Simplify and Scale XGBoost on Ray 4. Demo 5. Questions Agenda
  3. Distributed apps will become the norm The World Today “AI

    is the new electricity” - Andrew Ng Distributed computing necessary to achieve the promise of ML - The end of Moore’s Law - Data growing faster than compute MLOps is very hard and is a key initiative for many CDOs/CTOs as a competitive differentiator.
  4. Distributed apps will become the norm ML ecosystem moving at

    a breakneck pace - new frameworks, new algorithms Lack of a universal framework for distributed computing - often custom built or stitched together Developers are stuck managing infrastructure instead of building applications Fundamental Challenges
  5. State of ML Platform

  6. Uber Michelangelo in 2019

  7. Existing ML Pipelines Preprocess Save data Train

  8. Existing ML Pipelines

  9. • Performance overheads ◦ Serialization/Deserialization ◦ Data materialized to external

    storage • Implementation/Operational Complexity ◦ Impedance mismatch: Cross-languages, cross-workload ◦ CPUs vs GPUs • Missing operations ◦ Per-epoch shuffling ▪ How to do a fast, in-memory, distributed shuffle? • MLOps ◦ Often requires bespoke ML CI/CD tooling Challenges
  10. Simplify and Scale your ML Pipeline on Ray

  11. • A simple and general library for distributed computing ◦

    Single machine or 1000s of nodes ◦ Agnostic to the type of work • An ecosystem of libraries (for scaling ML and more) ◦ e.g. Ray RLlib, Ray Train, Ray Tune, Ray Serve • Tools for launching clusters on any cloud provider What is Ray?
  12. Native Libraries 3rd Party Libraries Your app here! Universal framework

    for distributed computing Run anywhere Library + app ecosystem
  13. Ray Users 13000+ Repositories Depend on Ray 1600+ Open Source

    Contributors 449+ Who uses Ray?
  14. Ray Users 13000+ Repositories Depend on Ray 1600+ Open Source

    Contributors 449+ Growth of Ray open-source 19K stars as of 1/2/22
  15. Anyscale The best way to develop, scale, and deploy AI

    apps on Ray
  16. Supercharge your Ray journey on Anyscale Accelerate time to market

    Enterprise ready Observability Get full visibility into your Ray workloads Multi-Cloud Diversify and deploy your workloads across public clouds with a click of a button. Fully-managed service Focus on innovation; not infra ops From the creators of Ray Access to Ray experts Built for dev -> prod journey Scale from laptop to cloud seamless; Easy CI/CD integration
  17. Simplify your MLOps with Anyscale Effortlessly deploy AI workflows and

    models into production with your existing CI/CD tools. Production jobs & services Deploy ML workflows & models into production with ease Observability Monitor health with event logs and prebuilt dashboards App packaging Package apps, incl. all code and library dependencies APIs & SDKs Automate and integrate into your workflows (eg. CI/CD)
  18. - Client makes it easy to run on the cloud

    as easily as your laptop - Built-in dashboards, integration with Tensorboard, Grafana. Infinite Laptop experience w/ power of the cloud
  19. Unified Compute Unify the end-to-end ML lifecycle on a single,

    scalable platform with a rich ecosystem of distributed machine learning libraries Data Processing Training Serving Hyper. Tuning Others Ray ecosystem + Native Ray + Anyscale universal framework for distributed computing Business Logic
  20. Uber Michelangelo in 2021 - All in on Ray!

  21. Universal Data Loading Last Mile Preprocessing Parallel GPU/CPU Compute Ray

    Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks
  22. Efficient algorithms that enable running trials in parallel Effective orchestration

    of distributed trials Easy to use APIs Ray Tune Cutting edge optimization algorithms Minimal code changes to work in distributed settings Compatible with ML ecosystem
  23. XGBoost-Ray

  24. Motivation • There are existing solutions for distributed XGBoost ◦

    E.g. Spark, Dask, Kubernetes • But most existing solutions lack one or more of: ◦ Dynamic computation graphs ◦ Fault tolerance handling ◦ GPU support ◦ Integration with hyperparameter tuning libraries
  25. XGBoost-Ray • Ray actors for stateful training workers • Advanced

    fault tolerance mechanisms • Full (multi) GPU support • Locality-aware distributed data loading • Integration with Ray Tune
  26. Gradient boosting: • Add a new model at each iteration

    • Trees or linear models • Each step try to fit the residuals using loss gradients • (XGBoost: 2nd order Taylor approximations) Tree 1 Tree 2 Tree 3 + + + ... Recap: XGBoost
  27. Recap: Distributed XGBoost

  28. load_data() Worker 1 Worker 2 Worker 3 Worker 4 load_data()

    load_data() load_data() Distributed data loading @ray.remote Actors Architecture Driver
  29. load_data() Worker 1 Worker 2 Worker 3 Worker 4 xgb.train()

    load_data() xgb.train() load_data() xgb.train() load_data() xgb.train() Distributed data loading Tree-based allreduce (Rabit) Architecture Driver
  30. Driver load_data() Worker 1 Worker 2 Worker 3 Worker 4

    xgb.train() load_data() xgb.train() load_data() xgb.train() load_data() xgb.train() Distributed data loading Tree-based allreduce (Rabit) Checkpoints Eval results Architecture
  31. Performance 100K rows, 1K features, 2 classes, 10 boosting rounds,

    all on GPU aside from hist
  32. Partition A Node 1 Node 2 Node 3 Node 4

    Partition B Partition C Partition F Partition D Partition E Partition G Partition H Partition A Worker 1 Worker 2 Worker 3 Worker 4 Partition B Partition C Partition F Partition D Partition E Partition G Partition H Distributed dataframe (e.g. Ray Datasets, Dask) XGBoost-Ray workers Distributed data loading
  33. • In distributed training, some worker nodes are bound to

    fail eventually • Default: Simple (cold) restart from last checkpoint • Non-elastic training (warm restart): Only failing worker restarts • Elastic training: Continue training with fewer workers until failed actor is back Fault tolerance
  34. Worker 1 Worker 2 Worker 3 Worker 4 Training Paused

    Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time Fault tolerance: Simple (cold) restart
  35. Fault tolerance: Non-elastic training (warm restart) Worker 1 Worker 2

    Worker 3 Worker 4 Training Paused Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time
  36. Worker 1 Worker 2 Worker 3 Worker 4 Training Paused

    Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time Finishes earlier Fault tolerance: Elastic training
  37. Fault tolerance: Benchmarks Condition Affected workers Eval error Time (s)

    Baseline 0 0.133326 1441.44 Fewer workers 1 0.134000 1227.45 Fewer workers 2 0.133977 1249.45 Fewer workers 3 0.133333 1291.54 Non elastic 1 0.133552 2205.95 Non elastic 2 0.133211 2226.96 Non elastic 3 0.133552 2033.94 Elastic training 1 0.133763 1231.58 Elastic training 2 0.133771 1197.55 Elastic training 3 0.133704 1259.37 30M rows, 500 features, 2 classes, 100 boosting rounds, 10 workers
  38. Hyperparameter tuning Trial 1 eta: 0.1 gamma: 0.2 Trial ...

    eta: 0.3 gamma: 0.1 Trial n eta: 0.2 gamma: 0.0 Worker 1 Worker 2 Worker ... Worker m Worker 1 Worker 2 Worker ... Worker m Worker 1 Worker 2 Worker ... Worker m Early stopping Searchers (e.g. BO, TPE) Report checkpoints and results
  39. API example from sklearn.datasets import load_breast_cancer from xgboost import DMatrix,

    train train_x, train_y = load_breast_cancer(return_X_y=True) train_set = DMatrix(train_x, train_y) bst = train( {"objective": "binary:logistic"}, train_set ) bst.save_model("trained.xgb") bst = train( {"objective": "binary:logistic"}, train_set, ray_params=RayParams(num_actors=2) ) bst.save_model("trained.xgb") from xgboost_ray import RayDMatrix, RayParams, train train_set = RayDMatrix(train_x, train_y)
  40. Demo

  41. Questions & Answers