$30 off During Our Annual Pro Sale. View Details »

Distributed XGBoost on Ray (Kai Fricke & Michael Mui)

Distributed XGBoost on Ray (Kai Fricke & Michael Mui)

Michael Mui and Kai Fricke discuss XGBoost-Ray, a scalable backend for distributed XGBoost training.

Anyscale
PRO

July 13, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Distributed XGBoost on Ray Kai Fricke Anyscale Michael Mui Uber

  2. Overview Part 1: Design and features of XGBoost-Ray • Motivation

    • Architecture • DIstributed Data loading • Fault-Tolerance • Hyperparameter Optimization Part 2: Ray and XGBoost-Ray at Uber • Distributed ML and DL Challenges at Scale • Ray and Distributed XGBoost on Ray at Uber • Next Steps
  3. Motivation • There are existing solutions for distributed XGBoost •

    E.g. Spark, Dask, Kubernetes • But most existing solutions are lacking • Dynamic computation graphs • Fault tolerance handling • GPU support • Integration with hyperparameter tuning libraries
  4. XGBoost-Ray • Ray actors for stateful training workers • Advanced

    fault tolerance mechanisms • Full (multi) GPU support • Locality-aware distributed data loading • Integration with Ray Tune
  5. Gradient boosting: • Add a new model at each iteration

    • Trees or linear models • Each step try to fit the residuals using loss gradients • (XGBoost: 2nd order Taylor approximations) Tree 1 Tree 2 Tree 3 + + + ... Recap: XGBoost
  6. Recap: Distributed XGBoost

  7. Driver load_data() Worker 1 Worker 2 Worker 3 Worker 4

    load_data() load_data() load_data() Distributed data loading @ray.remote Actors Architecture
  8. Driver load_data() Worker 1 Worker 2 Worker 3 Worker 4

    xgb.train() load_data() xgb.train() load_data() xgb.train() load_data() xgb.train() Distributed data loading Tree-based allreduce (Rabit) Architecture
  9. Driver load_data() Worker 1 Worker 2 Worker 3 Worker 4

    xgb.train() load_data() xgb.train() load_data() xgb.train() load_data() xgb.train() Distributed data loading Tree-based allreduce (Rabit) Checkpoints Eval results Architecture
  10. Partition A Node 1 Node 2 Node 3 Node 4

    Partition B Partition C Partition F Partition D Partition E Partition G Partition H Partition A Worker 1 Worker 2 Worker 3 Worker 4 Partition B Partition C Partition F Partition D Partition E Partition G Partition H Distributed dataframe (e.g. Modin) XGBoost-Ray workers Distributed data loading
  11. • In distributed training, some worker nodes are bound to

    fail eventually • Default: Simple (cold) restart from last checkpoint • Non-elastic training (warm restart): Only failing worker restarts • Elastic training: Continue training with fewer workers until failed actor is back Fault tolerance
  12. Worker 1 Worker 2 Worker 3 Worker 4 Training Paused

    Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time Fault tolerance: Simple (cold) restart
  13. Fault tolerance: Non-elastic training (warm restart) Worker 1 Worker 2

    Worker 3 Worker 4 Training Paused Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time
  14. Worker 1 Worker 2 Worker 3 Worker 4 Training Paused

    Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time Finishes earlier Fault tolerance: Elastic training
  15. Fault tolerance: Benchmarks Condition Affected workers Eval error Time (s)

    Baseline 0 0.133326 1441.44 Fewer workers 1 0.134000 1227.45 Fewer workers 2 0.133977 1249.45 Fewer workers 3 0.133333 1291.54 Non elastic 1 0.133552 2205.95 Non elastic 2 0.133211 2226.96 Non elastic 3 0.133552 2033.94 Elastic training 1 0.133763 1231.58 Elastic training 2 0.133771 1197.55 Elastic training 3 0.133704 1259.37 30M rows, 500 features, 2 classes, 100 boosting rounds, 10 workers
  16. Hyperparameter tuning Trial 1 eta: 0.1 gamma: 0.2 Trial ...

    eta: 0.3 gamma: 0.1 Trial n eta: 0.2 gamma: 0.0 Worker 1 Worker 2 Worker ... Worker m Worker 1 Worker 2 Worker ... Worker m Worker 1 Worker 2 Worker ... Worker m Early stopping Searchers (e.g. BO, TPE) Report checkpoints and results
  17. API example from sklearn.datasets import load_breast_cancer from xgboost import DMatrix,

    train train_x, train_y = load_breast_cancer(return_X_y=True) train_set = DMatrix(train_x, train_y) bst = train( {"objective": "binary:logistic"}, train_set ) bst.save_model("trained.xgb") bst = train( {"objective": "binary:logistic"}, train_set, ray_params=RayParams(num_actors=2) ) bst.save_model("trained.xgb") from xgboost_ray import RayDMatrix, RayParams, train train_set = RayDMatrix(train_x, train_y)
  18. Ray and Distributed XGBoost on Ray at Uber Michael Mui

  19. Overview Part 1: Design and features of XGBoost-Ray • Motivation

    • Architecture • DIstributed Data loading • Fault-Tolerance • Hyperparameter Optimization Part 2: Ray and XGBoost-Ray at Uber • Distributed ML and DL Challenges at Scale • Ray and Distributed XGBoost on Ray at Uber • Next Steps
  20. Distributed ML and DL Challenges at Scale • Distributed Training

    • Fault-Tolerance and Auto-Scaling • Resource-aware Scheduling and Job Provisioning • Distributed Hyperparameter Optimization • Budget-constraint Patterns (e.g. Successive-Halving, Population-based) • Dynamic Resource Allocation • Unified Compute and API • Heterogeneous compute across workflow stages • Moving data across stages Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/
  21. Ray: A General-Purpose Distributed Framework 1. Open Platform a. Open

    source b. Run anywhere 2. Set of general distributed compute primitives a. Fault-Tolerance / Auto-Scaling b. Dynamic task graph and actors (stateful API) c. Shared memory 3. Rich ML ecosystem a. Distributed Hyperparameter Search: RayTune
  22. How does Ray help? • Unified backend for Distributed Training

    • XGBoost or Horovod + Ray • Unified backend for Distributed Hyperparameter Search • XGBoost or Horovod + Ray Tune • General-purpose compute backend for ML Workflows • XGBoost or Horovod + Ray Tune + Dask on Ray Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/
  23. End-to-end ML at Uber Unified ML Model Representation Blog (2019):

    https://eng.uber.com/michelangelo-machine-learning-model-representation/
  24. Model Training in Production + = ? How do we

    combine distributed training on Ray with Apache Spark? Apache Spark logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
  25. Distributed Training in Spark: Challenges 1. DataFrames / RDDs not

    well-suited to distributed training 2. Spark applications typically run on CPU, distributed training on GPU Spark • Jobs typically easy to fan out with cheap CPU machines • Transformations do not benefit as much from GPU acceleration Distributed Training • Compute bound, not data bound • Computations easy to represent with linear algebra
  26. Spark ML Pipeline Abstractions Unified ML Model Representation Blog (2019):

    https://eng.uber.com/michelangelo-machine-learning-model-representation/
  27. Unit Contract between Training and Serving Serving is essentially a

    series of chained call to each Transformer’s transform or score interface (Spark Summit 2019): PREDICTIONS = TXn.score(... TX2.score(TX1.score(DATA))
  28. Data Preparation with Spark data_processing_pipeline = Pipeline(stages=[ StringToFloatEncoder(...), StringIndexer(...)]) #

    Train/Test Split. Fit the data transformation Pipeline to generate the data transformer (train_df, test_df) = train_test_split(spark_dataframe, split_ratio=0.1) data_pipeline = data_processing_pipeline.fit(train_df) # Transform the original train/test data set with the fitted data transformer transformed_train_data = data_pipeline.transform(train_df) transformed_test_data = data_pipeline.transform(test_df)
  29. Ray Estimator from xgboost_ray import train from malib.ray.common.estimator import RayEstimator

    # Create the Ray Estimator xgboost_estimator = RayEstimator(remote_fn=train, ray_params=RayParams(num_actors=10), config={“eta”:0.3, “max_depth”: 10}, ...) # Create Spark Pipeline with Ray Estimator pipeline = Pipeline(stages=[...,xgboost_estimator,...]) # Fit the entire Pipeline on Train dataset and do batch serving trained_pipeline = pipeline.fit(train_df) pred_df = trained_pipeline.transform(test_df)
  30. Apache Spark logos are either registered trademarks or trademarks of

    the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
  31. Creating Ray Context # Create the Ray Context (Backend) job_config

    = raylib.RayJobConfig(zone=”dummy-zone”, timeout=36000, num_cpus=8, num_gpus=2, num_workers=50, custom_docker=”docker”) ray_ctx = raylib.RayContext(job_config)
  32. Hyperparameter Optimization with Ray Tune # Create the Ray Estimator

    for Hyperparameter Search def start_tune() from xgboost_ray import train from ray import tune tune.run(train, ...) ... xgboost_tune_estimator = RayEstimator(remote_fn=start_tune, ray_params=RayParams(num_actors=10), config={“max_trials”:100, “scheduler”: “hyperband”}, ...)
  33. Next Steps: Unified Compute Infrastructure for ML and DL •

    Moving towards: • Ray Tune + Horovod on Ray for DL auto-scaling training + hyperopt • Ray Tune + XGBoost on Ray for ML auto-scaling training + hyperopt • Dask on Ray for preprocessing • No need to maintain / provision separate infra resources • No need to materialize preprocessed data to disk • Leverage data locality • Colocation and shared memory Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/
  34. XGBoost on Ray: https://docs.ray.io/en/master/xgboost-ray.html Uber Engineering Blogs: https://eng.uber.com/elastic-xgboost-ray/ https://eng.uber.com/horovod-ray/ Thank

    you!