Distributed XGBoost on Ray (Kai Fricke & Michael Mui)

Distributed XGBoost on Ray Kai Fricke Anyscale Michael Mui Uber

Overview Part 1: Design and features of XGBoost-Ray • Motivation
• Architecture • DIstributed Data loading • Fault-Tolerance • Hyperparameter Optimization Part 2: Ray and XGBoost-Ray at Uber • Distributed ML and DL Challenges at Scale • Ray and Distributed XGBoost on Ray at Uber • Next Steps

Motivation • There are existing solutions for distributed XGBoost •
E.g. Spark, Dask, Kubernetes • But most existing solutions are lacking • Dynamic computation graphs • Fault tolerance handling • GPU support • Integration with hyperparameter tuning libraries

XGBoost-Ray • Ray actors for stateful training workers • Advanced
fault tolerance mechanisms • Full (multi) GPU support • Locality-aware distributed data loading • Integration with Ray Tune

Gradient boosting: • Add a new model at each iteration
• Trees or linear models • Each step try to fit the residuals using loss gradients • (XGBoost: 2nd order Taylor approximations) Tree 1 Tree 2 Tree 3 + + + ... Recap: XGBoost

Recap: Distributed XGBoost

Driver load_data() Worker 1 Worker 2 Worker 3 Worker 4
load_data() load_data() load_data() Distributed data loading @ray.remote Actors Architecture

xgb.train() load_data() xgb.train() load_data() xgb.train() load_data() xgb.train() Distributed data loading Tree-based allreduce (Rabit) Architecture

xgb.train() load_data() xgb.train() load_data() xgb.train() load_data() xgb.train() Distributed data loading Tree-based allreduce (Rabit) Checkpoints Eval results Architecture

Partition A Node 1 Node 2 Node 3 Node 4
Partition B Partition C Partition F Partition D Partition E Partition G Partition H Partition A Worker 1 Worker 2 Worker 3 Worker 4 Partition B Partition C Partition F Partition D Partition E Partition G Partition H Distributed dataframe (e.g. Modin) XGBoost-Ray workers Distributed data loading

• In distributed training, some worker nodes are bound to
fail eventually • Default: Simple (cold) restart from last checkpoint • Non-elastic training (warm restart): Only failing worker restarts • Elastic training: Continue training with fewer workers until failed actor is back Fault tolerance

Worker 1 Worker 2 Worker 3 Worker 4 Training Paused
Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time Fault tolerance: Simple (cold) restart

Fault tolerance: Non-elastic training (warm restart) Worker 1 Worker 2
Worker 3 Worker 4 Training Paused Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time

Worker 1 Worker 2 Worker 3 Worker 4 Training Paused
Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time Finishes earlier Fault tolerance: Elastic training

Fault tolerance: Benchmarks Condition Affected workers Eval error Time (s)
Baseline 0 0.133326 1441.44 Fewer workers 1 0.134000 1227.45 Fewer workers 2 0.133977 1249.45 Fewer workers 3 0.133333 1291.54 Non elastic 1 0.133552 2205.95 Non elastic 2 0.133211 2226.96 Non elastic 3 0.133552 2033.94 Elastic training 1 0.133763 1231.58 Elastic training 2 0.133771 1197.55 Elastic training 3 0.133704 1259.37 30M rows, 500 features, 2 classes, 100 boosting rounds, 10 workers

Hyperparameter tuning Trial 1 eta: 0.1 gamma: 0.2 Trial ...
eta: 0.3 gamma: 0.1 Trial n eta: 0.2 gamma: 0.0 Worker 1 Worker 2 Worker ... Worker m Worker 1 Worker 2 Worker ... Worker m Worker 1 Worker 2 Worker ... Worker m Early stopping Searchers (e.g. BO, TPE) Report checkpoints and results

API example from sklearn.datasets import load_breast_cancer from xgboost import DMatrix,
train train_x, train_y = load_breast_cancer(return_X_y=True) train_set = DMatrix(train_x, train_y) bst = train( {"objective": "binary:logistic"}, train_set ) bst.save_model("trained.xgb") bst = train( {"objective": "binary:logistic"}, train_set, ray_params=RayParams(num_actors=2) ) bst.save_model("trained.xgb") from xgboost_ray import RayDMatrix, RayParams, train train_set = RayDMatrix(train_x, train_y)

Ray and Distributed XGBoost on Ray at Uber Michael Mui

Overview Part 1: Design and features of XGBoost-Ray • Motivation
• Architecture • DIstributed Data loading • Fault-Tolerance • Hyperparameter Optimization Part 2: Ray and XGBoost-Ray at Uber • Distributed ML and DL Challenges at Scale • Ray and Distributed XGBoost on Ray at Uber • Next Steps

Distributed ML and DL Challenges at Scale • Distributed Training
• Fault-Tolerance and Auto-Scaling • Resource-aware Scheduling and Job Provisioning • Distributed Hyperparameter Optimization • Budget-constraint Patterns (e.g. Successive-Halving, Population-based) • Dynamic Resource Allocation • Unified Compute and API • Heterogeneous compute across workflow stages • Moving data across stages Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/

Ray: A General-Purpose Distributed Framework 1. Open Platform a. Open
source b. Run anywhere 2. Set of general distributed compute primitives a. Fault-Tolerance / Auto-Scaling b. Dynamic task graph and actors (stateful API) c. Shared memory 3. Rich ML ecosystem a. Distributed Hyperparameter Search: RayTune

How does Ray help? • Unified backend for Distributed Training
• XGBoost or Horovod + Ray • Unified backend for Distributed Hyperparameter Search • XGBoost or Horovod + Ray Tune • General-purpose compute backend for ML Workflows • XGBoost or Horovod + Ray Tune + Dask on Ray Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/

End-to-end ML at Uber Unified ML Model Representation Blog (2019):
https://eng.uber.com/michelangelo-machine-learning-model-representation/

Model Training in Production + = ? How do we
combine distributed training on Ray with Apache Spark? Apache Spark logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Distributed Training in Spark: Challenges 1. DataFrames / RDDs not
well-suited to distributed training 2. Spark applications typically run on CPU, distributed training on GPU Spark • Jobs typically easy to fan out with cheap CPU machines • Transformations do not benefit as much from GPU acceleration Distributed Training • Compute bound, not data bound • Computations easy to represent with linear algebra

Spark ML Pipeline Abstractions Unified ML Model Representation Blog (2019):
https://eng.uber.com/michelangelo-machine-learning-model-representation/

Unit Contract between Training and Serving Serving is essentially a
series of chained call to each Transformer’s transform or score interface (Spark Summit 2019): PREDICTIONS = TXn.score(... TX2.score(TX1.score(DATA))

Data Preparation with Spark data_processing_pipeline = Pipeline(stages=[ StringToFloatEncoder(...), StringIndexer(...)]) #
Train/Test Split. Fit the data transformation Pipeline to generate the data transformer (train_df, test_df) = train_test_split(spark_dataframe, split_ratio=0.1) data_pipeline = data_processing_pipeline.fit(train_df) # Transform the original train/test data set with the fitted data transformer transformed_train_data = data_pipeline.transform(train_df) transformed_test_data = data_pipeline.transform(test_df)

Ray Estimator from xgboost_ray import train from malib.ray.common.estimator import RayEstimator
# Create the Ray Estimator xgboost_estimator = RayEstimator(remote_fn=train, ray_params=RayParams(num_actors=10), config={“eta”:0.3, “max_depth”: 10}, ...) # Create Spark Pipeline with Ray Estimator pipeline = Pipeline(stages=[...,xgboost_estimator,...]) # Fit the entire Pipeline on Train dataset and do batch serving trained_pipeline = pipeline.fit(train_df) pred_df = trained_pipeline.transform(test_df)

Apache Spark logos are either registered trademarks or trademarks of
the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Creating Ray Context # Create the Ray Context (Backend) job_config
= raylib.RayJobConfig(zone=”dummy-zone”, timeout=36000, num_cpus=8, num_gpus=2, num_workers=50, custom_docker=”docker”) ray_ctx = raylib.RayContext(job_config)

Hyperparameter Optimization with Ray Tune # Create the Ray Estimator
for Hyperparameter Search def start_tune() from xgboost_ray import train from ray import tune tune.run(train, ...) ... xgboost_tune_estimator = RayEstimator(remote_fn=start_tune, ray_params=RayParams(num_actors=10), config={“max_trials”:100, “scheduler”: “hyperband”}, ...)

Next Steps: Unified Compute Infrastructure for ML and DL •
Moving towards: • Ray Tune + Horovod on Ray for DL auto-scaling training + hyperopt • Ray Tune + XGBoost on Ray for ML auto-scaling training + hyperopt • Dask on Ray for preprocessing • No need to maintain / provision separate infra resources • No need to materialize preprocessed data to disk • Leverage data locality • Colocation and shared memory Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/

XGBoost on Ray: https://docs.ray.io/en/master/xgboost-ray.html Uber Engineering Blogs: https://eng.uber.com/elastic-xgboost-ray/ https://eng.uber.com/horovod-ray/ Thank
you!

Distributed XGBoost on Ray (Kai Fricke & Michae...

Distributed XGBoost on Ray (Kai Fricke & Michael Mui)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript