Overview Part 1: Design and features of XGBoost-Ray ● Motivation ● Architecture ● DIstributed Data loading ● Fault-Tolerance ● Hyperparameter Optimization Part 2: Ray and XGBoost-Ray at Uber ● Distributed ML and DL Challenges at Scale ● Ray and Distributed XGBoost on Ray at Uber ● Next Steps
Motivation • There are existing solutions for distributed XGBoost • E.g. Spark, Dask, Kubernetes • But most existing solutions are lacking • Dynamic computation graphs • Fault tolerance handling • GPU support • Integration with hyperparameter tuning libraries
XGBoost-Ray • Ray actors for stateful training workers • Advanced fault tolerance mechanisms • Full (multi) GPU support • Locality-aware distributed data loading • Integration with Ray Tune
Gradient boosting: • Add a new model at each iteration • Trees or linear models • Each step try to fit the residuals using loss gradients • (XGBoost: 2nd order Taylor approximations) Tree 1 Tree 2 Tree 3 + + + ... Recap: XGBoost
Partition A Node 1 Node 2 Node 3 Node 4 Partition B Partition C Partition F Partition D Partition E Partition G Partition H Partition A Worker 1 Worker 2 Worker 3 Worker 4 Partition B Partition C Partition F Partition D Partition E Partition G Partition H Distributed dataframe (e.g. Modin) XGBoost-Ray workers Distributed data loading
• In distributed training, some worker nodes are bound to fail eventually • Default: Simple (cold) restart from last checkpoint • Non-elastic training (warm restart): Only failing worker restarts • Elastic training: Continue training with fewer workers until failed actor is back Fault tolerance
Overview Part 1: Design and features of XGBoost-Ray ● Motivation ● Architecture ● DIstributed Data loading ● Fault-Tolerance ● Hyperparameter Optimization Part 2: Ray and XGBoost-Ray at Uber ● Distributed ML and DL Challenges at Scale ● Ray and Distributed XGBoost on Ray at Uber ● Next Steps
Distributed ML and DL Challenges at Scale • Distributed Training • Fault-Tolerance and Auto-Scaling • Resource-aware Scheduling and Job Provisioning • Distributed Hyperparameter Optimization • Budget-constraint Patterns (e.g. Successive-Halving, Population-based) • Dynamic Resource Allocation • Unified Compute and API • Heterogeneous compute across workflow stages • Moving data across stages Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/
Ray: A General-Purpose Distributed Framework 1. Open Platform a. Open source b. Run anywhere 2. Set of general distributed compute primitives a. Fault-Tolerance / Auto-Scaling b. Dynamic task graph and actors (stateful API) c. Shared memory 3. Rich ML ecosystem a. Distributed Hyperparameter Search: RayTune
How does Ray help? • Unified backend for Distributed Training • XGBoost or Horovod + Ray • Unified backend for Distributed Hyperparameter Search • XGBoost or Horovod + Ray Tune • General-purpose compute backend for ML Workflows • XGBoost or Horovod + Ray Tune + Dask on Ray Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/
Model Training in Production + = ? How do we combine distributed training on Ray with Apache Spark? Apache Spark logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Distributed Training in Spark: Challenges 1. DataFrames / RDDs not well-suited to distributed training 2. Spark applications typically run on CPU, distributed training on GPU Spark • Jobs typically easy to fan out with cheap CPU machines • Transformations do not benefit as much from GPU acceleration Distributed Training • Compute bound, not data bound • Computations easy to represent with linear algebra
Unit Contract between Training and Serving Serving is essentially a series of chained call to each Transformer’s transform or score interface (Spark Summit 2019): PREDICTIONS = TXn.score(... TX2.score(TX1.score(DATA))
Data Preparation with Spark data_processing_pipeline = Pipeline(stages=[ StringToFloatEncoder(...), StringIndexer(...)]) # Train/Test Split. Fit the data transformation Pipeline to generate the data transformer (train_df, test_df) = train_test_split(spark_dataframe, split_ratio=0.1) data_pipeline = data_processing_pipeline.fit(train_df) # Transform the original train/test data set with the fitted data transformer transformed_train_data = data_pipeline.transform(train_df) transformed_test_data = data_pipeline.transform(test_df)
Ray Estimator from xgboost_ray import train from malib.ray.common.estimator import RayEstimator # Create the Ray Estimator xgboost_estimator = RayEstimator(remote_fn=train, ray_params=RayParams(num_actors=10), config={“eta”:0.3, “max_depth”: 10}, ...) # Create Spark Pipeline with Ray Estimator pipeline = Pipeline(stages=[...,xgboost_estimator,...]) # Fit the entire Pipeline on Train dataset and do batch serving trained_pipeline = pipeline.fit(train_df) pred_df = trained_pipeline.transform(test_df)
Apache Spark logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Hyperparameter Optimization with Ray Tune # Create the Ray Estimator for Hyperparameter Search def start_tune() from xgboost_ray import train from ray import tune tune.run(train, ...) ... xgboost_tune_estimator = RayEstimator(remote_fn=start_tune, ray_params=RayParams(num_actors=10), config={“max_trials”:100, “scheduler”: “hyperband”}, ...)
Next Steps: Unified Compute Infrastructure for ML and DL • Moving towards: • Ray Tune + Horovod on Ray for DL auto-scaling training + hyperopt • Ray Tune + XGBoost on Ray for ML auto-scaling training + hyperopt • Dask on Ray for preprocessing • No need to maintain / provision separate infra resources • No need to materialize preprocessed data to disk • Leverage data locality • Colocation and shared memory Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/