Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed XGBoost on Ray (Kai Fricke & Michael Mui)

Distributed XGBoost on Ray (Kai Fricke & Michael Mui)

Michael Mui and Kai Fricke discuss XGBoost-Ray, a scalable backend for distributed XGBoost training.

Anyscale

July 13, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Overview Part 1: Design and features of XGBoost-Ray • Motivation

    • Architecture • DIstributed Data loading • Fault-Tolerance • Hyperparameter Optimization Part 2: Ray and XGBoost-Ray at Uber • Distributed ML and DL Challenges at Scale • Ray and Distributed XGBoost on Ray at Uber • Next Steps
  2. Motivation • There are existing solutions for distributed XGBoost •

    E.g. Spark, Dask, Kubernetes • But most existing solutions are lacking • Dynamic computation graphs • Fault tolerance handling • GPU support • Integration with hyperparameter tuning libraries
  3. XGBoost-Ray • Ray actors for stateful training workers • Advanced

    fault tolerance mechanisms • Full (multi) GPU support • Locality-aware distributed data loading • Integration with Ray Tune
  4. Gradient boosting: • Add a new model at each iteration

    • Trees or linear models • Each step try to fit the residuals using loss gradients • (XGBoost: 2nd order Taylor approximations) Tree 1 Tree 2 Tree 3 + + + ... Recap: XGBoost
  5. Driver load_data() Worker 1 Worker 2 Worker 3 Worker 4

    load_data() load_data() load_data() Distributed data loading @ray.remote Actors Architecture
  6. Driver load_data() Worker 1 Worker 2 Worker 3 Worker 4

    xgb.train() load_data() xgb.train() load_data() xgb.train() load_data() xgb.train() Distributed data loading Tree-based allreduce (Rabit) Architecture
  7. Driver load_data() Worker 1 Worker 2 Worker 3 Worker 4

    xgb.train() load_data() xgb.train() load_data() xgb.train() load_data() xgb.train() Distributed data loading Tree-based allreduce (Rabit) Checkpoints Eval results Architecture
  8. Partition A Node 1 Node 2 Node 3 Node 4

    Partition B Partition C Partition F Partition D Partition E Partition G Partition H Partition A Worker 1 Worker 2 Worker 3 Worker 4 Partition B Partition C Partition F Partition D Partition E Partition G Partition H Distributed dataframe (e.g. Modin) XGBoost-Ray workers Distributed data loading
  9. • In distributed training, some worker nodes are bound to

    fail eventually • Default: Simple (cold) restart from last checkpoint • Non-elastic training (warm restart): Only failing worker restarts • Elastic training: Continue training with fewer workers until failed actor is back Fault tolerance
  10. Worker 1 Worker 2 Worker 3 Worker 4 Training Paused

    Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time Fault tolerance: Simple (cold) restart
  11. Fault tolerance: Non-elastic training (warm restart) Worker 1 Worker 2

    Worker 3 Worker 4 Training Paused Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time
  12. Worker 1 Worker 2 Worker 3 Worker 4 Training Paused

    Failed Stopped Loading data Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 Time Finishes earlier Fault tolerance: Elastic training
  13. Fault tolerance: Benchmarks Condition Affected workers Eval error Time (s)

    Baseline 0 0.133326 1441.44 Fewer workers 1 0.134000 1227.45 Fewer workers 2 0.133977 1249.45 Fewer workers 3 0.133333 1291.54 Non elastic 1 0.133552 2205.95 Non elastic 2 0.133211 2226.96 Non elastic 3 0.133552 2033.94 Elastic training 1 0.133763 1231.58 Elastic training 2 0.133771 1197.55 Elastic training 3 0.133704 1259.37 30M rows, 500 features, 2 classes, 100 boosting rounds, 10 workers
  14. Hyperparameter tuning Trial 1 eta: 0.1 gamma: 0.2 Trial ...

    eta: 0.3 gamma: 0.1 Trial n eta: 0.2 gamma: 0.0 Worker 1 Worker 2 Worker ... Worker m Worker 1 Worker 2 Worker ... Worker m Worker 1 Worker 2 Worker ... Worker m Early stopping Searchers (e.g. BO, TPE) Report checkpoints and results
  15. API example from sklearn.datasets import load_breast_cancer from xgboost import DMatrix,

    train train_x, train_y = load_breast_cancer(return_X_y=True) train_set = DMatrix(train_x, train_y) bst = train( {"objective": "binary:logistic"}, train_set ) bst.save_model("trained.xgb") bst = train( {"objective": "binary:logistic"}, train_set, ray_params=RayParams(num_actors=2) ) bst.save_model("trained.xgb") from xgboost_ray import RayDMatrix, RayParams, train train_set = RayDMatrix(train_x, train_y)
  16. Overview Part 1: Design and features of XGBoost-Ray • Motivation

    • Architecture • DIstributed Data loading • Fault-Tolerance • Hyperparameter Optimization Part 2: Ray and XGBoost-Ray at Uber • Distributed ML and DL Challenges at Scale • Ray and Distributed XGBoost on Ray at Uber • Next Steps
  17. Distributed ML and DL Challenges at Scale • Distributed Training

    • Fault-Tolerance and Auto-Scaling • Resource-aware Scheduling and Job Provisioning • Distributed Hyperparameter Optimization • Budget-constraint Patterns (e.g. Successive-Halving, Population-based) • Dynamic Resource Allocation • Unified Compute and API • Heterogeneous compute across workflow stages • Moving data across stages Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/
  18. Ray: A General-Purpose Distributed Framework 1. Open Platform a. Open

    source b. Run anywhere 2. Set of general distributed compute primitives a. Fault-Tolerance / Auto-Scaling b. Dynamic task graph and actors (stateful API) c. Shared memory 3. Rich ML ecosystem a. Distributed Hyperparameter Search: RayTune
  19. How does Ray help? • Unified backend for Distributed Training

    • XGBoost or Horovod + Ray • Unified backend for Distributed Hyperparameter Search • XGBoost or Horovod + Ray Tune • General-purpose compute backend for ML Workflows • XGBoost or Horovod + Ray Tune + Dask on Ray Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/
  20. End-to-end ML at Uber Unified ML Model Representation Blog (2019):

    https://eng.uber.com/michelangelo-machine-learning-model-representation/
  21. Model Training in Production + = ? How do we

    combine distributed training on Ray with Apache Spark? Apache Spark logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
  22. Distributed Training in Spark: Challenges 1. DataFrames / RDDs not

    well-suited to distributed training 2. Spark applications typically run on CPU, distributed training on GPU Spark • Jobs typically easy to fan out with cheap CPU machines • Transformations do not benefit as much from GPU acceleration Distributed Training • Compute bound, not data bound • Computations easy to represent with linear algebra
  23. Spark ML Pipeline Abstractions Unified ML Model Representation Blog (2019):

    https://eng.uber.com/michelangelo-machine-learning-model-representation/
  24. Unit Contract between Training and Serving Serving is essentially a

    series of chained call to each Transformer’s transform or score interface (Spark Summit 2019): PREDICTIONS = TXn.score(... TX2.score(TX1.score(DATA))
  25. Data Preparation with Spark data_processing_pipeline = Pipeline(stages=[ StringToFloatEncoder(...), StringIndexer(...)]) #

    Train/Test Split. Fit the data transformation Pipeline to generate the data transformer (train_df, test_df) = train_test_split(spark_dataframe, split_ratio=0.1) data_pipeline = data_processing_pipeline.fit(train_df) # Transform the original train/test data set with the fitted data transformer transformed_train_data = data_pipeline.transform(train_df) transformed_test_data = data_pipeline.transform(test_df)
  26. Ray Estimator from xgboost_ray import train from malib.ray.common.estimator import RayEstimator

    # Create the Ray Estimator xgboost_estimator = RayEstimator(remote_fn=train, ray_params=RayParams(num_actors=10), config={“eta”:0.3, “max_depth”: 10}, ...) # Create Spark Pipeline with Ray Estimator pipeline = Pipeline(stages=[...,xgboost_estimator,...]) # Fit the entire Pipeline on Train dataset and do batch serving trained_pipeline = pipeline.fit(train_df) pred_df = trained_pipeline.transform(test_df)
  27. Apache Spark logos are either registered trademarks or trademarks of

    the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
  28. Creating Ray Context # Create the Ray Context (Backend) job_config

    = raylib.RayJobConfig(zone=”dummy-zone”, timeout=36000, num_cpus=8, num_gpus=2, num_workers=50, custom_docker=”docker”) ray_ctx = raylib.RayContext(job_config)
  29. Hyperparameter Optimization with Ray Tune # Create the Ray Estimator

    for Hyperparameter Search def start_tune() from xgboost_ray import train from ray import tune tune.run(train, ...) ... xgboost_tune_estimator = RayEstimator(remote_fn=start_tune, ray_params=RayParams(num_actors=10), config={“max_trials”:100, “scheduler”: “hyperband”}, ...)
  30. Next Steps: Unified Compute Infrastructure for ML and DL •

    Moving towards: • Ray Tune + Horovod on Ray for DL auto-scaling training + hyperopt • Ray Tune + XGBoost on Ray for ML auto-scaling training + hyperopt • Dask on Ray for preprocessing • No need to maintain / provision separate infra resources • No need to materialize preprocessed data to disk • Leverage data locality • Colocation and shared memory Elastic Horovod on Ray (2021): https://eng.uber.com/horovod-ray/