Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed XGBoost on Ray (Kai Fricke & Michael Mui)

Distributed XGBoost on Ray (Kai Fricke & Michael Mui)

Michael Mui and Kai Fricke discuss XGBoost-Ray, a scalable backend for distributed XGBoost training.

Anyscale
PRO

July 13, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Distributed XGBoost
    on Ray
    Kai Fricke Anyscale
    Michael Mui Uber

    View Slide

  2. Overview
    Part 1: Design and features of XGBoost-Ray
    ● Motivation
    ● Architecture
    ● DIstributed Data loading
    ● Fault-Tolerance
    ● Hyperparameter Optimization
    Part 2: Ray and XGBoost-Ray at Uber
    ● Distributed ML and DL Challenges at Scale
    ● Ray and Distributed XGBoost on Ray at Uber
    ● Next Steps

    View Slide

  3. Motivation
    • There are existing solutions for distributed XGBoost
    • E.g. Spark, Dask, Kubernetes
    • But most existing solutions are lacking
    • Dynamic computation graphs
    • Fault tolerance handling
    • GPU support
    • Integration with hyperparameter tuning libraries

    View Slide

  4. XGBoost-Ray
    • Ray actors for stateful training workers
    • Advanced fault tolerance mechanisms
    • Full (multi) GPU support
    • Locality-aware distributed data loading
    • Integration with Ray Tune

    View Slide

  5. Gradient boosting:
    • Add a new model at each iteration
    • Trees or linear models
    • Each step try to fit the residuals using loss gradients
    • (XGBoost: 2nd order Taylor approximations)
    Tree 1 Tree 2 Tree 3
    + + + ...
    Recap: XGBoost

    View Slide

  6. Recap: Distributed XGBoost

    View Slide

  7. Driver
    load_data()
    Worker 1 Worker 2 Worker 3 Worker 4
    load_data() load_data() load_data()
    Distributed
    data loading
    @ray.remote
    Actors
    Architecture

    View Slide

  8. Driver
    load_data()
    Worker 1 Worker 2 Worker 3 Worker 4
    xgb.train()
    load_data()
    xgb.train()
    load_data()
    xgb.train()
    load_data()
    xgb.train()
    Distributed
    data loading
    Tree-based
    allreduce
    (Rabit)
    Architecture

    View Slide

  9. Driver
    load_data()
    Worker 1 Worker 2 Worker 3 Worker 4
    xgb.train()
    load_data()
    xgb.train()
    load_data()
    xgb.train()
    load_data()
    xgb.train()
    Distributed
    data loading
    Tree-based
    allreduce
    (Rabit)
    Checkpoints
    Eval results
    Architecture

    View Slide

  10. Partition A
    Node 1 Node 2 Node 3 Node 4
    Partition B
    Partition C
    Partition F
    Partition D
    Partition E
    Partition G
    Partition H
    Partition A
    Worker 1 Worker 2 Worker 3 Worker 4
    Partition B
    Partition C
    Partition F
    Partition D
    Partition E
    Partition G
    Partition H
    Distributed
    dataframe
    (e.g. Modin)
    XGBoost-Ray
    workers
    Distributed data loading

    View Slide

  11. • In distributed training, some worker nodes are bound to fail
    eventually
    • Default: Simple (cold) restart from last checkpoint
    • Non-elastic training (warm restart):
    Only failing worker restarts
    • Elastic training: Continue training with fewer workers until
    failed actor is back
    Fault tolerance

    View Slide

  12. Worker 1
    Worker 2
    Worker 3
    Worker 4
    Training
    Paused
    Failed
    Stopped
    Loading data
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Time
    Fault tolerance: Simple (cold) restart

    View Slide

  13. Fault tolerance: Non-elastic training (warm restart)
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Training
    Paused
    Failed
    Stopped
    Loading data
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Time

    View Slide

  14. Worker 1
    Worker 2
    Worker 3
    Worker 4
    Training
    Paused
    Failed
    Stopped
    Loading data
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Worker 1
    Worker 2
    Worker 3
    Worker 4
    Time
    Finishes
    earlier
    Fault tolerance: Elastic training

    View Slide

  15. Fault tolerance: Benchmarks
    Condition Affected workers Eval error Time (s)
    Baseline 0 0.133326 1441.44
    Fewer workers 1 0.134000 1227.45
    Fewer workers 2 0.133977 1249.45
    Fewer workers 3 0.133333 1291.54
    Non elastic 1 0.133552 2205.95
    Non elastic 2 0.133211 2226.96
    Non elastic 3 0.133552 2033.94
    Elastic training 1 0.133763 1231.58
    Elastic training 2 0.133771 1197.55
    Elastic training 3 0.133704 1259.37
    30M rows, 500 features, 2 classes, 100 boosting rounds, 10 workers

    View Slide

  16. Hyperparameter tuning
    Trial 1
    eta: 0.1
    gamma: 0.2
    Trial ...
    eta: 0.3
    gamma: 0.1
    Trial n
    eta: 0.2
    gamma: 0.0
    Worker 1
    Worker 2
    Worker ...
    Worker m
    Worker 1
    Worker 2
    Worker ...
    Worker m
    Worker 1
    Worker 2
    Worker ...
    Worker m
    Early stopping
    Searchers (e.g. BO, TPE)
    Report checkpoints
    and results

    View Slide

  17. API example
    from sklearn.datasets import load_breast_cancer
    from xgboost import DMatrix, train
    train_x, train_y = load_breast_cancer(return_X_y=True)
    train_set = DMatrix(train_x, train_y)
    bst = train(
    {"objective": "binary:logistic"},
    train_set
    )
    bst.save_model("trained.xgb")
    bst = train(
    {"objective": "binary:logistic"},
    train_set,
    ray_params=RayParams(num_actors=2)
    )
    bst.save_model("trained.xgb")
    from xgboost_ray import RayDMatrix, RayParams, train
    train_set = RayDMatrix(train_x, train_y)

    View Slide

  18. Ray and Distributed XGBoost
    on Ray at Uber
    Michael Mui

    View Slide

  19. Overview
    Part 1: Design and features of XGBoost-Ray
    ● Motivation
    ● Architecture
    ● DIstributed Data loading
    ● Fault-Tolerance
    ● Hyperparameter Optimization
    Part 2: Ray and XGBoost-Ray at Uber
    ● Distributed ML and DL Challenges at Scale
    ● Ray and Distributed XGBoost on Ray at Uber
    ● Next Steps

    View Slide

  20. Distributed ML and DL Challenges at Scale
    • Distributed Training
    • Fault-Tolerance and Auto-Scaling
    • Resource-aware Scheduling and Job Provisioning
    • Distributed Hyperparameter Optimization
    • Budget-constraint Patterns (e.g. Successive-Halving,
    Population-based)
    • Dynamic Resource Allocation
    • Unified Compute and API
    • Heterogeneous compute across workflow stages
    • Moving data across stages
    Elastic Horovod on Ray (2021):
    https://eng.uber.com/horovod-ray/

    View Slide

  21. Ray: A General-Purpose Distributed Framework
    1. Open Platform
    a. Open source
    b. Run anywhere
    2. Set of general distributed compute primitives
    a. Fault-Tolerance / Auto-Scaling
    b. Dynamic task graph and actors (stateful API)
    c. Shared memory
    3. Rich ML ecosystem
    a. Distributed Hyperparameter Search: RayTune

    View Slide

  22. How does Ray help?
    • Unified backend for Distributed Training
    • XGBoost or Horovod + Ray
    • Unified backend for Distributed Hyperparameter Search
    • XGBoost or Horovod + Ray Tune
    • General-purpose compute backend for ML Workflows
    • XGBoost or Horovod + Ray Tune + Dask on Ray
    Elastic Horovod on Ray (2021):
    https://eng.uber.com/horovod-ray/

    View Slide

  23. End-to-end ML at Uber
    Unified ML Model Representation Blog (2019): https://eng.uber.com/michelangelo-machine-learning-model-representation/

    View Slide

  24. Model Training in Production
    + = ?
    How do we combine distributed training on Ray with Apache Spark?
    Apache Spark logos are either registered trademarks or
    trademarks of the Apache Software Foundation in the United
    States and/or other countries. No endorsement by The Apache
    Software Foundation is implied by the use of these marks.

    View Slide

  25. Distributed Training in Spark: Challenges
    1. DataFrames / RDDs not well-suited to distributed training
    2. Spark applications typically run on CPU, distributed training on GPU
    Spark
    • Jobs typically easy to fan out with cheap CPU machines
    • Transformations do not benefit as much from GPU acceleration
    Distributed Training
    • Compute bound, not data bound
    • Computations easy to represent with linear algebra

    View Slide

  26. Spark ML Pipeline Abstractions
    Unified ML Model Representation Blog (2019):
    https://eng.uber.com/michelangelo-machine-learning-model-representation/

    View Slide

  27. Unit Contract between Training and Serving
    Serving is essentially a series of chained call to each Transformer’s
    transform or score interface (Spark Summit 2019):
    PREDICTIONS = TXn.score(... TX2.score(TX1.score(DATA))

    View Slide

  28. Data Preparation with Spark
    data_processing_pipeline = Pipeline(stages=[
    StringToFloatEncoder(...),
    StringIndexer(...)])
    # Train/Test Split. Fit the data transformation Pipeline to generate the data
    transformer
    (train_df, test_df) = train_test_split(spark_dataframe, split_ratio=0.1)
    data_pipeline = data_processing_pipeline.fit(train_df)
    # Transform the original train/test data set with the fitted data transformer
    transformed_train_data = data_pipeline.transform(train_df)
    transformed_test_data = data_pipeline.transform(test_df)

    View Slide

  29. Ray Estimator
    from xgboost_ray import train
    from malib.ray.common.estimator import RayEstimator
    # Create the Ray Estimator
    xgboost_estimator = RayEstimator(remote_fn=train,
    ray_params=RayParams(num_actors=10),
    config={“eta”:0.3, “max_depth”: 10},
    ...)
    # Create Spark Pipeline with Ray Estimator
    pipeline = Pipeline(stages=[...,xgboost_estimator,...])
    # Fit the entire Pipeline on Train dataset and do batch serving
    trained_pipeline = pipeline.fit(train_df)
    pred_df = trained_pipeline.transform(test_df)

    View Slide

  30. Apache Spark logos are either registered trademarks or
    trademarks of the Apache Software Foundation in the United
    States and/or other countries. No endorsement by The Apache
    Software Foundation is implied by the use of these marks.

    View Slide

  31. Creating Ray Context
    # Create the Ray Context (Backend)
    job_config = raylib.RayJobConfig(zone=”dummy-zone”,
    timeout=36000,
    num_cpus=8, num_gpus=2,
    num_workers=50,
    custom_docker=”docker”)
    ray_ctx = raylib.RayContext(job_config)

    View Slide

  32. Hyperparameter Optimization with Ray Tune
    # Create the Ray Estimator for Hyperparameter Search
    def start_tune()
    from xgboost_ray import train
    from ray import tune
    tune.run(train, ...)
    ...
    xgboost_tune_estimator = RayEstimator(remote_fn=start_tune,
    ray_params=RayParams(num_actors=10),
    config={“max_trials”:100,
    “scheduler”: “hyperband”},
    ...)

    View Slide

  33. Next Steps: Unified Compute Infrastructure for ML
    and DL
    • Moving towards:
    • Ray Tune + Horovod on Ray for DL auto-scaling training + hyperopt
    • Ray Tune + XGBoost on Ray for ML auto-scaling training + hyperopt
    • Dask on Ray for preprocessing
    • No need to maintain / provision separate infra resources
    • No need to materialize preprocessed data to disk
    • Leverage data locality
    • Colocation and shared memory
    Elastic Horovod on Ray (2021):
    https://eng.uber.com/horovod-ray/

    View Slide

  34. XGBoost on Ray:
    https://docs.ray.io/en/master/xgboost-ray.html
    Uber Engineering Blogs:
    https://eng.uber.com/elastic-xgboost-ray/
    https://eng.uber.com/horovod-ray/
    Thank you!

    View Slide