Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fast and efficient hyperparameter tuning with Ray Tune

Fast and efficient hyperparameter tuning with Ray Tune

Hyperparameter tuning or optimization is used to find the best performing machine learning (ML) model by exploring and optimizing the model hyperparameters (eg. learning rate, tree depth, etc). It is a compute-intensive problem that lends itself well to distributed execution.

Ray Tune is a Python library, built on Ray, that allows you to easily run distributed hyperparameter tuning at scale. Ray Tune is framework-agnostic and supports all the popular training frameworks including PyTorch, TensorFlow, XGBoost, LightGBM, and Keras.

Anyscale
PRO

October 21, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Ray Tune
    Fast and efficient hyperparameter tuning with Ray Tune
    Will Drevo
    Amog Kamsetty
    Xiaowei Jiang

    View Slide

  2. 1. What brings you here
    2. Challenges in hyperparameter tuning
    3. Ray Tune’s approach
    4. Demo
    Agenda

    View Slide

  3. About us!
    Hi! I’m Will Drevo.
    • PM at Anyscale for the open-source ML team (Ray Tune, Train, RLlib)
    • Previously
    • an ML engineer at Coinbase
    • founded a data labeling company for ML teams
    • founded a pharma SaaS for accelerating clinical trials
    • BS, MEng CS @ MIT in ML & distributed systems
    • I like to DJ, make electronic music, travel, and eat Ethiopian food
    Joining us is Amog Kamsetty, a software engineer at Anyscale and a
    Ray Tune & Train lead developer

    View Slide

  4. (1) What brings you here?
    What is your stack?

    View Slide

  5. (2) Why we tune

    View Slide

  6. The future of ML: models with lots of parameters
    Note the log axis!

    View Slide

  7. Hyperparameter tuning
    “choosing a set of optimal hyperparameters for a learning algorithm”
    How many layers? What kinds of layers? Learning rate
    schedule?
    Every number here is a hyperparameter!
    Example: what network structure is best for your binary classification problem?

    View Slide

  8. Why we tune
    $
    Cost
    Performance
    Achieve performance you might otherwise
    not find
    ● Minimizing fraud loss
    ● Reduce unsold goods (forecasting)
    ● Lower error rate in object detection
    ● ... etc
    Don’t go broke with compute ($$) or
    developer hours to train!
    These can also “cost” you !

    View Slide

  9. Pure parallelization isn’t enough
    $$
    $$
    $$
    $$
    ==

    View Slide

  10. Why we tune
    Cost
    Performance per unit
    Performance
    Time (Money)

    View Slide

  11. (2) Challenges in hyperparameter tuning

    View Slide

  12. ● Scaling memory / compute
    ○ Hold at >1 copy of data in memory, likely many more
    ● Algorithmic (cost) efficiency
    ○ Cleverly & quickly search the parameter space
    ● Ease of use
    ○ Quick to get started
    ○ Local → cluster
    ○ Extensible
    Challenges in tuning

    View Slide

  13. ● Parallelism: how do you distribute computation?
    ● Does your distributed training framework work with your
    tuning framework? Hard to get both.
    Scaling memory

    View Slide

  14. Two ways to accomplish this:
    1. Sampling (searching)
    2. Scheduling (pruning)
    Sample better: pick the next parameter set to evaluate better
    Schedule better: allocate more time to promising parameter sets
    Algorithmic cost efficiency

    View Slide

  15. ● Better search algorithms have wildly better performance
    ● Deep learning == more complex, more parameters
    Algorithmic cost efficiency

    View Slide

  16. Algorithmic cost efficiency: scheduler (SHA)

    View Slide

  17. Algorithmic cost efficiency: sampler + searcher (BOHB)

    View Slide

  18. ● [2012] Random search
    ● [2015] Successive Halving Algorithm (SHA)
    ● [2017] HyperBand (HB / AHB)
    ● [2017] Population Based Training (PBT)
    ● [2017] Google Vizier
    ● [2018] Bayesian Optimization of HyperBand (BOHB)
    ● [2018] Async Successive Halving Algorithm (ASHA)
    :
    Many algorithms out there...
    Today, ASHA is likely the best choice for
    real-world, highly distributed tuning
    PBT is also a good choice, specifically for
    deep learning.

    View Slide

  19. Cost(development) + Cost(tuning compute) < Cost(bad model)
    ● Library interoperability
    ● Maintenance cost over time
    ● Ease of getting started, debugging
    Ease of use A very real cost!

    View Slide

  20. (2) Ray Tune’s approach

    View Slide

  21. ● Open source!
    ● Interoperability with Ray ecosystem
    ● Distributed-first, GPU-first
    ● Work with as many Python libraries as possible
    ● Constantly updated with state-of-the-art tuning algorithms
    ● Orchestrate other searchers, samplers: HyperOpt, Optuna, etc
    Why use Ray Tune?

    View Slide

  22. What is Ray
    Native Libraries 3rd Party Libraries
    Your
    app
    here!
    Universal
    framework for
    distributed
    computing
    Run anywhere
    Library + app
    ecosystem

    View Slide

  23. Zooming in: Ray’s native libraries
    Universal
    framework for
    distributed
    computing
    Run anywhere
    Library + app
    ecosystem

    View Slide

  24. ● Efficient algorithms that enable running trials in parallel
    ● Effective orchestration of distributed trials
    ● Easy to use APIs
    Ray Tune - distributed HPO
    Cutting edge
    optimization algorithms
    Minimal code changes to
    work in distributed
    settings
    Compatible with ML
    ecosystem

    View Slide

  25. A universal framework for distributed computing
    Notable users of Ray
    Ray Tune
    users!

    View Slide

  26. Open
    source
    No cloud lock-in Distributed SOTA
    algorithms
    Bring your
    own
    framework
    Also runs
    Ray Tune ✅ ✅ ✅ ⭐⭐⭐ ✅ HyperOpt,
    Optuna, SigOpt
    HyperOpt ✅ ✅ ✅ ⭐⭐ ✅
    Optuna ✅ ✅ ❌ ⭐⭐ ✅
    SigOpt ❌ ❌ ✅ ⭐ ✅
    Vertex AI
    (Vizier)
    ❌ ❌ ✅ ⭐ ❌
    Sagemaker ❌ ❌ ✅ ⭐ ❌
    Azure ML ❌ ❌ ✅ ⭐ ❌
    Katib ✅ ✅ ✅ ⭐⭐ ❌ HyperOpt,
    Optuna
    Spark ML ✅ ✅ ✅ ❌ ❌ HyperOpt

    View Slide

  27. ● Provides efficient cutting edge HPO algorithms
    ● Distributes and coordinates parallel trials in a fault-tolerant
    and elastic manner
    ● Saves you time and cost every step of HPO
    Ray Tune benefits
    Ray Tune benchmark on 2 weeks of production data at Uber
    2X efficiency improvement in
    terms of GPU/CPU-hours

    View Slide

  28. Case studies: Ray Tune

    View Slide

  29. (4) Code example!

    View Slide

  30. Ray Tune
    with PyTorch
    ASHA
    scheduler with
    loss
    specification
    Simply use your train
    func (ie:
    train_cifer()) and
    just tune.run() !
    See this example on the
    PyTorch official docs Easily specify
    hyperparameter
    ranges to search over

    View Slide

  31. ● A scikit-learn wrapper for Ray Tune
    ○ drop-in replacement for scikit-learn model selection
    module (RandomizedSearchCV and GridSearchCV)
    ● Provides a familiar and simple API for advanced, distributed
    HPO
    ● https://github.com/ray-project/tune-sklearn
    Tune-sklearn

    View Slide

  32. Drop-in replacement
    from sklearn.model_selection import GridSearchCV
    parameters = {
    'alpha': [1e-4, 1e-1, 1],
    'epsilon':[0.01, 0.1]
    }
    search = GridSearchCV(
    SGDClassifier(),
    parameters,
    n_jobs=-1
    )
    search.fit(X_train, y_train)
    Use all the cores on the single machine

    View Slide

  33. Drop-in replacement
    from tune_sklearn import TuneSearchCV
    parameters = {
    'alpha': [1e-4, 1e-1, 1],
    'epsilon':[0.01, 0.1]
    }
    search = TuneSearchCV(
    SGDClassifier(),
    parameters,
    n_jobs=-1
    )
    search.fit(X_train, y_train)
    Use all the resources throughout the entire cluster!

    View Slide

  34. tune-sklearn
    demo
    ● Driver safety prediction
    ● Jupyter notebook that runs on
    head node

    View Slide

  35. Website (ray.io/ray-tune)
    One stop for all Ray Tune resources
    Documentation (docs.ray.io)
    Quick start example, reference guides, etc
    Forums (discuss.ray.io)
    Learn / share with broader Ray Tune community, including developers
    Ray Slack
    Connect with the Ray team and community
    Get started with Ray Tune

    View Slide

  36. Report the best model from trying...
    ● Grid search: N evenly spaced trials
    ● [2012] Random search: N randomly sampled trials
    ● [2013-14] Bayesian optimization: sample, then train meta-model to chose next sample. Repeat
    ● [2015] Successive Halving Algorithm (SHA): N randomly sampled, then keep best N/2, then N/4...
    ● [2017] Population Based Training (PBT): genetic algorithm-like optimizer for starting with a set of
    parameterizations and spawning off new parameterizations from well-performing ones
    ● [2017] HyperBand: do SHA but more intelligently choose N over time
    ● [2018] Bayesian Optimization and HyperBand (BOHB): Use HyperBand to choose budgets per
    parameterization, but suggest new examples via Bayesian methods instead of random
    ● [2018] Asynchronous SHA (ASHA): SHA, but don’t wait up for stragglers, and replaced failed runs
    with new parameterizations
    A short history of techniques

    View Slide

  37. A short history of techniques
    Grid
    Search
    Random
    Search
    (2012)
    Random
    Search
    (2012)

    View Slide

  38. Example
    How many layers? What kinds of layers? - Every number here is a
    hyperparameter!

    View Slide

  39. Hyperparameters
    Model
    parameters
    ● Model type and architecture
    ● Learning and training related
    parameters
    ● Pipeline configurations
    Set before training
    Learnt during training
    Hyperparameters

    View Slide

  40. Example
    XGBoost
    6 - 10
    hyperparameters to
    tune
    Total: ~15 hyperparameters to tune!
    Type: Simple or
    iterative
    Simple strategy:
    Mean or median or
    constant?
    Type: One-hot
    encoding or label
    encoding?
    Imputer
    Categorical
    encoder
    Under/
    oversampler
    Type: SMOTE or
    random
    undersampling?
    Number of
    neighbors?

    View Slide

  41. 01 02 03
    Exhaustive
    Search
    Bayesian
    Optimization
    Advanced
    Scheduling
    ● Over 15 algorithms natively provided or integrated
    ● Easy to swap out different algorithms with no code change
    Ray Tune - HPO algorithms

    View Slide

  42. ● Uses results from previous combinations (trials) to decide
    which trial to try next
    Bayesian optimization
    https://www.wikiwand.com/en/Hyperparamet
    er_optimization
    ● Inherently sequential
    ● Popular libraries:
    ○ hyperopt
    ○ optuna

    View Slide

  43. Parallel exploration and exploitation
    Advanced scheduling
    • Fan out parallel trials during the initial exploration phase
    • Make decisions based on intermediate cross-trial evaluations
    • Allocate resources to more promising trials
    • Early stopping
    • Population based training

    View Slide

  44. ● Fan out parallel trials during the initial exploration phase
    ● Use intermediate results (epochs, trees) to prune
    underperforming trials, saving time and computing resources
    Advanced Scheduling - Early stopping
    ● Median stopping, ASHA/Hyperband
    ● Can be combined with Bayesian Optimization (BOHB)

    View Slide

  45. Advanced Scheduling - Population Based Training
    ● Evolutionary algorithm
    ● Evaluate a population in parallel
    ● Terminate lowest performers
    ● Copy weights of the best performing trials and mutate them

    View Slide

  46. Woohoo!
    Let’s review what we have talked about.

    View Slide

  47. ● There are various HPO algorithms with a trend of going parallel
    ● More advanced ones are often hard to implement
    ○ Even more so in a distributed setting

    View Slide

  48. Good news!
    ● Ray Tune implements and integrates with all these algorithms
    ● Allows user to swap out different algorithms very easily

    View Slide

  49. Architecture requirements
    ● Granular control over when to start, pause, early stop, restore,
    or mutate each trial at specific iterations with little overhead
    ● Master-worker architecture that centralizes decision making
    ● Elasticity and fault tolerance

    View Slide

  50. Ray Tune - distributed HPO
    Head Node
    DriverProcess
    tune.run(train_func)
    Orchestrator running HPO
    algorithm
    from ray import tune
    def train_func(config):
    model = ConvNet(config)
    for i in range(epochs):
    current_loss = model.train()
    tune.report(loss=current_loss)
    tune.run(
    train_func,
    config={“alpha”: tune.uniform(0.001,
    0.1)},
    num_samples=100,
    scheduler=“asha”,
    search_alg=”optuna”)

    View Slide

  51. Worker Node
    Worker Node
    Ray Tune - distributed HPO
    Each actor performs one set of hyperparameter
    combination evaluation (a trial)
    Head Node
    Worker Node
    DriverProcess
    WorkerProcess
    Actor: Runs
    train_func
    tune.run(train_func)
    Orchestrator running HPO
    algorithm
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    Launch Launch
    Launch

    View Slide

  52. Worker Node
    Worker Node
    Ray Tune - distributed HPO
    Head Node
    Worker Node
    DriverProcess
    WorkerProcess
    Actor: Runs
    train_func
    tune.run(train_func)
    Orchestrator running HPO
    algorithm
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    Report metrics Report metrics
    Report metrics
    Orchestrator keeps track of all the trials’
    progress and metrics.

    View Slide

  53. Worker Node
    Worker Node
    Ray Tune - distributed HPO
    Head Node
    Worker Node
    DriverProcess
    WorkerProcess
    Actor: Runs
    train_func
    tune.run(train_func)
    Orchestrator running HPO
    algorithm
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    Early stop Continue
    Continue
    Based on the metrics, the orchestrator
    may stop/pause/mutate trials or launch
    new trials when resources are available.

    View Slide

  54. Worker Node
    Worker Node
    Ray Tune - distributed HPO
    Resources are repurposed to explore
    new trials.
    Head Node
    Worker Node
    DriverProcess
    WorkerProcess
    Actor: Runs
    train_func
    tune.run(train_func)
    Orchestrator running HPO
    algorithm
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    Launch a new trial

    View Slide

  55. Worker Node
    Worker Node
    Ray Tune - distributed HPO
    Head Node
    Worker Node
    DriverProcess
    WorkerProcess
    Actor: Runs
    train_func
    tune.run(train_func)
    Orchestrator running HPO
    algorithm
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    Trials are
    checkpointed to
    cloud storage
    Orchestrator also manages checkpoint state.
    Checkpoint

    View Slide

  56. Worker Node
    Worker Node
    Ray Tune - distributed HPO
    Some worker process crashes.
    Head Node
    Worker Node
    DriverProcess
    WorkerProcess
    Actor: Runs
    train_func
    tune.run(train_func)
    Orchestrator running HPO
    algorithm
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func

    View Slide

  57. Worker Node
    Worker Node
    Ray Tune - distributed HPO
    New actor comes up fresh and the
    crashed trial is restored from remote
    checkpoint.
    Head Node
    Worker Node
    DriverProcess
    WorkerProcess
    Actor: Runs
    train_func
    tune.run(train_func)
    Orchestrator running HPO
    algorithm
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    WorkerProcess
    Actor: Runs
    train_func
    Load checkpoint
    from cloud storage
    restore

    View Slide

  58. Woohoo!
    Let’s review what we have talked about.

    View Slide

  59. ● Provides efficient HPO algorithms
    ● Distributes and coordinates parallel trials in a fault-tolerant
    and elastic manner
    ● Integrated with ML ecosystem
    What makes Ray Tune special

    View Slide

  60. View Slide

  61. Thank You
    Let’s keep in touch!
    ● https://ray.io/
    ● https://discuss.ray.io/
    ● Ray slack
    ● https://github.com/ray-project/tune-sklearn

    View Slide

  62. Q & A

    View Slide

  63. Appendix

    View Slide


  64. View Slide

  65. View Slide

  66. 01 02 03

    View Slide

  67. Distributed apps will become the norm
    Something to highlight

    View Slide

  68. View Slide

  69. 01 02 03
    04 05
    Ray and Ray Tune
    Agenda

    View Slide

  70. 01 02 03
    04 05
    Agenda

    View Slide

  71. Thank you for listening!

    View Slide

  72. View Slide

  73. tune-sklearn
    demo

    View Slide

  74. Thank You

    View Slide

  75. View Slide

  76. A universal framework for distributed computing
    Ray- universal framework for distributed computing
    ● Berkeley RiseLab → Ray → Anyscale
    ● Over 500 contributors from a huge number of companies

    View Slide

  77. Capitalize Ray
    Native Libraries 3rd Party Libraries
    Your
    app
    here!
    Universal
    framework for
    distributed
    computing
    Run anywhere
    Library + app
    ecosystem

    View Slide

  78. Capitalize Ray
    Native Libraries 3rd Party Libraries
    Your
    app
    here!
    Universal
    framework for
    distributed
    computing
    Run anywhere
    Library + app
    ecosystem

    View Slide

  79. View Slide