2.5x Speedup of GPSampler by Batching (PFN 2025 夏期国内インターンシップ)

Kaichi Irie (入江海地) 2.5x Speedup of GPSampler by Batching
Mentors: Shuhei Watanabe Keywords: Optuna, GPSampler, Batching

Summary The Challenge • Optuna's GPSampler is a powerful sampler,
but slow, creating a critical bottleneck in real-world, large-scale optimization tasks Objective • Accelerate GPSampler without compromising its optimization accuracy, the library's usability, and maintainability Approach • Replace sequential multi-start optimization with batching Key Achievements & Impact • 2.0-2.5x speedup🚀 without degrading accuracy or usability • Merged into master🎉 : will be available in Optuna v4.6+ 2

• Optuna is a de facto standard library for black-box
optimization • GPSampler is a powerful sampler, especially for continuous optimization ◦ TPESampler is the default sampler due to broader applicability ▪ e.g., search spaces with conditional parameters About Optuna and GPSampler 3 https://optuna.readthedocs.io/en/stable/reference/samplers/index.html

• Cubic Time Complexity: GPSampler's runtime scales with O(N³), where
N is the number of trials. ◦ Example (500 trials): ▪ GPSampler: 1,000 - 10,000 seconds ▪ TPESampler: 10 - 100 seconds • Real-World Impact: GPSampler sometimes becomes a bottleneck, as its sampling process can take as long as the objective function evaluation itself. The Challenge: GPSampler is slow 4

• Surrogate optimization in the hyperparameter optimization routine ◦ Occupy
50+% of overall runtime, yet executed sequentially • PyTorch batching is faster than simple for-loop The Inner Optimizations are the Current Bottleneck 5 Currently sequential, but can be batched!

Batching Should Be Faster… But, How to Batch? Prior Work:
BoTorch suggests Stacking speeds up • Stacking: one of the batching approaches • BoTorch: a library for Bayesian Optimization built on PyTorch Open Questions: 1. What batching methods should we consider? ◦ cf. Investigation 2. What are the pros & cons of each method? ◦ cf. Investigation + Speedup 3. How should we implement and is it acceptable for Optuna? ◦ cf. Investigation + Master merge 6

Batching Methods • Stacking (used in BoTorch): ◦ ✅ Fast
when the dimension is small & simple implementation ◦ ❌ Degrade optimization accuracy • Multiprocessing: ◦ ✅ Simple Implementation ◦ ❌ Not so fast (-1.5x) & Break Sampler’s API • Batched Evaluation: ⭐ Chosen ◦ ✅ Fastest ◦ ❌ Require design discussion for master merge My Contribution 1: Investigation 7

*1 f1, f6, f10, f15, f20 from the blackbox optimization
benchmarking (https://hub.optuna.org/benchmarks/bbob/ ) My Contribution 2: Speedup🚀 • 2.0-2.5x Speedup by batching • avg. runtime over 300 Trials on 5 objectives*1 × 3 seeds • Batched Evaluation scales better across dimensionalities ◦ dimensions: the search space dimension 8 Batched Evaluation

OK, Then Let’s Merge… Wait! 😱 Technical Hurdles • SciPy
API Limitations: SciPy API does not allow batched evaluation • Built-in Python Multithreading: Too slow Design Question: Who should handle batching? 🤔 • The Optimizer? The Function? Project Constraints • Avoid additional packages • Avoid using internal modules in third-party libraries • Keep API compatibility 9

• Batched evaluation with greenlet ◦ greenlet: A fast coroutine
library (& already in Optuna requirement) • ✅ 2.0-2.5x Faster • ✅ No additional packages required • ✅ Same accuracy, same usability • will be available in Optuna v4.6+ ◦ Run: pip install -U optuna My Contribution 3: Master Merge 🎉 (#6268) 10 https://github.com/optuna/optuna/pull/6268

Appendix 11

What is Multi-Start Optimization? 12 • An optimization algorithm can
only ﬁnd a local minimum • Multi-Start optimization: ﬁnd the best solution by optimizing from multiple starting points • Each optimization is independent https://www.researchgate.net/figure/Principle-of-the-MultiStart-algorithm-explained-on-a-1-parameter-optimization-problem_fig38_346062586 ① ③ ② ④

Inside a SciPy Optimization Routine The algorithm minimizes a function
f by repeating a two-step cycle: Step 1: Evaluate Current Point x: Calculate f(x), g(x) Step 2: Update x: Use the value and gradient to decide next point 13

What is Stacking? 14 Problems with B starting points “Stack”
B diﬀerent problems ×2 e.g., B=2

Equivalence of GD on Stacked Problems 15 The gradient descent
method The gradient Updates are the same! Stack

What is Batched Evaluation? 16 • The Bottleneck: Evaluating the
acquisition function requires expensive computations on large Tensors. • The Solution: The optimizer evaluates many points. We can extract these evaluations and run all the Tensor computations at once in a batch. scipy_optim

Batch Evaluation with greenlet 17 Iterate (L-BFGS) Batch Eval. Iterate
(L-BFGS) Iterate (L-BFGS) Main Worker Worker 2 (Batch 2) Worker 1 (Batch 1) Iterate (L-BFGS) Time t=0 1. Request & Pause Instead of calculating f(x), each worker pauses and sends its point x to the main worker. 2. Batch Evaluation The main worker collects all points [x₁, x₂, ...] and runs one batch evaluation 3. Resume & Update The main worker sends the results back. Each optimizer resumes its process with the value it needs.

Stacking: The Good & The Bad 👍 The Good Stuﬀ
• 2x-2.5x faster in low dimensions. • Fewer iterations in low dimensions (for some reason!). • Simple to implement with no new dependencies. 👎 The Downsides • It changes the optimization problem, which can change the result. • Suﬀers in high dimensions: more iterations mean longer runtimes, or lower quality solutions if you limit them. ◦ Doesn't work well with quasi-Newton (e.g., L-BFGS-B) method, causing bad hessian approximation more iterations. 18

Multiprocessing: The Good & The Bad 👍 The Good Stuﬀ
• Consistent ~1.5x speed-up for any dimension. • Super simple to code and to understand. • Same problem, same results. (Iterations and accuracy are identical). 👎 The Downsides • Slower speed-up than other techniques. • Requires API changes to the Sampler to handle processes. • Too much overhead to start mid-optimization. • A nightmare to debug for developers (logging, proﬁling, etc.). 19

2.5x Speedup of GPSampler by Batching (PFN 2025...

2.5x Speedup of GPSampler by Batching (PFN 2025 夏期国内インターンシップ)

Preferred Networks PRO

More Decks by Preferred Networks

Other Decks in Technology

Featured

Transcript

Kaichi Irie (入江海地) 2.5x Speedup of GPSampler by Batching

Summary The Challenge • Optuna's GPSampler is a powerful sampler,

• Optuna is a de facto standard library for black-box

• Cubic Time Complexity: GPSampler's runtime scales with O(N³), where

• Surrogate optimization in the hyperparameter optimization routine ◦ Occupy

Batching Should Be Faster… But, How to Batch? Prior Work:

Batching Methods • Stacking (used in BoTorch): ◦ ✅ Fast

*1 f1, f6, f10, f15, f20 from the blackbox optimization

OK, Then Let’s Merge… Wait! 😱 Technical Hurdles • SciPy

• Batched evaluation with greenlet ◦ greenlet: A fast coroutine

Appendix 11

What is Multi-Start Optimization? 12 • An optimization algorithm can

Inside a SciPy Optimization Routine The algorithm minimizes a function

What is Stacking? 14 Problems with B starting points “Stack”

Equivalence of GD on Stacked Problems 15 The gradient descent

What is Batched Evaluation? 16 • The Bottleneck: Evaluating the

Batch Evaluation with greenlet 17 Iterate (L-BFGS) Batch Eval. Iterate

Stacking: The Good & The Bad 👍 The Good Stuﬀ

Multiprocessing: The Good & The Bad 👍 The Good Stuﬀ