Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2.5x Speedup of GPSampler by Batching (PFN 2025...

2.5x Speedup of GPSampler by Batching (PFN 2025 夏期国内インターンシップ)

PFN 2025 夏期国内インターンシップにおける入江海地さんの成果発表資料です。バッチ化によるOptunaのGPSampler高速化に取り組み、2.5倍の高速化を達成しました。

Avatar for Preferred Networks

Preferred Networks PRO

November 17, 2025
Tweet

More Decks by Preferred Networks

Other Decks in Technology

Transcript

  1. Kaichi Irie (入江 海地) 2.5x Speedup of GPSampler by Batching

    Mentors: Shuhei Watanabe Keywords: Optuna, GPSampler, Batching
  2. Summary The Challenge • Optuna's GPSampler is a powerful sampler,

    but slow, creating a critical bottleneck in real-world, large-scale optimization tasks Objective • Accelerate GPSampler without compromising its optimization accuracy, the library's usability, and maintainability Approach • Replace sequential multi-start optimization with batching Key Achievements & Impact • 2.0-2.5x speedup🚀 without degrading accuracy or usability • Merged into master🎉 : will be available in Optuna v4.6+ 2
  3. • Optuna is a de facto standard library for black-box

    optimization • GPSampler is a powerful sampler, especially for continuous optimization ◦ TPESampler is the default sampler due to broader applicability ▪ e.g., search spaces with conditional parameters About Optuna and GPSampler 3 https://optuna.readthedocs.io/en/stable/reference/samplers/index.html
  4. • Cubic Time Complexity: GPSampler's runtime scales with O(N³), where

    N is the number of trials. ◦ Example (500 trials): ▪ GPSampler: 1,000 - 10,000 seconds ▪ TPESampler: 10 - 100 seconds • Real-World Impact: GPSampler sometimes becomes a bottleneck, as its sampling process can take as long as the objective function evaluation itself. The Challenge: GPSampler is slow 4
  5. • Surrogate optimization in the hyperparameter optimization routine ◦ Occupy

    50+% of overall runtime, yet executed sequentially • PyTorch batching is faster than simple for-loop The Inner Optimizations are the Current Bottleneck 5 Currently sequential, but can be batched!
  6. Batching Should Be Faster… But, How to Batch? Prior Work:

    BoTorch suggests Stacking speeds up • Stacking: one of the batching approaches • BoTorch: a library for Bayesian Optimization built on PyTorch Open Questions: 1. What batching methods should we consider? ◦ cf. Investigation 2. What are the pros & cons of each method? ◦ cf. Investigation + Speedup 3. How should we implement and is it acceptable for Optuna? ◦ cf. Investigation + Master merge 6
  7. Batching Methods • Stacking (used in BoTorch): ◦ ✅ Fast

    when the dimension is small & simple implementation ◦ ❌ Degrade optimization accuracy • Multiprocessing: ◦ ✅ Simple Implementation ◦ ❌ Not so fast (-1.5x) & Break Sampler’s API • Batched Evaluation: ⭐ Chosen ◦ ✅ Fastest ◦ ❌ Require design discussion for master merge My Contribution 1: Investigation 7
  8. *1 f1, f6, f10, f15, f20 from the blackbox optimization

    benchmarking (https://hub.optuna.org/benchmarks/bbob/ ) My Contribution 2: Speedup🚀 • 2.0-2.5x Speedup by batching • avg. runtime over 300 Trials on 5 objectives*1 × 3 seeds • Batched Evaluation scales better across dimensionalities ◦ dimensions: the search space dimension 8 Batched Evaluation
  9. OK, Then Let’s Merge… Wait! 😱 Technical Hurdles • SciPy

    API Limitations: SciPy API does not allow batched evaluation • Built-in Python Multithreading: Too slow Design Question: Who should handle batching? 🤔 • The Optimizer? The Function? Project Constraints • Avoid additional packages • Avoid using internal modules in third-party libraries • Keep API compatibility 9
  10. • Batched evaluation with greenlet ◦ greenlet: A fast coroutine

    library (& already in Optuna requirement) • ✅ 2.0-2.5x Faster • ✅ No additional packages required • ✅ Same accuracy, same usability • will be available in Optuna v4.6+ ◦ Run: pip install -U optuna My Contribution 3: Master Merge 🎉 (#6268) 10 https://github.com/optuna/optuna/pull/6268
  11. What is Multi-Start Optimization? 12 • An optimization algorithm can

    only find a local minimum • Multi-Start optimization: find the best solution by optimizing from multiple starting points • Each optimization is independent https://www.researchgate.net/figure/Principle-of-the-MultiStart-algorithm-explained-on-a-1-parameter-optimization-problem_fig38_346062586 ① ③ ② ④
  12. Inside a SciPy Optimization Routine The algorithm minimizes a function

    f by repeating a two-step cycle: Step 1: Evaluate Current Point x: Calculate f(x), g(x) Step 2: Update x: Use the value and gradient to decide next point 13
  13. Equivalence of GD on Stacked Problems 15 The gradient descent

    method The gradient Updates are the same! Stack
  14. What is Batched Evaluation? 16 • The Bottleneck: Evaluating the

    acquisition function requires expensive computations on large Tensors. • The Solution: The optimizer evaluates many points. We can extract these evaluations and run all the Tensor computations at once in a batch. scipy_optim
  15. Batch Evaluation with greenlet 17 Iterate (L-BFGS) Batch Eval. Iterate

    (L-BFGS) Iterate (L-BFGS) Main Worker Worker 2 (Batch 2) Worker 1 (Batch 1) Iterate (L-BFGS) Time t=0 1. Request & Pause Instead of calculating f(x), each worker pauses and sends its point x to the main worker. 2. Batch Evaluation The main worker collects all points [x₁, x₂, ...] and runs one batch evaluation 3. Resume & Update The main worker sends the results back. Each optimizer resumes its process with the value it needs.
  16. Stacking: The Good & The Bad 👍 The Good Stuff

    • 2x-2.5x faster in low dimensions. • Fewer iterations in low dimensions (for some reason!). • Simple to implement with no new dependencies. 👎 The Downsides • It changes the optimization problem, which can change the result. • Suffers in high dimensions: more iterations mean longer runtimes, or lower quality solutions if you limit them. ◦ Doesn't work well with quasi-Newton (e.g., L-BFGS-B) method, causing bad hessian approximation more iterations. 18
  17. Multiprocessing: The Good & The Bad 👍 The Good Stuff

    • Consistent ~1.5x speed-up for any dimension. • Super simple to code and to understand. • Same problem, same results. (Iterations and accuracy are identical). 👎 The Downsides • Slower speed-up than other techniques. • Requires API changes to the Sampler to handle processes. • Too much overhead to start mid-optimization. • A nightmare to debug for developers (logging, profiling, etc.). 19