Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2021: Bayesian Optimization with Time-Decaying Jitter for Hyperparameter Tuning of Neural Networks

Exactpro
November 25, 2021
21

TMPA-2021: Bayesian Optimization with Time-Decaying Jitter for Hyperparameter Tuning of Neural Networks

Konstantin Maslov

Bayesian Optimization with Time-Decaying Jitter for Hyperparameter Tuning of Neural Networks

TMPA is an annual International Conference on Software Testing, Machine Learning and Complex Process Analysis. The conference will focus on the application of modern methods of data science to the analysis of software quality.

To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

Exactpro

November 25, 2021
Tweet

Transcript

  1. 1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS

    ANALYSIS Bayesian Optimization with Time- Decaying Jitter for Hyperparameter Tuning of Neural Networks Konstantin A. Maslov, [email protected] Tomsk Polytechnic University
  2. 2 Introduction • The performance of deep learning algorithms can

    significantly depend on the values of the hyperparameters • In the number of recent studies, Bayesian optimization methods have gained particular popularity for the hyperparameter tuning, since, unlike others, they can significantly reduce the number of calls to the objective function • It is important that an optimization algorithm makes it possible to balance between exploration and exploitation. However, in Bayesian optimization algorithms, typically, it is possible to distinguish an explicit random search phase and a phase of further search for optimal values • It forces the researcher to determine an additional parameter of the optimization algorithm— the number of iterations for the random search, and may also lead to suboptimal results
  3. 3 Aim Modify the ordinary Bayesian optimization algorithm by introducing

    time-decaying parameter ξ (jitter) for dynamic balancing between exploration and exploitation Objectives • Design the modified algorithm • Implement the ordinary and the modified algorithms • Evaluate them on a number of artificial landscapes • Evaluate the algorithms on a practical problem of hyperparameter tuning
  4. 4 Algorithm 1: Ordinary Bayesian Optimization Algorithm with Constant Jitter

    Input: f: the objective, D: the search domain, ξ: the jitter parameter, N: the number of iterations at the first (random search) phase, M: the number of iterations at the second phase. Output: θ*: the best hyperparameters found. 1: best_value = −∞ 2: repeat N times 3: θ = random(D) // sample uniformly from the search domain 4: value = f(θ) 5: if value > best_value then 6: best_value = value 7: θ* = θ 8: end if 9: end repeat 10: repeat M times 11: fit a Gaussian process g(θ) approximating f(θ) 12: θ+ = argmax EIξ (θ) // utilizing g(θ), maximize the EI function parameterized with ξ 13: value = f(θ+) 14: if value > best_value then 15: best_value = value 16: θ* = θ 17: end if 18: end repeat
  5. 5 Algorithm 2: Bayesian Optimization Algorithm with Time- Decaying Jitter

    Input: f: the objective, D: the search domain, ξ': the initial jitter parameter, N: the number of iterations. Output: θ*: the best hyperparameters found. 1: θ* = random(D) // sample uniformly from the search domain 2: best_value = f(θ*) 3: for i = 2, …, N do 4: ξ = ξ' / i 5: fit a Gaussian process g(θ) approximating f(θ) 6: θ+ = argmax EIξ (θ) // utilizing g(θ), maximize the EI function parameterized with ξ 7: value = f(θ+) 8: if value > best_value then 9: best_value = value 10: θ* = θ 11: end if 12: end for
  6. 6 Artificial Landscapes (1) Function Definition Global minimum, i =

    1,…, d Search domain, i = 1,…, d Sphere f(x) = 0, xi = 0 −2 ≤ xi ≤ 2 Zakharov f(x) = 0, xi = 0 −5 ≤ xi ≤ 10 Rosenbrock f(x) = 0, xi = 1 −2.048 ≤ xi ≤ 2.048 Styblinski-Tang f(x) = −39.16599d, xi = −2.903534 −5 ≤ xi ≤ 5 2 1 ( ) d i i f x    x 2 4 2 1 1 1 ( ) 2 2 d d d i i i i i i i i f x x x                      x       1 2 2 2 1 1 ( ) 100 1 d i i i i f x x x         x   4 2 1 1 ( ) 16 5 2 d i i i i f x x x      x
  7. 7 Artificial Landscapes (2) Function Definition Global minimum, i =

    1,…, d Search domain, i = 1,…, d Schwefel f(x) = −418.9829d, xi = 420.9687 −500 ≤ xi ≤ 500 Rastrigin f(x) = 0, xi = 0 −5.12 ≤ xi ≤ 5.12 Griewank f(x) = 0, xi = 0 −600 ≤ xi ≤ 600 Ackley f(x) = 0, xi = 0 −32.768 ≤ xi ≤ 32.768     1 ( ) sin d i i i f x x     x     2 1 ( ) 10 10cos 2 d i i i f d x x       x 2 1 1 1 ( ) cos 1 4000 d d i i i i x f x i              x   2 1 1 1 1 ( ) exp exp cos d d i i i i f a b x cx a e d d                       x
  8. 9 Evaluation of Artificial Landscapes (1) Function d T p-value

    Statistically significant? (α = 0.05) Better algorithm Sphere 5 0 1.7e−6 Yes Algorithm 1 10 220 0.80 No Equivalent 50 8 3.9e−6 Yes Algorithm 2 Zakharov 5 232 0.99 No Equivalent 10 106 9.3e−3 Yes Algorithm 1 50 166 0.17 No Equivalent Rosenbrock 5 27 2.4e−5 Yes Algorithm 2 10 175 0.24 No Equivalent 50 148 8.2e−2 No Equivalent Styblinski-Tang 5 229 0.94 No Equivalent 10 81 1.8e−3 Yes Algorithm 1 50 180 0.28 No Equivalent
  9. 10 Function d T p-value Statistically significant? (α = 0.05)

    Better algorithm Schwefel 5 222 0.83 No Equivalent 10 187 0.35 No Equivalent 50 209 0.63 No Equivalent Rastrigin 5 152 9.8e−2 No Equivalent 10 206 0.59 No Equivalent 50 166 0.17 No Equivalent Griewank 5 171 0.21 No Equivalent 10 216 0.73 No Equivalent 50 213 0.69 No Equivalent Ackley 5 181 0.29 No Equivalent 10 212 0.67 No Equivalent 50 216 0.73 No Equivalent Evaluation of Artificial Landscapes (2)
  10. 11 U-Net • A fully-convolutional network for semantic image segmentation

    • Was applied to the problem of damaged Pinus sibirica trees segmentation (five classes of their condition and background, C = 6) • During training, soft Jaccard coefficient was maximized:           _ 1 1 1 _ 1 1 LS 1 ( , ) LS 1 LS max, H W ijc loss s ijc C i j H W c ijc loss s ijc ijc i j J C                             T P T P T T P     _ _ LS 1 label s label s C       T T
  11. 12 Identified Hyperparameters # Hyperparameter Description Search domain 1 log10

    θlr Logarithm of the learning rate −8 ≤ log10 θlr ≤ −2 2 θd Spatial dropout rate 0 ≤ θd ≤ 0.5 3 log10 θloss_s Logarithm of the loss smoothing coefficient −7 ≤ log10 θloss_s ≤ −3 4 θlabel_s Label smoothing coefficient 0 ≤ θlabel_s ≤ 0.8 5 θz Zoom rate for random clipping 0.5 ≤ θz ≤ 1 6 θb Brightness change rate 0 ≤ θb ≤ 0.6 7 θc Contrast change rate 0 ≤ θc ≤ 1 8 θα Elastic transformation coefficients 30 ≤ θα ≤ 300 9 θσ 3 ≤ θσ ≤ 20 10 θlr_d Exponential learning rate decay rate 0.7 ≤ θlr_d ≤ 1
  12. 13 Evaluation on Semantic Image Segmentation Problem • Both algorithms

    are suitable for the addressed hyperparameter tuning problem • Although Algorithm 2, ξ' = 0.5 produced the best result (J = 0.6412), it is not possible to make any conclusions about the statistical significance of the obtained results
  13. 14 Segmentation Results Jc J Background Living Dying Recently dead

    Long dead 0.8561 0.7509 0.4072 0.7808 0.6919 0.6974 Test area Ground truth Output TP TP FP FN c c c c c J    1 1 C c c J J C   
  14. 15 Conclusion • This study proposed a Bayesian optimization algorithm

    with time-decaying jitter for finding the optimal hyperparameters of neural networks • The proposed algorithm has the smaller number of parameters than the ordinary one • For a part of the landscapes (Sphere, d = 50; Rosenbrock, d = 5) the proposed algorithm consistently shows the results better than the ordinary one, for another part (Sphere, d = 5; Zakharov, d = 10; Styblinski-Tang, d = 10)—worse, and for most of the artificial landscapes, no statistically significant difference was found • The proposed algorithm has shown the comparable performance while tuning hyperparameters of a neural network for semantic image segmentation, but no claims can be made about non-randomness of these results (not enough experiments) • ‘While not a silver bullet, the proposed algorithm seems to be a good addition to the toolbox of a data scientist’ (one of the reviewers)