TMPA-2021: Bayesian Optimization with Time-Decaying Jitter for Hyperparameter Tuning of Neural Networks

1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS
ANALYSIS Bayesian Optimization with Time- Decaying Jitter for Hyperparameter Tuning of Neural Networks Konstantin A. Maslov, [email protected] Tomsk Polytechnic University

2 Introduction • The performance of deep learning algorithms can
significantly depend on the values of the hyperparameters • In the number of recent studies, Bayesian optimization methods have gained particular popularity for the hyperparameter tuning, since, unlike others, they can significantly reduce the number of calls to the objective function • It is important that an optimization algorithm makes it possible to balance between exploration and exploitation. However, in Bayesian optimization algorithms, typically, it is possible to distinguish an explicit random search phase and a phase of further search for optimal values • It forces the researcher to determine an additional parameter of the optimization algorithm— the number of iterations for the random search, and may also lead to suboptimal results

3 Aim Modify the ordinary Bayesian optimization algorithm by introducing
time-decaying parameter ξ (jitter) for dynamic balancing between exploration and exploitation Objectives • Design the modified algorithm • Implement the ordinary and the modified algorithms • Evaluate them on a number of artificial landscapes • Evaluate the algorithms on a practical problem of hyperparameter tuning

4 Algorithm 1: Ordinary Bayesian Optimization Algorithm with Constant Jitter
Input: f: the objective, D: the search domain, ξ: the jitter parameter, N: the number of iterations at the first (random search) phase, M: the number of iterations at the second phase. Output: θ*: the best hyperparameters found. 1: best_value = −∞ 2: repeat N times 3: θ = random(D) // sample uniformly from the search domain 4: value = f(θ) 5: if value > best_value then 6: best_value = value 7: θ* = θ 8: end if 9: end repeat 10: repeat M times 11: fit a Gaussian process g(θ) approximating f(θ) 12: θ+ = argmax EIξ (θ) // utilizing g(θ), maximize the EI function parameterized with ξ 13: value = f(θ+) 14: if value > best_value then 15: best_value = value 16: θ* = θ 17: end if 18: end repeat

5 Algorithm 2: Bayesian Optimization Algorithm with Time- Decaying Jitter
Input: f: the objective, D: the search domain, ξ': the initial jitter parameter, N: the number of iterations. Output: θ*: the best hyperparameters found. 1: θ* = random(D) // sample uniformly from the search domain 2: best_value = f(θ*) 3: for i = 2, …, N do 4: ξ = ξ' / i 5: fit a Gaussian process g(θ) approximating f(θ) 6: θ+ = argmax EIξ (θ) // utilizing g(θ), maximize the EI function parameterized with ξ 7: value = f(θ+) 8: if value > best_value then 9: best_value = value 10: θ* = θ 11: end if 12: end for

6 Artificial Landscapes (1) Function Definition Global minimum, i =
1,…, d Search domain, i = 1,…, d Sphere f(x) = 0, xi = 0 −2 ≤ xi ≤ 2 Zakharov f(x) = 0, xi = 0 −5 ≤ xi ≤ 10 Rosenbrock f(x) = 0, xi = 1 −2.048 ≤ xi ≤ 2.048 Styblinski-Tang f(x) = −39.16599d, xi = −2.903534 −5 ≤ xi ≤ 5 2 1 ( ) d i i f x    x 2 4 2 1 1 1 ( ) 2 2 d d d i i i i i i i i f x x x                      x       1 2 2 2 1 1 ( ) 100 1 d i i i i f x x x         x   4 2 1 1 ( ) 16 5 2 d i i i i f x x x      x

7 Artificial Landscapes (2) Function Definition Global minimum, i =
1,…, d Search domain, i = 1,…, d Schwefel f(x) = −418.9829d, xi = 420.9687 −500 ≤ xi ≤ 500 Rastrigin f(x) = 0, xi = 0 −5.12 ≤ xi ≤ 5.12 Griewank f(x) = 0, xi = 0 −600 ≤ xi ≤ 600 Ackley f(x) = 0, xi = 0 −32.768 ≤ xi ≤ 32.768     1 ( ) sin d i i i f x x     x     2 1 ( ) 10 10cos 2 d i i i f d x x       x 2 1 1 1 ( ) cos 1 4000 d d i i i i x f x i              x   2 1 1 1 1 ( ) exp exp cos d d i i i i f a b x cx a e d d                       x

8 8 Artificial landscapes, d = 2

9 Evaluation of Artificial Landscapes (1) Function d T p-value
Statistically significant? (α = 0.05) Better algorithm Sphere 5 0 1.7e−6 Yes Algorithm 1 10 220 0.80 No Equivalent 50 8 3.9e−6 Yes Algorithm 2 Zakharov 5 232 0.99 No Equivalent 10 106 9.3e−3 Yes Algorithm 1 50 166 0.17 No Equivalent Rosenbrock 5 27 2.4e−5 Yes Algorithm 2 10 175 0.24 No Equivalent 50 148 8.2e−2 No Equivalent Styblinski-Tang 5 229 0.94 No Equivalent 10 81 1.8e−3 Yes Algorithm 1 50 180 0.28 No Equivalent

10 Function d T p-value Statistically significant? (α = 0.05)
Better algorithm Schwefel 5 222 0.83 No Equivalent 10 187 0.35 No Equivalent 50 209 0.63 No Equivalent Rastrigin 5 152 9.8e−2 No Equivalent 10 206 0.59 No Equivalent 50 166 0.17 No Equivalent Griewank 5 171 0.21 No Equivalent 10 216 0.73 No Equivalent 50 213 0.69 No Equivalent Ackley 5 181 0.29 No Equivalent 10 212 0.67 No Equivalent 50 216 0.73 No Equivalent Evaluation of Artificial Landscapes (2)

11 U-Net • A fully-convolutional network for semantic image segmentation
• Was applied to the problem of damaged Pinus sibirica trees segmentation (five classes of their condition and background, C = 6) • During training, soft Jaccard coefficient was maximized:           _ 1 1 1 _ 1 1 LS 1 ( , ) LS 1 LS max, H W ijc loss s ijc C i j H W c ijc loss s ijc ijc i j J C                             T P T P T T P     _ _ LS 1 label s label s C       T T

12 Identified Hyperparameters # Hyperparameter Description Search domain 1 log10
θlr Logarithm of the learning rate −8 ≤ log10 θlr ≤ −2 2 θd Spatial dropout rate 0 ≤ θd ≤ 0.5 3 log10 θloss_s Logarithm of the loss smoothing coefficient −7 ≤ log10 θloss_s ≤ −3 4 θlabel_s Label smoothing coefficient 0 ≤ θlabel_s ≤ 0.8 5 θz Zoom rate for random clipping 0.5 ≤ θz ≤ 1 6 θb Brightness change rate 0 ≤ θb ≤ 0.6 7 θc Contrast change rate 0 ≤ θc ≤ 1 8 θα Elastic transformation coefficients 30 ≤ θα ≤ 300 9 θσ 3 ≤ θσ ≤ 20 10 θlr_d Exponential learning rate decay rate 0.7 ≤ θlr_d ≤ 1

13 Evaluation on Semantic Image Segmentation Problem • Both algorithms
are suitable for the addressed hyperparameter tuning problem • Although Algorithm 2, ξ' = 0.5 produced the best result (J = 0.6412), it is not possible to make any conclusions about the statistical significance of the obtained results

14 Segmentation Results Jc J Background Living Dying Recently dead
Long dead 0.8561 0.7509 0.4072 0.7808 0.6919 0.6974 Test area Ground truth Output TP TP FP FN c c c c c J    1 1 C c c J J C   

15 Conclusion • This study proposed a Bayesian optimization algorithm
with time-decaying jitter for finding the optimal hyperparameters of neural networks • The proposed algorithm has the smaller number of parameters than the ordinary one • For a part of the landscapes (Sphere, d = 50; Rosenbrock, d = 5) the proposed algorithm consistently shows the results better than the ordinary one, for another part (Sphere, d = 5; Zakharov, d = 10; Styblinski-Tang, d = 10)—worse, and for most of the artificial landscapes, no statistically significant difference was found • The proposed algorithm has shown the comparable performance while tuning hyperparameters of a neural network for semantic image segmentation, but no claims can be made about non-randomness of these results (not enough experiments) • ‘While not a silver bullet, the proposed algorithm seems to be a good addition to the toolbox of a data scientist’ (one of the reviewers)

16 Thank You! Follow TMPA on Facebook TMPA-2021 Conference

TMPA-2021: Bayesian Optimization with Time-Deca...

TMPA-2021: Bayesian Optimization with Time-Decaying Jitter for Hyperparameter Tuning of Neural Networks

Exactpro
PRO

More Decks by Exactpro

Featured

Transcript

1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS

2 Introduction • The performance of deep learning algorithms can

3 Aim Modify the ordinary Bayesian optimization algorithm by introducing

4 Algorithm 1: Ordinary Bayesian Optimization Algorithm with Constant Jitter

5 Algorithm 2: Bayesian Optimization Algorithm with Time- Decaying Jitter

6 Artificial Landscapes (1) Function Definition Global minimum, i =

7 Artificial Landscapes (2) Function Definition Global minimum, i =

8 8 Artificial landscapes, d = 2

9 Evaluation of Artificial Landscapes (1) Function d T p-value

10 Function d T p-value Statistically significant? (α = 0.05)

11 U-Net • A fully-convolutional network for semantic image segmentation

12 Identified Hyperparameters # Hyperparameter Description Search domain 1 log10

13 Evaluation on Semantic Image Segmentation Problem • Both algorithms

14 Segmentation Results Jc J Background Living Dying Recently dead

15 Conclusion • This study proposed a Bayesian optimization algorithm

16 Thank You! Follow TMPA on Facebook TMPA-2021 Conference