Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2021: Bayesian Optimization with Time-Decaying Jitter for Hyperparameter Tuning of Neural Networks

Exactpro
November 25, 2021
21

TMPA-2021: Bayesian Optimization with Time-Decaying Jitter for Hyperparameter Tuning of Neural Networks

Konstantin Maslov

Bayesian Optimization with Time-Decaying Jitter for Hyperparameter Tuning of Neural Networks

TMPA is an annual International Conference on Software Testing, Machine Learning and Complex Process Analysis. The conference will focus on the application of modern methods of data science to the analysis of software quality.

To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

Exactpro

November 25, 2021
Tweet

Transcript

  1. 1
    25-27 NOVEMBER
    SOFTWARE TESTING, MACHINE LEARNING
    AND COMPLEX PROCESS ANALYSIS
    Bayesian Optimization with Time-
    Decaying Jitter for Hyperparameter
    Tuning of Neural Networks
    Konstantin A. Maslov, [email protected]
    Tomsk Polytechnic University

    View full-size slide

  2. 2
    Introduction
    ● The performance of deep learning algorithms can significantly depend on the values of the
    hyperparameters
    ● In the number of recent studies, Bayesian optimization methods have gained particular
    popularity for the hyperparameter tuning, since, unlike others, they can significantly reduce
    the number of calls to the objective function
    ● It is important that an optimization algorithm makes it possible to balance between
    exploration and exploitation. However, in Bayesian optimization algorithms, typically, it is
    possible to distinguish an explicit random search phase and a phase of further search for
    optimal values
    ● It forces the researcher to determine an additional parameter of the optimization algorithm—
    the number of iterations for the random search, and may also lead to suboptimal results

    View full-size slide

  3. 3
    Aim
    Modify the ordinary Bayesian optimization algorithm by introducing time-decaying
    parameter ξ (jitter) for dynamic balancing between exploration and exploitation
    Objectives
    ● Design the modified algorithm
    ● Implement the ordinary and the modified algorithms
    ● Evaluate them on a number of artificial landscapes
    ● Evaluate the algorithms on a practical problem of hyperparameter tuning

    View full-size slide

  4. 4
    Algorithm 1: Ordinary Bayesian Optimization Algorithm with
    Constant Jitter
    Input: f: the objective, D: the search domain, ξ: the jitter parameter, N: the number of iterations
    at the first (random search) phase, M: the number of iterations at the second phase.
    Output: θ*: the best hyperparameters found.
    1: best_value = −∞
    2: repeat N times
    3: θ = random(D) // sample uniformly from the search domain
    4: value = f(θ)
    5: if value > best_value then
    6: best_value = value
    7: θ* = θ
    8: end if
    9: end repeat
    10: repeat M times
    11: fit a Gaussian process g(θ) approximating f(θ)
    12: θ+ = argmax EIξ
    (θ) // utilizing g(θ), maximize the EI function parameterized with ξ
    13: value = f(θ+)
    14: if value > best_value then
    15: best_value = value
    16: θ* = θ
    17: end if
    18: end repeat

    View full-size slide

  5. 5
    Algorithm 2: Bayesian Optimization Algorithm with Time-
    Decaying Jitter
    Input: f: the objective, D: the search domain, ξ': the initial jitter parameter, N: the number of
    iterations.
    Output: θ*: the best hyperparameters found.
    1: θ* = random(D) // sample uniformly from the search domain
    2: best_value = f(θ*)
    3: for i = 2, …, N do
    4: ξ = ξ' / i
    5: fit a Gaussian process g(θ) approximating f(θ)
    6: θ+ = argmax EIξ
    (θ) // utilizing g(θ), maximize the EI function parameterized with ξ
    7: value = f(θ+)
    8: if value > best_value then
    9: best_value = value
    10: θ* = θ
    11: end if
    12: end for

    View full-size slide

  6. 6
    Artificial Landscapes (1)
    Function Definition
    Global minimum,
    i = 1,…, d
    Search domain,
    i = 1,…, d
    Sphere
    f(x) = 0,
    xi
    = 0
    −2 ≤ xi
    ≤ 2
    Zakharov
    f(x) = 0,
    xi
    = 0
    −5 ≤ xi
    ≤ 10
    Rosenbrock
    f(x) = 0,
    xi
    = 1
    −2.048 ≤ xi
    ≤ 2.048
    Styblinski-Tang
    f(x) = −39.16599d,
    xi
    = −2.903534
    −5 ≤ xi
    ≤ 5
    2
    1
    ( )
    d
    i
    i
    f x

     
    x
    2 4
    2
    1 1 1
    ( )
    2 2
    d d d
    i i i
    i i i
    i i
    f x x x
      

       
     
       
       
      
    x
       
     
    1 2 2
    2
    1
    1
    ( ) 100 1
    d
    i i i
    i
    f x x x



       

    x
     
    4 2
    1
    1
    ( ) 16 5
    2
    d
    i i i
    i
    f x x x

      

    x

    View full-size slide

  7. 7
    Artificial Landscapes (2)
    Function Definition
    Global minimum,
    i = 1,…, d
    Search domain,
    i = 1,…, d
    Schwefel
    f(x) = −418.9829d,
    xi
    = 420.9687
    −500 ≤ xi
    ≤ 500
    Rastrigin
    f(x) = 0,
    xi
    = 0
    −5.12 ≤ xi
    ≤ 5.12
    Griewank
    f(x) = 0,
    xi
    = 0
    −600 ≤ xi
    ≤ 600
    Ackley
    f(x) = 0,
    xi
    = 0
    −32.768 ≤ xi
    ≤ 32.768
     
     
    1
    ( ) sin
    d
    i i
    i
    f x x

     

    x
     
     
    2
    1
    ( ) 10 10cos 2
    d
    i i
    i
    f d x x


      

    x
    2
    1 1
    1
    ( ) cos 1
    4000
    d
    d
    i
    i
    i i
    x
    f x
    i
     
      
     
     
     
     
    x
     
    2
    1 1
    1 1
    ( ) exp exp cos
    d d
    i i
    i i
    f a b x cx a e
    d d
     
         
       
     
     
     
     
     
    x

    View full-size slide

  8. 8
    8
    Artificial landscapes, d = 2

    View full-size slide

  9. 9
    Evaluation of Artificial Landscapes (1)
    Function d T p-value
    Statistically significant?
    (α = 0.05)
    Better algorithm
    Sphere
    5 0 1.7e−6 Yes Algorithm 1
    10 220 0.80 No Equivalent
    50 8 3.9e−6 Yes Algorithm 2
    Zakharov
    5 232 0.99 No Equivalent
    10 106 9.3e−3 Yes Algorithm 1
    50 166 0.17 No Equivalent
    Rosenbrock
    5 27 2.4e−5 Yes Algorithm 2
    10 175 0.24 No Equivalent
    50 148 8.2e−2 No Equivalent
    Styblinski-Tang
    5 229 0.94 No Equivalent
    10 81 1.8e−3 Yes Algorithm 1
    50 180 0.28 No Equivalent

    View full-size slide

  10. 10
    Function d T p-value
    Statistically significant?
    (α = 0.05)
    Better algorithm
    Schwefel
    5 222 0.83 No Equivalent
    10 187 0.35 No Equivalent
    50 209 0.63 No Equivalent
    Rastrigin
    5 152 9.8e−2 No Equivalent
    10 206 0.59 No Equivalent
    50 166 0.17 No Equivalent
    Griewank
    5 171 0.21 No Equivalent
    10 216 0.73 No Equivalent
    50 213 0.69 No Equivalent
    Ackley
    5 181 0.29 No Equivalent
    10 212 0.67 No Equivalent
    50 216 0.73 No Equivalent
    Evaluation of Artificial Landscapes (2)

    View full-size slide

  11. 11
    U-Net
    ● A fully-convolutional network for semantic image segmentation
    ● Was applied to the problem of damaged Pinus sibirica trees segmentation (five classes
    of their condition and background, C = 6)
    ● During training, soft Jaccard coefficient was maximized:
     
       
     
     
    _
    1 1
    1
    _
    1 1
    LS
    1
    ( , )
    LS 1 LS
    max,
    H W
    ijc loss s
    ijc
    C
    i j
    H W
    c
    ijc loss s
    ijc ijc
    i j
    J
    C


     

     
     

       
     
     

     
     
     
    

    
    T P
    T P
    T T P
        _
    _
    LS 1 label s
    label s
    C


       
    T T

    View full-size slide

  12. 12
    Identified Hyperparameters
    # Hyperparameter Description Search domain
    1 log10
    θlr
    Logarithm of the learning rate −8 ≤ log10
    θlr
    ≤ −2
    2 θd
    Spatial dropout rate 0 ≤ θd
    ≤ 0.5
    3 log10
    θloss_s
    Logarithm of the loss smoothing coefficient −7 ≤ log10
    θloss_s
    ≤ −3
    4 θlabel_s
    Label smoothing coefficient 0 ≤ θlabel_s
    ≤ 0.8
    5 θz
    Zoom rate for random clipping 0.5 ≤ θz
    ≤ 1
    6 θb
    Brightness change rate 0 ≤ θb
    ≤ 0.6
    7 θc
    Contrast change rate 0 ≤ θc
    ≤ 1
    8 θα
    Elastic transformation coefficients
    30 ≤ θα
    ≤ 300
    9 θσ
    3 ≤ θσ
    ≤ 20
    10 θlr_d
    Exponential learning rate decay rate 0.7 ≤ θlr_d
    ≤ 1

    View full-size slide

  13. 13
    Evaluation on Semantic Image Segmentation Problem
    ● Both algorithms are suitable for the addressed hyperparameter tuning problem
    ● Although Algorithm 2, ξ' = 0.5 produced the best result (J = 0.6412), it is not possible to
    make any conclusions about the statistical significance of the obtained results

    View full-size slide

  14. 14
    Segmentation Results
    Jc
    J
    Background Living Dying Recently dead Long dead
    0.8561 0.7509 0.4072 0.7808 0.6919 0.6974
    Test area Ground truth Output
    TP
    TP FP FN
    c
    c
    c c c
    J 
     
    1
    1 C
    c
    c
    J J
    C 
     

    View full-size slide

  15. 15
    Conclusion
    ● This study proposed a Bayesian optimization algorithm with time-decaying jitter for finding the
    optimal hyperparameters of neural networks
    ● The proposed algorithm has the smaller number of parameters than the ordinary one
    ● For a part of the landscapes (Sphere, d = 50; Rosenbrock, d = 5) the proposed algorithm
    consistently shows the results better than the ordinary one, for another part (Sphere, d = 5;
    Zakharov, d = 10; Styblinski-Tang, d = 10)—worse, and for most of the artificial landscapes, no
    statistically significant difference was found
    ● The proposed algorithm has shown the comparable performance while tuning
    hyperparameters of a neural network for semantic image segmentation, but no claims can be
    made about non-randomness of these results (not enough experiments)
    ● ‘While not a silver bullet, the proposed algorithm seems to be a good addition to the toolbox of
    a data scientist’ (one of the reviewers)

    View full-size slide

  16. 16
    Thank You!
    Follow TMPA on Facebook
    TMPA-2021 Conference

    View full-size slide