Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bayesian Optimization with TensorFlow/Keras by Keisuke Kamataki - TMLS #2

Bayesian Optimization with TensorFlow/Keras by Keisuke Kamataki - TMLS #2

Keisuke talked about hyper parameters tuning issues in machine learning, mainly focusing on Bayesian Optimization techniques.

More Decks by Tokyo Machine Learning Society

Other Decks in Research

Transcript

  1. INTRODUCTION y = f ( x ; ~ ✓ )

    A general prediction problem A general optimization problem ~ ✓ = arg min ~ ✓ ⇥ L(X, Y ; ~ ✓) + ⌦(~ ✓) ⇤
  2. INTRODUCTION y = f ( x ; ~ ✓ )

    A prediction problem with hyper parameters A parameter optimization problem with hyper parameters does affect how parameters are learned ~ ✓ = arg min ~ ✓ ⇥ L(X, Y ; ~ ✓) + ⌦(~ ✓) ⇤ ~ ✓H ~ ✓H ~ ✓H
  3. INTRODUCTION A brief clarification on terminology ~ ✓ Parameters. Controls

    how algorithms make prediction decisions along with Data. Automatically learned during training phase. Hyper Parameters. Affects how algorithms learn parameters. Machine learning algorithms usually don’t automatically learn. ~ ✓H
  4. A NON-COMPREHENSIVE LIST OF HYPER PARAMETERS Bayesian Models Random Forest

    (Deep) Neural Networks • Parameters of prior distributions, … • Number of trees, Number of features for node split, …. Support Vector Machines • Selection of Kernel(and its parameter), Soft-margin parameter, … • Learning rate, Number of hidden units/layers, dropout rate, weight decay, activation functions, …
  5. HOW THEY AFFECT Support Vector Machines (with RBF Kernel on

    iris dataset) IUUQTDJLJUMFBSOPSHTUBCMFBVUP@FYBNQMFTTWNQMPU@SCG@QBSBNFUFSTIUNM K ( x i , x j) = exp ( || x i x j||2) min w 1 2 || w ||2 + C n X i=1 (0 , 1 y (i)( w T x (i) + b ))
  6. HOW THEY AFFECT Bayesian Models (Latent Dirichlet Allocation on patent

    abstracts) IUUQEJSJDIMFUOFUQEGXBMMBDISFUIJOLJOHQEG IUUQTFOXJLJQFEJBPSHXJLJ-BUFOU@%JSJDIMFU@BMMPDBUJPO
  7. HOW THEY AFFECT Deep Neural Nets (2 different cases) IUUQBSYJWPSHQEGQEG

    IUUQBSYJWPSHQEGQEG • Dropout Rate (TIMIT speech recognition benchmark) • Long Short Term Memory (speech, hand written, music data)
  8. HOW CAN WE OPTIMIZE Hyper parameter optimization is difficult •

    Top Kagglers say… I do this very manually IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTLB[BOPWBOFXJOUIFXPSME Mr. Marios Michailidis I mostly tune by hand IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTPXFO[IBOHDVSSFOUMZJOUIFXPSME Mr. Owen Zhang IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTHJMCFSUPUJUFSJD[OFXJOUIFXPSME I often use brute force searching manually based on my past experience Mr. Gilberto Titericz Mr. Lucas Eustaquio Gomes da Silva I look for params used in previous competitions and tweak them one by one until the outcome stops improving IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTMFVTUBHPTDVSSFOUIJHIFTU
  9. A LITTLE PHILOSOPHY DEBATE Hyper parameter optimization, what is it

    after all? • To be fair, these masters also say… I feel I learn more about the algorithms and why they work the way do by doing this manually. At the same time I feel that “I do something, it’s not only the machine!” IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTLB[BOPWBOFXJOUIFXPSME Mr. Marios Michailidis Sometimes I think I should eventually be more disciplined and systematic, using some Bayesian optimizer. However that would take some fun away from the process, I think. IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTPXFO[IBOHDVSSFOUMZJOUIFXPSME Mr. Owen Zhang
  10. ANYWAY IN PRACTICE Complex algorithms • Many hyper parameters to

    tune • Evaluation becomes time consuming It would be important to find good hyper parameters efficiently Large Datasets
  11. HOW TO FIND GOOD HYPER PARAMETERS 0 1 1 1

    1 ✓H(2) ✓H(1) max ~ ✓H⇤ F(f(x; ~ ✓); ~ ✓ H⇤ )
  12. HOW TO FIND GOOD HYPER PARAMETERS 0 1 1 1

    1 ✓H(2) ✓H(1) x y ~ ✓ H⇤ = { ✓ H(1) = x, ✓ H(2) = y } ~ ✓H⇤
  13. HOW TO FIND GOOD HYPER PARAMETERS 0 1 1 1

    1 ✓H(2) ✓H(1) ~ ✓H⇤ x y ???
  14. COMMON SITUATIONS We don’t know analytic form of F (

    f ( x ; ~ ✓ ); ~ ✓ H ) • No dependency/correlation information of each elements ~ ✓H • No gradient information of ~ ✓H • Unknown underlying distribution over (maybe non-convex) F F can be seen as a function to be optimized • could be noisy and expensive to evaluate F black-box
  15. WELL KNOWN APPROACHES Bayesian Optimization Grid Search • Gaussian Process

    based • Easy to try, but some crucial drawbacks Random Search • Often leads better result than grid search • Random Forest based • Deep Neural Net based • Tree Parzan Estimators based
  16. BASIC APPROACHES Grid Search • Easy to try, but complexity

    for trying k different values for each n hyper parameters       F(f(θ,x){0,0}) F(f(θ,x){0,25}) F(f(θ,x){0,50}) F(f(θ,x){0,75}) F(f(θ,x){0,100})  F(f(θ,x){2.5,0}) F(f(θ,x){2.5,25}) F(f(θ,x){2.5,50}) F(f(θ,x){2.5,75}) F(f(θ,x){2.5,100})  F(f(θ,x){5,0}) F(f(θ,x){5,25}) F(f(θ,x){5,50}) F(f(θ,x){5,75}) F(f(θ,x){5,100})  F(f(θ,x){7.5,0}) F(f(θ,x){7.5,25}) F(f(θ,x){7.5,50}) F(f(θ,x){7.5,75}) F(f(θ,x){7.5,100})  F(f(θ,x){10,0}) F(f(θ,x){10,25}) F(f(θ,x){10,50}) F(f(θ,x){10,75}) F(f(θ,x){10,100}) O(kn)
  17. BASIC APPROACHES Random Search (Bergstra et al., 2012) IUUQXXXKNMSPSHQBQFSTWPMVNFCFSHTUSBBCFSHTUSBBQEG •

    Can exploit important hyper parameters more efficiently than grid based approach • For finite search space, the maximum performance out of 60 random observation lies within top 5% of true maximum with 95% probability* IUUQCMPHEBUPDPNIPXUPFWBMVBUFNBDIJOFMFBSOJOHNPEFMTQBSUIZQFSQBSBNFUFSUVOJOH 1 (1 0.05)n > 0.95
  18. BAYESIAN OPTIMIZATION Gaussian Process (Snoek et al. 2012) • Approximates

    with a multivariate gaussian distribution F IUUQBSYJWPSHQEGWQEG GP learning sequence Detailed 1D illustration
  19. BAYESIAN OPTIMIZATION Gaussian Process • Acquisition functions can be defined

    as we want :) • Drawback: Inference time grows with complexity for data points observations for covariance estimation O(n3) n • Advantage: Allows us to have a rich distribution -FGUpH`TBRVBUJDTGVODUJPO 3JHIUpH`TBRVBUJDTGVODUJPOGPSQBSBMMFMJTN
  20. • Acquisition function where is the fixed quantile of losses

    observed so far. BAYESIAN OPTIMIZATION Tree of Parzan Estimators (Bergstra et al., 2011) • Approximates using density ratio of Gaussian Mixture Models IUUQKNMSPSHQSPDFFEJOHTQBQFSTWCFSHTUSBQEG F IUUQTQBQFSTOJQTDDQBQFSBMHPSJUINTGPSIZQFSQBSBNFUFSPQUJNJ[BUJPOQEG ` ( x ) g ( x ) y⇤ and are densities according to . y
  21. BAYESIAN OPTIMIZATION Empirical performance comparing with Gaussian Process IUUQBSYJWPSHQEGWQEG •

    Drawback: Poor in optimizing dependent hyper parameters • Advantage: Usually faster than Gaussian Process
  22. BAYESIAN OPTIMIZATION Random Forest • Approximated still corresponds to a

    Gaussian whose and corresponds to the empirical output of trees IUUQBBEJOGPSNBUJLVOJGSFJCVSHEFdIVUUFS.-QEG • Works particularly well when is complex and consists of categorical values (Hutter et al., 2013) F ~ ✓H IUUQXXXDTVCDDBdIVUUFSQBQFST#BZFT0QU@&NQJSJDBM'PVOEBUJPOQEG µ 2 F
  23. BAYESIAN OPTIMIZATION Deep Neural Net (Snoek et al. 2015) •

    Overcomes the scalability issue of Gaussian Process based approach • Approximates with an adaptive basis regression function over Deep Neural Networks F IUUQBSYJWPSHQEGWQEG
  24. BAYESIAN OPTIMIZATION Deep Neural Net • So, the authors worked

    on optimizing the HyperParameter optimizer :) • Adaptive basis regression function employs a Bayesian linear regressor Predictive mean and variance with IUUQBSYJWPSHQEGWQEG is the network output value
  25. BAYESIAN OPTIMIZATION Deep Neural Net • Experiment in General settings

    (HPOLib Benchmarks) • More challenging task (image caption generation)
  26. BAYESIAN OPTIMIZATION BEYOND HYPER-PARAMETER TUNING Assessing Hyper Parameter Importance (Hutter

    et al. , 2014) • Quantitates importance of each single hyper parameter and interactions between them Enhance deep generative models (Ranganath et al. , 2016) • Enriches posterior distributions of generative models with Variational Gaussian Process More info on black box methods • Check http://www.blackboxworkshop.org/
  27. WHAT NOT COVERED IN THIS TALK Parallelized/distributed optimization Simulated Gradient

    based approach Other gradient free approaches • Genetic algorithm, simulated annealing, etc • Hyper Gradient, Nelder-Mead method, etc
  28. EXISTING LIBRARIES Implements Programming
 Interface Licence Spearmint Gaussian Process Python

    Academic/Non-commercial Research Bayesopt Gaussian Process Java, Matlab GNU Public MOE Gaussian Process Python, C++, REST Apache ver2.0 HYPER OPT Tree Parezan/Random Search Python BSD SMAC Random Forest Java AGPLv3 Hyper Gradients Hyper Gradient Python MIT "OETPPOʜ
  29. DEMO Spearmint integration with Keras and TensorFlow • A config

    file to specify hyper parameters to optimize • A main function file to calculate objective value Basically, we just need to prepare 2 files….
  30. REFERENCES Papers • Hanna Wallach, David Mimno, and Andrew McCallum.

    Rethinking lda: Why priors matter. In Advances in Neural Information Processing Systems 22, pages 1973–1981, 2009. • G. Hinton, N. Srivastave, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. • Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmidhuber, J. (2015). LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 . 417 • James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281–305, 2012. • Brochu, E., Cora, M., and de Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. In TR-2009-23, UBC, 2009. • Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In NIPS, pp. 2951–2959, 2012. • James S. Bergstra, Remi Bardenet, Yoshua Bengio, and B. K´egl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 25. 2011. • J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proc. of ICML’12, 2013.
  31. REFERENCES Papers • J. Snoek, O. Rippel, K. Swersky, R.

    Kiros, N. Satish, N. Sundaram, M.M.A. Patwary, Prabhat, and R.P. Adams. Scalable Bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015. • K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-Brown. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS Workshop on Bayesian Optimization in Theory and Practice, 2013. • Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. “An Efficient Approach for Assessing Hyperparameter Importance”. In: Proceedings of The 31st International Conference on Machine Learning. 2014, pp. 754–762. • Tran, D., Ranganath, R., and Blei, D. M. (2016). Variational Gaussian process. In International Conference on Learning Representations(ICLR), 2016.
  32. REFERENCES • RBF SVM Parameters:
 http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html Web Sites • no

    free hunch:
 http://blog.kaggle.com • How to Evaluate Machine Learning Models: Hyperparameter Tuning:
 http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning • Optimization of Machine Learning Hyperparameters:
 http://aad.informatik.uni-freiburg.de/~hutter/ML3.pdf • Bishop, C. M. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006. Books