Bayesian Optimization with TensorFlow/Keras by Keisuke Kamataki - TMLS #2

Keisuke talked about hyper parameters tuning issues in machine learning, mainly focusing on Bayesian Optimization techniques.

May 27, 2016

Transcript

4. INTRODUCTION y = f ( x ; ~ ✓ )

A general prediction problem A general optimization problem ~ ✓ = arg min ~ ✓ ⇥ L(X, Y ; ~ ✓) + ⌦(~ ✓) ⇤
5. INTRODUCTION y = f ( x ; ~ ✓ )

A prediction problem with hyper parameters A parameter optimization problem with hyper parameters does affect how parameters are learned ~ ✓ = arg min ~ ✓ ⇥ L(X, Y ; ~ ✓) + ⌦(~ ✓) ⇤ ~ ✓H ~ ✓H ~ ✓H
6. INTRODUCTION A brief clarification on terminology ~ ✓ Parameters. Controls

how algorithms make prediction decisions along with Data. Automatically learned during training phase. Hyper Parameters. Affects how algorithms learn parameters. Machine learning algorithms usually don’t automatically learn. ~ ✓H
7. A NON-COMPREHENSIVE LIST OF HYPER PARAMETERS Bayesian Models Random Forest

(Deep) Neural Networks • Parameters of prior distributions, … • Number of trees, Number of features for node split, …. Support Vector Machines • Selection of Kernel(and its parameter), Soft-margin parameter, … • Learning rate, Number of hidden units/layers, dropout rate, weight decay, activation functions, …
8. HOW THEY AFFECT Support Vector Machines (with RBF Kernel on

iris dataset) IUUQTDJLJUMFBSOPSHTUBCMFBVUP@FYBNQMFTTWNQMPU@SCG@QBSBNFUFSTIUNM K ( x i , x j) = exp ( || x i x j||2) min w 1 2 || w ||2 + C n X i=1 (0 , 1 y (i)( w T x (i) + b ))
9. HOW THEY AFFECT Bayesian Models (Latent Dirichlet Allocation on patent

abstracts) IUUQEJSJDIMFUOFUQEGXBMMBDISFUIJOLJOHQEG IUUQTFOXJLJQFEJBPSHXJLJ-BUFOU@%JSJDIMFU@BMMPDBUJPO
10. HOW THEY AFFECT Deep Neural Nets (2 different cases) IUUQBSYJWPSHQEGQEG

IUUQBSYJWPSHQEGQEG • Dropout Rate (TIMIT speech recognition benchmark) • Long Short Term Memory (speech, hand written, music data)
11. HOW CAN WE OPTIMIZE Hyper parameter optimization is difficult •

Top Kagglers say… I do this very manually IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTLB[BOPWBOFXJOUIFXPSME Mr. Marios Michailidis I mostly tune by hand IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTPXFO[IBOHDVSSFOUMZJOUIFXPSME Mr. Owen Zhang IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTHJMCFSUPUJUFSJD[OFXJOUIFXPSME I often use brute force searching manually based on my past experience Mr. Gilberto Titericz Mr. Lucas Eustaquio Gomes da Silva I look for params used in previous competitions and tweak them one by one until the outcome stops improving IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTMFVTUBHPTDVSSFOUIJHIFTU
12. A LITTLE PHILOSOPHY DEBATE Hyper parameter optimization, what is it

after all? • To be fair, these masters also say… I feel I learn more about the algorithms and why they work the way do by doing this manually. At the same time I feel that “I do something, it’s not only the machine!” IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTLB[BOPWBOFXJOUIFXPSME Mr. Marios Michailidis Sometimes I think I should eventually be more disciplined and systematic, using some Bayesian optimizer. However that would take some fun away from the process, I think. IUUQCMPHLBHHMFDPNQSPpMJOHUPQLBHHMFSTPXFO[IBOHDVSSFOUMZJOUIFXPSME Mr. Owen Zhang
13. ANYWAY IN PRACTICE Complex algorithms • Many hyper parameters to

tune • Evaluation becomes time consuming It would be important to find good hyper parameters efficiently Large Datasets

15. HOW TO FIND GOOD HYPER PARAMETERS 0 1 1 1

1 ✓H(2) ✓H(1) max ~ ✓H⇤ F(f(x; ~ ✓); ~ ✓ H⇤ )
16. HOW TO FIND GOOD HYPER PARAMETERS 0 1 1 1

1 ✓H(2) ✓H(1) x y ~ ✓ H⇤ = { ✓ H(1) = x, ✓ H(2) = y } ~ ✓H⇤
17. HOW TO FIND GOOD HYPER PARAMETERS 0 1 1 1

1 ✓H(2) ✓H(1) ~ ✓H⇤ x y ???
18. COMMON SITUATIONS We don’t know analytic form of F (

f ( x ; ~ ✓ ); ~ ✓ H ) • No dependency/correlation information of each elements ~ ✓H • No gradient information of ~ ✓H • Unknown underlying distribution over (maybe non-convex) F F can be seen as a function to be optimized • could be noisy and expensive to evaluate F black-box
19. WELL KNOWN APPROACHES Bayesian Optimization Grid Search • Gaussian Process

based • Easy to try, but some crucial drawbacks Random Search • Often leads better result than grid search • Random Forest based • Deep Neural Net based • Tree Parzan Estimators based
20. BASIC APPROACHES Grid Search • Easy to try, but complexity

for trying k different values for each n hyper parameters       F(f(θ,x){0,0}) F(f(θ,x){0,25}) F(f(θ,x){0,50}) F(f(θ,x){0,75}) F(f(θ,x){0,100})  F(f(θ,x){2.5,0}) F(f(θ,x){2.5,25}) F(f(θ,x){2.5,50}) F(f(θ,x){2.5,75}) F(f(θ,x){2.5,100})  F(f(θ,x){5,0}) F(f(θ,x){5,25}) F(f(θ,x){5,50}) F(f(θ,x){5,75}) F(f(θ,x){5,100})  F(f(θ,x){7.5,0}) F(f(θ,x){7.5,25}) F(f(θ,x){7.5,50}) F(f(θ,x){7.5,75}) F(f(θ,x){7.5,100})  F(f(θ,x){10,0}) F(f(θ,x){10,25}) F(f(θ,x){10,50}) F(f(θ,x){10,75}) F(f(θ,x){10,100}) O(kn)
21. BASIC APPROACHES Random Search (Bergstra et al., 2012) IUUQXXXKNMSPSHQBQFSTWPMVNFCFSHTUSBBCFSHTUSBBQEG •

Can exploit important hyper parameters more efficiently than grid based approach • For finite search space, the maximum performance out of 60 random observation lies within top 5% of true maximum with 95% probability* IUUQCMPHEBUPDPNIPXUPFWBMVBUFNBDIJOFMFBSOJOHNPEFMTQBSUIZQFSQBSBNFUFSUVOJOH 1 (1 0.05)n > 0.95
22. BAYESIAN OPTIMIZATION Gaussian Process (Snoek et al. 2012) • Approximates

with a multivariate gaussian distribution F IUUQBSYJWPSHQEGWQEG GP learning sequence Detailed 1D illustration
23. BAYESIAN OPTIMIZATION Gaussian Process • Acquisition functions can be defined

as we want :) • Drawback: Inference time grows with complexity for data points observations for covariance estimation O(n3) n • Advantage: Allows us to have a rich distribution -FGUpH`TBRVBUJDTGVODUJPO 3JHIUpH`TBRVBUJDTGVODUJPOGPSQBSBMMFMJTN
24. • Acquisition function where is the fixed quantile of losses

observed so far. BAYESIAN OPTIMIZATION Tree of Parzan Estimators (Bergstra et al., 2011) • Approximates using density ratio of Gaussian Mixture Models IUUQKNMSPSHQSPDFFEJOHTQBQFSTWCFSHTUSBQEG F IUUQTQBQFSTOJQTDDQBQFSBMHPSJUINTGPSIZQFSQBSBNFUFSPQUJNJ[BUJPOQEG ` ( x ) g ( x ) y⇤ and are densities according to . y
25. BAYESIAN OPTIMIZATION Empirical performance comparing with Gaussian Process IUUQBSYJWPSHQEGWQEG •

Drawback: Poor in optimizing dependent hyper parameters • Advantage: Usually faster than Gaussian Process
26. BAYESIAN OPTIMIZATION Random Forest • Approximates with random forest regression

trees F ( f ( x ; ~ ✓ ); ~ ✓ H ) F
27. BAYESIAN OPTIMIZATION Random Forest • Approximated still corresponds to a

Gaussian whose and corresponds to the empirical output of trees IUUQBBEJOGPSNBUJLVOJGSFJCVSHEFdIVUUFS.-QEG • Works particularly well when is complex and consists of categorical values (Hutter et al., 2013) F ~ ✓H IUUQXXXDTVCDDBdIVUUFSQBQFST#BZFT0QU@&NQJSJDBM'PVOEBUJPOQEG µ 2 F
28. BAYESIAN OPTIMIZATION Deep Neural Net (Snoek et al. 2015) •

Overcomes the scalability issue of Gaussian Process based approach • Approximates with an adaptive basis regression function over Deep Neural Networks F IUUQBSYJWPSHQEGWQEG
29. BAYESIAN OPTIMIZATION Deep Neural Net • So, the authors worked

on optimizing the HyperParameter optimizer :) • Adaptive basis regression function employs a Bayesian linear regressor Predictive mean and variance with IUUQBSYJWPSHQEGWQEG is the network output value
30. BAYESIAN OPTIMIZATION Deep Neural Net • Experiment in General settings

(HPOLib Benchmarks) • More challenging task (image caption generation)
31. BAYESIAN OPTIMIZATION BEYOND HYPER-PARAMETER TUNING Assessing Hyper Parameter Importance (Hutter

et al. , 2014) • Quantitates importance of each single hyper parameter and interactions between them Enhance deep generative models (Ranganath et al. , 2016) • Enriches posterior distributions of generative models with Variational Gaussian Process More info on black box methods • Check http://www.blackboxworkshop.org/
32. WHAT NOT COVERED IN THIS TALK Parallelized/distributed optimization Simulated Gradient

based approach Other gradient free approaches • Genetic algorithm, simulated annealing, etc • Hyper Gradient, Nelder-Mead method, etc
33. EXISTING LIBRARIES Implements Programming  Interface Licence Spearmint Gaussian Process Python

Academic/Non-commercial Research Bayesopt Gaussian Process Java, Matlab GNU Public MOE Gaussian Process Python, C++, REST Apache ver2.0 HYPER OPT Tree Parezan/Random Search Python BSD SMAC Random Forest Java AGPLv3 Hyper Gradients Hyper Gradient Python MIT "OETPPOʜ

35. DEMO Spearmint integration with Keras and TensorFlow • A config

file to specify hyper parameters to optimize • A main function file to calculate objective value Basically, we just need to prepare 2 files….

38. REFERENCES Papers • Hanna Wallach, David Mimno, and Andrew McCallum.

Rethinking lda: Why priors matter. In Advances in Neural Information Processing Systems 22, pages 1973–1981, 2009. • G. Hinton, N. Srivastave, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012. • Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmidhuber, J. (2015). LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 . 417 • James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281–305, 2012. • Brochu, E., Cora, M., and de Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. In TR-2009-23, UBC, 2009. • Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In NIPS, pp. 2951–2959, 2012. • James S. Bergstra, Remi Bardenet, Yoshua Bengio, and B. K´egl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems 25. 2011. • J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proc. of ICML’12, 2013.
39. REFERENCES Papers • J. Snoek, O. Rippel, K. Swersky, R.

Kiros, N. Satish, N. Sundaram, M.M.A. Patwary, Prabhat, and R.P. Adams. Scalable Bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015. • K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-Brown. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS Workshop on Bayesian Optimization in Theory and Practice, 2013. • Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. “An Efficient Approach for Assessing Hyperparameter Importance”. In: Proceedings of The 31st International Conference on Machine Learning. 2014, pp. 754–762. • Tran, D., Ranganath, R., and Blei, D. M. (2016). Variational Gaussian process. In International Conference on Learning Representations(ICLR), 2016.
40. REFERENCES • RBF SVM Parameters:  http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html Web Sites • no

free hunch:  http://blog.kaggle.com • How to Evaluate Machine Learning Models: Hyperparameter Tuning:  http://blog.dato.com/how-to-evaluate-machine-learning-models-part-4-hyperparameter-tuning • Optimization of Machine Learning Hyperparameters:  http://aad.informatik.uni-freiburg.de/~hutter/ML3.pdf • Bishop, C. M. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006. Books