Bayesian Dropout and Beyond

Slide 1

Slide 1 text

Bayesian Dropout and Beyond Lukasz Krawczyk, 29th June 2017 Homo Apriorius Homo Pragmaticus Homo Friquentistus Homo Sapiens Homo Bayesianis

Slide 2

Slide 2 text

2017/06/29 2 / 27 Agenda ● About me ● Bayesian Neural Networks ● Regularization ● Dropout

Slide 3

Slide 3 text

2017/06/29 3 / 27 About me ● Data Scientist at Asurion Japan Holdings ● Data Scientist & Full Stack Developer at Abeja Inc. ● MSc degree from Jagiellonian University, Poland ● Open Source & Bayesian Inference advocate

Slide 4

Slide 4 text

2017/06/29 4 / 27 PART 1 Bayesian Neural Networks

Slide 5

Slide 5 text

2017/06/29 5 / 27 Bayesian Inference Bayesian Inference ● General purpose framework ● Generative models ● Clarity of FS + Power of ML – White-box modelling – Black-box fitting (MCMC,VI) – Uncertainity → Intuitive insights ● Learning from very small datasets ● Probabilistic Programming

Slide 6

Slide 6 text

2017/06/29 6 / 27 Bayesian Inference Inference MCMC or VI Posterior Data Credibile Region Uncertainity Better Insights Prior μ σ Model Assumptions about data controlled by the prior

Slide 7

Slide 7 text

2017/06/29 7 / 27 Bayesian Inference Very easy way to cook your laptop

Slide 8

Slide 8 text

2017/06/29 8 / 27 Bayesian Neural Networks ● Replace weights with probability distributions

Slide 9

Slide 9 text

2017/06/29 9 / 27 Example – standard NN x 1 x 2 y 0.1 1.0 0 0.1 -1.3 1 … sigmoid tanh Data Backpropagation

Slide 10

Slide 10 text

2017/06/29 10 / 27 Example – NN using Bayesian Inference π n=2 Bayesian Approximation Data x 1 x 2 y 0.1 1.0 [0,1,...] 0.1 -1.3 [1,1,...] … MCMC ● NUTS ● HMC ● Metropolis ● Gibbs VI ● ADVI ● OPVI

Slide 11

Slide 11 text

2017/06/29 11 / 27 Results – binary NN output NN output uncertainity

Slide 12

Slide 12 text

2017/06/29 12 / 27 Results – n classes NN output NN output uncertainity

Slide 13

Slide 13 text

2017/06/29 13 / 27 PART 2 Bayesian Regularization

Slide 14

Slide 14 text

2017/06/29 14 / 27 Bayesian Regularization OR b μ laplace L1 regularization L2 regularization μ σ

Slide 15

Slide 15 text

2017/06/29 15 / 27 Code (pymc3 + Lasagne) with pm.Model() as model: # weights with L2 normalization w_in_1 = Normal('w_in_1', 0, sd=1, shape=(n_in, n_hidden)) w_1_2 = Normal('w_1_2', 0, sd=1, shape=(n_hidden, n_hidden)) w_2_out = Normal('w_2_out', 0, sd=1, shape=(n_hidden, n_out)) # layers l_in = InputLayer(in_shape, input_var=X_shared) l_1 = DenseLayer(l_in, n_hidden, W=w_in_1, nonlinearity=tanh) l_2 = DenseLayer(l_1, n_hidden, W=w_1_2, nonlinearity=tanh) l_out = DenseLayer(l_2, n_out, W=w_2_out, nonlinearity=softmax) p = Deterministic('p', lasagne.layers.get_output(l_out)) out = Categorical('out', p=p, observed=y_shared) x y y x μ σ

Slide 16

Slide 16 text

2017/06/29 16 / 27 Bayesian Regularization G ~ μ σ L2 regularization with automated hyperparameter optimization

Slide 17

Slide 17 text

2017/06/29 17 / 27 Code (pymc3 + Lasagne) with pm.Model() as model: # regularization hyperparameters r_in_1 = HalfNormal('r_in_1', sd=1) r_1_2 = HalfNormal('r_1_2', sd=1) r_2_out = HalfNormal('r_2_out', sd=1) # weights with L2 normalization w_in_1 = Normal('w_in_1', 0, sd=r_in_1, shape=(n_in, n_hidden)) w_1_2 = Normal('w_1_2', 0, sd=r_1_2, shape=(n_hidden, n_hidden)) w_2_out = Normal('w_2_out', 0, sd=r_2_out, shape=(n_hidden, n_out)) # layers l_in = InputLayer(in_shape, input_var=X_shared) l_1 = DenseLayer(l_in, n_hidden, W=w_in_1, nonlinearity=tanh) l_2 = DenseLayer(l_1, n_hidden, W=w_1_2, nonlinearity=tanh) l_out = DenseLayer(l_2, n_out, W=w_2_out, nonlinearity=softmax) p = Deterministic('p', lasagne.layers.get_output(l_out)) out = Categorical('out', p=p, observed=y_shared) x y y x μ σ G ~

Slide 18

Slide 18 text

2017/06/29 18 / 27 Results ● Learning regularization directly from data ● Transfering knowledge to other models

Slide 19

Slide 19 text

2017/06/29 19 / 27 PART 3 Bayesian Dropout

Slide 20

Slide 20 text

2017/06/29 20 / 27 Dropout ● Standard dropout is already a form of Bayesian Approximation ● Experiments shows it has good influence on learning process ● The output is predicted by the weighted average of submodel predictions p=0.5

Slide 21

Slide 21 text

2017/06/29 21 / 27 Bayesian Dropout layer n = σ(W * Z * layer n-1 + b) W ~ Normal(mu, std) p Z ~ Bernoulli(p)

Slide 22

Slide 22 text

2017/06/29 22 / 27 Code (pymc3 + Lasagne) with pm.Model() as model: # weights with L2 normalization w_in_1 = Normal('w_in_1', 0, sd=1, shape=(n_in, n_hidden)) w_1_2 = Normal('w_1_2', 0, sd=1, shape=(n_hidden, n_hidden)) w_2_out = Normal('w_2_out', 0, sd=1, shape=(n_hidden, n_out)) # dropout d_in_1 = Bernoulli('d_in_1', p=0.5, shape=n_in) d_1_2 = Bernoulli('d_1_2', p=0.5, shape=n_hidden) d_2_out = Bernoulli('d_2_out', p=0.5, shape=n_hidden) # layers l_in = InputLayer(in_shape, input_var=X_shared) l_1 = DenseLayer(l_in, n_hidden, W=T.dot(T.nlinalg.diag(d_in_1), w_in_1), nonlinearity=tanh) l_2 = DenseLayer(l_1, n_hidden, W=T.dot(T.nlinalg.diag(d_1_2), w_1_2), nonlinearity=tanh) l_out = DenseLayer(l_2, n_out, W=T.dot(T.nlinalg.diag(d_2_out), w_2_out), nonlinearity=softmax) p = Deterministic('p', lasagne.layers.get_output(l_out)) out = Categorical('out', p=p, observed=y_shared) x y y x p μ σ

Slide 23

Slide 23 text

2017/06/29 23 / 27 Results ● Approximating dropout rate node-wise!

Slide 24

Slide 24 text

2017/06/29 24 / 27 Results ● https:/ /github.com/uhho/bnn-experiments ● We can use this information for building standard NN – Feature selection – Architecture decisions

Slide 25

Slide 25 text

2017/06/29 25 / 27 Summary Scientific perspective ● NN models with small datasets ● Complex hierarchical neural networks (Bayesian CNN/RNN) ● Reduced overfitting ● Faster training Business perspective ● Clear and intuitive models ● Uncertainity in Finance & Insurance is extremely important ● Better trust and adoption of Neural Network-based models

Slide 26

Slide 26 text

2017/06/29 26 / 27 Links & Sources ● Code – https:/ /github.com/uhho/bnn-experiments ● Papers – „Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning“ Y.Gal, Z. Ghahramani (Cambridge University) http:/ /mlg.eng.cam.ac.uk/yarin/PDFs/NIPS_2015_deep_learning_uncertainty.pdf – "Bayesian Dropout" T. Herlau, M. Mørup, M. N. Schmidt (Technical University of Denmark) https:/ /www.researchgate.net/publication/280970177_Bayesian_Dropout – "A Bayesian encourages dropout" S. Maeda (Kyoto University) https:/ /arxiv.org/pdf/1412.7003.pdf

Slide 27

Slide 27 text

2017/06/29 27 / 27 Thank you!