Probabilistic Programming & Deep Learning

Probabilistic Programming & Deep Learning How to combine expressive statistical
models with the power of DeepLearning. And why this could be a game changer. A.I. Meetup Stuttgart #1 - 25.06.17 - Joachim Rosskopf - @jrosskopf

Trends in Machine Learning BigData SIMD HW Cloud Storage Statistics
Data Bayesian Probabilistic Programming Deep Learning Convolutions Recurrence Reinforcement Applications? Expectations Experience Opportunity 2

Who is in the Audience? Business & Applications Technology &
Tools Algorithms & Methods 3

Who am I? • Joachim works since ~15 years as
developer and consultant for software systems, architecture and processes. • Mainly interested in businesses around engineering, manufacturing and logistics. • Self-Employed after high school for 5 years • Right now he's doing a PhD at Institute of Theoretical Physics in Ulm and is still consulting. 4

Probabilistic Programming Model Infer Criticize Data • Probabilistic Programming allows
very flexible creation of custom probabilistic models. • It is mainly concerned with insight and learning from your data. • The approach is inherently Bayesian so we can specify priors to inform and constrain our models and get uncertainty estimation in form of a posterior distribution. • Using MCMC sampling algorithms we can draw samples from this posterior to very flexibly estimate these models. • Variational inference algorithms fit a distribution (e.g. normal) to the posterior turning a sampling problem into an optimization problem. Neal (1995), MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.11 Gelman (2016), Automatic differentiation variational inference, arXiv preprint arXiv:1603.00788 5

Bayesian Linear Regression - Data • Simulate 40 data points.
• Inputs: • Outputs: • Linear dependence • Normally distributed noise 6

Bayesian Linear Regression - Model • Bayesian Linear Regression assumes
linear relationship between inputs and output . • For a set of N data points the model looks like: • The Variances: ◦ Priors: ◦ Likelihood: • The latent variables: ◦ Model weights: ◦ Intercept/bias: Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press. 7

Bayesian Linear Regression - Inference Define the variational model to
be a fully factorized normal across the weights. Run variational inference with the Kullback-Leibler divergence, using 250 iterations and 5 latent variable samples in the algorithm. The inference is done by minimizing the divergence measure using the reparameterization gradient. Probabilistic Programming emphasizes expressiveness instead of scale. 8

Bayesian Linear Regression - Inference • Model has learned linear
relationship • Just plotting the first dim of the D-dim x 9

Modeling a Coin Flip • Beta-Bernoulli model • Representation as
graph • Fetching x from the graph generates a binary vector. • All computation is represented on the graph, enabling to leverage model structure during inference. 10

What a coincidence! The modern Probabilistic Programming reuses recent multidim.
array/tensor and symbolic auto differentiation frameworks! 11

Digression to Tensorflow • Computational graph framework. • Nodes are
operations: Arithmetic, Looping, Slice. • Edges are tensors communicated between nodes. • Everythings is formulated with respect of the graph. CPU GPU ... Android Tensorflow Distributed Execution Engine Python Frontend C++ Frontend ... Layers Estim ator Keras Edward Abadi (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Whitepaper. 12

What is a Neural Network? Outputs Weights Weights Hidden Units
Inputs A Neural network is parameterized function, which is fitted to data. Parameters are weights of neural net. Feedforward neural nets model Data: as a nonlinear function of and , e.g.: 13

What is a Neural Network? Outputs Weights Weights Hidden Units
Inputs A Neural network is parameterized function, which is fitted to data. Mulitlayer / deep neural networks model the overall function as a composition of functions / layers: Usually trained to maximise likelihood using variants of stochastic gradient descent (SGD) optimization. 14

Deep Learning Deep learning systems are neural network models similar
to those popular in the ’80s and ’90s, with: • Some architectural and algorithmic innovations • A lot larger data sets • Much better software tools. • Magnitudes larger compute resources • Vastly increased investment. Bengio (2015), Deep learning. Nature 521: 436-444. 15

Deep Learning Neural networks and deep learning systems give amazing
performance on many benchmark tasks. 16

Deep Learning Has also non-negligible limitations which make application in
certain domains hard. • Very data hungry. • Very compute-intensive to train and deploy. • Poor at representing uncertainty. • Easily fooled by adversarial examples. • Tricky to optimize: non-convex & choice of architecture, learning procedure, initialization. • Uninterpretable black-boxes, lacking in transparency, difficult to trust. • Hard to incorporate prior knowledge on the model. 17

Bridging DL and PP • Small & focused models •
Domain & prior knowledge • Very principled & well understood • Huge & complex models • Many Heuristics • Amazing predictions Probabilistic Programming Deep Learning Gain intelligibility Scale model & data size Awesome Gal (2016) PhD Thesis, Uncertainty in Deep Learning, University of Cambridge. 18

Necessity of the Probabilistic Approach Many aspects of learning and
intelligence depend crucially on the careful probabilistic representation of uncertainty: • Forecasting. • Decision making. • Learning from limited, noisy, and missing data • Learning complex personalised models. • Data compression. • Automating scientific modelling, discovery, and experiment design. • Incorporation of Domain Specific Knowledge 19

Benefits - Uncertainty in Predictions Calibrated model and prediction uncertainty:
Getting systems that know when they don’t know. • AI on real-life settings needs safety, e.g. medical domain, drones, cars, finance. • Model confidence and result distribution for human machine interaction. • Low level errors propagating to top level rule-based systems. (Tesla incident). • Framework of model confidence necessary. • Adapt learning to uncertainty: Active learning & Deep Reinforcement Learning. 20

Benefits - Uncertainty in Representations Uncertainty in Representation leads to
ensembling of networks. • Uncertainty in weights informs about stability of learned representation in the network. • Duality between weights conditioned on distributions and dropout. • Evidence Lower Bound objective from Variational Inference: • Dropout objective: 21

Benefits - Transfer Learning by Informed Priors Bootstrap the learning
by placing informed priors centered around weights retrieved from other pre-trained networks. 22

Benefits - Hierarchical Neural Networks Hierarchy of neural networks where
a sub-networks concentrates on a certain aspect of the problem, but is informed about representation of the overall population. Microscopic probabilistic model for inputs Macroscopic probabilistic model for outputs Pattern recognition and connection network 23

Benefits - Hybrid Architectures Bayesian non-parametrics could be used to
flexibly adjust the size and shape of the hidden layers. • More freedom in building network architectures. • Change shape and structure of the network during training. • Avoids costly hyperparameter optimization and “tribal” knowledge. • Optimally scale the network architecture to problem during training. 24

Example: Bayesian Recurrent Neural Network (RNN) 25

Example: Bayesian Neural Network for classification 26

Conclusions • Probabilistic Programming offers a general framework for building
systems that learn from data. • Advantages include better estimates of uncertainty, automatic ways of learning structure and avoiding overfitting, and a principled foundation. • Disadvantages include higher computational cost, depending on the approximate inference algorithm • Bayesian neural networks have a long history and are undergoing a tremendous wave of revival • There could be a lot of practical benefit to marry Probabilistic Programming and Neural Networks 27

Thank you for the opportunity to present my ideas! Probabilistic
Programming & Deep Learning. My Email: [email protected] My Twitter: @jrosskopf A.I. Meetup Stuttgart #1 - 25.06.17 - Joachim Rosskopf

Probabilistic Programming & Deep Learning

Probabilistic Programming & Deep Learning

Joachim Rosskopf

More Decks by Joachim Rosskopf

Other Decks in Technology

Featured

Transcript

Probabilistic Programming & Deep Learning How to combine expressive statistical

Trends in Machine Learning BigData SIMD HW Cloud Storage Statistics

Who is in the Audience? Business & Applications Technology &

Who am I? • Joachim works since ~15 years as

Probabilistic Programming Model Infer Criticize Data • Probabilistic Programming allows

Bayesian Linear Regression - Data • Simulate 40 data points.

Bayesian Linear Regression - Model • Bayesian Linear Regression assumes

Bayesian Linear Regression - Inference Define the variational model to

Bayesian Linear Regression - Inference • Model has learned linear

Modeling a Coin Flip • Beta-Bernoulli model • Representation as

What a coincidence! The modern Probabilistic Programming reuses recent multidim.

Digression to Tensorflow • Computational graph framework. • Nodes are

What is a Neural Network? Outputs Weights Weights Hidden Units

What is a Neural Network? Outputs Weights Weights Hidden Units

Deep Learning Deep learning systems are neural network models similar

Deep Learning Neural networks and deep learning systems give amazing

Deep Learning Has also non-negligible limitations which make application in

Bridging DL and PP • Small & focused models •

Necessity of the Probabilistic Approach Many aspects of learning and

Benefits - Uncertainty in Predictions Calibrated model and prediction uncertainty:

Benefits - Uncertainty in Representations Uncertainty in Representation leads to

Benefits - Transfer Learning by Informed Priors Bootstrap the learning

Benefits - Hierarchical Neural Networks Hierarchy of neural networks where

Benefits - Hybrid Architectures Bayesian non-parametrics could be used to

Example: Bayesian Recurrent Neural Network (RNN) 25

Example: Bayesian Neural Network for classification 26

Conclusions • Probabilistic Programming offers a general framework for building

Thank you for the opportunity to present my ideas! Probabilistic