Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Probabilistic Programming & Deep Learning

Probabilistic Programming & Deep Learning

How to combine expressive statistical models with the power of DeepLearning. And why this could be a game changer.
Presented at 1st AI Meetup in Stuttgart.

Joachim Rosskopf

June 25, 2017
Tweet

More Decks by Joachim Rosskopf

Other Decks in Technology

Transcript

  1. Probabilistic Programming & Deep Learning How to combine expressive statistical

    models with the power of DeepLearning. And why this could be a game changer. A.I. Meetup Stuttgart #1 - 25.06.17 - Joachim Rosskopf - @jrosskopf
  2. Trends in Machine Learning BigData SIMD HW Cloud Storage Statistics

    Data Bayesian Probabilistic Programming Deep Learning Convolutions Recurrence Reinforcement Applications? Expectations Experience Opportunity 2
  3. Who am I? • Joachim works since ~15 years as

    developer and consultant for software systems, architecture and processes. • Mainly interested in businesses around engineering, manufacturing and logistics. • Self-Employed after high school for 5 years • Right now he's doing a PhD at Institute of Theoretical Physics in Ulm and is still consulting. 4
  4. Probabilistic Programming Model Infer Criticize Data • Probabilistic Programming allows

    very flexible creation of custom probabilistic models. • It is mainly concerned with insight and learning from your data. • The approach is inherently Bayesian so we can specify priors to inform and constrain our models and get uncertainty estimation in form of a posterior distribution. • Using MCMC sampling algorithms we can draw samples from this posterior to very flexibly estimate these models. • Variational inference algorithms fit a distribution (e.g. normal) to the posterior turning a sampling problem into an optimization problem. Neal (1995), MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.11 Gelman (2016), Automatic differentiation variational inference, arXiv preprint arXiv:1603.00788 5
  5. Bayesian Linear Regression - Data • Simulate 40 data points.

    • Inputs: • Outputs: • Linear dependence • Normally distributed noise 6
  6. Bayesian Linear Regression - Model • Bayesian Linear Regression assumes

    linear relationship between inputs and output . • For a set of N data points the model looks like: • The Variances: ◦ Priors: ◦ Likelihood: • The latent variables: ◦ Model weights: ◦ Intercept/bias: Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press. 7
  7. Bayesian Linear Regression - Inference Define the variational model to

    be a fully factorized normal across the weights. Run variational inference with the Kullback-Leibler divergence, using 250 iterations and 5 latent variable samples in the algorithm. The inference is done by minimizing the divergence measure using the reparameterization gradient. Probabilistic Programming emphasizes expressiveness instead of scale. 8
  8. Bayesian Linear Regression - Inference • Model has learned linear

    relationship • Just plotting the first dim of the D-dim x 9
  9. Modeling a Coin Flip • Beta-Bernoulli model • Representation as

    graph • Fetching x from the graph generates a binary vector. • All computation is represented on the graph, enabling to leverage model structure during inference. 10
  10. What a coincidence! The modern Probabilistic Programming reuses recent multidim.

    array/tensor and symbolic auto differentiation frameworks! 11
  11. Digression to Tensorflow • Computational graph framework. • Nodes are

    operations: Arithmetic, Looping, Slice. • Edges are tensors communicated between nodes. • Everythings is formulated with respect of the graph. CPU GPU ... Android Tensorflow Distributed Execution Engine Python Frontend C++ Frontend ... Layers Estim ator Keras Edward Abadi (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Whitepaper. 12
  12. What is a Neural Network? Outputs Weights Weights Hidden Units

    Inputs A Neural network is parameterized function, which is fitted to data. Parameters are weights of neural net. Feedforward neural nets model Data: as a nonlinear function of and , e.g.: 13
  13. What is a Neural Network? Outputs Weights Weights Hidden Units

    Inputs A Neural network is parameterized function, which is fitted to data. Mulitlayer / deep neural networks model the overall function as a composition of functions / layers: Usually trained to maximise likelihood using variants of stochastic gradient descent (SGD) optimization. 14
  14. Deep Learning Deep learning systems are neural network models similar

    to those popular in the ’80s and ’90s, with: • Some architectural and algorithmic innovations • A lot larger data sets • Much better software tools. • Magnitudes larger compute resources • Vastly increased investment. Bengio (2015), Deep learning. Nature 521: 436-444. 15
  15. Deep Learning Has also non-negligible limitations which make application in

    certain domains hard. • Very data hungry. • Very compute-intensive to train and deploy. • Poor at representing uncertainty. • Easily fooled by adversarial examples. • Tricky to optimize: non-convex & choice of architecture, learning procedure, initialization. • Uninterpretable black-boxes, lacking in transparency, difficult to trust. • Hard to incorporate prior knowledge on the model. 17
  16. Bridging DL and PP • Small & focused models •

    Domain & prior knowledge • Very principled & well understood • Huge & complex models • Many Heuristics • Amazing predictions Probabilistic Programming Deep Learning Gain intelligibility Scale model & data size Awesome Gal (2016) PhD Thesis, Uncertainty in Deep Learning, University of Cambridge. 18
  17. Necessity of the Probabilistic Approach Many aspects of learning and

    intelligence depend crucially on the careful probabilistic representation of uncertainty: • Forecasting. • Decision making. • Learning from limited, noisy, and missing data • Learning complex personalised models. • Data compression. • Automating scientific modelling, discovery, and experiment design. • Incorporation of Domain Specific Knowledge 19
  18. Benefits - Uncertainty in Predictions Calibrated model and prediction uncertainty:

    Getting systems that know when they don’t know. • AI on real-life settings needs safety, e.g. medical domain, drones, cars, finance. • Model confidence and result distribution for human machine interaction. • Low level errors propagating to top level rule-based systems. (Tesla incident). • Framework of model confidence necessary. • Adapt learning to uncertainty: Active learning & Deep Reinforcement Learning. 20
  19. Benefits - Uncertainty in Representations Uncertainty in Representation leads to

    ensembling of networks. • Uncertainty in weights informs about stability of learned representation in the network. • Duality between weights conditioned on distributions and dropout. • Evidence Lower Bound objective from Variational Inference: • Dropout objective: 21
  20. Benefits - Transfer Learning by Informed Priors Bootstrap the learning

    by placing informed priors centered around weights retrieved from other pre-trained networks. 22
  21. Benefits - Hierarchical Neural Networks Hierarchy of neural networks where

    a sub-networks concentrates on a certain aspect of the problem, but is informed about representation of the overall population. Microscopic probabilistic model for inputs Macroscopic probabilistic model for outputs Pattern recognition and connection network 23
  22. Benefits - Hybrid Architectures Bayesian non-parametrics could be used to

    flexibly adjust the size and shape of the hidden layers. • More freedom in building network architectures. • Change shape and structure of the network during training. • Avoids costly hyperparameter optimization and “tribal” knowledge. • Optimally scale the network architecture to problem during training. 24
  23. Conclusions • Probabilistic Programming offers a general framework for building

    systems that learn from data. • Advantages include better estimates of uncertainty, automatic ways of learning structure and avoiding overfitting, and a principled foundation. • Disadvantages include higher computational cost, depending on the approximate inference algorithm • Bayesian neural networks have a long history and are undergoing a tremendous wave of revival • There could be a lot of practical benefit to marry Probabilistic Programming and Neural Networks 27
  24. Thank you for the opportunity to present my ideas! Probabilistic

    Programming & Deep Learning. My Email: [email protected] My Twitter: @jrosskopf A.I. Meetup Stuttgart #1 - 25.06.17 - Joachim Rosskopf