Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Trends in Machine Learning

Trends in Machine Learning

SciPy 2013, Austin, TX

Video of the presentation here: http://www.youtube.com/watch?v=S6IbD86Dbvc

Olivier Grisel

June 27, 2013
Tweet

More Decks by Olivier Grisel

Other Decks in Technology

Transcript

  1. Trends in Machine Learning and the SciPy Community SciPy -

    Austin, TX - June 2013 Tuesday, June 25, 13
  2. Outline • Black Box Models with scikit-learn • Probabilistic Programming

    with PyMC • Deep Learning with PyLearn2 & Theano Tuesday, June 25, 13
  3. Spam Classification 0 2 1 0 0 1 0 0

    0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 email 1 email 2 email 3 email 4 email 5 X word 6 0 1 1 0 0 Spam ? y Tuesday, June 25, 13
  4. Topic Classification 0 2 1 0 0 1 0 0

    0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 news 1 news 2 news 3 news 4 news 5 X 0 0 1 1 1 0 1 0 0 0 1 1 0 0 0 word 6 Sport Business Tech. y Tuesday, June 25, 13
  5. Sentiment Analysis 0 2 1 0 0 1 0 0

    0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 review 1 review 2 review 3 review 4 review 5 X 0 1 1 0 0 word 6 Positive? y Tuesday, June 25, 13
  6. Vegetation Cover Type 46. 200. 1 0 0.0 N -30.

    150. 2. 149 0.1 S 87. 50 1000 10 0.1 W 45. 10 10. 1 0.4 NW 5. 2. 67. 1. 0.2 E Latitude Altitude D istance to closest river Altitude closest river Slope Slope orientation location 1 X 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 Rain forest G rassland Arid Ice y location 2 location 3 location 4 location 5 Tuesday, June 25, 13
  7. Object Classification in Images 0 2 1 0 0 1

    0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 SIFT word 1 SIFT word 2 SIFT word 3 SIFT word 4 SIFT word 5 SIFT word 6 image 1 image 2 image 3 image 4 image 5 X 0 0 1 1 1 0 1 0 0 0 1 1 0 0 0 C at C ar Pedestrian y Tuesday, June 25, 13
  8. Many more applications... • Product Recommendations Given past purchase history

    of all users • Ad-Placement / bidding in Web Pages Given user browsing history / keywords • Fraud detection Given features derived from behavior Tuesday, June 25, 13
  9. Problem #1: Not So Blackbox • Feature Extraction: highly domain

    specific • + Feature Normalization / Transformation • Unmet Statistical Assumptions • Linear Separability of the target classes • Correlations between features • Natural metric to for the features Tuesday, June 25, 13
  10. Problem #2: Lack of Explainability Blackbox models can rarely explain

    what they learned. Expert knowledge required to understand the model behavior and gain deeper insight on the data: this is model specific. Tuesday, June 25, 13
  11. Possible Solutions • Problem #1: Costly Feature Engineering • Unsupervised

    feature extraction with Deep Learning • Problem #2: Lack of Explainability • Probabilistic Programming with generic inference engines Tuesday, June 25, 13
  12. What is Prob. Programming? • Model unknown causes of a

    phenomenon with random variables • Write a programmatic story to derive observables from unknown variables • Plug data into observed variables • Use engine to invert the story and assign prob. distributions to unknown params. Tuesday, June 25, 13
  13. Inverting the Story w/ Bayesian Inference p(H|D) = p(D|H). p(H)

    / p(D) D: data, H: hypothesis (e.g. parameters) p(D|H): likelihood p(H): prior p(H|D): posterior p(D): evidence Tuesday, June 25, 13
  14. Generic Inference with MCMC • Monte Carlo Markov Chains •

    Start from a Random Point • Move variable values randomly • Reject with new sample randomly depending on a likelihood test • Accumulate non-rejected samples and call it the trace Tuesday, June 25, 13
  15. Alternatives to MCMC • Closed-Form Solutions • Belief Propagation •

    Deterministic Approximations: ‣ Mean Field Approximation ‣ Variational Bayes and VMP Tuesday, June 25, 13
  16. Alternatives to MCMC • Closed-Form Solutions • Belief Propagation •

    Deterministic Approximations: ‣ Mean Field Approximation ‣ Variational Bayes and VMP Only is VMP seems as generic as MCMC for Prob. Programming Tuesday, June 25, 13
  17. Implementations • Prob. Programming with MCMC • Stan: in C++

    with R bindings • PyMC: in Python / NumPy / Theano • Prob. Programming with VMP • Infer.NET (C#, F#..., academic use only) • Infer.py (pythonnet bindings, very alpha) Tuesday, June 25, 13
  18. Why is Probabilistic Programming so hot? • Open Model that

    tells a Generative Story • Story Telling is good for Human Understanding and Persuasion • Grounded in Quantitative Analysis and the sound theory of Bayesian Inference • Black Box inference Engine (e.g. MCMC): ‣ can be treated as Compiler Optimization Tuesday, June 25, 13
  19. Why Bayesian Inference? • Makes it possible to explicitly model

    uncertainty caused by lack of data using priors Tuesday, June 25, 13
  20. Prob. Programming not so hot (yet)? • Scalability? Accuracy? •

    Highly nonlinear dependencies lead to highly multi-modals posterior • Hard to mix between posterior modes: slow convergence • How to best build models? How to choose priors? Tuesday, June 25, 13
  21. Old idea but recent developments • No-U-Turn Sampler (2011): breakthrough

    for scalability of MCMC for some model classes (in Stan and PyMC3 with Theano) • VMP (orig. paper 2005, generalized in 2011) in Infer.NET • New DARPA Program (2013-2017) to fund research on Prob. Programming. Tuesday, June 25, 13
  22. Learning Prob. Programming • Probabilistic Programming and Bayesian Methods for

    Hackers • Creative Commons Book on Github • Uses PyMC & IPython notebook • Doing Bayesian Data Analysis • Book with example in R and BUGS Tuesday, June 25, 13
  23. A bit of history It all started with connectionist models

    in the late 50s early 60s Tuesday, June 25, 13
  24. 60s 70s 80s 90s 00s 10s Perceptron by Rosenblatt Backprop

    by Rumelhart, Hinton & Williams 1957 1986 Tuesday, June 25, 13
  25. 2 Major Problems • In practice Backpropagation stops working with

    more than 1 or 2 hidden layers • Overfitting: forces early stopping Tuesday, June 25, 13
  26. Overfitting with Neural Networks Number of passes over in the

    training set (epochs) Average number of mis- classified examples Error on testing set examples Error on training set examples Tuesday, June 25, 13
  27. One exception • For Computer Vision: Convolutional Networks can learn

    deep hierarchies • Shared weights in convolution kernel reduce the total number of parameters hence limit the over-fitting problem of nets • On works if task in translation invariant in original feature space Tuesday, June 25, 13
  28. 60s 70s 80s 90s 00s 10s Perceptron by Rosenblatt Backprop

    by Rumelhart, Hinton & Williams 1957 1986 ConvNet by LeCun 1998 Tuesday, June 25, 13
  29. What is depth in ML? • Architectural depth, not decision

    trees depth • Number of non-linearities between the unobserved “True”, “Real-World” factors of variations (causes) and the observed data (e.g. pixels in a robot’s camera) • A decision tree prediction function can be factored as a sum of products: depth = 1 Tuesday, June 25, 13
  30. Common ML Architectures by Depth Depth 0: Perceptron, Linear SVM,

    Logistic Regression, Multinomial Naive Bayes Depth 1: NN with 1 hidden Layer, Non-linear SVM, Decision Trees Depth 2: NN with 2 hidden Layers, Ensembles of Trees Tuesday, June 25, 13
  31. Generalizing the XOR problem to N dim • The Parity

    Function: given N boolean variables, return 1 if number of positive values is even, 0 otherwise • Depth 1 models can learn the parity function but: • Need ~ 2^N hidden nodes / SVs • Require 1 example per local variation Tuesday, June 25, 13
  32. Depth 2+ models can be more compact • The parity

    function can be learned by depth-2 NN with a number of hidden unit that grows linearly with the dimensionality of the problem • Similar results for the Checker Board learning task Tuesday, June 25, 13
  33. So in the 90s and early 00s • ML community

    moved away from NN • SVM with kernel: less hyper-parameters • Random Forests / Boosted Trees often beat all other models when enough labeled data and CPU time • The majority of kaggle winners use ensemble of trees (up until recently...) Tuesday, June 25, 13
  34. But in 2006... • Breakthrough by Geof. Hinton at the

    U. of Toronto • Unsupervised Pre-training of Deep Architectures (Deep Belief Networks) • Can be unfolded into a traditional NN for fine tuning Tuesday, June 25, 13
  35. 60s 70s 80s 90s 00s 10s Perceptron by Rosenblatt Backprop

    by Rumelhart, Hinton & Williams Unsupervised Pre-training by Hinton 1957 1986 2006 ConvNet by LeCun 1998 Tuesday, June 25, 13
  36. RBM Input data Hidden Representation #1 RBM Hidden Representation #2

    RBM Hidden Representation #3 Tuesday, June 25, 13
  37. RBM Input data Hidden Representation #1 RBM Hidden Representation #2

    RBM Hidden Representation #3 Labels clf Tuesday, June 25, 13
  38. Soon replicated and extended... • Bengio et al. in U.

    of Montreal • Ng et al. in Stanford • Replaced RBM with various other models such as Auto-Encoders in a denoising setting or with a sparsity penalty • Started to reach state of the art in speech recognition Tuesday, June 25, 13
  39. Example: Convolutional DBN Convolutional Deep Belief Networks for Scalable Unsupervised

    Learning of Hierarchical Representations Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng Tuesday, June 25, 13
  40. 60s 70s 80s 90s 00s 10s Perceptron by Rosenblatt Backprop

    by Rumelhart, Hinton & Williams Unsupervised Pre-training by Hinton 1957 1986 2006 Dropout by Hinton 2012 ConvNet by LeCun 1998 Tuesday, June 25, 13
  41. Dropout • New way to train deep supervised neural networks

    with much less overfitting and without unsupervised pre-training • Allows NN to beat state of the art approaches on ImageNet (object classification in images) Tuesday, June 25, 13
  42. Dropout: the end of overfitting? Number of passes over the

    training set (epochs) Average number of mis- classified examples Test error without dropout Test error with dropout Tuesday, June 25, 13
  43. Even more recently • Maxout networks: • New non linearity

    optimized for Dropout • Easier / faster to train • Implementation in Python / Theano • http://deeplearning.net/software/ pylearn2/ Tuesday, June 25, 13
  44. 60s 70s 80s 90s 00s 10s Perceptron by Rosenblatt Backprop

    by Rumelhart, Hinton & Williams Unsupervised Pre-training by Hinton 1957 1986 2006 Dropout by Hinton 2012 Maxout, Fast Dropout DropConnect, ... 2013 ConvNet by LeCun 1998 Tuesday, June 25, 13
  45. Why is Deep Learning so hot? • Can automatically extract

    high level, invariant, discriminative features from raw data (pixels, sound frequencies...) • Starting to reach or beat State of the Art in several Speech Understanding and Computer Vision tasks • Stacked Abstractions and Composability might be a path to build a real AI Tuesday, June 25, 13
  46. Why is Deep Learning not so practical (yet)? • Requires

    lots of (labeled) training data • Typically requires running a GPU for days to fit a model + many hyperparameters • Not yet that useful with high level abstract input (e.g. text data): shallow models can already do very well for text classification Tuesday, June 25, 13
  47. Deep Learning: very hot research area • Big Industry Players

    (Google, Microsoft, IBM...) investing in DL for speech understanding and computer vision • Many top ML researchers are starting to look at DL & some on the theory side Tuesday, June 25, 13
  48. In Production for Speech Recognition Google and Microsoft use Deep

    Auto Encoders for extracting features for Speech Recognition in Chrome, Android and WindowsPhone Tuesday, June 25, 13
  49. In Production for Computer Vision Google and Microsoft use Deep

    Auto Encoders for extracting features for Speech Recognition in Chrome, Android and WindowsPhone Tuesday, June 25, 13
  50. Learn Prob. Programming • If you want to do data

    analysis with a priori knowledge on the data generation process from hidden causes • If you want to model uncertainty of hidden causes using probability distribution • But don’t expect high predictive accuracy • PyMC is a good place to start in Python Tuesday, June 25, 13
  51. Learn Deep Learning • If you have many labeled samples

    • If you are researcher in Speech Recognition or Computer Vision (or NLP) • If you are ready to invest time in learning the latest tricks • If you are ready to mess with GPUs • http://deeplearning.net Tuesday, June 25, 13
  52. Otherwise stick with scikit-learn for now • K-Means, Regularized Linear

    Models and Ensemble of Trees can get you pretty far • Less parameters to tune • Faster to train on CPUs • http://scikit-learn.org http://kaggle.com https://www.coursera.org/course/ml Tuesday, June 25, 13