Trends in Machine Learning

Trends in Machine Learning and the SciPy Community SciPy -
Austin, TX - June 2013 Tuesday, June 25, 13

Outline • Black Box Models with scikit-learn • Probabilistic Programming
with PyMC • Deep Learning with PyLearn2 & Theano Tuesday, June 25, 13

Machine Learning == Executable Data Summarization Tuesday, June 25, 13

Blackbox Machine Learning with scikit-learn Data Predictions Tuesday, June 25,
13

Supervised Machine Learning Tuesday, June 25, 13

Supervised ML with sklearn Tuesday, June 25, 13

Spam Classiﬁcation 0 2 1 0 0 1 0 0
0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 email 1 email 2 email 3 email 4 email 5 X word 6 0 1 1 0 0 Spam ? y Tuesday, June 25, 13

Topic Classiﬁcation 0 2 1 0 0 1 0 0
0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 news 1 news 2 news 3 news 4 news 5 X 0 0 1 1 1 0 1 0 0 0 1 1 0 0 0 word 6 Sport Business Tech. y Tuesday, June 25, 13

Sentiment Analysis 0 2 1 0 0 1 0 0
0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 review 1 review 2 review 3 review 4 review 5 X 0 1 1 0 0 word 6 Positive? y Tuesday, June 25, 13

Vegetation Cover Type 46. 200. 1 0 0.0 N -30.
150. 2. 149 0.1 S 87. 50 1000 10 0.1 W 45. 10 10. 1 0.4 NW 5. 2. 67. 1. 0.2 E Latitude Altitude D istance to closest river Altitude closest river Slope Slope orientation location 1 X 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 Rain forest G rassland Arid Ice y location 2 location 3 location 4 location 5 Tuesday, June 25, 13

Object Classiﬁcation in Images 0 2 1 0 0 1
0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 SIFT word 1 SIFT word 2 SIFT word 3 SIFT word 4 SIFT word 5 SIFT word 6 image 1 image 2 image 3 image 4 image 5 X 0 0 1 1 1 0 1 0 0 0 1 1 0 0 0 C at C ar Pedestrian y Tuesday, June 25, 13

Many more applications... • Product Recommendations Given past purchase history
of all users • Ad-Placement / bidding in Web Pages Given user browsing history / keywords • Fraud detection Given features derived from behavior Tuesday, June 25, 13

Unsupervised ML Tuesday, June 25, 13

Limitations of Blackbox Machine Learning Tuesday, June 25, 13

Problem #1: Not So Blackbox • Feature Extraction: highly domain
speciﬁc • + Feature Normalization / Transformation • Unmet Statistical Assumptions • Linear Separability of the target classes • Correlations between features • Natural metric to for the features Tuesday, June 25, 13

scikit-learn in practice by Andreas Mueller Tuesday, June 25, 13

Problem #2: Lack of Explainability Blackbox models can rarely explain
what they learned. Expert knowledge required to understand the model behavior and gain deeper insight on the data: this is model speciﬁc. Tuesday, June 25, 13

Possible Solutions • Problem #1: Costly Feature Engineering • Unsupervised
feature extraction with Deep Learning • Problem #2: Lack of Explainability • Probabilistic Programming with generic inference engines Tuesday, June 25, 13

Probabilistic Programming Openbox Models Blackbox Inference Engine Data Predictions Tuesday,
June 25, 13

What is Prob. Programming? • Model unknown causes of a
phenomenon with random variables • Write a programmatic story to derive observables from unknown variables • Plug data into observed variables • Use engine to invert the story and assign prob. distributions to unknown params. Tuesday, June 25, 13

Inverting the Story w/ Bayesian Inference p(H|D) = p(D|H). p(H)
/ p(D) D: data, H: hypothesis (e.g. parameters) p(D|H): likelihood p(H): prior p(H|D): posterior p(D): evidence Tuesday, June 25, 13

Generic Inference with MCMC • Monte Carlo Markov Chains •
Start from a Random Point • Move variable values randomly • Reject with new sample randomly depending on a likelihood test • Accumulate non-rejected samples and call it the trace Tuesday, June 25, 13

Alternatives to MCMC • Closed-Form Solutions • Belief Propagation •
Deterministic Approximations: ‣ Mean Field Approximation ‣ Variational Bayes and VMP Tuesday, June 25, 13

Alternatives to MCMC • Closed-Form Solutions • Belief Propagation •
Deterministic Approximations: ‣ Mean Field Approximation ‣ Variational Bayes and VMP Only is VMP seems as generic as MCMC for Prob. Programming Tuesday, June 25, 13

Implementations • Prob. Programming with MCMC • Stan: in C++
with R bindings • PyMC: in Python / NumPy / Theano • Prob. Programming with VMP • Infer.NET (C#, F#..., academic use only) • Infer.py (pythonnet bindings, very alpha) Tuesday, June 25, 13

Why is Probabilistic Programming so hot? • Open Model that
tells a Generative Story • Story Telling is good for Human Understanding and Persuasion • Grounded in Quantitative Analysis and the sound theory of Bayesian Inference • Black Box inference Engine (e.g. MCMC): ‣ can be treated as Compiler Optimization Tuesday, June 25, 13

Why Bayesian Inference? • Makes it possible to explicitly model
uncertainty caused by lack of data using priors Tuesday, June 25, 13

Prob. Programming not so hot (yet)? • Scalability? Accuracy? •
Highly nonlinear dependencies lead to highly multi-modals posterior • Hard to mix between posterior modes: slow convergence • How to best build models? How to choose priors? Tuesday, June 25, 13

Old idea but recent developments • No-U-Turn Sampler (2011): breakthrough
for scalability of MCMC for some model classes (in Stan and PyMC3 with Theano) • VMP (orig. paper 2005, generalized in 2011) in Infer.NET • New DARPA Program (2013-2017) to fund research on Prob. Programming. Tuesday, June 25, 13

Learning Prob. Programming • Probabilistic Programming and Bayesian Methods for
Hackers • Creative Commons Book on Github • Uses PyMC & IPython notebook • Doing Bayesian Data Analysis • Book with example in R and BUGS Tuesday, June 25, 13

Deep Learning The end of feature engineering? Tuesday, June 25,
13

A bit of history It all started with connectionist models
in the late 50s early 60s Tuesday, June 25, 13

60s 70s 80s 90s 00s 10s Perceptron by Rosenblatt 1957
Tuesday, June 25, 13

The Perceptron Tuesday, June 25, 13

But the perceptron cannot solve non-linear classiﬁcation tasks... Tuesday, June
25, 13

The 2D XOR problem Tuesday, June 25, 13

Neural Nets progress in the 80s with early successes (e.g.
OCR) Tuesday, June 25, 13

60s 70s 80s 90s 00s 10s Perceptron by Rosenblatt Backprop
by Rumelhart, Hinton & Williams 1957 1986 Tuesday, June 25, 13

2 Major Problems • In practice Backpropagation stops working with
more than 1 or 2 hidden layers • Overﬁtting: forces early stopping Tuesday, June 25, 13

Overﬁtting with Neural Networks Number of passes over in the
training set (epochs) Average number of mis- classiﬁed examples Error on testing set examples Error on training set examples Tuesday, June 25, 13

One exception • For Computer Vision: Convolutional Networks can learn
deep hierarchies • Shared weights in convolution kernel reduce the total number of parameters hence limit the over-ﬁtting problem of nets • On works if task in translation invariant in original feature space Tuesday, June 25, 13

by Rumelhart, Hinton & Williams 1957 1986 ConvNet by LeCun 1998 Tuesday, June 25, 13

Theoretical Break! Tuesday, June 25, 13

What is depth in ML? • Architectural depth, not decision
trees depth • Number of non-linearities between the unobserved “True”, “Real-World” factors of variations (causes) and the observed data (e.g. pixels in a robot’s camera) • A decision tree prediction function can be factored as a sum of products: depth = 1 Tuesday, June 25, 13

Common ML Architectures by Depth Depth 0: Perceptron, Linear SVM,
Logistic Regression, Multinomial Naive Bayes Depth 1: NN with 1 hidden Layer, Non-linear SVM, Decision Trees Depth 2: NN with 2 hidden Layers, Ensembles of Trees Tuesday, June 25, 13

Depth 0: Linearly Separable Data Tuesday, June 25, 13

Depth 1: the 2D XOR problem Tuesday, June 25, 13

Generalizing the XOR problem to N dim • The Parity
Function: given N boolean variables, return 1 if number of positive values is even, 0 otherwise • Depth 1 models can learn the parity function but: • Need ~ 2^N hidden nodes / SVs • Require 1 example per local variation Tuesday, June 25, 13

Depth 2+ models can be more compact • The parity
function can be learned by depth-2 NN with a number of hidden unit that grows linearly with the dimensionality of the problem • Similar results for the Checker Board learning task Tuesday, June 25, 13

Enough theory! and back to our chronology of Deep Learning

So in the 90s and early 00s • ML community
moved away from NN • SVM with kernel: less hyper-parameters • Random Forests / Boosted Trees often beat all other models when enough labeled data and CPU time • The majority of kaggle winners use ensemble of trees (up until recently...) Tuesday, June 25, 13

But in 2006... • Breakthrough by Geof. Hinton at the
U. of Toronto • Unsupervised Pre-training of Deep Architectures (Deep Belief Networks) • Can be unfolded into a traditional NN for ﬁne tuning Tuesday, June 25, 13

by Rumelhart, Hinton & Williams Unsupervised Pre-training by Hinton 1957 1986 2006 ConvNet by LeCun 1998 Tuesday, June 25, 13

RBM Input data Hidden Representation #1 Tuesday, June 25, 13

RBM Input data Hidden Representation #1 RBM Hidden Representation #2

RBM Hidden Representation #3 Tuesday, June 25, 13

RBM Hidden Representation #3 Labels clf Tuesday, June 25, 13

Soon replicated and extended... • Bengio et al. in U.
of Montreal • Ng et al. in Stanford • Replaced RBM with various other models such as Auto-Encoders in a denoising setting or with a sparsity penalty • Started to reach state of the art in speech recognition Tuesday, June 25, 13

Example: Convolutional DBN Convolutional Deep Belief Networks for Scalable Unsupervised
Learning of Hierarchical Representations Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng Tuesday, June 25, 13

2012 results by Stanford / Google Tuesday, June 25, 13

The YouTube Neuron Tuesday, June 25, 13

by Rumelhart, Hinton & Williams Unsupervised Pre-training by Hinton 1957 1986 2006 Dropout by Hinton 2012 ConvNet by LeCun 1998 Tuesday, June 25, 13

Dropout • New way to train deep supervised neural networks
with much less overﬁtting and without unsupervised pre-training • Allows NN to beat state of the art approaches on ImageNet (object classiﬁcation in images) Tuesday, June 25, 13

Dropout: the end of overﬁtting? Number of passes over the
training set (epochs) Average number of mis- classiﬁed examples Test error without dropout Test error with dropout Tuesday, June 25, 13

Even more recently • Maxout networks: • New non linearity
optimized for Dropout • Easier / faster to train • Implementation in Python / Theano • http://deeplearning.net/software/ pylearn2/ Tuesday, June 25, 13

by Rumelhart, Hinton & Williams Unsupervised Pre-training by Hinton 1957 1986 2006 Dropout by Hinton 2012 Maxout, Fast Dropout DropConnect, ... 2013 ConvNet by LeCun 1998 Tuesday, June 25, 13

Why is Deep Learning so hot? • Can automatically extract
high level, invariant, discriminative features from raw data (pixels, sound frequencies...) • Starting to reach or beat State of the Art in several Speech Understanding and Computer Vision tasks • Stacked Abstractions and Composability might be a path to build a real AI Tuesday, June 25, 13

Why is Deep Learning not so practical (yet)? • Requires
lots of (labeled) training data • Typically requires running a GPU for days to ﬁt a model + many hyperparameters • Not yet that useful with high level abstract input (e.g. text data): shallow models can already do very well for text classiﬁcation Tuesday, June 25, 13

Deep Learning: very hot research area • Big Industry Players
(Google, Microsoft, IBM...) investing in DL for speech understanding and computer vision • Many top ML researchers are starting to look at DL & some on the theory side Tuesday, June 25, 13

In Production for Speech Recognition Google and Microsoft use Deep
Auto Encoders for extracting features for Speech Recognition in Chrome, Android and WindowsPhone Tuesday, June 25, 13

In Production for Computer Vision Google and Microsoft use Deep
Auto Encoders for extracting features for Speech Recognition in Chrome, Android and WindowsPhone Tuesday, June 25, 13

SOLVED! Tuesday, June 25, 13

Concluding remarks Advices to the SciPy crowd Tuesday, June 25,
13

Learn Prob. Programming • If you want to do data
analysis with a priori knowledge on the data generation process from hidden causes • If you want to model uncertainty of hidden causes using probability distribution • But don’t expect high predictive accuracy • PyMC is a good place to start in Python Tuesday, June 25, 13

Learn Deep Learning • If you have many labeled samples
• If you are researcher in Speech Recognition or Computer Vision (or NLP) • If you are ready to invest time in learning the latest tricks • If you are ready to mess with GPUs • http://deeplearning.net Tuesday, June 25, 13

Otherwise stick with scikit-learn for now • K-Means, Regularized Linear
Models and Ensemble of Trees can get you pretty far • Less parameters to tune • Faster to train on CPUs • http://scikit-learn.org http://kaggle.com https://www.coursera.org/course/ml Tuesday, June 25, 13

Thanks • https://github.com/CamDavidsonPilon/ Probabilistic-Programming-and-Bayesian- Methods-for-Hackers • http://radar.oreilly.com/2013/04/ probabilistic-programming.html • http://deeplearning.net
• NIPS 2012, ICLR 2013, ICML 2013 Tuesday, June 25, 13

Trends in Machine Learning

Trends in Machine Learning

More Decks by Olivier Grisel

Other Decks in Technology

Featured

Transcript