Deep Learning: An Introduction from the NLP Perspective by Kevin Duh

Deep Learning: An Introduction from the NLP Perspective Kevin Duh
NAIST August 19, 2012

Introduction Neural Networks Deep Learning Approach 1 Deep Learning Approach
2 Disclaimer I am not (yet) an expert in Deep Learning. Let me know if these slides contain any mistakes. The focus here is Natural Language Processing (NLP); I’m glossing over much active work in Vision & Speech. Lots of good tutorial information online, some borrowed here: [Bengio, 2009] Excellent short book summarizing the area [Socher et al., 2012a] Tutorial with video Step-by-step code based on Theano python library: http://deeplearning.net/tutorial/ 2/43

2 Outline 1 Introduction 2 Neural Networks Preliminaries 1-Layer & 2-Layer Nets Neural Language Models 3 Deep Learning Approach 1: Deep Belief Nets Preliminaries Restricted Boltzman Machines Deep Belief Nets 4 Deep Learning Approach 2: Stacked Auto-Encoders Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants 3/43

2 What is Deep Learning? A model (e.g. neural network) with many layers, trained in a layer-wise way 4/43

2 What is Deep Learning? A model (e.g. neural network) with many layers, trained in a layer-wise way An approach for unsupervised learning of feature representations, at successively higher levels 4/43

2 What is Deep Learning? A model (e.g. neural network) with many layers, trained in a layer-wise way An approach for unsupervised learning of feature representations, at successively higher levels These two deﬁnitions are very related, but correspond to diﬀerent motivations. 4/43

2 Why explore Deep Learning? 1 It can model complex non-linear phenomenon 2 It learns a distributed feature representation 3 It learns a hierarchical feature representation 4 It can exploit unlabeled data 5/43

2 #1 Modeling complex non-linearities Given same number of units (with non-linear activation), a deeper architecture is more expressive than a shallow one [Bishop, 1995] 6/43

2 #2 Distributed Feature Representations One-hot representation is common in NLP: ”dog”=[1, 0, 0, . . . , 0], (vector dimension = vocabulary size) ”cat”=[0, 1, 0, . . . , 0] ”the”=[0, 0, 0, . . . , 1] ”dog” and ”cat” share zero similarity, just like ”dog” and ”the” 7/43

2 #2 Distributed Feature Representations One-hot representation is common in NLP: ”dog”=[1, 0, 0, . . . , 0], (vector dimension = vocabulary size) ”cat”=[0, 1, 0, . . . , 0] ”the”=[0, 0, 0, . . . , 1] ”dog” and ”cat” share zero similarity, just like ”dog” and ”the” Word clustering has proven eﬀective in many tasks: ”dog”=[1, 0, 0, 0] (vector dim = number of clusters) ”cat”=[1, 0, 0, 0] (”dog” and ”cat” were clustered together) ”the”=[0, 1, 0, 0] dog,cat > dog,the = 0 7/43

2 #2 Distributed Feature Representations One-hot representation is common in NLP: ”dog”=[1, 0, 0, . . . , 0], (vector dimension = vocabulary size) ”cat”=[0, 1, 0, . . . , 0] ”the”=[0, 0, 0, . . . , 1] ”dog” and ”cat” share zero similarity, just like ”dog” and ”the” Word clustering has proven eﬀective in many tasks: ”dog”=[1, 0, 0, 0] (vector dim = number of clusters) ”cat”=[1, 0, 0, 0] (”dog” and ”cat” were clustered together) ”the”=[0, 1, 0, 0] dog,cat > dog,the = 0 Distributed represented (= ”distributional representation”) is a multi-clustering, modeling factors like POS & semantics: ”dog”=[1, 0, 0.9, 0.0] ”cat”=[1, 0, 0.5, 0.2] ”the”=[0, 1, 0.0, 0.0] 7/43

2 #3 Hierarchical Feature Representations Hierarchical features eﬀective captures part-and-whole relationships and naturally addresses multi-task problems [Lee et al., 2009] 8/43

2 #4 Exploiting Unlabeled Data Unsupervised & semi-supervised learning will be standard1: Engineering question: Unlabeled data is more abundant than labeled data. Scientiﬁc question: Children learn language (syntax, meaning, etc.) mostly from raw unlabeled data 1my prediction for 2020 9/43

2 #4 Exploiting Unlabeled Data Unsupervised & semi-supervised learning will be standard1: Engineering question: Unlabeled data is more abundant than labeled data. Scientific question: Children learn language (syntax, meaning, etc.) mostly from raw unlabeled data Layer-wise pre-training in Deep Learning: good model of input P(X) can help train P(Y |X) ”If you want to do computer vision, first learn computer graphics.” – Geoff Hinton 1my prediction for 2020 9/43

2 Some (personal) skepticism 10/43

2 Some (personal) skepticism 1 There are other ways to learn distributed representations, e.g. Topic model for documents Concatenating multiple word clustering solutions (has anyone tried this?) Dictionary learning and sparse reconstruction methods 10/43

2 Some (personal) skepticism 1 There are other ways to learn distributed representations, e.g. Topic model for documents Concatenating multiple word clustering solutions (has anyone tried this?) Dictionary learning and sparse reconstruction methods 2 Is multiple-level of representations really necessary in NLP? For Vision problems, there is clear analogy to the brain’s structure, but for language? Maybe: compositionally and recursion in natural language. 10/43

2 Some (personal) skepticism 1 There are other ways to learn distributed representations, e.g. Topic model for documents Concatenating multiple word clustering solutions (has anyone tried this?) Dictionary learning and sparse reconstruction methods 2 Is multiple-level of representations really necessary in NLP? For Vision problems, there is clear analogy to the brain’s structure, but for language? Maybe: compositionally and recursion in natural language. 3 Black magic required for eﬀective training, e.g. hyper-parameter setting and large computer resources? 10/43

2 Research Opportunities in NLP 1 Improving on current state-of-the-art results on standard tasks 11/43

2 Research Opportunities in NLP 1 Improving on current state-of-the-art results on standard tasks 2 Encoding linguistic knowledge into the training process Current methods are relatively generic, incorporates little domain knowledge. 11/43

2 Research Opportunities in NLP 1 Improving on current state-of-the-art results on standard tasks 2 Encoding linguistic knowledge into the training process Current methods are relatively generic, incorporates little domain knowledge. 3 Integrating deep learning into current NLP pipelines In particular: how to handle structured prediction problems of sequences and trees 11/43

2 What we’ll cover here 1 Neural Language Models & Distributed Word Representations Not sure if they’re ”deep”, but they’re relevant to what we’re interested in Basic math here useful for later material 2 Restricted Boltzman Machines & Deep Belief Nets Deep Learning Approach #1: the original generative model 3 Autoencoders, Denoising Autoencoders, and Stacked Denoising Autoencoders Deep Learning Approach #2: competitive with #1 and perhaps easier to train 12/43

2 Aside: A Brief History Early days of AI. Invention of artiﬁcial neuron [McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958] 13/43

2 Aside: A Brief History Early days of AI. Invention of artiﬁcial neuron [McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958] AI Winter. [Minsky and Papert, 1969] showed perceptron only learns linearly separable concepts 13/43

2 Aside: A Brief History Early days of AI. Invention of artiﬁcial neuron [McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958] AI Winter. [Minsky and Papert, 1969] showed perceptron only learns linearly separable concepts Revival in 1980s: Multi-layer Perceptrons (MLP) and Back-propagation [Rumelhart et al., 1986] 13/43

2 Aside: A Brief History Early days of AI. Invention of artiﬁcial neuron [McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958] AI Winter. [Minsky and Papert, 1969] showed perceptron only learns linearly separable concepts Revival in 1980s: Multi-layer Perceptrons (MLP) and Back-propagation [Rumelhart et al., 1986] Other directions (1990s - present): SVMs, Bayesian Networks 13/43

2 Aside: A Brief History Early days of AI. Invention of artiﬁcial neuron [McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958] AI Winter. [Minsky and Papert, 1969] showed perceptron only learns linearly separable concepts Revival in 1980s: Multi-layer Perceptrons (MLP) and Back-propagation [Rumelhart et al., 1986] Other directions (1990s - present): SVMs, Bayesian Networks Revival in 2006: Deep learning [Hinton et al., 2006] 13/43

2 Aside: A Brief History Early days of AI. Invention of artiﬁcial neuron [McCulloch and Pitts, 1943] & perceptron [Rosenblatt, 1958] AI Winter. [Minsky and Papert, 1969] showed perceptron only learns linearly separable concepts Revival in 1980s: Multi-layer Perceptrons (MLP) and Back-propagation [Rumelhart et al., 1986] Other directions (1990s - present): SVMs, Bayesian Networks Revival in 2006: Deep learning [Hinton et al., 2006] Recent successes in applications: Speech at IBM/Toronto [Sainath et al., 2011], Microsoft [Dahl et al., 2012]. Vision at Google/Stanford [Le et al., 2012] 13/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Outline 1 Introduction 2 Neural Networks Preliminaries 1-Layer & 2-Layer Nets Neural Language Models 3 Deep Learning Approach 1: Deep Belief Nets Preliminaries Restricted Boltzman Machines Deep Belief Nets 4 Deep Learning Approach 2: Stacked Auto-Encoders Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants 14/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Basic Setup of Machine Learning Training Data: a set of (x(m), y(m))m={1,2,..M} pairs, where input x(m) ∈ Rd and output y(m) = {0, 1} e.g. x=document, y=spam or not Goal: Learn function f : x → y that predicts correctly on new inputs x. 15/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Basic Setup of Machine Learning Training Data: a set of (x(m), y(m))m={1,2,..M} pairs, where input x(m) ∈ Rd and output y(m) = {0, 1} e.g. x=document, y=spam or not Goal: Learn function f : x → y that predicts correctly on new inputs x. Step 1: Choose a function model family: e.g. f (x) = σ(wT · x). (logistic regression, aka. 1-layer net) e.g. f (x) =sign(wT · x). (perceptron) e.g. f (x) =sign m wT m · k(x, x(m)). (SVM) 15/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Basic Setup of Machine Learning Training Data: a set of (x(m), y(m))m={1,2,..M} pairs, where input x(m) ∈ Rd and output y(m) = {0, 1} e.g. x=document, y=spam or not Goal: Learn function f : x → y that predicts correctly on new inputs x. Step 1: Choose a function model family: e.g. f (x) = σ(wT · x). (logistic regression, aka. 1-layer net) e.g. f (x) =sign(wT · x). (perceptron) e.g. f (x) =sign m wT m · k(x, x(m)). (SVM) Step 2: Optimize parameters w on the Training Data e.g. minimize loss function minw M m=1 (fw (x(m)) − y(m))2 15/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models 1-Layer Nets (logistic regression) Function model: f (x) = σ(wT · x + b) Parameters: vector w ∈ Rd , b is scalar bias term σ is a non-linearity: σ(z) = 1/(1 + exp(−z)) For simplicity, sometimes write f (x) = σ(wT x) where w = [w; b] and x = [x; 1] 16/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models 1-Layer Nets (logistic regression) Function model: f (x) = σ(wT · x + b) Parameters: vector w ∈ Rd , b is scalar bias term σ is a non-linearity: σ(z) = 1/(1 + exp(−z)) For simplicity, sometimes write f (x) = σ(wT x) where w = [w; b] and x = [x; 1] Non-linearity will be important in expressiveness multi-layer nets. Other non-linearities are also used, e.g. tanh 16/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Training 1-Layer Nets Easiest method: gradient descent Let Loss(w) = m (σ(wT x(m)) − y(m))2 17/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Training 1-Layer Nets Easiest method: gradient descent Let Loss(w) = m (σ(wT x(m)) − y(m))2 Gradient ∇w Loss = m 2(σ(wT x(m)) − y(m))(σ(wT x(m))(1 − σ(wT x(m)))x(m) 17/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Training 1-Layer Nets Easiest method: gradient descent Let Loss(w) = m (σ(wT x(m)) − y(m))2 Gradient ∇w Loss = m 2(σ(wT x(m)) − y(m))(σ(wT x(m))(1 − σ(wT x(m)))x(m) General form of gradient: Error ∗ σ (in) ∗ x 17/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Training 1-Layer Nets Easiest method: gradient descent Let Loss(w) = m (σ(wT x(m)) − y(m))2 Gradient ∇w Loss = m 2(σ(wT x(m)) − y(m))(σ(wT x(m))(1 − σ(wT x(m)))x(m) General form of gradient: Error ∗ σ (in) ∗ x Stochastic gradient descent algorithm: 1 Initialize w 2 for each sample (x(m), y(m)) in training set 3 w ← w − γ(Error ∗ σ (in) ∗ x(m)) 4 Repeat steps 2-3 until some condition satisﬁed Some practical tricks for learning rate γ & stopping condition for quick training and good generalization 17/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models 2-Layer Nets (MLP, Multi-layer Perceptron) x1 x2 x3 x4 h1 h2 h3 y xi wij hj wj w11 w12 w1 w2 w3 f (x) = σ( j wj · hj ) = σ( j wj · σ( i wij xi )) 18/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Training 2-Layer Nets: Backpropagation Recall the gradient for 1-Layer Nets consists of: ∂Loss/∂wj = Error ∗ σ (in) ∗ xj We just need to use Chain Rule to take derivatives over 2-layers 19/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Training 2-Layer Nets: Backpropagation Recall the gradient for 1-Layer Nets consists of: ∂Loss/∂wj = Error ∗ σ (in) ∗ xj We just need to use Chain Rule to take derivatives over 2-layers For the 2-Layer network (previous slide): ∂Loss/∂wj = [y − f (x)]f (x)hj ∂Loss/∂wij = [y − f (x)]f (x)wj σ ( i wi jxi )xi 19/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Training 2-Layer Nets: Backpropagation Recall the gradient for 1-Layer Nets consists of: ∂Loss/∂wj = Error ∗ σ (in) ∗ xj We just need to use Chain Rule to take derivatives over 2-layers For the 2-Layer network (previous slide): ∂Loss/∂wj = [y − f (x)]f (x)hj ∂Loss/∂wij = [y − f (x)]f (x)wj σ ( i wi jxi )xi Note: 1 First, run sample through network to get result f (x). 2 Then, ”errors” are propagated back and weights ﬁxed according to their ”responsibility” 3 Problem is not convex (may have several local optima) 19/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Definition of ”Depth” Depends on elementary computational elements: weighted sum, product, single neuron, kernel, logic gate 1-Layer - linear classifier: Logistic Regression, Maximum Entropy Classifier Perceptron, Linear SVM 2-Layer - universal approximator: Most MLPs (except some convolutional neural nets) SVMs with kernels Gaussian processes Decision trees 3-Layer or more - compact universal approximator: Deep Learning Boosted decision trees, Random Forests 20/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Neural Language Models [Bengio et al., 2003] Motivation: Use Neural Nets to learn continuous distributed representations of words. Addresses curse of dimensionality arising from one-hot representation of discrete variables. Architecture (see pic on next slide): C() are the learned word representations of dimension m. The history context x = [C(wt−1 ) C(wt−2 ) C(wt−3 )] is compressed to a h node hidden layer via tanh(Hx) Final output mapping with softmax gives probabilities p(wt|x). 21/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Neural Language Models [Bengio et al., 2003] 22/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Distributed Representations: many possibilities 1 Neural Networks & Neural Language Model: Hidden layer serve as learned representation We can view this as analogous learning a kernel. 23/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Distributed Representations: many possibilities 1 Neural Networks & Neural Language Model: Hidden layer serve as learned representation We can view this as analogous learning a kernel. 2 Principle Component Analysis (PCA), Factor Analysis Linear transform to decorrelated features: h = W T x + b 3 Sparse coding h∗ = arg minh ||x − W · h||2 2 + λ||h||1 4 Also: manifold embeddings, ICA, and various unsupervised methods. 23/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Summary: things to remember about Neural Nets 1 Stacking layers of non-linearity (e.g. σ) is critical for expressive power of neural nets 2 Hidden layers of neural nets can serve as distributed representations 3 Backpropagation Training is just gradient descent, applied with Chain Rule. 4 Unfortunately, training beyond 2-layers is often diﬃcult due to local optimum and vanishing gradients 24/43

2 Preliminaries 1-Layer & 2-Layer Nets Neural Language Models Minimal Reading List for Neural Language Models Original Neural LM paper: [Bengio et al., 2003] Alternate training criteria & architecture: [Collobert et al., 2011] Hierarchical distributed representations: [Mnih and Hinton, 2008] Handling large data (code available also): [Mikolov et al., 2011, Schwenk et al., 2012] Application in NLP: [Turian et al., 2010] 25/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Outline 1 Introduction 2 Neural Networks Preliminaries 1-Layer & 2-Layer Nets Neural Language Models 3 Deep Learning Approach 1: Deep Belief Nets Preliminaries Restricted Boltzman Machines Deep Belief Nets 4 Deep Learning Approach 2: Stacked Auto-Encoders Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants 26/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Motivation Goal: Discover useful latent features h from data x 27/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Motivation Goal: Discover useful latent features h from data x One possibility: Directed Graphical Models: Model p(x, h) = p(x|h)p(h), where p(x|h) is likelihood, p(h) is prior Directed: we can think of h as a ”cause”. Given h = 1, what’s the probability of x? h x 27/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Explaining away eﬀect of directed graphical models p(h1) and p(h2) are a priori independent, but dependent given x : p(h1, h2|x) = p(h1|x) · p(h2|x) Thus, posterior p(h|e), which is needed for features or deep learning, is not easy to compute Example: x = grass is wet; h1 = it rained last night; h2 = water sprinkler was on. x h1 h2 28/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Undirected Graphical Models (aka MRF, Markov Random Fields) MRF models p(x, h) = 1 Zθ i φi (x) j ηj (h) k νk(x, h) as product of un-normalized potentials θ are parameters, Zθ is (potentially expensive) normalization Clique potentials φi (x), ηj (h), νk (x, h) describe interactions between inputs, hiddens, and input-hidden variables 29/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Undirected Graphical Models (aka MRF, Markov Random Fields) MRF models p(x, h) = 1 Zθ i φi (x) j ηj (h) k νk(x, h) as product of un-normalized potentials θ are parameters, Zθ is (potentially expensive) normalization Clique potentials φi (x), ηj (h), νk (x, h) describe interactions between inputs, hiddens, and input-hidden variables Boltzman Machines deﬁne p(x, h) = 1 Zθ exp (−Eθ(x, h)) where x and h are binary variables, and Eθ (x, h) = −1 2 xT Ux − 1 2 hT Vh − xT Wh − bT x − dT h with θ = {U, V , W , b, d} as parameters 29/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Undirected Graphical Models (aka MRF, Markov Random Fields) MRF models p(x, h) = 1 Zθ i φi (x) j ηj (h) k νk(x, h) as product of un-normalized potentials θ are parameters, Zθ is (potentially expensive) normalization Clique potentials φi (x), ηj (h), νk (x, h) describe interactions between inputs, hiddens, and input-hidden variables Boltzman Machines deﬁne p(x, h) = 1 Zθ exp (−Eθ(x, h)) where x and h are binary variables, and Eθ (x, h) = −1 2 xT Ux − 1 2 hT Vh − xT Wh − bT x − dT h with θ = {U, V , W , b, d} as parameters Posterior p(h|x) of Boltzman Machines also intractable, e.g. p(hj |x) = h1 .. hj−1 hj+1 ..p(h|x). 29/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Restricted Boltzman Machine (RBM) RBM: p(x, h) = 1 Zθ exp (−Eθ(x, h)) with only h-x interactions: Eθ (x, h) = −xT Wh − bT x − dT h x1 x2 x3 h1 h2 h3 30/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Restricted Boltzman Machine (RBM) RBM: p(x, h) = 1 Zθ exp (−Eθ(x, h)) with only h-x interactions: Eθ (x, h) = −xT Wh − bT x − dT h x1 x2 x3 h1 h2 h3 Conditional distribution over hidden units factorizes: p(h|x) = i p(hi |x) p(hj = 1|x) = σ( i wij xi + dj ) Similarly: p(x|h) = i p(xi |h);p(xi = 1|h) = σ( j wij hj + bi ) Computing posteriors p(h|x) or features (E[p(h|x)) is easy. 30/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Restricted Boltzman Machine (RBM) RBM: p(x, h) = 1 Zθ exp (−Eθ(x, h)) with only h-x interactions: Eθ (x, h) = −xT Wh − bT x − dT h x1 x2 x3 h1 h2 h3 Conditional distribution over hidden units factorizes: p(h|x) = i p(hi |x) p(hj = 1|x) = σ( i wij xi + dj ) Similarly: p(x|h) = i p(xi |h);p(xi = 1|h) = σ( j wij hj + bi ) Computing posteriors p(h|x) or features (E[p(h|x)) is easy. Note partition function Zθ is still expensive, so approximation required during parameter learning 30/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Training RBMs Gradient of the Log-Likelihood: ∇w log Pw (x = x(m)) = ∇wij log h Pw (x = x(m), h) (1) = ∇wij log h 1 Zw exp (− Ew (x(m), h)) (2) = − ∇wij log Zw + ∇wij log h exp (− Ew (x(m), h)) (3) = 1 Zw h,x e(− Ew(x,h)) ∇wij Ew (x, h) − 1 h e(− Ew(x(m),h)) h e(− Ew(x(m),h)) ∇wij Ew (x(m), h) = h,x Pw (x, h)[∇wij Ew (x, h)] − h Pw (x(m), h)[∇wij Ew (x(m), h)] (4) = −Ep(x,h) [xi · hj ] + Ep(h|x=x(m)) [x(m) i · hj ] (5) 31/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Training RBMs with Contrastive Divergence In the previous equation, ﬁrst term is expensive (Ep(x,h) [xi · hj ]) Gibbs Sampling (sample x then h iteratively) works but re-running for each gradient step is slow. Contrastive Divergence is a faster but biased method that initializes with training data: 1 ˆ h ∼ P(h|x(m)) 2 ˜ x ∼ P(x|ˆ h); ˜ h ∼ P(h|˜ x) 3 wij ← wij + γ batch (x(m) i · ˆ hj − ˜ xi · ˜ hj ) 32/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Deep Belief Nets (DBN) DBN stacks RBMs layer-by-layer to get deep architecture. Layer-wise pre-training is critical: First, train RBM to learn 1st layer of features h from input x. Then, treat h as input and learn a 2nd layer of features. Each added layer improves the variational lower bound on the log probability of training data. 33/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Deep Belief Nets (DBN) DBN stacks RBMs layer-by-layer to get deep architecture. Layer-wise pre-training is critical: First, train RBM to learn 1st layer of features h from input x. Then, treat h as input and learn a 2nd layer of features. Each added layer improves the variational lower bound on the log probability of training data. Further ﬁne-tuning can be obtained with the Wake-Sleep Algorithm Do stochastic bottom-up pass (adjust weights to reconstruct layer below) Do a few iterations of Gibbs sampling at top-level RBM Do stochastic top-down pass (adjust weights to reconstruct layer above) note: not to be confused with Dynamic Bayesian Nets or Deep Boltzman Machines 33/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Summary: things to remember about DBNs 1 Layer-wise pre-training is the innovation that enabled training deep architectures. 2 Pre-training focuses on optimizing likelihood on the data, not the target label. The philosophy is to ﬁrst model p(x) in order to do better p(y|x). 3 Why use an undirected graphical model like RBM? It’s because p(h|x) is computationally tractable (no ”explaining away eﬀect”), so that stacking them into DBNs is feasible. 4 Learning RBM still require approximates inference (e.g. contrastive divergence) since partition function is expensive. 34/43

2 Preliminaries Restricted Boltzman Machines Deep Belief Nets Minimal Reading List for RBM/DBN Original DBN paper [Hinton et al., 2006] Why does unsupervised pre-training help deep learning? [Erhan et al., 2010] Successful application in Collaborative Filtering [Salakhutdinov et al., 2007] 35/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Outline 1 Introduction 2 Neural Networks Preliminaries 1-Layer & 2-Layer Nets Neural Language Models 3 Deep Learning Approach 1: Deep Belief Nets Preliminaries Restricted Boltzman Machines Deep Belief Nets 4 Deep Learning Approach 2: Stacked Auto-Encoders Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants 36/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Auto-Encoders Auto-Encoders are a simpler non-probabilistic alternative to RBMs. Deﬁne encoder and decoder and pass the data through it: Encoder h = fθ (x), e.g. h = σ(Wx + b) Decoder x = gθ (h), e.g. x = σ(W h + d) W and W need not be tied, but often are in practice. 37/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Auto-Encoders Auto-Encoders are a simpler non-probabilistic alternative to RBMs. Deﬁne encoder and decoder and pass the data through it: Encoder h = fθ (x), e.g. h = σ(Wx + b) Decoder x = gθ (h), e.g. x = σ(W h + d) W and W need not be tied, but often are in practice. Encourage θ to give small reconstruction errorl: e.g. Loss = m ||x(m) − gθ (fθ (x(m)))||2 37/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Auto-Encoders Auto-Encoders are a simpler non-probabilistic alternative to RBMs. Deﬁne encoder and decoder and pass the data through it: Encoder h = fθ (x), e.g. h = σ(Wx + b) Decoder x = gθ (h), e.g. x = σ(W h + d) W and W need not be tied, but often are in practice. Encourage θ to give small reconstruction errorl: e.g. Loss = m ||x(m) − gθ (fθ (x(m)))||2 Linear encoder/decoder with squared reconstruction error learns same subspace of PCA. Sigmoid encoder/decoder gives same form p(h|x), p(x|h) as RBMs. 37/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Architecture: Stacked Auto-Encoders Auto-encoders can be stacked in the same way RBMs are stacked to give Deep Architectures Hidden unit size: Hidden layer should be lower dimensional or else Auto-encoder may just learn the identity mapping Alternatively, allow more hidden units but enforce sparsity. 38/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Denoising Auto-Encoders First, perturb the input data x to ˜ x using invariance from domain knowledge. Reconstruct the original data e.g. Loss = m ||x(m) − gθ (fθ (˜ x(m)))||2 [Vincent et al., 2010] explored Gaussian noise and salt-and-pepper noise for Vision data. [Glorot et al., 2011] explored masking noise (random set to 0) for Text data. 39/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Predictive Sparse Decomposition [Kavukcuoglu et al., 2008] Objective (minimize with respect to h, W , θ): m λ||h(m)||1 + ||x(m) − Wh(m)||2 2 + ||h(m) − fθ(x(m))||2 2 First two terms similar to sparse coding. Third term learns a fast encoder that approximates the sparse coder. 40/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Summary: things to remember about Stacked Autoencoders 1 Auto-encoders are computationally cheaper alternatives to RBMs. We stack them into deep architectures in the same way we stack RBMs into DBNs. 2 Auto-encoders learn to ”compress” and ”re-construct” input data. Low reconstruction error corresponds to an encoding that captures the main variations in data. Again, the focus is on modeling p(x) ﬁrst. 3 Many variants of encoders are out there, and some provide eﬀective ways to incorporate expertise domain knowledge. 41/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Minimal Reading List for Stacked Auto-Encoders Original Stacked Auto-encoder paper [Bengio et al., 2006] Comparison of optimization methods [Le et al., 2011] Speeding up the reconstruction error computation for large word vectors [Dauphin et al., 2011] De-noising Auto-encoders [Vincent et al., 2010] 42/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Selected Readings for NLPers Deep Learning Applications in NLP: Sentiment Analysis [Glorot et al., 2011] Parsing [Socher et al., 2011b, Collobert et al., 2011, Collobert, 2011] Paraphrase Detection [Socher et al., 2011a] Learning lexical semantics: [Huang et al., 2012, Socher et al., 2012b] Applications in other ﬁelds, but worth reading: Good reference that deﬁnes many terms popular in Deep Learning Vision papers [Jarrett et al., 2009] Deep learning of cats: entirely unsupervised learning high-level features on massive datasets [Le et al., 2012] 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Bengio, Y. (2009). Learning Deep Architectures for AI, volume Foundations and Trends in Machine Learning. NOW Publishers. Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language models. JMLR. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2006). Greedy layer-wise training of deep networks. In NIPS’06, pages 153–160. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press. Collobert, R. (2011). 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Deep learning for eﬃcient discriminative parsing. In AISTATS. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537. Dahl, G., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, Special Issue on Deep Learning for Speech and Langauge Processing. Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with reconstruction sampling. In ICML’11, pages 945–952. 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010). Why does unsupervised pre-training help deep learning? Journal of M, 11:625–660. Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain adaptation for large-scale sentiment classication: A deep learning approach. In ICML. Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554. Huang, E., Socher, R., Manning, C., and Ng, A. (2012). Improving word representations via global context and multiple word prototypes. 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 873–882, Jeju Island, Korea. Association for Computational Linguistics. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International Conference on. Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008). Fast inference in sparse coding algorithms with applications to object recognition. Technical Report CBLL-TR-2008-12-01, Computational and Biological Learning Lab, Courant Institute, NYU. 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. (2011). On optimization methods for deep learning. In Getoor, L. and Scheﬀer, T., editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 265–272, New York, NY, USA. ACM. Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G. S., Dean, J., and Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In ICML. Lee, H., Grosse, R., Ranganath, R., and Ng, A. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML. 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants McCulloch, W. S. and Pitts, W. H. (1943). A logical calculus of the ideas immanent in nervous activity. In Bulletin of Mathematical Biophysics, volume 5, pages 115–137. Mikolov, T., Deoras, A., Povey, D., Burget, L., and ˘ Cernock´ y, J. (2011). Strategies for training large scale neural network language model. In ASRU. Minsky, M. and Papert, S. (1969). Perceptrons: an introduction to computational geometry. MIT Press. Mnih, A. and Hinton, G. (2008). A scalable hierarchical distributed language models. In Advances in Neural Information Processing Systems 21 (NIPS 2008). 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–408. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323:533–536. Sainath, T. N., Kingsbury, B., Ramabhadran, B., Fousek, P., Novak, P., and Mohamed, A. (2011). Making deep belief networks eﬀective for large vocabulary continuous speech recognition. In ASRU. Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted boltzmann machines for collaborative ﬁltering. 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants In Proceedings of the 24th international conference on Machine learning, ICML ’07, pages 791–798. Schwenk, H., Rousseau, A., and Attik, M. (2012). Large, pruned or continuous space language models on a gpu for statistical machine translation. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pages 11–19, Montr´ eal, Canada. Association for Computational Linguistics. Socher, R., Bengio, Y., and Manning, C. (2012a). Deep learning for NLP (without the magic). ACL Tutorials http://www.socher.org/index.php/ DeepLearningTutorial/DeepLearningTutorial. Socher, R., Huang, E. H., Pennin, J., Ng, A. Y., and Manning, C. D. (2011a). 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS. Socher, R., Huval, B., Manning, C. D., and Ng, A. Y. (2012b). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1201–1211, Jeju Island, Korea. Association for Computational Linguistics. Socher, R., Lin, C., Ng, A. Y., and Manning, C. D. (2011b). Parsing natural scenes and natural language with recursive neural networks. In ICML. Turian, J., Ratinov, L.-A., and Bengio, Y. (2010). 43/43

2 Auto-Encoders Stacked Auto-Encoders Denoising Auto-Encoders and Variants Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394, Uppsala, Sweden. Association for Computational Linguistics. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:3371–3408. 43/43

Deep Learning: An Introduction from the NLP Per...

Deep Learning: An Introduction from the NLP Perspective by Kevin Duh

More Decks by Jie Bao

Other Decks in Technology

Featured

Transcript