Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Recent Advances in Deep Learning by Kevin Duh

Jie Bao
June 19, 2013

Recent Advances in Deep Learning by Kevin Duh

Jie Bao

June 19, 2013
Tweet

More Decks by Jie Bao

Other Decks in Technology

Transcript

  1. Recent Advances in Deep Learning Kevin Duh Nara Institute of

    Science and Technology Graduate School of Information Science Feb 15, 2013
  2. Outline 1 Background Knowledge in Deep Learning Motivation and Definitions

    Approach 1: Deep Belief Nets [Hinton et al., 2006] Approach 2: Stacked Auto-Encoders [Bengio et al., 2006] 2 Recent Advances in Applications Computer Vision Speech Recognition Natural Language Processing 3/47
  3. Outline 1 Background Knowledge in Deep Learning Motivation and Definitions

    Approach 1: Deep Belief Nets [Hinton et al., 2006] Approach 2: Stacked Auto-Encoders [Bengio et al., 2006] 2 Recent Advances in Applications Computer Vision Speech Recognition Natural Language Processing 4/47
  4. Basic Problem Setup in Machine Learning Training Data: a set

    of (x(m), y(m))m={1,2,..M} pairs, where input x(m) ∈ Rd and output y(m) = {0, 1} e.g. x=document, y=spam or not Goal: Learn function f : x → y to predicts correctly on new inputs x. 5/47
  5. Basic Problem Setup in Machine Learning Training Data: a set

    of (x(m), y(m))m={1,2,..M} pairs, where input x(m) ∈ Rd and output y(m) = {0, 1} e.g. x=document, y=spam or not Goal: Learn function f : x → y to predicts correctly on new inputs x. Step 1: Choose a function model family: e.g. logistic regression, support vector machines, neural networks 5/47
  6. Basic Problem Setup in Machine Learning Training Data: a set

    of (x(m), y(m))m={1,2,..M} pairs, where input x(m) ∈ Rd and output y(m) = {0, 1} e.g. x=document, y=spam or not Goal: Learn function f : x → y to predicts correctly on new inputs x. Step 1: Choose a function model family: e.g. logistic regression, support vector machines, neural networks Step 2: Optimize parameters w on the Training Data e.g. minimize loss function minw M m=1 (fw (x(m)) − y(m))2 5/47
  7. Example: Neural Networks x1 x2 x3 x4 h1 h2 h3

    y xi wij hj wj w11 w12 w1 w2 w3 f (x) = σ( j wj · hj ) = σ( j wj · σ( i wij xi )) 6/47
  8. Example: Neural Networks x1 x2 x3 x4 h1 h2 h3

    y xi wij hj wj Predict f (x(m)) w11 w12 w1 w2 w3 1. Predict x(m) given current weights w. 7/47
  9. Example: Neural Networks x1 x2 x3 x4 h1 h2 h3

    y xi wij hj wj Predict f (x(m)) Adjust weights w11 w12 w1 w2 w3 2. If f (x(m)) = y(m), back-propagate error and adjust weight. 8/47
  10. Example: Neural Networks x1 x2 x3 x4 h1 h2 h3

    y xi wij hj wj Predict f (x(m)) Adjust weights w11 w12 w1 w2 w3 2. If f (x(m)) = y(m), back-propagate error and adjust weight. More layer = more expressive; but > 2 layers difficult due to local optimum 8/47
  11. What is Deep Learning? Two definitions. A model (e.g. neural

    network) with many layers, trained in a layer-wise way 9/47
  12. What is Deep Learning? Two definitions. A model (e.g. neural

    network) with many layers, trained in a layer-wise way An approach for unsupervised learning of feature representations, at successively higher levels 9/47
  13. Advantages of Deep Learning 1 It can model complex non-linear

    phenomenon 2 It learns a distributed and hierarchical feature representation 3 It can exploit unlabeled data 10/47
  14. #1 Modeling complex non-linearities Given same number of units (with

    non-linear activation), a deeper architecture is more expressive than a shallow one [Bishop, 1995] 11/47
  15. #2 Distributed & Hierarchical Feature Representations Distributed and Hierarchical features

    parsimoniously models part-and-whole relationships in data [Lee et al., 2009] 12/47
  16. #3 Exploiting Unlabeled Data Labels are expensive. Unlabeled data are

    abundant. Unlike Neural Nets, Deep Learning can exploit unlabeled data. Focus on modeling of input P(X), rather than P(Y |X) ”If you want to do computer vision, first learn computer graphics.” – Geoff Hinton 13/47
  17. #3 Exploiting Unlabeled Data Labels are expensive. Unlabeled data are

    abundant. Unlike Neural Nets, Deep Learning can exploit unlabeled data. Focus on modeling of input P(X), rather than P(Y |X) ”If you want to do computer vision, first learn computer graphics.” – Geoff Hinton Scientific question: Children learn mostly from raw unlabeled data. Can we teach computers to do this too? 13/47
  18. Outline 1 Background Knowledge in Deep Learning Motivation and Definitions

    Approach 1: Deep Belief Nets [Hinton et al., 2006] Approach 2: Stacked Auto-Encoders [Bengio et al., 2006] 2 Recent Advances in Applications Computer Vision Speech Recognition Natural Language Processing 14/47
  19. General Approach for Deep Learning Recall the problem setup: Learn

    function f : x → y But rather doing this directly, we first learn hidden features h that model input x, i.e. x → h → y 15/47
  20. General Approach for Deep Learning Recall the problem setup: Learn

    function f : x → y But rather doing this directly, we first learn hidden features h that model input x, i.e. x → h → y How do we discover useful latent features h from data x? Different Deep Learning methods differ by this basic component e.g. Deep Belief Nets use Restricted Boltzman Machines (RBMs) 15/47
  21. Deep Belief Nets (DBN) = Stacked RBM x1 x2 x3

    h1 h2 h3 h1 h2 h3 Layer1 RBM Layer2 RBM 16/47
  22. Key Idea: Layer-wise Pre-training x1 x2 x3 h1 h2 h3

    h1 h2 h3 Train Layer1 RBM Train one layer a time. Each layer improves bound on data likelihood. 17/47
  23. Key Idea: Layer-wise Pre-training x1 x2 x3 h1 h2 h3

    h1 h2 h3 Train Layer2 RBM Train one layer a time. Each layer improves bound on data likelihood. 18/47
  24. Supervised Fine-Tuning x1 x2 x3 h1 h2 h3 h1 h2

    h3 y Predict f (x(m)) Adjust weights Pre-training seems to initialize weights so that fine-tuning is fast and has good generalization [Erhan et al., 2010] 19/47
  25. Restricted Boltzman Machine (RBM) RBM (a simple Markov Random Field):

    p(x, h) = 1 Zθ exp (−Eθ(x, h)) with only h-x interactions: Eθ (x, h) = −xT Wh − bT x − dT h x1 x2 x3 h1 h2 h3 20/47
  26. Restricted Boltzman Machine (RBM) RBM (a simple Markov Random Field):

    p(x, h) = 1 Zθ exp (−Eθ(x, h)) with only h-x interactions: Eθ (x, h) = −xT Wh − bT x − dT h x1 x2 x3 h1 h2 h3 Computing posteriors p(h|x) is easy! p(h|x) = i p(hi |x) p(hj = 1|x) = σ( i wij xi + dj ) 20/47
  27. Restricted Boltzman Machine (RBM) RBM (a simple Markov Random Field):

    p(x, h) = 1 Zθ exp (−Eθ(x, h)) with only h-x interactions: Eθ (x, h) = −xT Wh − bT x − dT h x1 x2 x3 h1 h2 h3 Computing posteriors p(h|x) is easy! p(h|x) = i p(hi |x) p(hj = 1|x) = σ( i wij xi + dj ) Note partition function Zθ is still expensive, so approximation required during parameter learning 20/47
  28. Summary: things to remember about DBNs 1 Layer-wise pre-training is

    the innovation that enabled deep learning. 21/47
  29. Summary: things to remember about DBNs 1 Layer-wise pre-training is

    the innovation that enabled deep learning. 2 Pre-training focuses on optimizing likelihood on the data, not the target label. First model p(x) to do better p(y|x). 21/47
  30. Summary: things to remember about DBNs 1 Layer-wise pre-training is

    the innovation that enabled deep learning. 2 Pre-training focuses on optimizing likelihood on the data, not the target label. First model p(x) to do better p(y|x). 3 Why RBM? p(h|x) is tractable (no ”explaining away effect”), so it’s easy to stack. 21/47
  31. Summary: things to remember about DBNs 1 Layer-wise pre-training is

    the innovation that enabled deep learning. 2 Pre-training focuses on optimizing likelihood on the data, not the target label. First model p(x) to do better p(y|x). 3 Why RBM? p(h|x) is tractable (no ”explaining away effect”), so it’s easy to stack. 4 Current research focus: (1) faster Bayesian inference, (2) alternative probabilistic models 21/47
  32. Outline 1 Background Knowledge in Deep Learning Motivation and Definitions

    Approach 1: Deep Belief Nets [Hinton et al., 2006] Approach 2: Stacked Auto-Encoders [Bengio et al., 2006] 2 Recent Advances in Applications Computer Vision Speech Recognition Natural Language Processing 22/47
  33. Auto-Encoders: simpler alternatives to RBMs x1 x2 x3 h1 h2

    x1 x2 x3 Encoder: h = σ(Wx + b) Decoder: x = σ(W h + d) 23/47
  34. Auto-Encoders: simpler alternatives to RBMs x1 x2 x3 h1 h2

    x1 x2 x3 Encoder: h = σ(Wx + b) Decoder: x = σ(W h + d) Encourage h to give small reconstruction error: e.g. Loss = m ||x(m) − DECODER(ENCODER(x(m)))||2 23/47
  35. Auto-Encoders: simpler alternatives to RBMs x1 x2 x3 h1 h2

    x1 x2 x3 Encoder: h = σ(Wx + b) Decoder: x = σ(W h + d) Encourage h to give small reconstruction error: e.g. Loss = m ||x(m) − DECODER(ENCODER(x(m)))||2 Linear encoder/decoder with squared reconstruction error learns same subspace of PCA. 23/47
  36. Auto-Encoders: simpler alternatives to RBMs x1 x2 x3 h1 h2

    x1 x2 x3 Encoder: h = σ(Wx + b) Decoder: x = σ(W h + d) Encourage h to give small reconstruction error: e.g. Loss = m ||x(m) − DECODER(ENCODER(x(m)))||2 Linear encoder/decoder with squared reconstruction error learns same subspace of PCA. Sigmoid encoder/decoder gives same form p(h|x), p(x|h) as RBMs. 23/47
  37. Architecture: Stacked Auto-Encoders Auto-encoders can be stacked in the same

    way RBMs are stacked to give Deep Architectures Many variants: Enforce compression (lower dimensional h) Enforce sparsity (high dimensional h) [Ranzato et al., 2006] Incorporate domain knowledge, e.g. denoising auto-encoders [Vincent et al., 2010] 24/47
  38. Denoising Auto-Encoders x1 x2 x3 h1 h2 x1 x2 x3

    Encoder: h = σ(W ˜ x + b) Decoder: x = σ(W h + d) Perturb the input data x to ˜ x using invariance from domain knowledge. Force small ||x − x || 25/47
  39. Summary: things to remember about Stacked Autoencoders 1 Auto-encoders are

    cheaper alternatives to RBMs. 2 Auto-encoders learn to ”compress” and ”re-construct” input data. Again, the focus is on modeling p(x) first. 26/47
  40. Summary: things to remember about Stacked Autoencoders 1 Auto-encoders are

    cheaper alternatives to RBMs. 2 Auto-encoders learn to ”compress” and ”re-construct” input data. Again, the focus is on modeling p(x) first. 3 Many variants, some provide ways to incorporate domain knowledge. 4 Research focus: (1) more variants, esp. sparse coding, (2) large-scale implementations 26/47
  41. Outline 1 Background Knowledge in Deep Learning Motivation and Definitions

    Approach 1: Deep Belief Nets [Hinton et al., 2006] Approach 2: Stacked Auto-Encoders [Bengio et al., 2006] 2 Recent Advances in Applications Computer Vision Speech Recognition Natural Language Processing 27/47
  42. Building High-Level Features using Large Scale Unsupervised Learning [Le et

    al., 2012] Question: Is it possible to learn a face detector using only unlabeled images? 28/47
  43. Building High-Level Features using Large Scale Unsupervised Learning [Le et

    al., 2012] Question: Is it possible to learn a face detector using only unlabeled images? Answer: yes. 28/47
  44. Building High-Level Features using Large Scale Unsupervised Learning [Le et

    al., 2012] Question: Is it possible to learn a face detector using only unlabeled images? Answer: yes. Using 9-layer network, 1 billion parameters 10 million images (sampled from Youtube) 1000 machines (16,000 cores) x 1 week. 28/47
  45. Architecture min Wd ,We ,H ||Wd Wex(m) − x(m)|| (1)

    + + H(Wex(m))2 (2) (1): autoencoder (2): pooling/sparsity 29/47
  46. Outline 1 Background Knowledge in Deep Learning Motivation and Definitions

    Approach 1: Deep Belief Nets [Hinton et al., 2006] Approach 2: Stacked Auto-Encoders [Bengio et al., 2006] 2 Recent Advances in Applications Computer Vision Speech Recognition Natural Language Processing 36/47
  47. Deep Neural Networks for Acoustic Modeling in Speech Recognition: 4

    research groups share their views [Hinton et al., 2012] Deep Learning has impressive impact in speech. Already commercialized! 37/47
  48. Deep Neural Networks for Acoustic Modeling in Speech Recognition: 4

    research groups share their views [Hinton et al., 2012] Deep Learning has impressive impact in speech. Already commercialized! Word Error Rate Results: 37/47
  49. Training Procedure 1 Train Deep Belief Nets on speech features:

    typically 3-8 layers, 2000 units/layer, 15 frames of input, 6000 output 2 Fine-tune with frame-per-frame phone labels obtained from traditional Gaussian models 3 Further discriminative training in conjunction with higher-level Hidden Markov Model Why it works: Larger context and less hand-engineered preprocessing. 38/47
  50. Outline 1 Background Knowledge in Deep Learning Motivation and Definitions

    Approach 1: Deep Belief Nets [Hinton et al., 2006] Approach 2: Stacked Auto-Encoders [Bengio et al., 2006] 2 Recent Advances in Applications Computer Vision Speech Recognition Natural Language Processing 39/47
  51. My Interest: Learning Robust Representations of Words How should a

    word be represented in a natural language understanding system? 40/47
  52. My Interest: Learning Robust Representations of Words How should a

    word be represented in a natural language understanding system? How to handle new words, i.e. linguistic productivity? What is the mechanism for combining words into meanings? How does the brain do it? 40/47
  53. My Interest: Learning Robust Representations of Words How should a

    word be represented in a natural language understanding system? How to handle new words, i.e. linguistic productivity? What is the mechanism for combining words into meanings? How does the brain do it? Humans can handle such novel words or meanings; computers can’t ”Can you friend me on Facebook? 40/47
  54. My Interest: Learning Robust Representations of Words How should a

    word be represented in a natural language understanding system? How to handle new words, i.e. linguistic productivity? What is the mechanism for combining words into meanings? How does the brain do it? Humans can handle such novel words or meanings; computers can’t ”Can you friend me on Facebook? ”He is such a Republicrat!” 40/47
  55. My Interest: Learning Robust Representations of Words How should a

    word be represented in a natural language understanding system? How to handle new words, i.e. linguistic productivity? What is the mechanism for combining words into meanings? How does the brain do it? Humans can handle such novel words or meanings; computers can’t ”Can you friend me on Facebook? ”He is such a Republicrat!” ”I am on the train to ZaverLandde.” 40/47
  56. Inspirations from NeuroImaging Predicting Human Brain Activity Associated with the

    Meanings of Nouns, Science 2008 [Mitchell et al., 2008] 41/47
  57. A Deep Learning Approach context word h1 h2 h3 h1

    h2 h3 Map words to hidden features. Model words as distributed representation of hidden features 42/47
  58. A Deep Learning Approach context word h1 h2 h3 h1

    h2 h3 Map words to hidden features. Model words as distributed representation of hidden features For new words, context helps provide interpretation. 42/47
  59. Conclusions Deep Learning is a (new) suite of techniques that

    is garnering much excitement! Idea: Learn successively higher-levels of features from large datasets 43/47
  60. Conclusions Deep Learning is a (new) suite of techniques that

    is garnering much excitement! Idea: Learn successively higher-levels of features from large datasets Basic component: RBM or Auto-Encoder 43/47
  61. Conclusions Deep Learning is a (new) suite of techniques that

    is garnering much excitement! Idea: Learn successively higher-levels of features from large datasets Basic component: RBM or Auto-Encoder Many research topics remain: Faster and more scalable algorithms Incorporating domain knowledge (e.g. sparsity) Novel applications 43/47
  62. References I Bengio, Y., Lamblin, P., Popovici, D., and Larochelle,

    H. (2006). Greedy layer-wise training of deep networks. In NIPS’06, pages 153–160. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press. Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010). Why does unsupervised pre-training help deep learning? Journal of M, 11:625–660. 44/47
  63. References II Hinton, G., Deng, L., Yu, D., Dahl, G.,

    A.Mohamed, Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29. Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554. Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G. S., Dean, J., and Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In ICML. 45/47
  64. References III Lee, H., Grosse, R., Ranganath, R., and Ng,

    A. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML. Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., and Just, M. A. (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320(5880):1191–1195. Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2006). Sparse feature learning for deep belief networks. In NIPS. 46/47
  65. References IV Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y.,

    and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:3371–3408. 47/47