Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hugo Larochelle: Neural Networks

ML Review
July 04, 2017

Hugo Larochelle: Neural Networks

Hugo Larochelle
Google Brain
Slides from CIFAR Deep Learning Summer School

ML Review

July 04, 2017
Tweet

More Decks by ML Review

Other Decks in Research

Transcript

  1. SUPERVISED LEARNING 3 Topics: supervised learning • Training time ‣

    data :
 
 
 ‣ setting : • Test time ‣ data :
 
 
 ‣ setting : { x (t), y(t)} { x (t), y(t)} • Example ‣ classification ‣ regression x (t), y(t) ⇠ p( x , y) x (t), y(t) ⇠ p( x , y)
  2. UNSUPERVISED LEARNING 4 Topics: unsupervised learning • Training time ‣

    data :
 
 
 ‣ setting : • Test time ‣ data :
 
 
 ‣ setting : { x (t)} { x (t)} x (t) ⇠ p( x ) x (t) ⇠ p( x ) • Example ‣ distribution estimation ‣ dimensionality reduction
  3. SEMI-SUPERVISED LEARNING 5 Topics: semi-supervised learning • Training time ‣

    data :
 
 
 ‣ setting : • Test time ‣ data :
 
 
 ‣ setting : { x (t), y(t)} { x (t), y(t)} { x (t)} x (t) ⇠ p( x ) x (t), y(t) ⇠ p( x , y) x (t), y(t) ⇠ p( x , y)
  4. MULTITASK LEARNING 6 Topics: multitask learning • Training time ‣

    data :
 
 
 ‣ setting : • Test time ‣ data :
 
 
 ‣ setting : { x (t), y(t) 1 , . . . , y(t) M } { x (t), y(t) 1 , . . . , y(t) M } x (t), y(t) 1 , . . . , y(t) M ⇠ p( x , y1, . . . , yM ) x (t), y(t) 1 , . . . , y(t) M ⇠ p( x , y1, . . . , yM ) • Example ‣ object recognition in images with multiple objects
  5. MULTITASK LEARNING 7 Topics: multitask learning ... Hugo Larochelle D´

    epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd ... • x1 xd b w1 wd • w • { • g(a) = a • g(a) = sigm(a) = 1 1+exp( a ) • g(a) = tanh(a) = exp( a ) exp( a ) exp( a )+exp( a ) = exp(2 a ) 1 exp(2 a )+1 • g(a) = max(0, a) • g(a) = reclin(a) = max(0, a) • g( · ) b • W (1) i,j b (1) i xj h(x)i ... ... ... ... ... • p(y = c | x) • o(a) = softmax(a) = h exp( a1) P c exp( ac) . . . exp( aC ) P c exp( ac) i> • f(x) • h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3) • a( k ) (x) = b( k ) + W( k ) h( k 1) x (h(0) (x) = x) • h( k ) (x) = g(a( k ) (x)) • h( L +1) (x) = o(a( L +1) (x)) = f(x) • p(y = c | x) • o(a) = softmax(a) = h exp( a1) P c exp( ac) . . . exp( aC ) P c exp( ac) i> • f(x) • h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3) • a( k ) (x) = b( k ) + W( k ) h( k 1) x (h(0) (x) = x) • h( k ) (x) = g(a( k ) (x)) • h( L +1) (x) = o(a( L +1) (x)) = f(x) • p(y = c | x) • o(a) = softmax(a) = h exp( a1) P c exp( ac) . . . exp( aC ) P c exp( ac) i> • f(x) • h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3) • a( k ) (x) = b( k ) + W( k ) h( k 1) x (h(0) (x) = x) • h( k ) (x) = g(a( k ) (x)) • p(y = c | x) • o(a) = softmax(a) = h exp( a1) P c exp( ac) . . . exp( aC ) P c exp( ac) i> • f(x) • h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3) • a( k ) (x) = b( k ) + W( k ) h( k 1) x (h(0) (x) = x) • h( k ) (x) = g(a( k ) (x)) • h( L +1) (x) = o(a( L +1) (x)) = f(x) y
  6. MULTITASK LEARNING 7 Topics: multitask learning ... Hugo Larochelle D´

    epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd ... • x1 xd b w1 wd • w • { • g(a) = a • g(a) = sigm(a) = 1 1+exp( a ) • g(a) = tanh(a) = exp( a ) exp( a ) exp( a )+exp( a ) = exp(2 a ) 1 exp(2 a )+1 • g(a) = max(0, a) • g(a) = reclin(a) = max(0, a) • g( · ) b • W (1) i,j b (1) i xj h(x)i ... ... ... ... ... • p(y = c | x) • o(a) = softmax(a) = h exp( a1) P c exp( ac) . . . exp( aC ) P c exp( ac) i> • f(x) • h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3) • a( k ) (x) = b( k ) + W( k ) h( k 1) x (h(0) (x) = x) • h( k ) (x) = g(a( k ) (x)) • h( L +1) (x) = o(a( L +1) (x)) = f(x) • p(y = c | x) • o(a) = softmax(a) = h exp( a1) P c exp( ac) . . . exp( aC ) P c exp( ac) i> • f(x) • h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3) • a( k ) (x) = b( k ) + W( k ) h( k 1) x (h(0) (x) = x) • h( k ) (x) = g(a( k ) (x)) • h( L +1) (x) = o(a( L +1) (x)) = f(x) • p(y = c | x) • o(a) = softmax(a) = h exp( a1) P c exp( ac) . . . exp( aC ) P c exp( ac) i> • f(x) • h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3) • a( k ) (x) = b( k ) + W( k ) h( k 1) x (h(0) (x) = x) • h( k ) (x) = g(a( k ) (x)) • p(y = c | x) • o(a) = softmax(a) = h exp( a1) P c exp( ac) . . . exp( aC ) P c exp( ac) i> • f(x) • h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3) • a( k ) (x) = b( k ) + W( k ) h( k 1) x (h(0) (x) = x) • h( k ) (x) = g(a( k ) (x)) • h( L +1) (x) = o(a( L +1) (x)) = f(x) y ... ... y3 y1 2
  7. TRANSFER LEARNING 8 Topics: transfer learning • Training time ‣

    data :
 
 
 ‣ setting : • Test time ‣ data :
 
 
 ‣ setting : { x (t), y(t) 1 , . . . , y(t) M } x (t), y(t) 1 , . . . , y(t) M ⇠ p( x , y1, . . . , yM ) { x (t), y(t) 1 } x (t), y(t) 1 ⇠ p( x , y1)
  8. STRUCTURED OUTPUT PREDICTION 9 Topics: structured output prediction • Training

    time ‣ data :
 
 
 ‣ setting : • Test time ‣ data :
 
 
 ‣ setting : • Example ‣ image caption generation ‣ machine translation x (t), y (t) ⇠ p( x , y ) x (t), y (t) ⇠ p( x , y ) { x (t), y (t)} { x (t), y (t)} of arbitrary structure (vector, sequence, graph)
  9. DOMAIN ADAPTATION 10 Topics: domain adaptation, covariate shift • Training

    time ‣ data :
 
 
 ‣ setting : • Test time ‣ data :
 
 
 ‣ setting : { x (t), y(t)} x (t) ⇠ p( x ) ⇡ p( x ) • Example ‣ classify sentiment in reviews of different products y(t) ⇠ p(y| x (t)) { ¯ x (t), y(t)} ¯ x (t) ⇠ q( x ) y(t) ⇠ p(y| ¯ x (t)) { ¯ x (t0)} ¯ x (t) ⇠ q( x )
  10. DOMAIN ADAPTATION 11 Topics: domain adaptation, covariate shift x October

    17, 2012 Abstract Math for my slides “Autoencoders”. • h(x) = g(a(x)) = sigm(b + Wx) • b x = o( b a(x)) f( x ) W V b c o ( h ( x )) w d • Domain-adversarial networks (Ganin et al. 2015)
 train hidden layer representation to be 1. predictive of the target class 2. indiscriminate of the domain • Trained by stochastic gradient descent ‣ for each random pair 1. update W,V,b,c in opposite direction of gradient 2. update w,d in direction of gradient x (t), ¯ x (t0)
  11. DOMAIN ADAPTATION 11 Topics: domain adaptation, covariate shift x October

    17, 2012 Abstract Math for my slides “Autoencoders”. • h(x) = g(a(x)) = sigm(b + Wx) • b x = o( b a(x)) f( x ) W V b c o ( h ( x )) w d • Domain-adversarial networks (Ganin et al. 2015)
 train hidden layer representation to be 1. predictive of the target class 2. indiscriminate of the domain • Trained by stochastic gradient descent ‣ for each random pair 1. update W,V,b,c in opposite direction of gradient 2. update w,d in direction of gradient x (t), ¯ x (t0) May also be used to promote
 fair and unbiased models …
  12. ONE-SHOT LEARNING 12 Topics: one-shot learning • Training time ‣

    data :
 
 
 ‣ setting : • Test time ‣ data :
 
 
 ‣ setting :
 
 
 
 ‣ side information : - a single labeled example from each of the M new classes { x (t), y(t)} { x (t), y(t)} • Example ‣ recognizing a person based on a single picture of him/her subject to y(t) 2 {1, . . . , C} y(t) 2 {C + 1, . . . , C + M} subject to x (t), y(t) ⇠ p( x , y) x (t), y(t) ⇠ p( x , y)
  13. ONE-SHOT LEARNING 13 Topics: one-shot learning W W W W

    W W W W 500 500 500 500 2000 Learning Similarity Metric 30 2000 1 2 3 4 30 1 2 3 4 y X X a b y a b D[y ,y ] a b We can then learn the non-linear trans mizing the log probability of the pairs t the training set. The normalizing term in the number of training cases rather the number of pixels or the number of cause we are only attempting to mode pairings, not the structure in the indiv mutual information between the code v The idea of using Eq. 2 to train a m work was originally described in [9]. a network would extract a two-dimen plicitly represented the size and orien was trained on pairs of face images th and orientation but were otherwise very to extract more elaborate properties w partly because of the difficulty of train Siamese architecture (figure taken from Salakhutdinov 
 and Hinton, 2007)
  14. ZERO-SHOT LEARNING 14 Topics: zero-shot learning, zero-data learning • Training

    time ‣ data :
 
 
 ‣ setting :
 
 
 
 ‣ side information : - description vector zc of each of the C classes • Test time ‣ data :
 
 
 ‣ setting :
 
 
 
 ‣ side information : - description vector zc of each of the new M classes { x (t), y(t)} { x (t), y(t)} • Example ‣ recognizing an object based on a worded description of it subject to y(t) 2 {1, . . . , C} y(t) 2 {C + 1, . . . , C + M} subject to x (t), y(t) ⇠ p( x , y) x (t), y(t) ⇠ p( x , y)
  15. ZERO-SHOT LEARNING 15 Topics: zero-shot learning, zero-data learning n Swersky

    Sanja Fidler Ruslan Salakhutdinov University of Toronto wersky,fidler,[email protected] t Learning of vi- ibutes to accom- at learning from icles, avoids the e attributes. We nseen categories we use text fea- he convolutional nvolutional neu- the architecture yers, rather than h modalities, as proposed model a list of pseudo- ng of words from s end-to-end us- CNN MLP Class score Dot product Wikipedia article TF-IDF Image g f The Cardinals or Cardinalidae are a family of passerine birds found in North and South America The South American cardinals in the genus… family north genus birds south america … Cxk 1xk 1xC Ba, Swersky, Fidler, Salakhutdinov arxiv 2015
  16. DESIGNING NEW ARCHITECTURES 16 Topics: designing new architectures • Tackling

    a new learn problem often requires designing 
 an adapted neural architecture • Approach 1: use our intuition for how a human would reason the problem • Approach 2: take an existing algorithm/procedure and 
 turn it into a neural network
  17. DESIGNING NEW ARCHITECTURES 17 Topics: designing new architectures • Many

    other examples ‣ structured prediction by unrolling probabilistic inference in an MRF ‣ planning by unrolling the value iteration algorithm
 (Tamar et al., NIPS 2016) ‣ few-shot learning by unrolling gradient descent on small training set Under review as a conference paper at ICLR 2017 Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line divides as a conference paper at ICLR 2017 2017 paper at ICLR 2017 Neural
 network Learning
 algorithm Ravi and Larochelle, ICLR 2017 _ _ _ _
  18. THEY CAN MAKE DUMB ERRORS 19 Topics: adversarial examples •

    Intriguing Properties of Neural Networks
 Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, ICLR 2014 (b) Correctly
 classified Badly
 classified Difference
  19. THEY CAN MAKE DUMB ERRORS 20 Topics: adversarial examples •

    Humans have adversarial examples too • However they don’t match those of neural networks
  20. 21 Topics: adversarial examples • Humans have adversarial examples too

    • However they don’t match those of neural networks THEY CAN MAKE DUMB ERRORS
  21. THEY ARE STRANGELY NON-CONVEX 22 Topics: non-convexity, saddle points •

    Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
 Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014 avg loss θ
  22. THEY ARE STRANGELY NON-CONVEX 22 Topics: non-convexity, saddle points •

    Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
 Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014 avg loss θ
  23. THEY ARE STRANGELY NON-CONVEX 23 Topics: non-convexity, saddle points •

    Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
 Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014
  24. THEY ARE STRANGELY NON-CONVEX 24 Topics: non-convexity, saddle points •

    Qualitatively Characterizing Neural Network Optimization Problems
 Goodfellow, Vinyals, Saxe, ICLR 2015 hm struggles to make progress.
  25. THEY ARE STRANGELY NON-CONVEX 25 Topics: non-convexity, saddle points •

    If dataset is created by labeling points using a N-hidden units neural network ‣ training another N-hidden units network is likely to fail ‣ but training a larger neural network is more likely to work! 
 (saddle points seem to be a blessing)
  26. THEY WORK BEST WHEN BADLY TRAINED 26 Topics: sharp vs.

    flat miniman • Flat Minima
 Hochreiter, Schmidhuber, Neural Computation 1997 literature. (Hochreiter & Schmidhuber, 1997) (informally) define a flat minimizer ¯ x as one for which the function varies slowly in a relatively large neighborhood of ¯ x . In contrast, a sharp minimizer ˆ x is such that the function increases rapidly in a small neighborhood of ˆ x . A flat minimum can be de- scribed with low precision, whereas a sharp minimum requires high precision. The large sensitivity of the training function at a sharp minimizer negatively impacts the ability of the trained model to generalize on new data; see Figure 1 for a hypothetical illustration. This can be explained through the lens of the minimum description length (MDL) theory, which states that statistical models that require fewer bits to describe (i.e., are of low complexity) generalize better (Rissanen, 1983). Since flat minimizers can be specified with lower precision than to sharp minimizers, they tend to have bet- ter generalization performance. Alternative explanations are proffered through the Bayesian view of learning (MacKay, 1992), and through the lens of free Gibbs energy; see e.g. Chaudhari et al. (2016). Flat Minimum Sharp Minimum Training Function Testing Function f(x) avg loss θ
  27. THEY WORK BEST WHEN BADLY TRAINED 27 Topics: sharp vs.

    flat miniman • On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
 Keskar, Mudigere, Nocedal, Smelyanskiy, Tang, ICLR 2017 ‣ found that using large batch sizes tends to find sharped minima and generalize worse • This means that we can’t talk about generalization without taking the training algorithm into account
  28. THEY CAN EASILY MEMORIZE 28 Topics: model capacity vs. training

    algorithm • Understanding Deep Learning Requires Rethinking Generalization
 Zhang, Bengio, Hardt, Recth, Vinyals, ICLR 2017
  29. THEY CAN BE COMPRESSED 29 Topics: knowledge distillation • Distilling

    the Knowledge in a Neural Network
 Hinton, Vinyals, Dean, arXiv 2015 ... Feedforward neural network Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd Feedforward neural network Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd ... • a(x) = b + P i wixi = b + w > x • h(x) = g(a(x)) = g(b + P i wixi) • x1 xd b w1 wd • w • { • g(a) = a • g(a) = sigm(a) = 1 1+exp( a ) • g(a) = tanh(a) = exp( a ) exp( a ) exp( a )+exp( a ) = exp(2 a ) 1 exp(2 a )+1 • g(a) = max(0, a) • g(a) = reclin(a) = max(0, a) • g( · ) b • W (1) i,j b (1) i xj h(x)i ... ... ... ... ...
  30. THEY CAN BE COMPRESSED 30 Topics: knowledge distillation • Distilling

    the Knowledge in a Neural Network
 Hinton, Vinyals, Dean, arXiv 2015 ... Feedforward neural network Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd Feedforward neural network Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd ... • a(x) = b + P i wixi = b + w > x • h(x) = g(a(x)) = g(b + P i wixi) • x1 xd b w1 wd • w • { • g(a) = a • g(a) = sigm(a) = 1 1+exp( a ) • g(a) = tanh(a) = exp( a ) exp( a ) exp( a )+exp( a ) = exp(2 a ) 1 exp(2 a )+1 • g(a) = max(0, a) • g(a) = reclin(a) = max(0, a) • g( · ) b • W (1) i,j b (1) i xj h(x)i ... ... ... ... ... ... Feedforward neural network Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd Feedforward neural networ Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd ... • a(x) = b + P i wixi = b + w > x • h(x) = g(a(x)) = g(b + P i wixi) • x1 xd b w1 wd • w • { • g(a) = a • g(a) = sigm(a) = 1 1+exp( a ) • g(a) = tanh(a) = exp( a ) exp( a ) exp( a )+exp( a ) = exp(2 a ) 1 exp(2 a )+1 • g(a) = max(0, a) • g(a) = reclin(a) = max(0, a) • g( · ) b • W (1) i,j b (1) i xj h(x)i ... ... ...
  31. THEY CAN BE COMPRESSED 31 Topics: knowledge distillation • Distilling

    the Knowledge in a Neural Network
 Hinton, Vinyals, Dean, arXiv 2015 ... Feedforward neural network Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd Feedforward neural network Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd ... • a(x) = b + P i wixi = b + w > x • h(x) = g(a(x)) = g(b + P i wixi) • x1 xd b w1 wd • w • { • g(a) = a • g(a) = sigm(a) = 1 1+exp( a ) • g(a) = tanh(a) = exp( a ) exp( a ) exp( a )+exp( a ) = exp(2 a ) 1 exp(2 a )+1 • g(a) = max(0, a) • g(a) = reclin(a) = max(0, a) • g( · ) b • W (1) i,j b (1) i xj h(x)i ... ... ... ... ... ... Feedforward neural network Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd Feedforward neural networ Hugo Larochelle D´ epartement d’informatique Universit´ e de Sherbrooke [email protected] September 6, 2012 Abstract Math for my slides “Feedforward neural network”. • a ( x ) = b + P i wixi = b + w > x • h ( x ) = g ( a ( x )) = g ( b + P i wixi) • x1 xd ... • a(x) = b + P i wixi = b + w > x • h(x) = g(a(x)) = g(b + P i wixi) • x1 xd b w1 wd • w • { • g(a) = a • g(a) = sigm(a) = 1 1+exp( a ) • g(a) = tanh(a) = exp( a ) exp( a ) exp( a )+exp( a ) = exp(2 a ) 1 exp(2 a )+1 • g(a) = max(0, a) • g(a) = reclin(a) = max(0, a) • g( · ) b • W (1) i,j b (1) i xj h(x)i ... ... ... y
  32. THEY CAN BE COMPRESSED 32 Topics: knowledge distillation • Can

    successfully distill ‣ a large neural network ‣ an ensemble of neural network • Works better than training it from scratch! ‣ Do Deep Nets Really Need to be Deep?
 Jimmy Ba, Rich Caruana, NIPS 2014
  33. THEY ARE INFLUENCED BY INITIALIZATION 33 Topics: impact of initialization

    • Why Does Unsupervised Pre- Training Help Deep Learning
 Erhan, Bengio, Courville, Manzagol, Vincent, JMLR 2010 ERHAN, BENGIO, COURVILLE, MANZAGOL, VINCENT AND BENGIO focus respectively on local9 and global structure.10 Each point is colored according to the training iteration, to help follow the trajectory movement. −100 −80 −60 −40 −20 0 20 40 60 80 100 −100 −80 −60 −40 −20 0 20 40 60 80 100 2 layers without pre−training 2 layers with pre−training
  34. THEY ARE INFLUENCED BY FIRST EXAMPLES 34 Topics: impact of

    early examples • Why Does Unsupervised Pre- Training Help Deep Learning
 Erhan, Bengio, Courville, Manzagol, Vincent, JMLR 2010 WHY DOES UNSUPERVISED PRE-TRAINING HELP DEEP LEARNING? the first million examples (across 10 different random draws, sampling a different set of 1 million examples each time) and keep the other ones fixed. After training the (10) models, we measure the variance (across the 10 draws) of the output of the networks on a fixed test set (i.e., we measure the variance in function space). We then vary the next million examples in the same fashion, and so on, to see how much each of the ten parts of the training set influenced the final function.
  35. YET THEY FORGET WHAT THEY LEARNED 35 Topics: lifelong learning,

    continual learning • Overcoming Catastrophic Forgetting in Neural Networks
 Kirkpatrick et al. PNAS 2017