Trends in Deep
Learning
Kyle Kastner
Southwest Research Institute (SwRI)
University of Texas - San Antonio (UTSA)
Slide 2
Slide 2 text
The W’s
Slide 3
Slide 3 text
Anatomy
Sigmoid Tanh
Rectified
Linear
(reLU)
CS294A Notes, Andrew Ng
f(x)
Slide 4
Slide 4 text
● Input is L1
● Hidden is L2
● Output is L3
● Bias Units
○ +1 in diagram
● 6-3-6 autoencoder shown
● “Deep” is ill-defined
Anatomy (c.)
CS294A Notes, Andrew Ng
Bias Unit
Slide 5
Slide 5 text
● Trained with cost and gradient of cost
● Negative log likelihood (supervised)
● Mean squared error (unsupervised)
● 3-3-1 classifier shown
Anatomy (c.)
CS294A Notes, Andrew Ng
Slide 6
Slide 6 text
Terminology
Overfit
● Performs well on training data, but poorly on test data
Hyperparameters
● “Knobs” to be tweaked, specific to each algorithm
Slide 7
Slide 7 text
Terminology (c.)
Learning Rate
○ Amount to “step” in direction of error gradient
○ Very important hyperparameter (VI...H?)
Dropout
○ Equivalent to averaging many neural networks
○ Randomly zero out weights for each training example
○ Drop 20% input, 50% hidden
Momentum
○ Analogous to physics
○ Want settle in the lowest “valley”
Slide 8
Slide 8 text
Ex1: Feature Learning
● Autoencoder for feature extraction, 784-150-784
● Input coded values (150) to classifier
Score on raw features: 0.8961
Score on encoded features: 0.912
“Borat”, 20th Century Fox
The Dawson Academy,
thedawsonacademy.org
Slide 9
Slide 9 text
Ex2: Deep Classifier
● Using reLU, add equal size layers until overfit
● Once overfitting, add dropout
● 784-1000-1000-1000-1000-10 architecture
● Example achieves ~1.8% error on MNIST
● State of the art is < .8% on MNIST digits!
www.frontiersin.org
Slide 10
Slide 10 text
Ex3: Deep Autoencoder
● Once again, uses MNIST digits
● 784-250-150-30-150-250-784 architecture
● Very difficult to set hyperparameters
● “Supermarket” search
Slide 11
Slide 11 text
In The Wild
Applications
● Google+ Image Search
● Android Speech to Text
● Microsoft Speech
Translation
Conferences
● PyTexas (whoo!)
● SciPy, PyCon, PyData...
● ICML, NIPS, ICLR,
ICASSP, IJCNN, AAAI
Slide 12
Slide 12 text
In The Wild (c.)
Python!
● pylearn2 (http://github.com/lisa-lab/pylearn2)
● theano-nets (http://github.com/lmjohns3/theano-nets)
● scikit-learn (http://github.com/scikit-learn/scikit-learn)
● hyperopt (http://github.com/jbergstra/hyperopt)
● Theano (https://github.com/Theano/Theano)
Other
● Torch (http://torch.ch)
● Deep Learning Toolbox (http://github.
com/rasmusbergpalm/DeepLearnToolbox)
Slide 13
Slide 13 text
References
Pure python autoencoder:
http://easymachinelearning.blogspot.com/p/sparse-auto-
encoders.html
Tutorial:
http://deeplearning.net/tutorial/
CS249A Notes:
http://www.stanford.edu/class/cs294a/sparseAutoencoder.
pdf
Slide 14
Slide 14 text
Questions?
Slides and examples:
http://github.com/kastnerkyle/PyTexas2013
theano-nets:
http://github.com/lmjohns3/theano-nets
Thank you!
Slide 15
Slide 15 text
BONUS!
Time to spare?
Slide 16
Slide 16 text
Difficulties
● Many “knobs” (hyperparameters)
● Optimization is difficult
● Optimization can’t fix poor initialization
● Compute power and time
Slide 17
Slide 17 text
Current Strategies
Hyperparameters
● Random search (Bergstra ‘12)
● Don’t grid search
● Most hyperparameters have little effect
Optimization
○ Hessian Free (HF) (Martens ‘10)
○ Stochastic Gradient Descent (SGD)
○ Layerwise pretraining + finetuning (Hinton ‘06)
Slide 18
Slide 18 text
Current Strategies (c.)
Dropout
● Acts like “bagging” for neural nets
● Randomly zero out units (20% input, 50% hidden)
Activations
● Rectified linear (reLU) with dropout, classification
● Sigmoid or tanh, autoencoder (no dropout!)