pytexas2013 - Speaker Deck

Slide 1

Slide 1 text

Trends in Deep Learning Kyle Kastner Southwest Research Institute (SwRI) University of Texas - San Antonio (UTSA)

Slide 2

Slide 2 text

The W’s

Slide 3

Slide 3 text

Anatomy Sigmoid Tanh Rectified Linear (reLU) CS294A Notes, Andrew Ng f(x)

Slide 4

Slide 4 text

● Input is L1 ● Hidden is L2 ● Output is L3 ● Bias Units ○ +1 in diagram ● 6-3-6 autoencoder shown ● “Deep” is ill-defined Anatomy (c.) CS294A Notes, Andrew Ng Bias Unit

Slide 5

Slide 5 text

● Trained with cost and gradient of cost ● Negative log likelihood (supervised) ● Mean squared error (unsupervised) ● 3-3-1 classifier shown Anatomy (c.) CS294A Notes, Andrew Ng

Slide 6

Slide 6 text

Terminology Overfit ● Performs well on training data, but poorly on test data Hyperparameters ● “Knobs” to be tweaked, specific to each algorithm

Slide 7

Slide 7 text

Terminology (c.) Learning Rate ○ Amount to “step” in direction of error gradient ○ Very important hyperparameter (VI...H?) Dropout ○ Equivalent to averaging many neural networks ○ Randomly zero out weights for each training example ○ Drop 20% input, 50% hidden Momentum ○ Analogous to physics ○ Want settle in the lowest “valley”

Slide 8

Slide 8 text

Ex1: Feature Learning ● Autoencoder for feature extraction, 784-150-784 ● Input coded values (150) to classifier Score on raw features: 0.8961 Score on encoded features: 0.912 “Borat”, 20th Century Fox The Dawson Academy, thedawsonacademy.org

Slide 9

Slide 9 text

Ex2: Deep Classifier ● Using reLU, add equal size layers until overfit ● Once overfitting, add dropout ● 784-1000-1000-1000-1000-10 architecture ● Example achieves ~1.8% error on MNIST ● State of the art is < .8% on MNIST digits! www.frontiersin.org

Slide 10

Slide 10 text

Ex3: Deep Autoencoder ● Once again, uses MNIST digits ● 784-250-150-30-150-250-784 architecture ● Very difficult to set hyperparameters ● “Supermarket” search

Slide 11

Slide 11 text

In The Wild Applications ● Google+ Image Search ● Android Speech to Text ● Microsoft Speech Translation Conferences ● PyTexas (whoo!) ● SciPy, PyCon, PyData... ● ICML, NIPS, ICLR, ICASSP, IJCNN, AAAI

Slide 12

Slide 12 text

In The Wild (c.) Python! ● pylearn2 (http://github.com/lisa-lab/pylearn2) ● theano-nets (http://github.com/lmjohns3/theano-nets) ● scikit-learn (http://github.com/scikit-learn/scikit-learn) ● hyperopt (http://github.com/jbergstra/hyperopt) ● Theano (https://github.com/Theano/Theano) Other ● Torch (http://torch.ch) ● Deep Learning Toolbox (http://github. com/rasmusbergpalm/DeepLearnToolbox)

Slide 13

Slide 13 text

References Pure python autoencoder: http://easymachinelearning.blogspot.com/p/sparse-auto- encoders.html Tutorial: http://deeplearning.net/tutorial/ CS249A Notes: http://www.stanford.edu/class/cs294a/sparseAutoencoder. pdf

Slide 14

Slide 14 text

Questions? Slides and examples: http://github.com/kastnerkyle/PyTexas2013 theano-nets: http://github.com/lmjohns3/theano-nets Thank you!

Slide 15

Slide 15 text

BONUS! Time to spare?

Slide 16

Slide 16 text

Difficulties ● Many “knobs” (hyperparameters) ● Optimization is difficult ● Optimization can’t fix poor initialization ● Compute power and time

Slide 17

Slide 17 text

Current Strategies Hyperparameters ● Random search (Bergstra ‘12) ● Don’t grid search ● Most hyperparameters have little effect Optimization ○ Hessian Free (HF) (Martens ‘10) ○ Stochastic Gradient Descent (SGD) ○ Layerwise pretraining + finetuning (Hinton ‘06)

Slide 18

Slide 18 text

Current Strategies (c.) Dropout ● Acts like “bagging” for neural nets ● Randomly zero out units (20% input, 50% hidden) Activations ● Rectified linear (reLU) with dropout, classification ● Sigmoid or tanh, autoencoder (no dropout!)

Slide 19

Slide 19 text

Current Strategies (c.) Initialization ● Sparse initialization (Martens ‘10, Sutskever ‘13) ● sqrt(6/fan) initialization (Glorot & Bengio, ‘10) Momentum (SGD) ● Nesterov’s Accelerated Gradient (Sutskever ‘13) Learning Rate (SGD) ● Adaptive learning rate (Schaul ‘13)

Slide 20

Slide 20 text

Moving Forward ● Simplify hyperparameter whack-a-mole ● Validate research results ● Apply to new datasets ● Try to avoid a SKYNET situation...