Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pytexas2013

 pytexas2013

Trends in Deep Learning, a talk from PyTexas 2013

Kyle Kastner

August 14, 2013
Tweet

More Decks by Kyle Kastner

Other Decks in Programming

Transcript

  1. • Input is L1 • Hidden is L2 • Output

    is L3 • Bias Units ◦ +1 in diagram • 6-3-6 autoencoder shown • “Deep” is ill-defined Anatomy (c.) CS294A Notes, Andrew Ng Bias Unit
  2. • Trained with cost and gradient of cost • Negative

    log likelihood (supervised) • Mean squared error (unsupervised) • 3-3-1 classifier shown Anatomy (c.) CS294A Notes, Andrew Ng
  3. Terminology Overfit • Performs well on training data, but poorly

    on test data Hyperparameters • “Knobs” to be tweaked, specific to each algorithm
  4. Terminology (c.) Learning Rate ◦ Amount to “step” in direction

    of error gradient ◦ Very important hyperparameter (VI...H?) Dropout ◦ Equivalent to averaging many neural networks ◦ Randomly zero out weights for each training example ◦ Drop 20% input, 50% hidden Momentum ◦ Analogous to physics ◦ Want settle in the lowest “valley”
  5. Ex1: Feature Learning • Autoencoder for feature extraction, 784-150-784 •

    Input coded values (150) to classifier Score on raw features: 0.8961 Score on encoded features: 0.912 “Borat”, 20th Century Fox The Dawson Academy, thedawsonacademy.org
  6. Ex2: Deep Classifier • Using reLU, add equal size layers

    until overfit • Once overfitting, add dropout • 784-1000-1000-1000-1000-10 architecture • Example achieves ~1.8% error on MNIST • State of the art is < .8% on MNIST digits! www.frontiersin.org
  7. Ex3: Deep Autoencoder • Once again, uses MNIST digits •

    784-250-150-30-150-250-784 architecture • Very difficult to set hyperparameters • “Supermarket” search
  8. In The Wild Applications • Google+ Image Search • Android

    Speech to Text • Microsoft Speech Translation Conferences • PyTexas (whoo!) • SciPy, PyCon, PyData... • ICML, NIPS, ICLR, ICASSP, IJCNN, AAAI
  9. In The Wild (c.) Python! • pylearn2 (http://github.com/lisa-lab/pylearn2) • theano-nets

    (http://github.com/lmjohns3/theano-nets) • scikit-learn (http://github.com/scikit-learn/scikit-learn) • hyperopt (http://github.com/jbergstra/hyperopt) • Theano (https://github.com/Theano/Theano) Other • Torch (http://torch.ch) • Deep Learning Toolbox (http://github. com/rasmusbergpalm/DeepLearnToolbox)
  10. Difficulties • Many “knobs” (hyperparameters) • Optimization is difficult •

    Optimization can’t fix poor initialization • Compute power and time
  11. Current Strategies Hyperparameters • Random search (Bergstra ‘12) • Don’t

    grid search • Most hyperparameters have little effect Optimization ◦ Hessian Free (HF) (Martens ‘10) ◦ Stochastic Gradient Descent (SGD) ◦ Layerwise pretraining + finetuning (Hinton ‘06)
  12. Current Strategies (c.) Dropout • Acts like “bagging” for neural

    nets • Randomly zero out units (20% input, 50% hidden) Activations • Rectified linear (reLU) with dropout, classification • Sigmoid or tanh, autoencoder (no dropout!)
  13. Current Strategies (c.) Initialization • Sparse initialization (Martens ‘10, Sutskever

    ‘13) • sqrt(6/fan) initialization (Glorot & Bengio, ‘10) Momentum (SGD) • Nesterov’s Accelerated Gradient (Sutskever ‘13) Learning Rate (SGD) • Adaptive learning rate (Schaul ‘13)
  14. Moving Forward • Simplify hyperparameter whack-a-mole • Validate research results

    • Apply to new datasets • Try to avoid a SKYNET situation...