Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Deep Learning

Britefury
January 05, 2016

An Introduction to Deep Learning

Deep Learning for computer vision. First presented at the School of Computing Sciences, University of East Anglia, Norwich, UK.
This version was presented at PyData London, January 2016.

Britefury

January 05, 2016
Tweet

More Decks by Britefury

Other Decks in Programming

Transcript

  1. Deep Learning An Introductory Tutorial G. French University of East

    Anglia & Kings College London Image montages from http://www.image-net.org
  2. ImageNet in 2012 Best approaches used hand-crafted features (SIFT, HOGs,

    Fisher vectors, etc) + classifier Top-5 error rate: ~25%
  3. In the last few years, more modern networks have achieved

    better results still [Simonyan14, He15] Top-5 error rates of ~5-7%
  4. Neural network Input layer Hidden layer 0 Hidden layer 1

    Output layer ⋯ Inputs Outputs ⋯
  5. Single layer of a neural network () Input vector Weighted

    connections Bias Activation function / non-linearity Layer activation
  6. = input (M-element vector) = output (N-element vector) = network

    weights (NxM matrix) = bias (N-element vector) = activation function; tanh / sigmoid / ReLU = ( + )
  7. Multiple layers Input vector ( + ) Hidden layer 0

    activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯
  8. Repeat for each layer: 0 = (0 + 0 )

    1 = 1 0 + 1 ⋯ = ( −1 + )
  9. As a classifier Input vector Hidden layer 0 activation Final

    layer activation with softmax (output) ⋯ Image pixels = 0.25 = 0.5 = 0.1 = 0.15 Class probabilities
  10. The cost (sometimes called loss), is a measure of the

    difference between network output and ground truth output
  11. Update parameters: 0 ′ = 0 − 0 0 ′

    = 0 − 0 γ = learning rate
  12. In practice this is done on a mini-batch of examples

    (e.g. 128) in parallel per pass Compute cost for each example, then average. Compute derivative of average cost w.r.t. params.
  13. Final layer: softmax as activation function ; output vector of

    class probabilities Cost: negative-log-likelihood / categorical cross-entropy
  14. Simplest model Each unit in each layer is connected too

    all units in previous layer All we have considered so far
  15. MNIST hand-written digit dataset 28x28 images, 10 classes 60K training

    examples, 10K validation, 10K test Examples from MNIST
  16. Network: 1 hidden layer of 64 units after 300 iterations

    over training set: 2.85% validation error Hidden layer weights visualised as 28x28 images Input Hidden Output 784 (28x28 images) 64 10
  17. Network: 2 hidden layers, both 256 units after 300 iterations

    over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10
  18. The fully connected networks so far have a weakness: No

    translation invariance; learned features are position dependent
  19. For more general imagery: requires a training set large enough

    to see all features in all possible positions… Requires network with enough units to represent this…
  20. Recap: FC (fully-connected) layer () Input vector Weighted connections Bias

    Activation function / non-linearity Layer activation
  21. The values of the weights form a convolution kernel For

    practical computer vision, more an one kernel must be used to extract a variety of features
  22. Still = ( + ) As convolution can be expressed

    as multiplication by weight matrix
  23. Max-pooling ‘layer’ [Ciresan12] Take maximum value from each (, )

    pooling region Down-samples image by factor Operates on channels independently
  24. Simplified LeNet for MNIST digits 28 28 24 24 Input

    Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12
  25. after 300 iterations over training set: 99.21% validation accuracy Model

    Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%
  26. Image processing requires large networks with perhaps millions of parameters

    Lots of training examples need to train Easily results in billions or even trillions of FLOPS
  27. As of now, nVidia is the most popular make of

    GPU. Cheaper gaming cards perfectly adequate Only use Tesla in production
  28. ReLU works better than tanh / sigmoid in many cases

    I don’t really understand the reasons (to be honest! -) See [Glorot11] [Glorot10]; written by people who do!
  29. Previously: rules of thumb often used, e.g. normal distribution with

    = 0.01 Problems arise when training deep networks with > 8 layers [Simonyan14], [He15]
  30. More recent approaches choose initial weights to maintain unit variance

    (as much as possible) throughout layers Otherwise layers can reduce or magnify magnitudes of signals exponentially
  31. Recent approach by He et. Al. [He15]: = 1 Where

    is the fan-in; the number of incoming connections, and is the gain (for ReLU activation function use = 2)
  32. For FC layer: = + = size of / width

    of ( is a -element vector, is a Q, P matrix)
  33. For convolutional layer: = product of kernel width, kernel height

    and number of channels incoming from previous layer
  34. Over-fitting always a problem in ML Model over-fits when it

    is very good at matching samples in training set but not those in validation/test
  35. DropOut [Hinton12] During training, randomly choose units to ‘drop out’

    by setting their output to 0, with probability , usually around 0.5 (compensate by multiplying values by 1 1− )
  36. Sampling a different subset of the network for each training

    example Kind of like model averaging with only one model -
  37. Dataset: MNIST Digits Network: Single hidden layer, fully connected, 256

    units, = 0.4 5000 iterations over training set
  38. DropOut OFF Train loss: 0.0003 Validation loss: 0.094 Validation error:

    1.9% DropOut ON Train loss: 0.0034 Validation loss: 0.077 Validation error: 1.56%
  39. Layers usually described using custom config/language CAFFE uses Google Protocol

    Buffers for base syntax (YAML/JSON like) and for data (since GPB is binary)
  40. In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE

    is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)
  41. https://github.com/Newmu/Theano- Tutorials Very simple Python code examples proceeding through logistic

    regression, fully connected and convolutional models. Shows complete mathematical expressions and training procedures
  42. Deep Neural Networks are Easily Fooled: High Confidence Predictions in

    Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network
  43. Deep Neural Networks are Easily Fooled: High Confidence Predictions in

    Recognizable Images [Nguyen15] Image taken from [Nguyen15]
  44. Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network

    in reverse; orientation, design colour, etc parameters as input, rendered images as output training images
  45. A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet

    model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.
  46. Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford

    15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set
  47. Deep learning is a fascinating field with lots going on

    Very flexible, wide range of techniques and applications
  48. Deep neural networks have proved to be highly effective* for

    computer vision, speech recognition and other areas *like with every other shiny new toy, see the small-print!
  49. [Ciresan12] Ciresan, Meier and Schmidhuber; Multi-column deep neural networks for

    image classification, Computer vision and Pattern Recognition (CVPR), 2012
  50. [Glorot10] Glorot, Bengio; Understanding the difficulty of training deep feedforward

    neural networks, International conference on artificial intelligence and statistics, 2010
  51. [He15] He, Zhang, Ren and Sun; Delving Deep into Rectifiers:

    Surpassing Human- Level Performance on ImageNet Classification, arXiv 2015
  52. [Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and

    R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  53. [Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the

    two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258
  54. [LeCun95] LeCun, Yann et. al.; Comparison of learning algorithms for

    handwritten digit recognition, International conference on artificial neural networks, 1995
  55. [Nguyen15] Nguyen, Yosinski and Clune; Deep Neural Networks are Easily

    Fooled: High Confidence Predictions for Unrecognizable Images, Computer Vision and Pattern Recognition (CVPR) 2015
  56. [Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for

    large-scale image recognition, arXiv:1409.1556, 2014