Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Deep Learning - Cambridge Pytho...

Britefury
February 02, 2016

Introduction to Deep Learning - Cambridge Python User Group

An introduction to deep learning, given at Cambridge Python User Group, 02/Feb/2016

Britefury

February 02, 2016
Tweet

More Decks by Britefury

Other Decks in Programming

Transcript

  1. Deep Learning An Introductory Tutorial G. French Kings College London

    & University of East Anglia Image montages from http://www.image-net.org
  2. ImageNet in 2012 Best approaches used hand-crafted features (SIFT, HOGs,

    Fisher vectors, etc) + classifier Top-5 error rate: ~25%
  3. In the last few years, more modern networks have achieved

    better results still [Simonyan14, He15] Top-5 error rates of ~5-7%
  4. Neural network Input layer Hidden layer 0 Hidden layer 1

    Output layer ⋯ Inputs Outputs ⋯
  5. Single layer of a neural network () Input vector Weighted

    connections Bias Activation function / non-linearity Layer activation
  6. = input (M-element vector) = output (N-element vector) = network

    weights (NxM matrix) = bias (N-element vector) = activation function; tanh / sigmoid / ReLU = ( + )
  7. Multiple layers Input vector ( + ) Hidden layer 0

    activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯
  8. Repeat for each layer: 0 = (0 + 0 )

    1 = 1 0 + 1 ⋯ = ( −1 + )
  9. As a classifier Input vector Hidden layer 0 activation Final

    layer activation with softmax (output) ⋯ Image pixels = 0.25 = 0.5 = 0.1 = 0.15 Class probabilities
  10. The cost (sometimes called loss), is a measure of the

    difference between network output and ground truth output
  11. Update parameters: 0 ′ = 0 − 0 0 ′

    = 0 − 0 γ = learning rate
  12. In practice this is done on a mini-batch of examples

    (e.g. 128) in parallel per pass Compute cost for each example, then average. Compute derivative of average cost w.r.t. params.
  13. Final layer: softmax as activation function ; output vector of

    class probabilities Cost: negative-log-likelihood / categorical cross-entropy
  14. Simplest model Each unit in each layer is connected too

    all units in previous layer All we have considered so far
  15. MNIST hand-written digit dataset 28x28 images, 10 classes 60K training

    examples, 10K validation, 10K test Examples from MNIST
  16. Network: 1 hidden layer of 64 units after 300 iterations

    over training set: 2.85% validation error Hidden layer weights visualised as 28x28 images Input Hidden Output 784 (28x28 images) 64 10
  17. Network: 2 hidden layers, both 256 units after 300 iterations

    over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10
  18. The fully connected networks so far have a weakness: No

    translation invariance; learned features are position dependent
  19. For more general imagery: requires a training set large enough

    to see all features in all possible positions… Requires network with enough units to represent this…
  20. Recap: FC (fully-connected) layer () Input vector Weighted connections Bias

    Activation function / non-linearity Layer activation
  21. The values of the weights form a convolution kernel For

    practical computer vision, more an one kernel must be used to extract a variety of features
  22. Still = ( + ) As convolution can be expressed

    as multiplication by weight matrix
  23. Max-pooling ‘layer’ [Ciresan12] Take maximum value from each (, )

    pooling region Down-samples image by factor Operates on channels independently
  24. Simplified LeNet for MNIST digits 28 28 24 24 Input

    Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12
  25. after 300 iterations over training set: 99.21% validation accuracy Model

    Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%
  26. Image processing requires large networks with perhaps millions of parameters

    Lots of training examples need to train Easily results in billions or even trillions of FLOPS
  27. As of now, nVidia is the most popular make of

    GPU. Cheaper gaming cards perfectly adequate Only use Tesla in production
  28. ReLU works better than tanh / sigmoid in many cases

    I don’t really understand the reasons (to be honest! ) See [Glorot11] [Glorot10]; written by people who do!
  29. PROBLEM: Magnitudes of activations can vary considerably, layer to layer

    If each layer ‘multiplies’ magnitude by some factor, they explode or vanish
  30. y = ( + ) Assume: σ = 1 σ

    depends on distribution of W; normal, uniform, std-dev, etc
  31. In the past: initialise W, rules of thumb often used,

    e.g. normal distribution with = 0.01 Problems arise when training deep networks with > 8 layers [Simonyan14], [He15]
  32. E.g. approach by He et. Al. [He15]: = 1 Where

    is the fan-in; the number of incoming connections, and is the gain
  33. Layer equation becomes: (scale, not needed if is ReLU) and

    (bias) are learned parameters σ = () μ = () = ( − + )
  34. Note: For a fully connected layer, each unit/output should have

    its own mean and std-dev; aggregate across examples in the mini-batch
  35. For a convolutional layer, each channel should have its own

    mean and std-dev; aggregate across examples in the mini- batch and across image rows and columns
  36. During training, keep a running exponential moving average of mean

    and std-dev During test time, use the averaged mean and std-dev
  37. Over-fitting always a problem in ML Model over-fits when it

    is very good at matching samples in training set but not those in validation/test
  38. Two techniques DropOut (quite a lot of people use batch

    normalisation instead) Dataset augmentation
  39. DropOut [Hinton12] During training, randomly choose units to ‘drop out’

    by setting their output to 0, with probability , usually around 0.5 (compensate by multiplying values by 1 1− )
  40. Sampling a different subset of the network for each training

    example Kind of like model averaging with only one model 
  41. Dataset: MNIST Digits Network: Single hidden layer, fully connected, 256

    units, = 0.4 5000 iterations over training set
  42. DropOut OFF Train loss: 0.0003 Validation loss: 0.094 Validation error:

    1.9% DropOut ON Train loss: 0.0034 Validation loss: 0.077 Validation error: 1.56%
  43. Layers usually described using custom config/language CAFFE uses Google Protocol

    Buffers for base syntax (YAML/JSON like) and for data (since GPB is binary)
  44. In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE

    is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)
  45. https://github.com/Newmu/Theano- Tutorials Very simple Python code examples proceeding through logistic

    regression, fully connected and convolutional models. Shows complete mathematical expressions and training procedures
  46. CCTV for Fisheries Project involving Dr. M. Fisher, Dr. M.

    Mackiewicz Funded by Marine Scotland
  47. Automatically quantify the amount of fish discarded by fishing trawlers

    (preferably by species) Process surveillance footage of discard belt
  48. STEPS: Segment fish from background Separate fish from one another

    Classify individual fish (TODO) Measure individual fish to estimate mass (TODO)
  49. -Fields Use network to transform input image patch into 16-element

    codeword vector Codeword vector used to look up a target (foreground or edge) patch that most closely matches, in dictionary of words
  50. Use 4-Fields to transform input image to foreground map Use

    4-Fields to transform input image to edge map Use Watershed algorithm to separate image into regions
  51. Deep Neural Networks are Easily Fooled: High Confidence Predictions in

    Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network
  52. Deep Neural Networks are Easily Fooled: High Confidence Predictions in

    Recognizable Images [Nguyen15] Image taken from [Nguyen15]
  53. Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network

    in reverse; orientation, design colour, etc parameters as input, rendered images as output training images
  54. A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet

    model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.
  55. Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford

    15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set
  56. Deep learning is a fascinating field with lots going on

    Very flexible, wide range of techniques and applications
  57. Deep neural networks have proved to be highly effective* for

    computer vision, speech recognition and other areas *like with every other shiny new toy, see the small-print!
  58. [Ciresan12] Ciresan, Meier and Schmidhuber; Multi-column deep neural networks for

    image classification, Computer vision and Pattern Recognition (CVPR), 2012
  59. [Ganin14] Ganin, Lempitsky; 4-Fields: Neural Network Nearest Neighbor Fields for

    Image Transforms, 12th Asian Conference on Computer Vision, 2014
  60. [Glorot10] Glorot, Bengio; Understanding the difficulty of training deep feedforward

    neural networks, International conference on artificial intelligence and statistics, 2010
  61. [He15] He, Zhang, Ren and Sun; Delving Deep into Rectifiers:

    Surpassing Human- Level Performance on ImageNet Classification, arXiv 2015
  62. [Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and

    R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  63. [Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep

    Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167
  64. [Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the

    two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258
  65. [LeCun95] LeCun, Yann et. al.; Comparison of learning algorithms for

    handwritten digit recognition, International conference on artificial neural networks, 1995
  66. [Nguyen15] Nguyen, Yosinski and Clune; Deep Neural Networks are Easily

    Fooled: High Confidence Predictions for Unrecognizable Images, Computer Vision and Pattern Recognition (CVPR) 2015
  67. [Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for

    large-scale image recognition, arXiv:1409.1556, 2014