Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Learning Tutorial - advanced techniques - PyData London 2016

Deep Learning Tutorial - advanced techniques - PyData London 2016

The main presentation slides from my tutorial 'Deep Learning - Advanced Techniques' that I gave on 6th/May/2016.

Britefury

May 06, 2016
Tweet

More Decks by Britefury

Other Decks in Technology

Transcript

  1. Deep Learning Tutorial Advanced Techniques G. French Kings College London

    University of East Anglia Image montages from http://www.image-net.org
  2. Theano What it is and how it works Review: Multi-layer

    perceptron The basic model Convolutional networks Neural networks for computer vision
  3. Lasagne and VGG-19 Explain Lasagne and use it with a

    convolutional network trained by the VGG group at Oxford University Deep learning tricks of the trade tips to save you some time Active learning less training data by careful choice
  4. Amazon AMI (Use GPU machine) AMI ID: ami-5f789e32 AMI Name:

    PyData London 2016 deep learning adv tutorial - Ubuntu-14.04 Anaconda2-4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel
  5. In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE

    is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)
  6. = input (M-element vector) = output (N-element vector) = weights

    parameter (NxM matrix) = bias parameter (N-element vector) = activation function; normally ReLU but can be tanh or sigmoid = ( + )
  7. Repeat for each layer Input vector ( + ) Hidden

    layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯
  8. Update parameters: + , = + − /0 /12 +

    , = + − /0 /32 γ = learning rate
  9. (Obligatory) MNIST example: 2 hidden layers, both 256 units after

    300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10
  10. The fully connected networks so far have a weakness: No

    translation invariance; learned features are position dependent
  11. For more general imagery: requires a training set large enough

    to see all features in all possible positions… Requires network with enough units to represent this…
  12. Recap: FC (fully-connected) layer () Input vector Weighted connections Bias

    Activation function (non-linearity) Layer activation
  13. The values of the weights form a convolution kernel For

    practical computer vision, more an one kernel must be used to extract a variety of features
  14. Still = ( + ) As convolution can be expressed

    as multiplication by weight matrix
  15. Another way of looking at it: A single kernel of

    an e.g. 5x5 convolutional layer is a bit like…
  16. fully-connected layer with 5x5 input image repeated across the whole

    image a new ‘fully-connected layer’ for each filter
  17. Max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2 x

    2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently
  18. Simplified LeNet for MNIST digits 28 28 24 24 Input

    Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12
  19. after 300 iterations over training set: 99.21% validation accuracy Model

    Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%
  20. Provides API for: constructing layers of a network getting Theano

    expressions representing output, loss, etc.
  21. Lasagne is quite a thin layer on top of Theano,

    so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.
  22. # Layer Input: 3 x 224 x 224 (RGB image,

    zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling, 2x2
  23. # Layer 9 512C3 10 512C3 11 512C3 12 512C3

    MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max Notation: FC4096 fully-connected layer 4096 channels drop 50% With 50% drop-out during training
  24. # Layer Input: 3 x 224 x 224 (RGB image,

    zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 # Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max
  25. These kinds of architectures tend to work well: Small convolution

    kernels (3x3) Interspersed with max-pooling
  26. For lower levels, this involves repeating many of the same

    computations, getting the same result ∗
  27. If we could apply the first convolutional layer across the

    whole image rather than many 224x224 blocks we could re- use those computations….
  28. Then we could also do this for the rest of

    the convolutional layers further down…
  29. In fact we can use the whole network in a

    convolutional fashion; we just need to convert the fully-connected layers to convolutional layers.
  30. This is a trick used when doing image segmentation, when

    we want to determine which parts of an image belong to which class At both training time and prediction time
  31. Small mini-batches Maybe around ~8 Good but slower training Small

    mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM
  32. Large mini-batches 1000s Ineffective training Won’t reach the same error

    rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks
  33. Happy medium (where you want to be) Maybe around 64-256,

    lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks
  34. Increasing mini-batch size will improve performance up to the point

    where all GPU units are in use Increasing it further will not improve performance; it will reduce accuracy
  35. Caveat When working in a convolutional fashion - like the

    example of using VGG- net to find the peacock – or when doing image segmentation
  36. In such cases, pushing large patches of an image through

    as a single batch along with a correspondingly large output patch re-uses data due to convolutions and results in substantial savings
  37. My experience: Use patches that are as large as possible

    Although it’s a tricky balance with accuracy of the final result
  38. Batch normalization [Ioffe15] is recommended in most cases Speeds up

    training Loss and error drop faster per-epoch
  39. Although epochs take longer (around 2x in my experience) Can

    (ultimately) reach lower error rates Lets you build deeper networks
  40. Standardise activations (zero-mean, unit variance) per-channel between network layers Solves

    problems caused by exponential growth or shrinkage of layer activations
  41. + = 1 9 = 2 ;<9 = 2 ;

    Assume that a layer – grey square – produces activations whose std-dev are twice that of the input:
  42. = 1 = 2= = = 2= + When layers

    are stacked together: ⋯
  43. The magnitude of activations and therefore gradients either explode or

    vanish (if the layers reduce the magnitude of activations rather than magnify them)
  44. Can be partially addressed with careful weight initialization [He15]. Batch

    normalization between layers keeps things sane; can train networks with hundreds of layers [He15b].
  45. Lasagne batch normalization inserts itself into a layer before the

    non- linearity, so its nice and easy to use: l = lasagne.layers.batch_norm(l)
  46. Standardise input data In case of regression, standardise output data

    too (don’t forget to invert the standardisation of network predictions!)
  47. Standardisation Extract samples (pixels in the case of images) into

    an array Compute distribution and standardise
  48. Either: Zero the mean and scale std-dev to 1, per

    channel (RGB for images) , = −
  49. Or better still: Use PCA whitening (retain all channels –

    we don’t want to reduce dimensionality)
  50. The previous fish based examples all used batch normalisation and

    still benefited from data standardisation, so no.
  51. Learns to predict constant value; optimises constant value for best

    loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)
  52. Saliency maps Determine which parts of an image the network

    is using to make its prediction Tells you what the network is ‘looking at’
  53. Theoretically possible to use a single network, with enough training

    data (where enough is an impractical amount)
  54. Example Identifying right whales, by Felix Lau 2nd place in

    Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html
  55. Identifying right whales, by Felix Lau The first naïve solution

    – training a classifier to identify individuals – did not work well
  56. Region-based saliency map revealed that the network had ‘locked on’

    to features in the ocean shape rather than the whales
  57. Lau’s solution: Train a keypoint finder to locate two keypoints

    on the whale’s head to identify its orientation
  58. Active learning by confidence The maximum probability is that of

    the predicted class and its value is the confidence
  59. Active learning by confidence Choose the samples with the lowest

    confidence as the next candidates for labelling
  60. Note: Predicted probabilities from neural nets are often very close

    to 0.0 or 1.0; maybe 1e-6 away. Would be nice if they were ‘smoother’
  61. Start with 500 labelled training samples Each round, of the

    remaining training samples, choose the 500 with the least confidence Add to dataset
  62. 0 1 2 3 4 5 6 7 8 9

    500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Prediction error (%) # labelled samples Random order Confidence Active learning: MNIST Random choice vs least-confidence choice
  63. MNIST: Only needs 5k out of 50k samples to each

    (very nearly) the same accuracy
  64. When training a network, we use gradient descent to iteratively

    modify weights given images and ground truths
  65. Deep Dreams: Take an image to hallucinate from Choose a

    layer, e.g. ‘pool4’ of VGG-19; choice depends on scale and level of features desired
  66. Deep Dreams: Compute gradient of L-norm of layer w.r.t. image

    Use gradient ascent to increase L-norm
  67. [He15a] He, Zhang, Ren and Sun; Delving Deep into Rectifiers:

    Surpassing Human-Level Performance on ImageNet Classification, arXiv 2015
  68. [He15b] He, Kaiming, et al. "Deep Residual Learning for Image

    Recognition." arXiv preprint arXiv:1512.03385 (2015).
  69. [Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and

    R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  70. [Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep

    Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167
  71. [Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the

    two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258
  72. [Lin13] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in

    network." arXiv preprint arXiv:1312.4400 (2013).
  73. [Nesterov83] Nesterov, Y. A method of solving a convex programming

    problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376 (1983).
  74. [Sutskever13] Sutskever, Ilya, et al. On the importance of initialization

    and momentum in deep learning. Proceedings of the 30th international conference on machine learning (ICML-13). 2013.
  75. [Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for

    large-scale image recognition, arXiv:1409.1556, 2014
  76. [Wang14] Wang, Dan, and Yi Shang. "A new active labeling

    method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.