Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Learning - Advanced Techniques Tutorial - PyData Amsterdam 2017

Deep Learning - Advanced Techniques Tutorial - PyData Amsterdam 2017

Slides from my PyData Amsterdam 2017 tutorial on deep learning.

Britefury

April 07, 2017
Tweet

More Decks by Britefury

Other Decks in Technology

Transcript

  1. Deep Learning Tutorial Advanced Techniques PyData Amsterdam 2017 G. French

    – University of East Anglia Image montages from http://www.image-net.org
  2. Theano What it is and how it works Review: Multi-layer

    perceptron The basic model Convolutional networks Neural networks for computer vision
  3. Lasagne and VGG-16 Explain Lasagne and use it with a

    convolutional network trained by the VGG group at Oxford University Transfer learning Re-using pre-trained networks Deep learning tricks of the trade tips to save you some time
  4. Amazon AMI (Use GPU machine) AMI ID: ami-5f789e32 AMI Name:

    PyData London 2016 deep learning adv tutorial - Ubuntu-14.04 Anaconda2-4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel
  5. Neural network software exists on a spectrum Higher level -

    Specify network in terms of layers Flexible and powerful - Specify network in terms of mathematical expressions Neural network toolkits Expression compilers API style
  6. Neural network software exists on a spectrum Less to debug

    so easier Depends on toolkit - (can end up with confusing errors in compiled code e.g. Theano) Neural network toolkits Expression compilers Debugging
  7. Neural network software exists on a spectrum CAFFE Theano Tensorflow

    Torch Neural network toolkits Expression compilers Examples
  8. = input (M-element vector) = output (N-element vector) = weights

    parameter (NxM matrix) = bias parameter (N-element vector) = activation function; normally ReLU but can be tanh or sigmoid = ( + )
  9. Repeat for each layer Input vector ( + ) Hidden

    layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯
  10. To train the network: Compute the derivative of cost w.r.t.

    parameters ( and ) (More on the cost/loss later)
  11. Update parameters: + , = + − /0 /12 +

    , = + − /0 /32 γ = learning rate
  12. (Obligatory) MNIST example: 2 hidden layers, both 256 units after

    300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10
  13. The fully connected networks so far have a weakness: No

    translation invariance; learned features are position dependent
  14. For more general imagery: requires a training set large enough

    to see all features in all possible positions… Requires network with enough units to represent this…
  15. The final non-linearity and the corresponding loss function is important

    Their choice depends on the desired output of the network
  16. Lets take a neural network: Input vector ( + )

    Hidden layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯
  17. Drive its activation function directly (no matrix multiplication) Final layer

    activation (output) () Final layer We are going to ‘simulate’ the network up to that point by providing values for the logits directly
  18. Create logit values: 0.0 -5.0 0.0 -4.0 0.0 -3.0 0.0

    -2.0 0.0 -1.0 0.0 0.0 0.0 1.0 0.0 2.0 0.0 3.0 0.0 4.0 0.0 5.0 keep logit for class 0 constant vary logit for class 1
  19. Lets add the predicted probabilities generated by softmax: 0.0 -5.0

    0.9933072 0.0066929 0.0 -4.0 0.9820138 0.0179862 0.0 -3.0 0.9525741 0.0474259 0.0 -2.0 0.8807970 0.1192029 0.0 -1.0 0.7310586 0.2689414 0.0 0.0 0.5000000 0.5000000 0.0 1.0 0.2689414 0.7310586 0.0 2.0 0.1192029 0.8807971 0.0 3.0 0.0474259 0.9525741 0.0 4.0 0.0179862 0.9820138 0.0 5.0 0.0066929 0.9933072 => ∑ => @ AB+
  20. Add negative log-loss: Note: loss is high when N is

    negative, tends to 0 when it is positive
  21. Learning occurs via gradient descent Note: gradient in range [-1,

    0], negative when N is negative, when N is positive the gradient is close to 0 (correct answer, not much learning to do)
  22. Probability regression Use for predicting single value in [0, 1]

    range Can use for binary classification Could use to generate mask for an image
  23. In general When network gives correct answer: Gradient of the

    loss will be near 0 (no more learning)
  24. In general When network gives incorrect answer: Gradient of loss

    will be of magnitude 1 pushing the network to learn
  25. A gradient of 1 The gradient of the final non-linearity

    + loss function having a magnitude of 1 is worth consideration
  26. A gradient of 1 They won’t scale up or down

    the gradient that is back-propagated throughout the earlier parts of the network
  27. A gradient of 1 Keeping the gradient magnitudes sane is

    important (*) Hence the success of weight initialisation schemes and batch normalisation * Particularly for GANs
  28. Regression with squared error loss Loss: squared error; U =

    prediction, = true value = − E − U V
  29. Regression with squared error loss plot The gradient can have

    a large magnitude when U and differ greatly
  30. Regression with Huber loss Uses squared error when difference between

    prediction and target is small, absolute error otherwise Used in [Girshick15]
  31. Regression with Huber loss Loss: Huber loss; U = prediction,

    = true value ℎ , U = 1 2 U − V U − ≤ 1 U − − 1 2 ℎ
  32. In summary: Final non-linearity and loss function depend on: What

    you want your network to generate Effects on gradients can be worth considering
  33. Recap: FC (fully-connected) layer () Input vector Weighted connections Bias

    Activation function (non-linearity) Layer activation
  34. The values of the weights form a convolution kernel For

    practical computer vision, more an one kernel must be used to extract a variety of features
  35. Still = ( + ) As convolution can be expressed

    as multiplication by weight matrix
  36. Another way of looking at it: A single kernel of

    an e.g. 5x5 convolutional layer is a bit like…
  37. fully-connected layer with 5x5 input image repeated across the whole

    image a new ‘fully-connected layer’ for each filter
  38. Down-sampling: max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2

    x 2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently
  39. Down-sampling: striding Only retain 1 pixel in every ; skip

    the rest Often fast; is built into the convolution operation of many neural network libraries
  40. Simplified LeNet for MNIST digits 28 28 24 24 Input

    Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12
  41. after 300 iterations over training set: 99.21% validation accuracy Model

    Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%
  42. Provides API for: constructing layers of a network getting Theano

    expressions representing output, loss, etc.
  43. Lasagne is quite a thin layer on top of Theano,

    so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.
  44. An aside note about other libraries Definitely check out Keras

    http://keras.io Works on both Theano and Tensorflow
  45. An aside note about other libraries Keras API can be

    simpler to get to grips with than Lasagne Lots of cool examples
  46. Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling,

    2x2 # Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2
  47. Notation: FC4096 fully-connected layer 4096 channels drop 50% With 50%

    drop-out during training # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max
  48. # Layer Input: 3 x 224 x 224 (RGB image,

    zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2 # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max
  49. These kinds of architectures tend to work well: Small convolution

    kernels (3x3) Interspersed with max-pooling
  50. For lower levels, this involves repeating many of the same

    computations, getting the same result ∗
  51. If we could apply the first convolutional layer across the

    whole image rather than many 224x224 blocks we could re- use those computations….
  52. Then we could also do this for the rest of

    the convolutional layers further down…
  53. In fact we can use the whole network in a

    convolutional fashion; we just need to convert the fully-connected layers to convolutional layers.
  54. This is a trick used when doing image segmentation, when

    we want to determine which parts of an image belong to which class At both training time and prediction time
  55. Training a neural network is notoriously data-hungry Preparing training data

    with ground truths is expensive and time consuming
  56. The ImageNet dataset is huge; millions of images with ground

    truths What if we could somehow use it to help us with a different task?
  57. Example; can re-use part of VGG-16 net for: Classifying images

    with classes that weren’t part of the original ImageNet dataset
  58. Example; can re-use part of VGG-16 net for: Localisation (find

    location of object in image) Segmentation (find exact boundary around object in image)
  59. # Layer Input: 3 x 224 x 224 (RGB image,

    zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2 # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max
  60. Remove last layers e.g. the fully- connected ones (just 14,15,16;

    those in the left box are hidden here for brevity!) # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2
  61. Build new randomly initialise layers to replace them (the number

    of layers created their size is only for illustration here) # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 FC1024 (drop 50%) FC21 soft-max
  62. Lets start with some code to get the pre- trained

    and new layer parameters separately
  63. # Get all parameters in network; `get_all_params` works backward #

    from the final layer through the network all_params = lasagne.layers.get_all_params(final_layer, trainable=True) # Get parameters from pre-trained layers; give the top pre-trained layer pretrained_params = lasagne.layers.get_all_params( vgg16.network[‘pool5’], trainable=True) new_params = [p for p in all_params if p not in pretrained_params]
  64. Transfer learning: train new layers Train the network with your

    training data, only learning parameters for the new layers
  65. Transfer learning: fine-tuning Learn parameters for new layers Fine-tuning: learn

    parameters for pre- trained layers using a lower learning rate; normally 1/10th
  66. # Update new layers with standard learning rate new_updates =

    lasagne.updates.adam( training_loss, new_params, learning_rate=lr) # Update new layers with standard learning rate pretrained_updates = lasagne.updates.adam( training_loss, pretrained_params, learning_rate=lr * 0.1) # Combine updates updates = new_updates.copy() updates.update(pretrained_updates)
  67. Validation score does not decrease monotonically, so the final score

    is not necessarily the best Dogs vs cats, with transfer learning and data augmentation
  68. Sometimes validation score can start getting gradually worse as the

    network overfits Once you reach this point, there’s no point training any more
  69. Geometric patience Set the patience (index of epoch that you

    are prepared to wait until) to be the number of epochs elapsed so far times some multiple e.g. 2
  70. Geometric patience An example can be seen in the Logistic

    Regression and MLP Theano tutorials: http://deeplearning.net/tutorial/mlp.html
  71. Small mini-batches Maybe around ~8 Good but slower training Small

    mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM
  72. Large mini-batches 1000s Ineffective training Won’t reach the same error

    rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks
  73. Happy medium (where you want to be) Maybe around 64-256,

    lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks
  74. Increasing mini-batch size will improve performance up to the point

    where all GPU units are in use Increasing it further will not improve performance; it will reduce accuracy
  75. Caveat When working in a convolutional fashion - like the

    example of using VGG- net to find the peacock – or when doing image segmentation
  76. In such cases, pushing large patches of an image through

    as a single batch along with a correspondingly large output patch re-uses data due to convolutions and results in substantial savings
  77. My experience: Use patches that are as large as possible

    Although it’s a tricky balance with accuracy of the final result
  78. Batch normalization [Ioffe15] is recommended in most cases Speeds up

    training Loss and error drop faster per-epoch
  79. Although epochs take longer (around 2x in my experience) Can

    (ultimately) reach lower error rates Lets you build deeper networks
  80. Standardise activations (zero-mean, unit variance) per-channel between network layers Solves

    problems caused by exponential growth or shrinkage of layer activations
  81. + = 1 N = 2 A`N = 2 A

    Assume that a layer – grey square – produces activations whose std-dev are twice that of the input:
  82. = 1 = 2@ @ = 2@ + When layers

    are stacked together: ⋯
  83. The magnitude of activations and therefore gradients either explode or

    vanish (if the layers reduce the magnitude of activations rather than magnify them)
  84. Can be partially addressed with careful weight initialization [He15]. Batch

    normalization between layers keeps things sane; can train networks with hundreds of layers [He15b].
  85. Lasagne batch normalization inserts itself into a layer before the

    non- linearity, so its nice and easy to use: l = lasagne.layers.batch_norm(l)
  86. Standardise input data In case of regression, standardise output data

    too (don’t forget to invert the standardisation of network predictions!)
  87. Standardisation Extract samples (pixels in the case of images) into

    an array Compute distribution and standardise
  88. Either: Zero the mean and scale std-dev to 1, per

    channel (RGB for images) , = −
  89. Or better still: Use PCA whitening (retain all channels –

    we don’t want to reduce dimensionality)
  90. ZCA whitening is very similar Notice that PCA whitening rotated

    the RGB distribution ZCA whitening doesn’t; it only scales along the axes found by PCA
  91. The previous fish based examples all used batch normalisation and

    still benefited from data standardisation, so no.
  92. For images: Transformations: move, scale, rotate, reflect, etc. Consider the

    domain: horizontally flipping characters for character recognition would be a bad idea
  93. For images: Lighting: Compute principal components of RGB pixel values

    Add normally distributed random multiples of 10% of the std-dev in each principal component
  94. Learns to predict constant value; optimises constant value for best

    loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)
  95. Saliency maps Determine which parts of an image the network

    is using to make its prediction Tells you what the network is ‘looking at’
  96. Theoretically possible to use a single network, with enough training

    data (where enough is an impractical amount)
  97. Example Identifying right whales, by Felix Lau 2nd place in

    Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html
  98. Identifying right whales, by Felix Lau The first naïve solution

    – training a classifier to identify individuals – did not work well
  99. Region-based saliency map revealed that the network had ‘locked on’

    to features in the ocean shape rather than the whales
  100. Lau’s solution: Train a keypoint finder to locate two keypoints

    on the whale’s head to identify its orientation
  101. When training a network, we use gradient descent to iteratively

    modify weights given images and ground truths
  102. Deep Dreams: Take an image to hallucinate from Choose a

    layer, e.g. ‘pool4’ of VGG-19; choice depends on scale and level of features desired
  103. Deep Dreams: Compute gradient of V-norm of layer w.r.t. image

    Use gradient ascent to increase V-norm
  104. Plug & Play Generative Networks: Conditional Iterative Generation of Images

    in Latent Space [Nguyen17] Image taken from http://www.evolvingai.org/ppgn
  105. Deep Neural Networks are Easily Fooled: High Confidence Predictions in

    Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network
  106. Deep Neural Networks are Easily Fooled: High Confidence Predictions in

    Recognizable Images [Nguyen15] Image taken from [Nguyen15]
  107. Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network

    in reverse; orientation, design colour, etc parameters as input, rendered images as output training images
  108. Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford

    15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set
  109. There has been a lot of work on GANs lately;

    mainly focused on improving their output quality
  110. A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet

    model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.
  111. Much work on style transfer has focused on either improving

    quality or performance Original algorithm used gradient descent; like Deep Dream, so its quite slow
  112. Deep Photo Style Transfer [Luan17] (also iterative, but just awesome)

    Image taken from https://github.com/luanfujun/deep-photo-styletransfer
  113. [Berthelot17] Berthelot D., Schumm T., Metz L.; “BEGAN: Boundary Equilibrium

    Generative Adversarial Networks”, arXiv 1703.10717 (2017).
  114. [Girshick15] Girshick, Ross; “Fast R- CNN”, Proceedings of the IEEE

    International Conference on Computer Vision, 2015
  115. [He15a] He, Zhang, Ren and Sun; “Delving Deep into Rectifiers:

    Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015
  116. [He15b] He, Kaiming, et al. "Deep Residual Learning for Image

    Recognition." arXiv preprint arXiv:1512.03385 (2015).
  117. [Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and

    R. R. Salakhutdinov; “Improving neural networks by preventing co-adaptation of feature detectors.” arXiv preprint arXiv:1207.0580, 2012.
  118. [Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep

    Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167
  119. [Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the

    two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258
  120. [Lin13] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in

    network." arXiv preprint arXiv:1312.4400 (2013).
  121. [Luan17] Luan F., Paris S., Shechtman E., Bala K. “Deep

    Photo Style Transfer" arXiv:1703.07511 (2017).
  122. [Krizhevsky12] Krizhevsky, A., Sutskever, I. and Hinton, G. E. "ImageNet

    Classification with Deep Convolutional Neural Networks." NIPS 2012.
  123. [Nesterov83] Nesterov, Y. A method of solving a convex programming

    problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376 (1983).
  124. [Nguyen17] Nguyen A., Yosinski J., Bengio Y., Dosovitskiy A., Clune

    J. “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”. CVPR 2017.
  125. [Sutskever13] Sutskever, Ilya, et al. On the importance of initialization

    and momentum in deep learning. Proceedings of the 30th international conference on machine learning (ICML-13). 2013.
  126. [Simonyan14] K. Simonyan and Zisserman; “Very deep convolutional networks for

    large-scale image recognition”, arXiv:1409.1556, 2014
  127. [Wang14] Wang, Dan, and Yi Shang. "A new active labeling

    method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.