Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Learning workshop - PyCon UK 2016

Britefury
September 18, 2016

Deep Learning workshop - PyCon UK 2016

These are the slides for a deep learning workshop I gave at PyCon UK 2016

Britefury

September 18, 2016
Tweet

More Decks by Britefury

Other Decks in Technology

Transcript

  1. Deep Learning Tutorial PyCon UK 2016 G. French University of

    East Anglia Image montages from http://www.image-net.org
  2. Intro, gradient descent, Theano Getting started What is a neural

    network? The basic model; the multi-layer perceptron Convolutional networks Neural networks for computer vision
  3. Lasagne and VGG-19 Explain Lasagne and use it with a

    convolutional network trained by the VGG group at Oxford University Deep learning tricks of the trade Tips to save you some time When things go wrong Detecting problems and debugging
  4. Designing a computer vision pipeline Neural networks aren’t a magic

    bullet; how to use them practically Cool work in the field some awesome work done by others
  5. Slides Intro to Machine Learning Intro to Theano and Lasagne

    https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-machine-learning-for-deep-learning-talk-at-pycon-uk-2016 https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning
  6. Amazon AMI (Use GPU machine) AMI ID: ami-5f789e32 AMI Name:

    Britefury deep learning - Ubuntu-14.04 Anaconda2- 4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel
  7. ImageNet in 2012 Best approaches used hand-crafted features (SIFT, HOGs,

    Fisher vectors, etc) + classifier Top-5 error rate: ~25%
  8. In the last few years, more modern networks have achieved

    better results still [Simonyan14, He15] Top-5 error rates of ~5-7%
  9. For a very quick Machine Learning intro, see the notebook

    INTRO ML 01 - Machine learning - a very basic introduction
  10. Temperature conversion Simple linear model: = + If and and

    temperatures in Fahrenheit and Kelvin respectively
  11. Sample temperatures Fahrenheit (x) Kelvin (y) Boiling point of He

    -452.1 4.22 Boiling point of N -320.4 77.36 Melting point of H2O 32.0 273.20 Body temperature 98.6 310.50 Boiling point of H2O 212.0 373.20
  12. First step: initialise parameters; come up with a guess Randomly

    initialise (scale) Initialise (offset) to 0
  13. with a=1.982141 and b=0 Fahrenheit (x) Kelvin (y) squared err

    (ϵ) Boiling point of He -452.1 4.22 -896.126276 810623.417462 Boiling point of N -320.4 77.36 -635.078210 507568.203773 Melting point of H2O 32.0 273.20 63.428535 44004.067369 Body temperature 98.6 310.50 195.439175 13238.993532 Boiling point of H2O 212.0 373.20 420.214047 2210.320605
  14. The error tells us how far away we are from

    the optimal parameter values If = 0 then our model is 100% accurate
  15. With a=0.55581 and b=255.484 Fahrenheit (x) Kelvin (y) squared err

    (ϵ) Boiling point of He -452.1 4.22 4.202747 0.000298 Boiling point of N -320.4 77.36 77.402958 0.001845 Melting point of H2O 32.0 273.20 273.270495 0.004969 Body temperature 98.6 310.50 310.287458 0.045174 Boiling point of H2O 212.0 373.20 373.316342 0.013535 True values: a=0.556 b=255.372
  16. NOTE In the example notebook, a very low learning rate

    () and a large # of iterations were required
  17. Data standardisation Subtract mean; mean will now be 0 Divide

    by std-dev; std-dev will now be 1 Discussed later
  18. In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE

    is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)
  19. Neural network image classifier Inputs Outputs = 0.003 = 0.002

    = 0.005 = 0.9 Class probabilities Hidden Hidden
  20. Neural network image regressor Inputs Outputs Avg. width= 2.1 Avg.

    length = 14.2 … Real-values Hidden Hidden
  21. Neural network Input layer Hidden layer 0 Hidden layer 1

    Output layer ⋯ Inputs Outputs ⋯
  22. Single layer of a neural network () Input vector Weighted

    connections Bias Activation function / non-linearity Layer activation
  23. = input (M-element vector) = output (N-element vector) = weights

    parameter (NxM matrix) = bias parameter (N-element vector) = activation function; normally ReLU but can be tanh or sigmoid = ( + )
  24. Repeat for each layer Input vector ( + ) Hidden

    layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯
  25. In mathematical notation: I = (I + I ) J

    = J I + J ⋯ K = (K KLJ + K )
  26. As a classifier Input vector Hidden layer 0 activation Final

    layer activation (with softmax non- linearity) ⋯ Image pixels = 0.25 = 0.5 = 0.1 = 0.15 Class probabilities
  27. Summary; a neural network is: Built from layers, each of

    which is: a matrix multiplication, then add bias, then apply non-linearity.
  28. Want to see it in action? Check out ConvNetJS by

    Andrej Karpathy Try: http://cs.stanford.edu/people/karpathy/convnetjs/index.html Github: https://github.com/karpathy/convnetjs
  29. Weight initialisation [He15a] provides a good rule of thumb Most

    toolkits such as Lasagne and Keras do this any so no need to worry abou ti
  30. For each example OPQ)R from training set evaluate network prediction

    SPTU given the training input; = OPQ)R Measure cost (error); difference between SPTU and ground truth output OPQ)R
  31. Classification (which of these categories best describes this?) Final layer:

    softmax as non-linearity ; output vector of class probabilities Cost: negative-log-likelihood / categorical cross-entropy
  32. Theano performs symbolic differentiation for you! dCdW = theano.grad(cost, W)

    (other toolkits – such as Torch and Tensorflow – can also do this)
  33. Update parameters: I V = I − UW UXY I

    V = I − UW UZY γ = learning rate
  34. Randomly split the training set into mini-batches of ~100 samples.

    Train on a mini-batch in a single step. The mini-batch cost is the mean of the costs of the samples in the mini-batch.
  35. Training on mini-batches means that ~100 samples are processed in

    parallel – very good for running GPUs that do lots of operations in parallel
  36. Training on (enough mini-batches to cover) all examples in the

    training set is called an epoch Run multiple epochs (often 200-300)
  37. Summary; train a neural network: Take mini-batch of training samples

    Evaluate (run/execute) the network Measure the average error/cost across mini- batch Use gradient descent to modify parameters to reduce cost REPEAT ABOVE UNTIL DONE
  38. (Obligatory) MNIST example: 2 hidden layers, both 256 units after

    300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10
  39. Each image visualises the weights connecting pixels to a specific

    unit in the first hidden layer Note the stroke features detected by the various units
  40. The fully connected networks so far have a weakness: No

    translation invariance; learned features are position dependent
  41. For more general imagery: requires a training set large enough

    to see all features in all possible positions… Requires network with enough units to represent this…
  42. Multiply image pixels by filter weights and sum Image Region

    of pixels from image Filter weights × Σ = Multiply Sum Result Do this for all possible positions in the image
  43. An output pixel shows the strength of filter response of

    the corresponding region of input ∗
  44. Convolution detects features in a position independent manner -- Convolutional

    neural networks learn position independent filters (feature detectors)
  45. Recap: FC (fully-connected) layer () Input vector Weighted connections Bias

    Activation function (non-linearity) Layer activation
  46. The values of the weights form a filter For practical

    computer vision, more than one filter must be used to extract a variety of features
  47. Still = ( + ) As convolution can be expressed

    as multiplication by weight matrix
  48. Another way of looking at it: A single filter of

    an e.g. 5x5 convolutional layer is a bit like…
  49. fully-connected layer with 5x5 input image repeated across the whole

    image a new ‘fully-connected layer’ for each filter
  50. Max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2 x

    2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently
  51. Down-sampling: striding Can also down-sample using strided convolution; generate output

    for 1 in every pixels Faster, can work as well as max-pooling
  52. Simplified LeNet for MNIST digits 28 28 24 24 Input

    Output 1 20 Conv: 20 5x5 filters Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 filters Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12
  53. after 300 iterations over training set: 99.21% validation accuracy Model

    Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%
  54. What about the learned kernels? Image taken from paper [Krizhevsky12]

    (ImageNet dataset, not MNIST) Gabor filters
  55. Provides API for: constructing layers of a network getting Theano

    expressions representing output, loss, etc.
  56. Lasagne is quite a thin layer on top of Theano,

    so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.
  57. The VGG group at Oxford university trained VGG-16 and VGG-19

    for ImageNet classification We will use VGG-19; the 19-layer model
  58. # Layer Input: 3 x 224 x 224 (RGB image,

    zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling, 2x2
  59. # Layer 9 512C3 10 512C3 11 512C3 12 512C3

    MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max Notation: FC4096 fully-connected layer 4096 channels drop 50% With 50% drop-out during training
  60. # Layer Input: 3 x 224 x 224 (RGB image,

    zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 # Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max
  61. These kinds of architectures tend to work well: Small convolution

    filters (3x3) Interspersed with max-pooling
  62. Standardise input data In case of regression, standardise output data

    too (don’t forget to invert the standardisation of network predictions!)
  63. Standardisation Extract samples (pixels in the case of images) into

    an array Compute distribution and standardise
  64. Either: Zero the mean and scale std-dev to 1, per

    channel (RGB for images) V = −
  65. Small mini-batches Maybe around ~8 Good but slower training Small

    mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM
  66. Large mini-batches 1000s Ineffective training Won’t reach the same error

    rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks
  67. Happy medium (where you want to be) Maybe around 64-256,

    lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks
  68. Increasing mini-batch size will improve performance up to the point

    where all GPU units are in use Increasing it further will not improve performance; it will reduce accuracy
  69. Normally applied after later, fully connected layers lyr = lasagne.layers.DenseLayer(lyr,

    num_units=256) lyr = lasagne.layers.DropoutLayer(lyr, p=0.5)
  70. Over-fitting is a well-known problem in machine learning, affects neural

    networks particularly A model over-fits when it is very good at correctly predicting samples in training set but fails to generalise to samples outside it
  71. DropOut [Hinton12] During training, randomly choose units to ‘drop out’

    by setting their output to 0, with probability , usually around 0.5 (compensate by multiplying values by J JLb )
  72. Turning on a different subset of units for each sample:

    causes units to learn more robust features that cannot rely on the presence of other specific features to cover for flaws
  73. Lasagne batch normalization inserts itself into a layer before the

    non- linearity, so its nice and easy to use: l = lasagne.layers.batch_norm(l)
  74. Batch normalization [Ioffe15] is recommended in most cases Lets you

    build deeper networks Speeds up training; loss and error drop faster per-epoch
  75. Standardise activations (zero-mean, unit variance) per-channel between network layers Solves

    problems caused by exponential growth or shrinkage of layer activations
  76. I = 1 J = 2 )cJ = 2 )

    Assume that a layer – grey square – produces activations whose std-dev are twice that of the input:
  77. = 1 = 2d d = 2d I When layers

    are stacked together: ⋯
  78. The magnitude of activations and therefore gradients either explode or

    vanish (if the layers reduce the magnitude of activations rather than magnify them)
  79. Active learning Predict which un-labelled samples are hardest to classify;

    where labels/ground truth would be most helpful
  80. Active learning by confidence The maximum probability is that of

    the predicted class and its value is the confidence
  81. Active learning by confidence Choose the samples with the lowest

    confidence as the next candidates for labelling
  82. Start with 500 labelled training samples Each round, of the

    remaining training samples, choose the 500 with the least confidence Add to dataset
  83. 0 1 2 3 4 5 6 7 8 9

    500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Prediction error (%) # labelled samples Random order Confidence Active learning: MNIST Random choice vs least-confidence choice
  84. MNIST: Only needs 5k out of 50k samples to each

    (very nearly) the same accuracy
  85. Learns to predict constant value; optimizes constant value for best

    loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)
  86. Saliency maps Determine which parts of an image the network

    is using to make its prediction Tells you what the network is ‘looking at’
  87. Theoretically possible to use a single network, with enough training

    data (where enough is an impractical amount)
  88. Example Identifying right whales, by Felix Lau 2nd place in

    Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html
  89. Identifying right whales, by Felix Lau The first naïve solution

    – training a classifier to identify individuals – did not work well
  90. Region-based saliency map revealed that the network had ‘locked on’

    to features in the ocean shape rather than the whales
  91. Lau’s solution: Train a keypoint finder to locate two keypoints

    on the whale’s head to identify its orientation
  92. Deep Neural Networks are Easily Fooled: High Confidence Predictions in

    Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network
  93. Deep Neural Networks are Easily Fooled: High Confidence Predictions in

    Recognizable Images [Nguyen15] Image taken from [Nguyen15]
  94. Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network

    in reverse; orientation, design colour, etc parameters as input, rendered images as output training images
  95. Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford

    15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set
  96. A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet

    model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.
  97. [He15a] He, Zhang, Ren and Sun; Delving Deep into Rectifiers:

    Surpassing Human-Level Performance on ImageNet Classification, arXiv 2015
  98. [He15b] He, Kaiming, et al. "Deep Residual Learning for Image

    Recognition." arXiv preprint arXiv:1512.03385 (2015).
  99. [Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and

    R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  100. [Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep

    Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167
  101. [Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for

    large-scale image recognition, arXiv:1409.1556, 2014
  102. [Wang14] Wang, Dan, and Yi Shang. "A new active labeling

    method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.