Deep Learning Tutorial - advanced techniques - PyData London 2016

Slide 1

Slide 1 text

Deep Learning Tutorial Advanced Techniques G. French Kings College London University of East Anglia Image montages from http://www.image-net.org

Slide 2

Slide 2 text

Focus

Slide 3

Slide 3 text

Image processing Using Theano1 and Lasagne2 1http://deeplearning.net/software/theano/ 2 https://github.com/Lasagne/Lasagne

Slide 4

Slide 4 text

What we’ll cover

Slide 5

Slide 5 text

Theano What it is and how it works Review: Multi-layer perceptron The basic model Convolutional networks Neural networks for computer vision

Slide 6

Slide 6 text

Lasagne and VGG-19 Explain Lasagne and use it with a convolutional network trained by the VGG group at Oxford University Deep learning tricks of the trade tips to save you some time Active learning less training data by careful choice

Slide 7

Slide 7 text

Tutorial materials

Slide 8

Slide 8 text

Github Repo: https://github.com/Britefury/deep-learning-tutorial-pydata2016 The notebooks are viewable on Github

Slide 9

Slide 9 text

Intro to Theano and Lasagne slides: https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning

Slide 10

Slide 10 text

Amazon AMI (Use GPU machine) AMI ID: ami-5f789e32 AMI Name: PyData London 2016 deep learning adv tutorial - Ubuntu-14.04 Anaconda2-4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel

Slide 11

Slide 11 text

Theano

Slide 12

Slide 12 text

Neural network software comes in two flavours: Neural network toolkits Expression compilers

Slide 13

Slide 13 text

Neural network toolkit Specify structure of neural network in terms of layers

Slide 14

Slide 14 text

Expression compilers Describe network architecture in terms of mathematical expressions

Slide 15

Slide 15 text

In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)

Slide 16

Slide 16 text

Theano An expression compiler

Slide 17

Slide 17 text

Write numpy style expressions Compiles to either C (CPU) or CUDA (nVidia GPU)

Slide 18

Slide 18 text

Notebook: Theano basics Expressions Modify shared variables Variables and functions Gradient and updates

Slide 19

Slide 19 text

There is much more to Theano For more information: http://deeplearning.net/

Slide 20

Slide 20 text

Review: MLP (multi-layer perceptron)

Slide 21

Slide 21 text

= input (M-element vector) = output (N-element vector) = weights parameter (NxM matrix) = bias parameter (N-element vector) = activation function; normally ReLU but can be tanh or sigmoid = ( + )

Slide 22

Slide 22 text

In a nutshell: = ( + )

Slide 23

Slide 23 text

Repeat for each layer Input vector ( + ) Hidden layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

Slide 24

Slide 24 text

To train the network: Compute the derivative of cost w.r.t. parameters ( and )

Slide 25

Slide 25 text

Update parameters: + , = + − /0 /12 + , = + − /0 /32 γ = learning rate

Slide 26

Slide 26 text

Theano takes care of the differentiation for you!

Slide 27

Slide 27 text

(Obligatory) MNIST example: 2 hidden layers, both 256 units after 300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

Slide 28

Slide 28 text

MNIST is quite a special case Digits nicely centred within image Scaled to approx. same size

Slide 29

Slide 29 text

The fully connected networks so far have a weakness: No translation invariance; learned features are position dependent

Slide 30

Slide 30 text

For more general imagery: requires a training set large enough to see all features in all possible positions… Requires network with enough units to represent this…

Slide 31

Slide 31 text

Convolutional networks

Slide 32

Slide 32 text

Convolution Slide a convolution kernel over an image Multiply image pixels by kernel pixels and sum

Slide 33

Slide 33 text

Convolution Convolutions are often used for feature detection

Slide 34

Slide 34 text

A brief detour…

Slide 35

Slide 35 text

Gabor filters ∗

Slide 36

Slide 36 text

Back on track to… Convolutional networks

Slide 37

Slide 37 text

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias Activation function (non-linearity) Layer activation

Slide 38

Slide 38 text

Convolutional layer Each unit only connected to units in its neighbourhood

Slide 39

Slide 39 text

Convolutional layer Weights are shared Red weights have same value As do greens… And yellows

Slide 40

Slide 40 text

The values of the weights form a convolution kernel For practical computer vision, more an one kernel must be used to extract a variety of features

Slide 41

Slide 41 text

Convolutional layer Different weight-kernels: Output is image with multiple channels

Slide 42

Slide 42 text

Still = ( + ) As convolution can be expressed as multiplication by weight matrix

Slide 43

Slide 43 text

Note In subsequent layers, each kernel connects to pixels in ALL channels in previous layer

Slide 44

Slide 44 text

Another way of looking at it: A single kernel of an e.g. 5x5 convolutional layer is a bit like…

Slide 45

Slide 45 text

fully-connected layer with 5x5 input image repeated across the whole image a new ‘fully-connected layer’ for each filter

Slide 46

Slide 46 text

Max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2 x 2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently

Slide 47

Slide 47 text

Example: A Simplified LeNet [LeCun95] for MNIST digits

Slide 48

Slide 48 text

Simplified LeNet for MNIST digits 28 28 24 24 Input Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

Slide 49

Slide 49 text

after 300 iterations over training set: 99.21% validation accuracy Model Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

Slide 50

Slide 50 text

Lasagne and VGG-19

Slide 51

Slide 51 text

Lasagne is a neural network library built on Theano

Slide 52

Slide 52 text

Provides API for: constructing layers of a network getting Theano expressions representing output, loss, etc.

Slide 53

Slide 53 text

Lasagne is quite a thin layer on top of Theano, so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.

Slide 54

Slide 54 text

Notebook: Lasagne basics Build network: modified LeNet for MNIST Train the network

Slide 55

Slide 55 text

Using a pre-trained VGG-19 Conv- net

Slide 56

Slide 56 text

Use VGG-19; the 19-layer model 1000-class image classifier, trained on ImageNet

Slide 57

Slide 57 text

VGG models are simple but effective Consist of: 3x3 convolutions 2x2 max pooling fully connected

Slide 58

Slide 58 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling, 2x2

Slide 59

Slide 59 text

# Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max Notation: FC4096 fully-connected layer 4096 channels drop 50% With 50% drop-out during training

Slide 60

Slide 60 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 # Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max

Slide 61

Slide 61 text

These kinds of architectures tend to work well: Small convolution kernels (3x3) Interspersed with max-pooling

Slide 62

Slide 62 text

Good first start when choosing a network architecture

Slide 63

Slide 63 text

Exercise / Demo Classifying an image with VGG-19

Slide 64

Slide 64 text

What about using VGG-19 to find a peacock in a photo?

Slide 65

Slide 65 text

Can extract square patches from image in sliding window fashion and classify:

Slide 66

Slide 66 text

Exercise / Demo Finding a peacock with VGG-19: part 1

Slide 67

Slide 67 text

Inefficient

Slide 68

Slide 68 text

Using convolutions to your advantage

Slide 69

Slide 69 text

Adjacent windows share majority of their pixels

Slide 70

Slide 70 text

For lower levels, this involves repeating many of the same computations, getting the same result ∗

Slide 71

Slide 71 text

If we could apply the first convolutional layer across the whole image rather than many 224x224 blocks we could re- use those computations….

Slide 72

Slide 72 text

Then we could also do this for the rest of the convolutional layers further down…

Slide 73

Slide 73 text

In fact we can use the whole network in a convolutional fashion; we just need to convert the fully-connected layers to convolutional layers.

Slide 74

Slide 74 text

Exercise / Demo Finding a peacock with VGG-19: part 2

Slide 75

Slide 75 text

This is a trick used when doing image segmentation, when we want to determine which parts of an image belong to which class At both training time and prediction time

Slide 76

Slide 76 text

Deep learning tricks of the trade

Slide 77

Slide 77 text

Choosing: mini-batch size

Slide 78

Slide 78 text

Small mini-batches Maybe around ~8 Good but slower training Small mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM

Slide 79

Slide 79 text

Large mini-batches 1000s Ineffective training Won’t reach the same error rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks

Slide 80

Slide 80 text

Happy medium (where you want to be) Maybe around 64-256, lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks

Slide 81

Slide 81 text

~100 seems to work well; gets good results

Slide 82

Slide 82 text

Increasing mini-batch size will improve performance up to the point where all GPU units are in use Increasing it further will not improve performance; it will reduce accuracy

Slide 83

Slide 83 text

Caveat When working in a convolutional fashion - like the example of using VGG- net to find the peacock – or when doing image segmentation

Slide 84

Slide 84 text

In such cases, pushing large patches of an image through as a single batch along with a correspondingly large output patch re-uses data due to convolutions and results in substantial savings

Slide 85

Slide 85 text

My experience: Use patches that are as large as possible Although it’s a tricky balance with accuracy of the final result

Slide 86

Slide 86 text

Batch normalization

Slide 87

Slide 87 text

Batch normalization [Ioffe15] is recommended in most cases Speeds up training Loss and error drop faster per-epoch

Slide 88

Slide 88 text

Although epochs take longer (around 2x in my experience) Can (ultimately) reach lower error rates Lets you build deeper networks

Slide 89

Slide 89 text

Standardise activations (zero-mean, unit variance) per-channel between network layers Solves problems caused by exponential growth or shrinkage of layer activations

Slide 90

Slide 90 text

+ = 1 9 = 2 ;<9 = 2 ; Assume that a layer – grey square – produces activations whose std-dev are twice that of the input:

Slide 91

Slide 91 text

= 1 = 2= = = 2= + When layers are stacked together: ⋯

Slide 92

Slide 92 text

The magnitude of activations and therefore gradients either explode or vanish (if the layers reduce the magnitude of activations rather than magnify them)

Slide 93

Slide 93 text

Can be partially addressed with careful weight initialization [He15]. Batch normalization between layers keeps things sane; can train networks with hundreds of layers [He15b].

Slide 94

Slide 94 text

After convolutional, fully-connected or network-in-network* layers, before the non-linearity (*) kind of like a 1x1 convolutional layer [Lin13]

Slide 95

Slide 95 text

Lasagne batch normalization inserts itself into a layer before the non- linearity, so its nice and easy to use: l = lasagne.layers.batch_norm(l)

Slide 96

Slide 96 text

Data standardisation

Slide 97

Slide 97 text

Standardise your data Ensure zero-mean and unit standard deviation

Slide 98

Slide 98 text

Standardise input data In case of regression, standardise output data too (don’t forget to invert the standardisation of network predictions!)

Slide 99

Slide 99 text

Autoencoder; image reconstruction (regression), PCA whitening

Slide 100

Slide 100 text

Autoencoder; image reconstruction (regression), no standardisation

Slide 101

Slide 101 text

Still a good idea to do this, even when using batch normalization

Slide 102

Slide 102 text

Autoencoder; edge map reconstruction (regression), standardisation

Slide 103

Slide 103 text

Autoencoder; edge map reconstruction (regression), no standardisation

Slide 104

Slide 104 text

Standardisation Extract samples (pixels in the case of images) into an array Compute distribution and standardise

Slide 105

Slide 105 text

Either: Zero the mean and scale std-dev to 1, per channel (RGB for images) , = −

Slide 106

Slide 106 text

CIFAR10 RGB distribution

Slide 107

Slide 107 text

CIFAR10 RGB – standardised

Slide 108

Slide 108 text

Or better still: Use PCA whitening (retain all channels – we don’t want to reduce dimensionality)

Slide 109

Slide 109 text

CIFAR10 RGB – with principal components

Slide 110

Slide 110 text

CIFAR10 RGB - principal components aligned with standard basis

Slide 111

Slide 111 text

CIFAR10 RGB – PCA whitened

Slide 112

Slide 112 text

PCA whitening From Scikit-learn use PCA or IncrementalPCA

Slide 113

Slide 113 text

Could batch normalisation make standardisation unnecessary?

Slide 114

Slide 114 text

The previous fish based examples all used batch normalisation and still benefited from data standardisation, so no.

Slide 115

Slide 115 text

When training goes wrong and what to look for

Slide 116

Slide 116 text

Loss becomes NaN

Slide 117

Slide 117 text

Classification error rate equivalent of random guess (its not learning)

Slide 118

Slide 118 text

Learns to predict constant value; optimises constant value for best loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)

Slide 119

Slide 119 text

Debugging your network

Slide 120

Slide 120 text

Neural networks (most) often DON’T learn what you want or expect them to

Slide 121

Slide 121 text

Local minima will be the bane of your existence

Slide 122

Slide 122 text

So, what has your network learnt? This is often a good question.

Slide 123

Slide 123 text

Saliency maps Determine which parts of an image the network is using to make its prediction Tells you what the network is ‘looking at’

Slide 124

Slide 124 text

Two approaches

Slide 125

Slide 125 text

1. Region-level saliency Blank out different regions of the image and compute the difference in prediction

Slide 126

Slide 126 text

Exercise / Demo Block-level image saliency

Slide 127

Slide 127 text

2. Pixel-level saliency Compute the gradient of the prediction of a specific class w.r.t. the image pixels

Slide 128

Slide 128 text

Exercise / Demo Pixel-level image saliency

Slide 129

Slide 129 text

Designing a computer vision pipeline

Slide 130

Slide 130 text

Simple problems may be solved with just a neural network

Slide 131

Slide 131 text

Not sufficient for more complex problems

Slide 132

Slide 132 text

Theoretically possible to use a single network, with enough training data (where enough is an impractical amount)

Slide 133

Slide 133 text

For more complex problems, the problem should be broken down

Slide 134

Slide 134 text

Example Identifying right whales, by Felix Lau 2nd place in Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html

Slide 135

Slide 135 text

Identifying right whales, by Felix Lau The first naïve solution – training a classifier to identify individuals – did not work well

Slide 136

Slide 136 text

Region-based saliency map revealed that the network had ‘locked on’ to features in the ocean shape rather than the whales

Slide 137

Slide 137 text

Lau’s solution: Train a localiser to locate the whale in the image

Slide 138

Slide 138 text

Lau’s solution: Train a keypoint finder to locate two keypoints on the whale’s head to identify its orientation

Slide 139

Slide 139 text

Lau’s solution: Train classifier on oriented and cropped whale head images

Slide 140

Slide 140 text

Active learning

Slide 141

Slide 141 text

Training deep neural networks is data hungry Labelled training data can be expensive to acquire or produce

Slide 142

Slide 142 text

Active learning reduces the amount of data required Can therefore reduce cost

Slide 143

Slide 143 text

Assumption 1: classification problem Assumption 2: unlimited or large quantities of un-labelled data available

Slide 144

Slide 144 text

Train a network with the labelled data we have

Slide 145

Slide 145 text

Predict which un-labelled samples are hardest to classify; where ground truths would be most useful

Slide 146

Slide 146 text

Get ground truth labels for those samples (by manual labelling say)

Slide 147

Slide 147 text

Train with enlarged dataset

Slide 148

Slide 148 text

Repeat as necessary; go back to predicting difficulty of un-labelled samples

Slide 149

Slide 149 text

How to determine which un-labelled samples are most worth labelling?

Slide 150

Slide 150 text

[Wang14] discusses a few different approaches, confidence being simple and effetive

Slide 151

Slide 151 text

Active learning by confidence

Slide 152

Slide 152 text

Active learning by confidence Predict class probabilities of un-labelled sample

Slide 153

Slide 153 text

Active learning by confidence The maximum probability is that of the predicted class and its value is the confidence

Slide 154

Slide 154 text

Active learning by confidence Choose the samples with the lowest confidence as the next candidates for labelling

Slide 155

Slide 155 text

Note: Predicted probabilities from neural nets are often very close to 0.0 or 1.0; maybe 1e-6 away. Would be nice if they were ‘smoother’

Slide 156

Slide 156 text

Can use softmax with ‘temperature’ to smooth the predictions

Slide 157

Slide 157 text

Softmax: ; = BC ∑ BE = FG+

Slide 158

Slide 158 text

Softmax with temperature t (just divide by t first): ; = ; ; = JC ∑ JE = FG+

Slide 159

Slide 159 text

A higher temperature softens the predicted probabilities I find a value of 3 for t works well

Slide 160

Slide 160 text

Active learning example: MNIST 50k training, 10k validation, 10k test

Slide 161

Slide 161 text

Start with 500 labelled training samples Each round, of the remaining training samples, choose the 500 with the least confidence Add to dataset

Slide 162

Slide 162 text

0 1 2 3 4 5 6 7 8 9 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Prediction error (%) # labelled samples Random order Confidence Active learning: MNIST Random choice vs least-confidence choice

Slide 163

Slide 163 text

MNIST: Only needs 5k out of 50k samples to each (very nearly) the same accuracy

Slide 164

Slide 164 text

MNIST is special easy case SVHN results less marked; can get save maybe 1/3rd of data

Slide 165

Slide 165 text

Saving even 25% of labelled data requirements could result in substantial cost saving though

Slide 166

Slide 166 text

Just for fun

Slide 167

Slide 167 text

Deep Dreams

Slide 168

Slide 168 text

When training a network, we use gradient descent to iteratively modify weights given images and ground truths

Slide 169

Slide 169 text

We just as easily use gradient descent to modify an image given weights

Slide 170

Slide 170 text

Deep Dreams: Take an image to hallucinate from Choose a layer, e.g. ‘pool4’ of VGG-19; choice depends on scale and level of features desired

Slide 171

Slide 171 text

Deep Dreams: Compute gradient of L-norm of layer w.r.t. image Use gradient ascent to increase L-norm

Slide 172

Slide 172 text

Exercise / Demo Deep Dreams

Slide 173

Slide 173 text

Hope you’ve found it helpful!

Slide 174

Slide 174 text

Thank you!

Slide 175

Slide 175 text

References

Slide 176

Slide 176 text

[He15a] He, Zhang, Ren and Sun; Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, arXiv 2015

Slide 177

Slide 177 text

[He15b] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015).

Slide 178

Slide 178 text

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

Slide 179

Slide 179 text

[Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167

Slide 180

Slide 180 text

[Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258

Slide 181

Slide 181 text

[Lin13] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

Slide 182

Slide 182 text

[Nesterov83] Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376 (1983).

Slide 183

Slide 183 text

[Sutskever13] Sutskever, Ilya, et al. On the importance of initialization and momentum in deep learning. Proceedings of the 30th international conference on machine learning (ICML-13). 2013.

Slide 184

Slide 184 text

[Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556, 2014

Slide 185

Slide 185 text

[Wang14] Wang, Dan, and Yi Shang. "A new active labeling method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.