Slide 1

Slide 1 text

Deep Learning Tutorial Advanced Techniques PyData Amsterdam 2017 G. French – University of East Anglia Image montages from http://www.image-net.org

Slide 2

Slide 2 text

Focus

Slide 3

Slide 3 text

Image processing Using Theano1 and Lasagne2 1http://deeplearning.net/software/theano/ 2 https://github.com/Lasagne/Lasagne

Slide 4

Slide 4 text

What we’ll cover

Slide 5

Slide 5 text

Theano What it is and how it works Review: Multi-layer perceptron The basic model Convolutional networks Neural networks for computer vision

Slide 6

Slide 6 text

Lasagne and VGG-16 Explain Lasagne and use it with a convolutional network trained by the VGG group at Oxford University Transfer learning Re-using pre-trained networks Deep learning tricks of the trade tips to save you some time

Slide 7

Slide 7 text

Tutorial materials

Slide 8

Slide 8 text

Github Repo: https://github.com/Britefury/deep-learning-tutorial-pydata The notebooks are viewable on Github

Slide 9

Slide 9 text

Intro to Theano and Lasagne slides: https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning

Slide 10

Slide 10 text

Amazon AMI (Use GPU machine) AMI ID: ami-5f789e32 AMI Name: PyData London 2016 deep learning adv tutorial - Ubuntu-14.04 Anaconda2-4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel

Slide 11

Slide 11 text

Theano

Slide 12

Slide 12 text

Neural network software exists on a spectrum Higher level - Specify network in terms of layers Flexible and powerful - Specify network in terms of mathematical expressions Neural network toolkits Expression compilers API style

Slide 13

Slide 13 text

Neural network software exists on a spectrum Less to debug so easier Depends on toolkit - (can end up with confusing errors in compiled code e.g. Theano) Neural network toolkits Expression compilers Debugging

Slide 14

Slide 14 text

Neural network software exists on a spectrum CAFFE Theano Tensorflow Torch Neural network toolkits Expression compilers Examples

Slide 15

Slide 15 text

Theano An expression compiler

Slide 16

Slide 16 text

Write numpy style expressions Compiles to either C (CPU) or CUDA (nVidia GPU)

Slide 17

Slide 17 text

Notebook: Theano basics Expressions Modify shared variables Variables and functions Gradient and updates

Slide 18

Slide 18 text

There is much more to Theano For more information: http://deeplearning.net/

Slide 19

Slide 19 text

Review: MLP (multi-layer perceptron)

Slide 20

Slide 20 text

= input (M-element vector) = output (N-element vector) = weights parameter (NxM matrix) = bias parameter (N-element vector) = activation function; normally ReLU but can be tanh or sigmoid = ( + )

Slide 21

Slide 21 text

In a nutshell: = ( + )

Slide 22

Slide 22 text

Repeat for each layer Input vector ( + ) Hidden layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

Slide 23

Slide 23 text

To train the network: Compute the derivative of cost w.r.t. parameters ( and ) (More on the cost/loss later)

Slide 24

Slide 24 text

Update parameters: + , = + − /0 /12 + , = + − /0 /32 γ = learning rate

Slide 25

Slide 25 text

Theano takes care of the differentiation for you!

Slide 26

Slide 26 text

MNIST Multi-layer Perceptron

Slide 27

Slide 27 text

(Obligatory) MNIST example: 2 hidden layers, both 256 units after 300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

Slide 28

Slide 28 text

MNIST is quite a special case Digits nicely centred within image Scaled to approx. same size

Slide 29

Slide 29 text

The fully connected networks so far have a weakness: No translation invariance; learned features are position dependent

Slide 30

Slide 30 text

For more general imagery: requires a training set large enough to see all features in all possible positions… Requires network with enough units to represent this…

Slide 31

Slide 31 text

Non-linearities and loss functions

Slide 32

Slide 32 text

The final non-linearity and the corresponding loss function is important Their choice depends on the desired output of the network

Slide 33

Slide 33 text

Classification Final non-linearity: softmax Softmax produces predicted probability vector from logits : = = => ∑ => @ AB+

Slide 34

Slide 34 text

Classification Loss: negative log probability (categorical cross-entropy); = prediction, = true value = − E ln A A

Slide 35

Slide 35 text

Lets dig a little deeper and do an experiment.

Slide 36

Slide 36 text

Lets take a neural network: Input vector ( + ) Hidden layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

Slide 37

Slide 37 text

Remove all but the final layer: Final layer activation (output) ( + ) Final layer

Slide 38

Slide 38 text

Drive its activation function directly (no matrix multiplication) Final layer activation (output) () Final layer We are going to ‘simulate’ the network up to that point by providing values for the logits directly

Slide 39

Slide 39 text

Create logit values: 0.0 -5.0 0.0 -4.0 0.0 -3.0 0.0 -2.0 0.0 -1.0 0.0 0.0 0.0 1.0 0.0 2.0 0.0 3.0 0.0 4.0 0.0 5.0 keep logit for class 0 constant vary logit for class 1

Slide 40

Slide 40 text

Lets add the predicted probabilities generated by softmax: 0.0 -5.0 0.9933072 0.0066929 0.0 -4.0 0.9820138 0.0179862 0.0 -3.0 0.9525741 0.0474259 0.0 -2.0 0.8807970 0.1192029 0.0 -1.0 0.7310586 0.2689414 0.0 0.0 0.5000000 0.5000000 0.0 1.0 0.2689414 0.7310586 0.0 2.0 0.1192029 0.8807971 0.0 3.0 0.0474259 0.9525741 0.0 4.0 0.0179862 0.9820138 0.0 5.0 0.0066929 0.9933072 => ∑ => @ AB+

Slide 41

Slide 41 text

Plot the predicted probability N from softmax and the true probability N (assume a value of 1)

Slide 42

Slide 42 text

Add negative log-loss: Note: loss is high when N is negative, tends to 0 when it is positive

Slide 43

Slide 43 text

Add gradient of negative log-loss:

Slide 44

Slide 44 text

Learning occurs via gradient descent Note: gradient in range [-1, 0], negative when N is negative, when N is positive the gradient is close to 0 (correct answer, not much learning to do)

Slide 45

Slide 45 text

Probability regression Use for predicting single value in [0, 1] range Can use for binary classification Could use to generate mask for an image

Slide 46

Slide 46 text

Probability regression Final non-linearity: sigmoid Sigmoid outputs values in range [0, 1]: = = 1 1 + ST

Slide 47

Slide 47 text

Probability regression Loss: binary cross-entropy; = prediction, = true value = − E ln A A + ln 1 − A 1 − A

Slide 48

Slide 48 text

Experiment Drive sigmoid directly Sigmoid takes a scalar per sample, rather than a vector

Slide 49

Slide 49 text

Probability regression plot Just like classification with softmax & NLL J (maths are similar)

Slide 50

Slide 50 text

In general When network gives correct answer: Gradient of the loss will be near 0 (no more learning)

Slide 51

Slide 51 text

In general When network gives incorrect answer: Gradient of loss will be of magnitude 1 pushing the network to learn

Slide 52

Slide 52 text

A gradient of 1 The gradient of the final non-linearity + loss function having a magnitude of 1 is worth consideration

Slide 53

Slide 53 text

A gradient of 1 They won’t scale up or down the gradient that is back-propagated throughout the earlier parts of the network

Slide 54

Slide 54 text

A gradient of 1 Keeping the gradient magnitudes sane is important (*) Hence the success of weight initialisation schemes and batch normalisation * Particularly for GANs

Slide 55

Slide 55 text

Regression with squared error loss Use for predicting real values NO final non-linearity (linear)

Slide 56

Slide 56 text

Regression with squared error loss Loss: squared error; U = prediction, = true value = − E − U V

Slide 57

Slide 57 text

Regression with squared error loss plot The gradient can have a large magnitude when U and differ greatly

Slide 58

Slide 58 text

Regression with Huber loss Uses squared error when difference between prediction and target is small, absolute error otherwise Used in [Girshick15]

Slide 59

Slide 59 text

Regression with Huber loss NO final non-linearity (linear)

Slide 60

Slide 60 text

Regression with Huber loss Loss: Huber loss; U = prediction, = true value ℎ , U = 1 2 U − V U − ≤ 1 U − − 1 2 ℎ

Slide 61

Slide 61 text

Regression with Huber loss plot Keeps the gradient magnitude ≤ 1

Slide 62

Slide 62 text

In summary: Final non-linearity and loss function depend on: What you want your network to generate Effects on gradients can be worth considering

Slide 63

Slide 63 text

Convolutional networks

Slide 64

Slide 64 text

Convolution Slide a convolution kernel over an image Multiply image pixels by kernel pixels and sum

Slide 65

Slide 65 text

Convolution Convolutions are often used for feature detection

Slide 66

Slide 66 text

A brief detour…

Slide 67

Slide 67 text

Gabor filters ∗

Slide 68

Slide 68 text

Back on track to… Convolutional networks

Slide 69

Slide 69 text

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias Activation function (non-linearity) Layer activation

Slide 70

Slide 70 text

Convolutional layer Each unit only connected to units in its neighbourhood

Slide 71

Slide 71 text

Convolutional layer Weights are shared Red weights have same value As do greens… And yellows

Slide 72

Slide 72 text

The values of the weights form a convolution kernel For practical computer vision, more an one kernel must be used to extract a variety of features

Slide 73

Slide 73 text

Convolutional layer Different weight-kernels: Output is image with multiple channels

Slide 74

Slide 74 text

Still = ( + ) As convolution can be expressed as multiplication by weight matrix

Slide 75

Slide 75 text

Note In subsequent layers, each kernel connects to pixels in ALL channels in previous layer

Slide 76

Slide 76 text

Another way of looking at it: A single kernel of an e.g. 5x5 convolutional layer is a bit like…

Slide 77

Slide 77 text

fully-connected layer with 5x5 input image repeated across the whole image a new ‘fully-connected layer’ for each filter

Slide 78

Slide 78 text

Down-sampling Max-pooling or Striding

Slide 79

Slide 79 text

Down-sampling: max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2 x 2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently

Slide 80

Slide 80 text

Down-sampling: striding Only retain 1 pixel in every ; skip the rest Often fast; is built into the convolution operation of many neural network libraries

Slide 81

Slide 81 text

Example: A Simplified LeNet [LeCun95] for MNIST digits

Slide 82

Slide 82 text

Simplified LeNet for MNIST digits 28 28 24 24 Input Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

Slide 83

Slide 83 text

after 300 iterations over training set: 99.21% validation accuracy Model Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

Slide 84

Slide 84 text

Lasagne and VGG-16

Slide 85

Slide 85 text

Lasagne is a neural network library built on Theano

Slide 86

Slide 86 text

Provides API for: constructing layers of a network getting Theano expressions representing output, loss, etc.

Slide 87

Slide 87 text

Lasagne is quite a thin layer on top of Theano, so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.

Slide 88

Slide 88 text

An aside note about other libraries Definitely check out Keras http://keras.io Works on both Theano and Tensorflow

Slide 89

Slide 89 text

An aside note about other libraries Keras API can be simpler to get to grips with than Lasagne Lots of cool examples

Slide 90

Slide 90 text

Notebook: Lasagne basics Build network: modified LeNet for MNIST Train the network

Slide 91

Slide 91 text

Using a pre-trained VGG-16 Conv- net

Slide 92

Slide 92 text

Use VGG-16; the 16-layer model 1000-class image classifier, trained on ImageNet

Slide 93

Slide 93 text

VGG models are simple but effective Consist of: 3x3 convolutions 2x2 max pooling fully connected

Slide 94

Slide 94 text

Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling, 2x2 # Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2

Slide 95

Slide 95 text

Notation: FC4096 fully-connected layer 4096 channels drop 50% With 50% drop-out during training # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max

Slide 96

Slide 96 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2 # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max

Slide 97

Slide 97 text

These kinds of architectures tend to work well: Small convolution kernels (3x3) Interspersed with max-pooling

Slide 98

Slide 98 text

Good first start when choosing a network architecture

Slide 99

Slide 99 text

Exercise / Demo Classifying an image with VGG-16

Slide 100

Slide 100 text

What about using VGG-16 to find a peacock in a photo?

Slide 101

Slide 101 text

Can extract square patches from image in sliding window fashion and classify:

Slide 102

Slide 102 text

Exercise / Demo Finding a peacock with VGG-16: part 1

Slide 103

Slide 103 text

Inefficient

Slide 104

Slide 104 text

Using convolutions to your advantage

Slide 105

Slide 105 text

Adjacent windows share majority of their pixels

Slide 106

Slide 106 text

For lower levels, this involves repeating many of the same computations, getting the same result ∗

Slide 107

Slide 107 text

If we could apply the first convolutional layer across the whole image rather than many 224x224 blocks we could re- use those computations….

Slide 108

Slide 108 text

Then we could also do this for the rest of the convolutional layers further down…

Slide 109

Slide 109 text

In fact we can use the whole network in a convolutional fashion; we just need to convert the fully-connected layers to convolutional layers.

Slide 110

Slide 110 text

Exercise / Demo Finding a peacock with VGG-16: part 2

Slide 111

Slide 111 text

This is a trick used when doing image segmentation, when we want to determine which parts of an image belong to which class At both training time and prediction time

Slide 112

Slide 112 text

Transfer learning (network re-use)

Slide 113

Slide 113 text

Training a neural network is notoriously data-hungry Preparing training data with ground truths is expensive and time consuming

Slide 114

Slide 114 text

What if we don’t have enough training data to get good results?

Slide 115

Slide 115 text

The ImageNet dataset is huge; millions of images with ground truths What if we could somehow use it to help us with a different task?

Slide 116

Slide 116 text

Good news: we can!

Slide 117

Slide 117 text

Transfer learning Re-use part (often most) of a pre-trained network for a new task

Slide 118

Slide 118 text

Example; can re-use part of VGG-16 net for: Classifying images with classes that weren’t part of the original ImageNet dataset

Slide 119

Slide 119 text

Example; can re-use part of VGG-16 net for: Localisation (find location of object in image) Segmentation (find exact boundary around object in image)

Slide 120

Slide 120 text

Transfer learning: how to Take existing network such as VGG-16

Slide 121

Slide 121 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2 # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max

Slide 122

Slide 122 text

Remove last layers e.g. the fully- connected ones (just 14,15,16; those in the left box are hidden here for brevity!) # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2

Slide 123

Slide 123 text

Build new randomly initialise layers to replace them (the number of layers created their size is only for illustration here) # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 FC1024 (drop 50%) FC21 soft-max

Slide 124

Slide 124 text

Training when using transfer learning Two approaches: Train new layers only Fine-tuning

Slide 125

Slide 125 text

Lets start with some code to get the pre- trained and new layer parameters separately

Slide 126

Slide 126 text

# Get all parameters in network; `get_all_params` works backward # from the final layer through the network all_params = lasagne.layers.get_all_params(final_layer, trainable=True) # Get parameters from pre-trained layers; give the top pre-trained layer pretrained_params = lasagne.layers.get_all_params( vgg16.network[‘pool5’], trainable=True) new_params = [p for p in all_params if p not in pretrained_params]

Slide 127

Slide 127 text

Transfer learning: train new layers Train the network with your training data, only learning parameters for the new layers

Slide 128

Slide 128 text

# Update new layers only new_updates = lasagne.updates.adam( training_loss, new_params, learning_rate=lr)

Slide 129

Slide 129 text

Transfer learning: fine-tuning Learn parameters for new layers Fine-tuning: learn parameters for pre- trained layers using a lower learning rate; normally 1/10th

Slide 130

Slide 130 text

# Update new layers with standard learning rate new_updates = lasagne.updates.adam( training_loss, new_params, learning_rate=lr) # Update new layers with standard learning rate pretrained_updates = lasagne.updates.adam( training_loss, pretrained_params, learning_rate=lr * 0.1) # Combine updates updates = new_updates.copy() updates.update(pretrained_updates)

Slide 131

Slide 131 text

Result Nice shiny network with good performance that was trained with much less of our training data

Slide 132

Slide 132 text

Exercise / Demo Solving Dogs vs Cats with Transfer Learning

Slide 133

Slide 133 text

Our work Quantifying fish discards on-board fishing trawlers We use VGG-16 based image segmentation

Slide 134

Slide 134 text

No content

Slide 135

Slide 135 text

Foreground segmentation

Slide 136

Slide 136 text

No content

Slide 137

Slide 137 text

Contour detection

Slide 138

Slide 138 text

No content

Slide 139

Slide 139 text

Deep learning tricks of the trade

Slide 140

Slide 140 text

Early stopping

Slide 141

Slide 141 text

Early stopping Can get better accuracy Allows you to terminate training earlier, saving time

Slide 142

Slide 142 text

Split data into: training set validation set test set

Slide 143

Slide 143 text

During training, regularly – e.g. after each epoch – evaluate network performance on validation set

Slide 144

Slide 144 text

Validation score does not decrease monotonically, so the final score is not necessarily the best Dogs vs cats, with transfer learning and data augmentation

Slide 145

Slide 145 text

Each time validation score improves: Save network state Restore saved state once training is finished

Slide 146

Slide 146 text

Sometimes validation score can start getting gradually worse as the network overfits Once you reach this point, there’s no point training any more

Slide 147

Slide 147 text

Fixed patience Stop training if no improvement detected after a fixed number of epochs e.g. 50

Slide 148

Slide 148 text

Geometric patience Set the patience (index of epoch that you are prepared to wait until) to be the number of epochs elapsed so far times some multiple e.g. 2

Slide 149

Slide 149 text

Geometric patience An example can be seen in the Logistic Regression and MLP Theano tutorials: http://deeplearning.net/tutorial/mlp.html

Slide 150

Slide 150 text

Choosing: mini-batch size

Slide 151

Slide 151 text

Small mini-batches Maybe around ~8 Good but slower training Small mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM

Slide 152

Slide 152 text

Large mini-batches 1000s Ineffective training Won’t reach the same error rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks

Slide 153

Slide 153 text

Happy medium (where you want to be) Maybe around 64-256, lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks

Slide 154

Slide 154 text

~100 seems to work well; gets good results

Slide 155

Slide 155 text

Increasing mini-batch size will improve performance up to the point where all GPU units are in use Increasing it further will not improve performance; it will reduce accuracy

Slide 156

Slide 156 text

Caveat When working in a convolutional fashion - like the example of using VGG- net to find the peacock – or when doing image segmentation

Slide 157

Slide 157 text

In such cases, pushing large patches of an image through as a single batch along with a correspondingly large output patch re-uses data due to convolutions and results in substantial savings

Slide 158

Slide 158 text

My experience: Use patches that are as large as possible Although it’s a tricky balance with accuracy of the final result

Slide 159

Slide 159 text

Batch normalization

Slide 160

Slide 160 text

Batch normalization [Ioffe15] is recommended in most cases Speeds up training Loss and error drop faster per-epoch

Slide 161

Slide 161 text

Although epochs take longer (around 2x in my experience) Can (ultimately) reach lower error rates Lets you build deeper networks

Slide 162

Slide 162 text

Standardise activations (zero-mean, unit variance) per-channel between network layers Solves problems caused by exponential growth or shrinkage of layer activations

Slide 163

Slide 163 text

+ = 1 N = 2 A`N = 2 A Assume that a layer – grey square – produces activations whose std-dev are twice that of the input:

Slide 164

Slide 164 text

= 1 = 2@ @ = 2@ + When layers are stacked together: ⋯

Slide 165

Slide 165 text

The magnitude of activations and therefore gradients either explode or vanish (if the layers reduce the magnitude of activations rather than magnify them)

Slide 166

Slide 166 text

Can be partially addressed with careful weight initialization [He15]. Batch normalization between layers keeps things sane; can train networks with hundreds of layers [He15b].

Slide 167

Slide 167 text

After convolutional, fully-connected or network-in-network* layers, before the non-linearity (*) kind of like a 1x1 convolutional layer [Lin13]

Slide 168

Slide 168 text

Lasagne batch normalization inserts itself into a layer before the non- linearity, so its nice and easy to use: l = lasagne.layers.batch_norm(l)

Slide 169

Slide 169 text

Data standardisation

Slide 170

Slide 170 text

Standardise your data Ensure zero-mean and unit standard deviation

Slide 171

Slide 171 text

Standardise input data In case of regression, standardise output data too (don’t forget to invert the standardisation of network predictions!)

Slide 172

Slide 172 text

Autoencoder; image reconstruction (regression), PCA whitening

Slide 173

Slide 173 text

Autoencoder; image reconstruction (regression), no standardisation

Slide 174

Slide 174 text

Still a good idea to do this, even when using batch normalization

Slide 175

Slide 175 text

Autoencoder; edge map reconstruction (regression), standardisation

Slide 176

Slide 176 text

Autoencoder; edge map reconstruction (regression), no standardisation

Slide 177

Slide 177 text

Standardisation Extract samples (pixels in the case of images) into an array Compute distribution and standardise

Slide 178

Slide 178 text

Either: Zero the mean and scale std-dev to 1, per channel (RGB for images) , = −

Slide 179

Slide 179 text

CIFAR10 RGB distribution

Slide 180

Slide 180 text

CIFAR10 RGB – standardised

Slide 181

Slide 181 text

Or better still: Use PCA whitening (retain all channels – we don’t want to reduce dimensionality)

Slide 182

Slide 182 text

CIFAR10 RGB – with principal components

Slide 183

Slide 183 text

CIFAR10 RGB - principal components aligned with standard basis

Slide 184

Slide 184 text

CIFAR10 RGB – PCA whitened

Slide 185

Slide 185 text

PCA whitening From Scikit-learn use PCA or IncrementalPCA

Slide 186

Slide 186 text

ZCA whitening is very similar Notice that PCA whitening rotated the RGB distribution ZCA whitening doesn’t; it only scales along the axes found by PCA

Slide 187

Slide 187 text

Could batch normalisation make standardisation unnecessary?

Slide 188

Slide 188 text

The previous fish based examples all used batch normalisation and still benefited from data standardisation, so no.

Slide 189

Slide 189 text

Data augmentation

Slide 190

Slide 190 text

Reduce over-fitting by artificially enlarging training set Modify existing training samples to make new ones

Slide 191

Slide 191 text

For images: Transformations: move, scale, rotate, reflect, etc. Consider the domain: horizontally flipping characters for character recognition would be a bad idea

Slide 192

Slide 192 text

For images: Lighting: Compute principal components of RGB pixel values Add normally distributed random multiples of 10% of the std-dev in each principal component

Slide 193

Slide 193 text

See [Krizhevsky12]; their paper discusses their techniques, that have since been used by many otehrs

Slide 194

Slide 194 text

Exercise / Demo Solving Dogs vs Cats with Transfer Learning and Data Augmentation

Slide 195

Slide 195 text

When training goes wrong and what to look for

Slide 196

Slide 196 text

Loss becomes NaN

Slide 197

Slide 197 text

Classification error rate equivalent of random guess (its not learning)

Slide 198

Slide 198 text

Learns to predict constant value; optimises constant value for best loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)

Slide 199

Slide 199 text

Debugging your network

Slide 200

Slide 200 text

Neural networks (most) often DON’T learn what you want or expect them to

Slide 201

Slide 201 text

Local minima will be the bane of your existence

Slide 202

Slide 202 text

So, what has your network learnt? This is often a good question.

Slide 203

Slide 203 text

Saliency maps Determine which parts of an image the network is using to make its prediction Tells you what the network is ‘looking at’

Slide 204

Slide 204 text

Two approaches

Slide 205

Slide 205 text

1. Region-level saliency Blank out different regions of the image and compute the difference in prediction

Slide 206

Slide 206 text

Exercise / Demo Block-level image saliency

Slide 207

Slide 207 text

2. Pixel-level saliency Compute the gradient of the prediction of a specific class w.r.t. the image pixels

Slide 208

Slide 208 text

Exercise / Demo Pixel-level image saliency

Slide 209

Slide 209 text

Designing a computer vision pipeline

Slide 210

Slide 210 text

Simple problems may be solved with just a neural network

Slide 211

Slide 211 text

Not sufficient for more complex problems

Slide 212

Slide 212 text

Theoretically possible to use a single network, with enough training data (where enough is an impractical amount)

Slide 213

Slide 213 text

For more complex problems, the problem should be broken down

Slide 214

Slide 214 text

Example Identifying right whales, by Felix Lau 2nd place in Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html

Slide 215

Slide 215 text

Identifying right whales, by Felix Lau The first naïve solution – training a classifier to identify individuals – did not work well

Slide 216

Slide 216 text

Region-based saliency map revealed that the network had ‘locked on’ to features in the ocean shape rather than the whales

Slide 217

Slide 217 text

Lau’s solution: Train a localiser to locate the whale in the image

Slide 218

Slide 218 text

Lau’s solution: Train a keypoint finder to locate two keypoints on the whale’s head to identify its orientation

Slide 219

Slide 219 text

Lau’s solution: Train classifier on oriented and cropped whale head images

Slide 220

Slide 220 text

Just for fun

Slide 221

Slide 221 text

Deep Dreams

Slide 222

Slide 222 text

When training a network, we use gradient descent to iteratively modify weights given images and ground truths

Slide 223

Slide 223 text

We just as easily use gradient descent to modify an image given weights

Slide 224

Slide 224 text

Deep Dreams: Take an image to hallucinate from Choose a layer, e.g. ‘pool4’ of VGG-19; choice depends on scale and level of features desired

Slide 225

Slide 225 text

Deep Dreams: Compute gradient of V-norm of layer w.r.t. image Use gradient ascent to increase V-norm

Slide 226

Slide 226 text

Exercise / Demo Deep Dreams

Slide 227

Slide 227 text

For other cool examples Look at the Keras library, they have loads http://keras.io

Slide 228

Slide 228 text

Some cool work in the field that might be of interest

Slide 229

Slide 229 text

Visualizing and understanding convolutional networks [Zeiler14] Visualisations of responses of layers to images

Slide 230

Slide 230 text

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Slide 231

Slide 231 text

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Slide 232

Slide 232 text

Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space [Nguyen17] Image taken from http://www.evolvingai.org/ppgn

Slide 233

Slide 233 text

Deep Neural Networks are Easily Fooled: High Confidence Predictions in Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network

Slide 234

Slide 234 text

Deep Neural Networks are Easily Fooled: High Confidence Predictions in Recognizable Images [Nguyen15] Image taken from [Nguyen15]

Slide 235

Slide 235 text

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network in reverse; orientation, design colour, etc parameters as input, rendered images as output training images

Slide 236

Slide 236 text

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Image taken from [Dosovitskiy15]

Slide 237

Slide 237 text

Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford 15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set

Slide 238

Slide 238 text

Generative Adversarial Nets [Radford15] Images of bedrooms generated using neural net Image taken from [Radford15]

Slide 239

Slide 239 text

Generative Adversarial Nets [Radford15] Image taken from [Radford15]

Slide 240

Slide 240 text

There has been a lot of work on GANs lately; mainly focused on improving their output quality

Slide 241

Slide 241 text

BEGAN:BoundaryEquilibriumGenerative AdversarialNetworks [Berthelot17] Image taken from [Berthelot17]

Slide 242

Slide 242 text

A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.

Slide 243

Slide 243 text

A Neural Algorithm of Artistic Style [Gatys15] Image taken from [Gatys15]

Slide 244

Slide 244 text

Much work on style transfer has focused on either improving quality or performance Original algorithm used gradient descent; like Deep Dream, so its quite slow

Slide 245

Slide 245 text

Deep Photo Style Transfer [Luan17] (also iterative, but just awesome) Image taken from https://github.com/luanfujun/deep-photo-styletransfer

Slide 246

Slide 246 text

Hope you’ve found it helpful!

Slide 247

Slide 247 text

Thank you!

Slide 248

Slide 248 text

References

Slide 249

Slide 249 text

[Berthelot17] Berthelot D., Schumm T., Metz L.; “BEGAN: Boundary Equilibrium Generative Adversarial Networks”, arXiv 1703.10717 (2017).

Slide 250

Slide 250 text

[Girshick15] Girshick, Ross; “Fast R- CNN”, Proceedings of the IEEE International Conference on Computer Vision, 2015

Slide 251

Slide 251 text

[He15a] He, Zhang, Ren and Sun; “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015

Slide 252

Slide 252 text

[He15b] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015).

Slide 253

Slide 253 text

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov; “Improving neural networks by preventing co-adaptation of feature detectors.” arXiv preprint arXiv:1207.0580, 2012.

Slide 254

Slide 254 text

[Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167

Slide 255

Slide 255 text

[Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258

Slide 256

Slide 256 text

[Lin13] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

Slide 257

Slide 257 text

[Luan17] Luan F., Paris S., Shechtman E., Bala K. “Deep Photo Style Transfer" arXiv:1703.07511 (2017).

Slide 258

Slide 258 text

[Krizhevsky12] Krizhevsky, A., Sutskever, I. and Hinton, G. E. "ImageNet Classification with Deep Convolutional Neural Networks." NIPS 2012.

Slide 259

Slide 259 text

[Nesterov83] Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376 (1983).

Slide 260

Slide 260 text

[Nguyen17] Nguyen A., Yosinski J., Bengio Y., Dosovitskiy A., Clune J. “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”. CVPR 2017.

Slide 261

Slide 261 text

[Sutskever13] Sutskever, Ilya, et al. On the importance of initialization and momentum in deep learning. Proceedings of the 30th international conference on machine learning (ICML-13). 2013.

Slide 262

Slide 262 text

[Simonyan14] K. Simonyan and Zisserman; “Very deep convolutional networks for large-scale image recognition”, arXiv:1409.1556, 2014

Slide 263

Slide 263 text

[Wang14] Wang, Dan, and Yi Shang. "A new active labeling method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.