Introduction to Deep Learning - Cambridge Python User Group

Slide 1

Slide 1 text

Deep Learning An Introductory Tutorial G. French Kings College London & University of East Anglia Image montages from http://www.image-net.org

Slide 2

Slide 2 text

ImageNet

Slide 3

Slide 3 text

Image classification dataset

Slide 4

Slide 4 text

~1,000,000 images ~1,000 classes Ground truths prepared manually through Amazon Mechanical Turk

Slide 5

Slide 5 text

ImageNet Top-5 challenge: You score if ground truth class is one your top 5 predictions

Slide 6

Slide 6 text

ImageNet in 2012 Best approaches used hand-crafted features (SIFT, HOGs, Fisher vectors, etc) + classifier Top-5 error rate: ~25%

Slide 7

Slide 7 text

Then the game changed.

Slide 8

Slide 8 text

Krizhevsky, Sutskever and Hinton; ImageNet Classification with Deep Convolutional Neural networks [Krizhevsky12] Top-5 error rate of ~15%

Slide 9

Slide 9 text

In the last few years, more modern networks have achieved better results still [Simonyan14, He15] Top-5 error rates of ~5-7%

Slide 10

Slide 10 text

I hope this talk will give you an idea of how!

Slide 11

Slide 11 text

What is a neural network?

Slide 12

Slide 12 text

Multiple layers Data propagates through layers Transformed by each layer

Slide 13

Slide 13 text

Neural network Input layer Hidden layer 0 Hidden layer 1 Output layer ⋯ Inputs Outputs ⋯

Slide 14

Slide 14 text

Single layer of a neural network () Input vector Weighted connections Bias Activation function / non-linearity Layer activation

Slide 15

Slide 15 text

= input (M-element vector) = output (N-element vector) = network weights (NxM matrix) = bias (N-element vector) = activation function; tanh / sigmoid / ReLU = ( + )

Slide 16

Slide 16 text

Multiple layers Input vector ( + ) Hidden layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

Slide 17

Slide 17 text

In a nutshell: = ( + )

Slide 18

Slide 18 text

Repeat for each layer: 0 = (0 + 0 ) 1 = 1 0 + 1 ⋯ = ( −1 + )

Slide 19

Slide 19 text

As a classifier Input vector Hidden layer 0 activation Final layer activation with softmax (output) ⋯ Image pixels = 0.25 = 0.5 = 0.1 = 0.15 Class probabilities

Slide 20

Slide 20 text

Training a neural network

Slide 21

Slide 21 text

Learn values for parameters; and (for each layer) Use back-propagation

Slide 22

Slide 22 text

Initialise weights randomly (more on this later) Initialise biases to 0

Slide 23

Slide 23 text

Iteratively modify parameters with gradient descent

Slide 24

Slide 24 text

Take example from training set, evaluate network given the training input; =

Slide 25

Slide 25 text

The cost (sometimes called loss), is a measure of the difference between network output and ground truth output

Slide 26

Slide 26 text

Compute the derivative of cost w.r.t. params ( and )

Slide 27

Slide 27 text

Update parameters: 0 ′ = 0 − 0 0 ′ = 0 − 0 γ = learning rate

Slide 28

Slide 28 text

In practice this is done on a mini-batch of examples (e.g. 128) in parallel per pass Compute cost for each example, then average. Compute derivative of average cost w.r.t. params.

Slide 29

Slide 29 text

Using mini-batches works well when using a GPU since computations can be parallelised

Slide 30

Slide 30 text

Cost function

Slide 31

Slide 31 text

Regression Final layer: no activation function / identity. Cost: Sum of squared differences

Slide 32

Slide 32 text

Classification Just like logistic regression

Slide 33

Slide 33 text

Final layer: softmax as activation function ; output vector of class probabilities Cost: negative-log-likelihood / categorical cross-entropy

Slide 34

Slide 34 text

Fully connected neural networks

Slide 35

Slide 35 text

Simplest model Each unit in each layer is connected too all units in previous layer All we have considered so far

Slide 36

Slide 36 text

How well does this perform on image classification?

Slide 37

Slide 37 text

MNIST hand-written digit dataset 28x28 images, 10 classes 60K training examples, 10K validation, 10K test Examples from MNIST

Slide 38

Slide 38 text

Network: 1 hidden layer of 64 units after 300 iterations over training set: 2.85% validation error Hidden layer weights visualised as 28x28 images Input Hidden Output 784 (28x28 images) 64 10

Slide 39

Slide 39 text

Network: 2 hidden layers, both 256 units after 300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

Slide 40

Slide 40 text

MNIST is quite a special case Digits nicely centred within image Scaled to approx. same size

Slide 41

Slide 41 text

The fully connected networks so far have a weakness: No translation invariance; learned features are position dependent

Slide 42

Slide 42 text

For more general imagery: requires a training set large enough to see all features in all possible positions… Requires network with enough units to represent this…

Slide 43

Slide 43 text

Convolutional networks

Slide 44

Slide 44 text

Convolution Slide a convolution kernel over an image Multiply image pixels by kernel pixels and sum

Slide 45

Slide 45 text

Convolution Convolutions are often used for feature detection

Slide 46

Slide 46 text

A brief detour…

Slide 47

Slide 47 text

Gabor filters ∗

Slide 48

Slide 48 text

Used for texture classification Bares similarity to low levels of cat visual system [Jones87]

Slide 49

Slide 49 text

Back on track to… Convolutional networks

Slide 50

Slide 50 text

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias Activation function / non-linearity Layer activation

Slide 51

Slide 51 text

Convolutional layer Each unit only connected to units in its neighbourhood

Slide 52

Slide 52 text

Convolutional layer Weights are shared Red weights have same value As do greens… And yellows

Slide 53

Slide 53 text

The values of the weights form a convolution kernel For practical computer vision, more an one kernel must be used to extract a variety of features

Slide 54

Slide 54 text

Convolutional layer Different weight-kernels: Output is vector/image with multiple channels

Slide 55

Slide 55 text

Still = ( + ) As convolution can be expressed as multiplication by weight matrix

Slide 56

Slide 56 text

Note In subsequent layers, each kernel connects to pixels in ALL channels in previous layer

Slide 57

Slide 57 text

Max-pooling ‘layer’ [Ciresan12] Take maximum value from each (, ) pooling region Down-samples image by factor Operates on channels independently

Slide 58

Slide 58 text

These are the models that have been getting excellent ImageNet results

Slide 59

Slide 59 text

How about an example?

Slide 60

Slide 60 text

A Simplified LeNet [LeCun95] for MNIST digits

Slide 61

Slide 61 text

Simplified LeNet for MNIST digits 28 28 24 24 Input Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

Slide 62

Slide 62 text

after 300 iterations over training set: 99.21% validation accuracy Model Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

Slide 63

Slide 63 text

What about the learned kernels? Image taken from paper [Krizhevsky12] Gabor filters

Slide 64

Slide 64 text

Neural networks – recent developments

Slide 65

Slide 65 text

GOOD NEWS Training neural networks is more practical now

Slide 66

Slide 66 text

More processing power Less of a black art

Slide 67

Slide 67 text

IMPROVEMENTS Processing power ReLU activation function Batch normalisation DropOut

Slide 68

Slide 68 text

(most important) IMPROVEMENT Processing power

Slide 69

Slide 69 text

Image processing requires large networks with perhaps millions of parameters Lots of training examples need to train Easily results in billions or even trillions of FLOPS

Slide 70

Slide 70 text

Neural networks are ‘embarrassingly parallelisable’ therefore ideally suited to GPUs Use GPUs for all but the smallest of networks

Slide 71

Slide 71 text

As of now, nVidia is the most popular make of GPU. Cheaper gaming cards perfectly adequate Only use Tesla in production

Slide 72

Slide 72 text

IMPROVEMENT New popular activation function: ReLU - Rectified Linear Unit

Slide 73

Slide 73 text

ReLU - Rectified Linear Unit = max(, 0)

Slide 74

Slide 74 text

ReLU works better than tanh / sigmoid in many cases I don’t really understand the reasons (to be honest! ) See [Glorot11] [Glorot10]; written by people who do!

Slide 75

Slide 75 text

IMPROVEMENT Batch normalisation

Slide 76

Slide 76 text

PROBLEM: Magnitudes of activations can vary considerably, layer to layer If each layer ‘multiplies’ magnitude by some factor, they explode or vanish

Slide 77

Slide 77 text

SOLUTIONS Initally: careful weight initialisation Now: replaced by batch normalisation

Slide 78

Slide 78 text

y = ( + ) Assume: σ = 1 σ depends on distribution of W; normal, uniform, std-dev, etc

Slide 79

Slide 79 text

In the past: initialise W, rules of thumb often used, e.g. normal distribution with = 0.01 Problems arise when training deep networks with > 8 layers [Simonyan14], [He15]

Slide 80

Slide 80 text

Previously: carefully choose distribution of W so that ≈ ()

Slide 81

Slide 81 text

E.g. approach by He et. Al. [He15]: = 1 Where is the fan-in; the number of incoming connections, and is the gain

Slide 82

Slide 82 text

New approach: BATCH NORMALISATION [Ioffe15] Keep distribution of activations sane as part of the network architecture

Slide 83

Slide 83 text

BATCH NORMALISATION For each mini-batch of examples during training Normalise using mean and standard deviation

Slide 84

Slide 84 text

Layer equation becomes: (scale, not needed if is ReLU) and (bias) are learned parameters σ = () μ = () = ( − + )

Slide 85

Slide 85 text

Note: For a fully connected layer, each unit/output should have its own mean and std-dev; aggregate across examples in the mini-batch

Slide 86

Slide 86 text

For a convolutional layer, each channel should have its own mean and std-dev; aggregate across examples in the mini- batch and across image rows and columns

Slide 87

Slide 87 text

During training, keep a running exponential moving average of mean and std-dev During test time, use the averaged mean and std-dev

Slide 88

Slide 88 text

Reducing Over-fitting

Slide 89

Slide 89 text

Over-fitting always a problem in ML Model over-fits when it is very good at matching samples in training set but not those in validation/test

Slide 90

Slide 90 text

Neural networks are very prone to over- fitting

Slide 91

Slide 91 text

Two techniques DropOut (quite a lot of people use batch normalisation instead) Dataset augmentation

Slide 92

Slide 92 text

DropOut [Hinton12] During training, randomly choose units to ‘drop out’ by setting their output to 0, with probability , usually around 0.5 (compensate by multiplying values by 1 1− )

Slide 93

Slide 93 text

During test/predict: Run as normal (no DropOut)

Slide 94

Slide 94 text

Normally applied to later, fully connected layers

Slide 95

Slide 95 text

Dropout OFF Input layer Hidden layer 0 Output layer

Slide 96

Slide 96 text

Dropout ON (1) Input layer Hidden layer 0 Output layer

Slide 97

Slide 97 text

Dropout ON (2) Input layer Hidden layer 0 Output layer

Slide 98

Slide 98 text

Sampling a different subset of the network for each training example Kind of like model averaging with only one model 

Slide 99

Slide 99 text

What effect does it have? (approx. replication of [Hinton12])

Slide 100

Slide 100 text

Dataset: MNIST Digits Network: Single hidden layer, fully connected, 256 units, = 0.4 5000 iterations over training set

Slide 101

Slide 101 text

DropOut OFF Train loss: 0.0003 Validation loss: 0.094 Validation error: 1.9% DropOut ON Train loss: 0.0034 Validation loss: 0.077 Validation error: 1.56%

Slide 102

Slide 102 text

Dataset augmentation Take existing dataset and expand by adding transformed version of existing samples

Slide 103

Slide 103 text

Dataset augmentation for images [Krizhevsky12] Cropping and translation Scaling Rotation Lighting/colour modifications

Slide 104

Slide 104 text

Neural network software

Slide 105

Slide 105 text

Two categories of software: Neural network toolkit (normally faster) Expression compilers

Slide 106

Slide 106 text

Neural network toolkit Most popular is CAFFE (from Berkeley) http://caffe.berkeleyvision.org/

Slide 107

Slide 107 text

Specify network architecture in terms of layers

Slide 108

Slide 108 text

Layers usually described using custom config/language CAFFE uses Google Protocol Buffers for base syntax (YAML/JSON like) and for data (since GPB is binary)

Slide 109

Slide 109 text

CAFFE can be used from: command line MATLAB Python

Slide 110

Slide 110 text

Expression compilers Theano (from University of Montreal) Torch 7 Tensorflow (more recent)

Slide 111

Slide 111 text

Describe network architecture in terms of mathematical expressions Expressions compiled to CUDA code and executed on GPU

Slide 112

Slide 112 text

Theano, Torch 7 and Tensorflow provide automatic symbolic differentiation: Big win; less bugs and less manual work

Slide 113

Slide 113 text

In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)

Slide 114

Slide 114 text

Resources and tutorials to get you going

Slide 115

Slide 115 text

http://cs.stanford.edu/people/karpathy /convnetjs/ Neural networks running in your web browser Excellent demos that shows how they work and what they can do

Slide 116

Slide 116 text

https://github.com/Newmu/Theano- Tutorials Very simple Python code examples proceeding through logistic regression, fully connected and convolutional models. Shows complete mathematical expressions and training procedures

Slide 117

Slide 117 text

http://deeplearning.net/tutorial/ More Theano tutorials More complete; explains mathematics behind them Code is longer than previous examples

Slide 118

Slide 118 text

CAFFE: http://caffe.berkeleyvision.org/ Plenty of documentation and tutorials

Slide 119

Slide 119 text

Our work

Slide 120

Slide 120 text

CCTV for Fisheries Project involving Dr. M. Fisher, Dr. M. Mackiewicz Funded by Marine Scotland

Slide 121

Slide 121 text

Automatically quantify the amount of fish discarded by fishing trawlers (preferably by species) Process surveillance footage of discard belt

Slide 122

Slide 122 text

No content

Slide 123

Slide 123 text

STEPS: Segment fish from background Separate fish from one another Classify individual fish (TODO) Measure individual fish to estimate mass (TODO)

Slide 124

Slide 124 text

Approach Use 4-Fields approach [Ganin14]

Slide 125

Slide 125 text

-Fields Use network to transform input image patch into 16-element codeword vector Codeword vector used to look up a target (foreground or edge) patch that most closely matches, in dictionary of words

Slide 126

Slide 126 text

Use 4-Fields to transform input image to foreground map Use 4-Fields to transform input image to edge map Use Watershed algorithm to separate image into regions

Slide 127

Slide 127 text

No content

Slide 128

Slide 128 text

No content

Slide 129

Slide 129 text

No content

Slide 130

Slide 130 text

Videos

Slide 131

Slide 131 text

Has problems with shadows Need more training data Much to do yet!

Slide 132

Slide 132 text

Some cool work in the field that might be of interest

Slide 133

Slide 133 text

Visualizing and understanding convolutional networks [Zeiler14] Visualisations of responses of layers to images

Slide 134

Slide 134 text

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Slide 135

Slide 135 text

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Slide 136

Slide 136 text

Deep Neural Networks are Easily Fooled: High Confidence Predictions in Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network

Slide 137

Slide 137 text

Deep Neural Networks are Easily Fooled: High Confidence Predictions in Recognizable Images [Nguyen15] Image taken from [Nguyen15]

Slide 138

Slide 138 text

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network in reverse; orientation, design colour, etc parameters as input, rendered images as output training images

Slide 139

Slide 139 text

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Image taken from [Dosovitskiy15]

Slide 140

Slide 140 text

A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.

Slide 141

Slide 141 text

A Neural Algorithm of Artistic Style [Gatys15] Image taken from [Gatys15]

Slide 142

Slide 142 text

Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford 15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set

Slide 143

Slide 143 text

Generative Adversarial Nets [Radford15] Images of bedrooms generated using neural net Image taken from [Radford15]

Slide 144

Slide 144 text

Generative Adversarial Nets [Radford15] Image taken from [Radford15]

Slide 145

Slide 145 text

Finishing words

Slide 146

Slide 146 text

Deep learning is a fascinating field with lots going on Very flexible, wide range of techniques and applications

Slide 147

Slide 147 text

Deep neural networks have proved to be highly effective* for computer vision, speech recognition and other areas *like with every other shiny new toy, see the small-print!

Slide 148

Slide 148 text

SMALL-PRINT Sufficient training data required (curse of dimensionality) Dataset augmentation advisable

Slide 149

Slide 149 text

SMALL-PRINT Model will only represent training examples; may not probably wont generalise

Slide 150

Slide 150 text

SMALL-PRINT Choose architecture carefully Use a GPU

Slide 151

Slide 151 text

I hope this has proved to be a good introduction to the topic!

Slide 152

Slide 152 text

Thank you!

Slide 153

Slide 153 text

References

Slide 154

Slide 154 text

[Ciresan12] Ciresan, Meier and Schmidhuber; Multi-column deep neural networks for image classification, Computer vision and Pattern Recognition (CVPR), 2012

Slide 155

Slide 155 text

[Dosovitskiy15] Dosovitskiy, Springenberg and Box; Learning to generate chairs with convolutional neural networks, arXiv preprint, 2015

Slide 156

Slide 156 text

[Ganin14] Ganin, Lempitsky; 4-Fields: Neural Network Nearest Neighbor Fields for Image Transforms, 12th Asian Conference on Computer Vision, 2014

Slide 157

Slide 157 text

[Gatys15] Gatys, Echer, Bethge; A Neural Algorithm of Artistic Style, arXiv: 1508.06576, 2015

Slide 158

Slide 158 text

[Glorot10] Glorot, Bengio; Understanding the difficulty of training deep feedforward neural networks, International conference on artificial intelligence and statistics, 2010

Slide 159

Slide 159 text

[Glorot11] Glorot, Bordes, Bengio; Deep Sparse Rectifier Neural Networks, JMLR 2011

Slide 160

Slide 160 text

[He15] He, Zhang, Ren and Sun; Delving Deep into Rectifiers: Surpassing Human- Level Performance on ImageNet Classification, arXiv 2015

Slide 161

Slide 161 text

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

Slide 162

Slide 162 text

[Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167

Slide 163

Slide 163 text

[Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258

Slide 164

Slide 164 text

[Krizhevsky12] Krizhevsky, Sutskever and Hinton; ImageNet Classification with Deep Convolutional Neural networks, NIPS 2012

Slide 165

Slide 165 text

[LeCun95] LeCun, Yann et. al.; Comparison of learning algorithms for handwritten digit recognition, International conference on artificial neural networks, 1995

Slide 166

Slide 166 text

[Nguyen15] Nguyen, Yosinski and Clune; Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, Computer Vision and Pattern Recognition (CVPR) 2015

Slide 167

Slide 167 text

[Radford15] Radford, Metz, Chintala; Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, arXiv:1511.06434, 2015

Slide 168

Slide 168 text

[Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556, 2014

Slide 169

Slide 169 text

[Zeiler14] Zeiler and Fergus; Visualizing and understanding convolutional networks, Computer Vision - ECCV 2014