An Introduction to Deep Learning

Slide 1

Slide 1 text

Deep Learning An Introductory Tutorial G. French University of East Anglia & Kings College London Image montages from http://www.image-net.org

Slide 2

Slide 2 text

ImageNet

Slide 3

Slide 3 text

Image classification dataset

Slide 4

Slide 4 text

~1,000,000 images ~1,000 classes Ground truths prepared manually through Amazon Mechanical Turk

Slide 5

Slide 5 text

ImageNet Top-5 challenge: You score if ground truth class is one your top 5 predictions

Slide 6

Slide 6 text

ImageNet in 2012 Best approaches used hand-crafted features (SIFT, HOGs, Fisher vectors, etc) + classifier Top-5 error rate: ~25%

Slide 7

Slide 7 text

Then the game changed.

Slide 8

Slide 8 text

Krizhevsky, Sutskever and Hinton; ImageNet Classification with Deep Convolutional Neural networks [Krizhevsky12] Top-5 error rate of ~15%

Slide 9

Slide 9 text

In the last few years, more modern networks have achieved better results still [Simonyan14, He15] Top-5 error rates of ~5-7%

Slide 10

Slide 10 text

I hope this talk will give you an idea of how!

Slide 11

Slide 11 text

What is a neural network?

Slide 12

Slide 12 text

Multiple layers Data propagates through layers Transformed by each layer

Slide 13

Slide 13 text

Neural network Input layer Hidden layer 0 Hidden layer 1 Output layer ⋯ Inputs Outputs ⋯

Slide 14

Slide 14 text

Single layer of a neural network () Input vector Weighted connections Bias Activation function / non-linearity Layer activation

Slide 15

Slide 15 text

= input (M-element vector) = output (N-element vector) = network weights (NxM matrix) = bias (N-element vector) = activation function; tanh / sigmoid / ReLU = ( + )

Slide 16

Slide 16 text

Multiple layers Input vector ( + ) Hidden layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

Slide 17

Slide 17 text

In a nutshell: = ( + )

Slide 18

Slide 18 text

Repeat for each layer: 0 = (0 + 0 ) 1 = 1 0 + 1 ⋯ = ( −1 + )

Slide 19

Slide 19 text

As a classifier Input vector Hidden layer 0 activation Final layer activation with softmax (output) ⋯ Image pixels = 0.25 = 0.5 = 0.1 = 0.15 Class probabilities

Slide 20

Slide 20 text

Training a neural network

Slide 21

Slide 21 text

Learn values for parameters; and (for each layer) Use back-propagation

Slide 22

Slide 22 text

Initialise weights randomly (more on this later) Initialise biases to 0

Slide 23

Slide 23 text

Iteratively modify parameters with gradient descent

Slide 24

Slide 24 text

Take example from training set, evaluate network given the training input; =

Slide 25

Slide 25 text

The cost (sometimes called loss), is a measure of the difference between network output and ground truth output

Slide 26

Slide 26 text

Compute the derivative of cost w.r.t. params ( and )

Slide 27

Slide 27 text

Update parameters: 0 ′ = 0 − 0 0 ′ = 0 − 0 γ = learning rate

Slide 28

Slide 28 text

In practice this is done on a mini-batch of examples (e.g. 128) in parallel per pass Compute cost for each example, then average. Compute derivative of average cost w.r.t. params.

Slide 29

Slide 29 text

Using mini-batches works well when using a GPU since computations can be parallelised

Slide 30

Slide 30 text

Cost function

Slide 31

Slide 31 text

Regression Final layer: no activation function / identity. Cost: Sum of squared differences

Slide 32

Slide 32 text

Classification Just like logistic regression

Slide 33

Slide 33 text

Final layer: softmax as activation function ; output vector of class probabilities Cost: negative-log-likelihood / categorical cross-entropy

Slide 34

Slide 34 text

Fully connected neural networks

Slide 35

Slide 35 text

Simplest model Each unit in each layer is connected too all units in previous layer All we have considered so far

Slide 36

Slide 36 text

How well does this perform on image classification?

Slide 37

Slide 37 text

MNIST hand-written digit dataset 28x28 images, 10 classes 60K training examples, 10K validation, 10K test Examples from MNIST

Slide 38

Slide 38 text

Network: 1 hidden layer of 64 units after 300 iterations over training set: 2.85% validation error Hidden layer weights visualised as 28x28 images Input Hidden Output 784 (28x28 images) 64 10

Slide 39

Slide 39 text

Network: 2 hidden layers, both 256 units after 300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

Slide 40

Slide 40 text

MNIST is quite a special case Digits nicely centred within image Scaled to approx. same size

Slide 41

Slide 41 text

The fully connected networks so far have a weakness: No translation invariance; learned features are position dependent

Slide 42

Slide 42 text

For more general imagery: requires a training set large enough to see all features in all possible positions… Requires network with enough units to represent this…

Slide 43

Slide 43 text

Convolutional networks

Slide 44

Slide 44 text

Convolution Slide a convolution kernel over an image Multiply image pixels by kernel pixels and sum

Slide 45

Slide 45 text

Convolution Convolutions are often used for feature detection

Slide 46

Slide 46 text

A brief detour…

Slide 47

Slide 47 text

Gabor filters ∗

Slide 48

Slide 48 text

Used for texture classification Bares similarity to low levels of cat visual system [Jones87]

Slide 49

Slide 49 text

Back on track to… Convolutional networks

Slide 50

Slide 50 text

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias Activation function / non-linearity Layer activation

Slide 51

Slide 51 text

Convolutional layer Each unit only connected to units in its neighbourhood

Slide 52

Slide 52 text

Convolutional layer Weights are shared Red weights have same value As do greens… And yellows

Slide 53

Slide 53 text

The values of the weights form a convolution kernel For practical computer vision, more an one kernel must be used to extract a variety of features

Slide 54

Slide 54 text

Convolutional layer Different weight-kernels: Output is vector/image with multiple channels

Slide 55

Slide 55 text

Still = ( + ) As convolution can be expressed as multiplication by weight matrix

Slide 56

Slide 56 text

Note In subsequent layers, each kernel connects to pixels in ALL channels in previous layer

Slide 57

Slide 57 text

Max-pooling ‘layer’ [Ciresan12] Take maximum value from each (, ) pooling region Down-samples image by factor Operates on channels independently

Slide 58

Slide 58 text

These are the models that have been getting excellent ImageNet results

Slide 59

Slide 59 text

How about an example?

Slide 60

Slide 60 text

A Simplified LeNet [LeCun95] for MNIST digits

Slide 61

Slide 61 text

Simplified LeNet for MNIST digits 28 28 24 24 Input Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

Slide 62

Slide 62 text

after 300 iterations over training set: 99.21% validation accuracy Model Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

Slide 63

Slide 63 text

What about the learned kernels? Image taken from paper [Krizhevsky12] Gabor filters

Slide 64

Slide 64 text

Neural networks – recent developments

Slide 65

Slide 65 text

GOOD NEWS Training neural networks is more practical now

Slide 66

Slide 66 text

More processing power Less of a black art

Slide 67

Slide 67 text

(most important) IMPROVEMENT Processing power

Slide 68

Slide 68 text

Image processing requires large networks with perhaps millions of parameters Lots of training examples need to train Easily results in billions or even trillions of FLOPS

Slide 69

Slide 69 text

Neural networks are ‘embarrassingly parallelisable’ therefore ideally suited to GPUs Use GPUs for all but the smallest of networks

Slide 70

Slide 70 text

As of now, nVidia is the most popular make of GPU. Cheaper gaming cards perfectly adequate Only use Tesla in production

Slide 71

Slide 71 text

IMPROVEMENT New popular activation function: ReLU - Rectified Linear Unit

Slide 72

Slide 72 text

ReLU - Rectified Linear Unit = max(, 0)

Slide 73

Slide 73 text

ReLU works better than tanh / sigmoid in many cases I don’t really understand the reasons (to be honest! -) See [Glorot11] [Glorot10]; written by people who do!

Slide 74

Slide 74 text

IMPROVEMENT Random weight initialisation

Slide 75

Slide 75 text

Previously: rules of thumb often used, e.g. normal distribution with = 0.01 Problems arise when training deep networks with > 8 layers [Simonyan14], [He15]

Slide 76

Slide 76 text

More recent approaches choose initial weights to maintain unit variance (as much as possible) throughout layers Otherwise layers can reduce or magnify magnitudes of signals exponentially

Slide 77

Slide 77 text

Recent approach by He et. Al. [He15]: = 1 Where is the fan-in; the number of incoming connections, and is the gain (for ReLU activation function use = 2)

Slide 78

Slide 78 text

For FC layer: = + = size of / width of ( is a -element vector, is a Q, P matrix)

Slide 79

Slide 79 text

For convolutional layer: = product of kernel width, kernel height and number of channels incoming from previous layer

Slide 80

Slide 80 text

This will ensure that ≈ ()

Slide 81

Slide 81 text

Reducing Over-fitting

Slide 82

Slide 82 text

Over-fitting always a problem in ML Model over-fits when it is very good at matching samples in training set but not those in validation/test

Slide 83

Slide 83 text

Neural networks are very prone to over- fitting

Slide 84

Slide 84 text

Two techniques DropOut Dataset augmentation

Slide 85

Slide 85 text

DropOut [Hinton12] During training, randomly choose units to ‘drop out’ by setting their output to 0, with probability , usually around 0.5 (compensate by multiplying values by 1 1− )

Slide 86

Slide 86 text

During test/predict: Run as normal (no DropOut)

Slide 87

Slide 87 text

Normally applied to later, fully connected layers

Slide 88

Slide 88 text

Dropout OFF Input layer Hidden layer 0 Output layer

Slide 89

Slide 89 text

Dropout ON (1) Input layer Hidden layer 0 Output layer

Slide 90

Slide 90 text

Dropout ON (2) Input layer Hidden layer 0 Output layer

Slide 91

Slide 91 text

Sampling a different subset of the network for each training example Kind of like model averaging with only one model -

Slide 92

Slide 92 text

What effect does it have? (approx. replication of [Hinton12])

Slide 93

Slide 93 text

Dataset: MNIST Digits Network: Single hidden layer, fully connected, 256 units, = 0.4 5000 iterations over training set

Slide 94

Slide 94 text

DropOut OFF Train loss: 0.0003 Validation loss: 0.094 Validation error: 1.9% DropOut ON Train loss: 0.0034 Validation loss: 0.077 Validation error: 1.56%

Slide 95

Slide 95 text

Loss plots Epoch 100 onwards for scale

Slide 96

Slide 96 text

Dataset augmentation Take existing dataset and expand by adding transformed version of existing samples

Slide 97

Slide 97 text

Dataset augmentation for images [Krizhevsky12] Cropping and translation Scaling Rotation Lighting/colour modifications

Slide 98

Slide 98 text

Neural network software

Slide 99

Slide 99 text

Two categories of software: Neural network toolkit (normally faster) Expression compilers

Slide 100

Slide 100 text

Neural network toolkit Most popular is CAFFE (from Berkeley) http://caffe.berkeleyvision.org/

Slide 101

Slide 101 text

Specify network architecture in terms of layers

Slide 102

Slide 102 text

Layers usually described using custom config/language CAFFE uses Google Protocol Buffers for base syntax (YAML/JSON like) and for data (since GPB is binary)

Slide 103

Slide 103 text

CAFFE can be used from: command line MATLAB Python

Slide 104

Slide 104 text

Expression compilers Theano (from University of Montreal) Torch 7 Tensorflow (more recent)

Slide 105

Slide 105 text

Describe network architecture in terms of mathematical expressions Expressions compiled to CUDA code and executed on GPU

Slide 106

Slide 106 text

Theano, Torch 7 and Tensorflow provide automatic symbolic differentiation: Big win; less bugs and less manual work

Slide 107

Slide 107 text

In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)

Slide 108

Slide 108 text

Resources and tutorials to get you going

Slide 109

Slide 109 text

http://cs.stanford.edu/people/karpathy /convnetjs/ Neural networks running in your web browser Excellent demos that shows how they work and what they can do

Slide 110

Slide 110 text

https://github.com/Newmu/Theano- Tutorials Very simple Python code examples proceeding through logistic regression, fully connected and convolutional models. Shows complete mathematical expressions and training procedures

Slide 111

Slide 111 text

http://deeplearning.net/tutorial/ More Theano tutorials More complete; explains mathematics behind them Code is longer than previous examples

Slide 112

Slide 112 text

CAFFE: http://caffe.berkeleyvision.org/ Plenty of documentation and tutorials

Slide 113

Slide 113 text

Some cool work in the field that might be of interest

Slide 114

Slide 114 text

Visualizing and understanding convolutional networks [Zeiler14] Visualisations of responses of layers to images

Slide 115

Slide 115 text

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Slide 116

Slide 116 text

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Slide 117

Slide 117 text

Deep Neural Networks are Easily Fooled: High Confidence Predictions in Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network

Slide 118

Slide 118 text

Deep Neural Networks are Easily Fooled: High Confidence Predictions in Recognizable Images [Nguyen15] Image taken from [Nguyen15]

Slide 119

Slide 119 text

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network in reverse; orientation, design colour, etc parameters as input, rendered images as output training images

Slide 120

Slide 120 text

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Image taken from [Dosovitskiy15]

Slide 121

Slide 121 text

A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.

Slide 122

Slide 122 text

A Neural Algorithm of Artistic Style [Gatys15] Image taken from [Gatys15]

Slide 123

Slide 123 text

Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford 15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set

Slide 124

Slide 124 text

Generative Adversarial Nets [Radford15] Images of bedrooms generated using neural net Image taken from [Radford15]

Slide 125

Slide 125 text

Generative Adversarial Nets [Radford15] Image taken from [Radford15]

Slide 126

Slide 126 text

Finishing words

Slide 127

Slide 127 text

Deep learning is a fascinating field with lots going on Very flexible, wide range of techniques and applications

Slide 128

Slide 128 text

Deep neural networks have proved to be highly effective* for computer vision, speech recognition and other areas *like with every other shiny new toy, see the small-print!

Slide 129

Slide 129 text

SMALL-PRINT Sufficient training data required (curse of dimensionality) Dataset augmentation advisable

Slide 130

Slide 130 text

SMALL-PRINT Model will only represent training examples; may not probably wont generalise

Slide 131

Slide 131 text

SMALL-PRINT Choose architecture carefully Use a GPU

Slide 132

Slide 132 text

I hope this has proved to be a good introduction to the topic!

Slide 133

Slide 133 text

Thank you!

Slide 134

Slide 134 text

References

Slide 135

Slide 135 text

[Ciresan12] Ciresan, Meier and Schmidhuber; Multi-column deep neural networks for image classification, Computer vision and Pattern Recognition (CVPR), 2012

Slide 136

Slide 136 text

[Dosovitskiy15] Dosovitskiy, Springenberg and Box; Learning to generate chairs with convolutional neural networks, arXiv preprint, 2015

Slide 137

Slide 137 text

[Gatys15] Gatys, Echer, Bethge; A Neural Algorithm of Artistic Style, arXiv: 1508.06576, 2015

Slide 138

Slide 138 text

[Glorot10] Glorot, Bengio; Understanding the difficulty of training deep feedforward neural networks, International conference on artificial intelligence and statistics, 2010

Slide 139

Slide 139 text

[Glorot11] Glorot, Bordes, Bengio; Deep Sparse Rectifier Neural Networks, JMLR 2011

Slide 140

Slide 140 text

[He15] He, Zhang, Ren and Sun; Delving Deep into Rectifiers: Surpassing Human- Level Performance on ImageNet Classification, arXiv 2015

Slide 141

Slide 141 text

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

Slide 142

Slide 142 text

[Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258

Slide 143

Slide 143 text

[Krizhevsky12] Krizhevsky, Sutskever and Hinton; ImageNet Classification with Deep Convolutional Neural networks, NIPS 2012

Slide 144

Slide 144 text

[LeCun95] LeCun, Yann et. al.; Comparison of learning algorithms for handwritten digit recognition, International conference on artificial neural networks, 1995

Slide 145

Slide 145 text

[Nguyen15] Nguyen, Yosinski and Clune; Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, Computer Vision and Pattern Recognition (CVPR) 2015

Slide 146

Slide 146 text

[Radford15] Radford, Metz, Chintala; Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, arXiv:1511.06434, 2015

Slide 147

Slide 147 text

[Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556, 2014

Slide 148

Slide 148 text

[Zeiler14] Zeiler and Fergus; Visualizing and understanding convolutional networks, Computer Vision - ECCV 2014