Introduction to Deep Learning - Cambridge Python User Group

Deep Learning An Introductory Tutorial G. French Kings College London
& University of East Anglia Image montages from http://www.image-net.org

ImageNet

Image classification dataset

~1,000,000 images ~1,000 classes Ground truths prepared manually through Amazon
Mechanical Turk

ImageNet Top-5 challenge: You score if ground truth class is
one your top 5 predictions

ImageNet in 2012 Best approaches used hand-crafted features (SIFT, HOGs,
Fisher vectors, etc) + classifier Top-5 error rate: ~25%

Then the game changed.

Krizhevsky, Sutskever and Hinton; ImageNet Classification with Deep Convolutional Neural
networks [Krizhevsky12] Top-5 error rate of ~15%

In the last few years, more modern networks have achieved
better results still [Simonyan14, He15] Top-5 error rates of ~5-7%

I hope this talk will give you an idea of
how!

What is a neural network?

Multiple layers Data propagates through layers Transformed by each layer

Neural network Input layer Hidden layer 0 Hidden layer 1
Output layer ⋯ Inputs Outputs ⋯

Single layer of a neural network () Input vector Weighted
connections Bias Activation function / non-linearity Layer activation

= input (M-element vector) = output (N-element vector) = network
weights (NxM matrix) = bias (N-element vector) = activation function; tanh / sigmoid / ReLU = ( + )

Multiple layers Input vector ( + ) Hidden layer 0
activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

In a nutshell: = ( + )

Repeat for each layer: 0 = (0 + 0 )
1 = 1 0 + 1 ⋯ = ( −1 + )

As a classifier Input vector Hidden layer 0 activation Final
layer activation with softmax (output) ⋯ Image pixels = 0.25 = 0.5 = 0.1 = 0.15 Class probabilities

Training a neural network

Learn values for parameters; and (for each layer) Use back-propagation

Initialise weights randomly (more on this later) Initialise biases to
0

Iteratively modify parameters with gradient descent

Take example from training set, evaluate network given the training
input; =

The cost (sometimes called loss), is a measure of the
difference between network output and ground truth output

Compute the derivative of cost w.r.t. params ( and )

Update parameters: 0 ′ = 0 − 0 0 ′
= 0 − 0 γ = learning rate

In practice this is done on a mini-batch of examples
(e.g. 128) in parallel per pass Compute cost for each example, then average. Compute derivative of average cost w.r.t. params.

Using mini-batches works well when using a GPU since computations
can be parallelised

Cost function

Regression Final layer: no activation function / identity. Cost: Sum
of squared differences

Classification Just like logistic regression

Final layer: softmax as activation function ; output vector of
class probabilities Cost: negative-log-likelihood / categorical cross-entropy

Fully connected neural networks

Simplest model Each unit in each layer is connected too
all units in previous layer All we have considered so far

How well does this perform on image classification?

MNIST hand-written digit dataset 28x28 images, 10 classes 60K training
examples, 10K validation, 10K test Examples from MNIST

Network: 1 hidden layer of 64 units after 300 iterations
over training set: 2.85% validation error Hidden layer weights visualised as 28x28 images Input Hidden Output 784 (28x28 images) 64 10

Network: 2 hidden layers, both 256 units after 300 iterations
over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

MNIST is quite a special case Digits nicely centred within
image Scaled to approx. same size

The fully connected networks so far have a weakness: No
translation invariance; learned features are position dependent

For more general imagery: requires a training set large enough
to see all features in all possible positions… Requires network with enough units to represent this…

Convolutional networks

Convolution Slide a convolution kernel over an image Multiply image
pixels by kernel pixels and sum

Convolution Convolutions are often used for feature detection

A brief detour…

Gabor filters ∗

Used for texture classification Bares similarity to low levels of
cat visual system [Jones87]

Back on track to… Convolutional networks

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias
Activation function / non-linearity Layer activation

Convolutional layer Each unit only connected to units in its
neighbourhood

Convolutional layer Weights are shared Red weights have same value
As do greens… And yellows

The values of the weights form a convolution kernel For
practical computer vision, more an one kernel must be used to extract a variety of features

Convolutional layer Different weight-kernels: Output is vector/image with multiple channels

Still = ( + ) As convolution can be expressed
as multiplication by weight matrix

Note In subsequent layers, each kernel connects to pixels in
ALL channels in previous layer

Max-pooling ‘layer’ [Ciresan12] Take maximum value from each (, )
pooling region Down-samples image by factor Operates on channels independently

These are the models that have been getting excellent ImageNet
results

How about an example?

A Simplified LeNet [LeCun95] for MNIST digits

Simplified LeNet for MNIST digits 28 28 24 24 Input
Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

after 300 iterations over training set: 99.21% validation accuracy Model
Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

What about the learned kernels? Image taken from paper [Krizhevsky12]
Gabor filters

Neural networks – recent developments

GOOD NEWS Training neural networks is more practical now

More processing power Less of a black art

IMPROVEMENTS Processing power ReLU activation function Batch normalisation DropOut

(most important) IMPROVEMENT Processing power

Image processing requires large networks with perhaps millions of parameters
Lots of training examples need to train Easily results in billions or even trillions of FLOPS

Neural networks are ‘embarrassingly parallelisable’ therefore ideally suited to GPUs
Use GPUs for all but the smallest of networks

As of now, nVidia is the most popular make of
GPU. Cheaper gaming cards perfectly adequate Only use Tesla in production

IMPROVEMENT New popular activation function: ReLU - Rectified Linear Unit

ReLU - Rectified Linear Unit = max(, 0)

ReLU works better than tanh / sigmoid in many cases
I don’t really understand the reasons (to be honest! ) See [Glorot11] [Glorot10]; written by people who do!

IMPROVEMENT Batch normalisation

PROBLEM: Magnitudes of activations can vary considerably, layer to layer
If each layer ‘multiplies’ magnitude by some factor, they explode or vanish

SOLUTIONS Initally: careful weight initialisation Now: replaced by batch normalisation

y = ( + ) Assume: σ = 1 σ
depends on distribution of W; normal, uniform, std-dev, etc

In the past: initialise W, rules of thumb often used,
e.g. normal distribution with = 0.01 Problems arise when training deep networks with > 8 layers [Simonyan14], [He15]

Previously: carefully choose distribution of W so that ≈ ()

E.g. approach by He et. Al. [He15]: = 1 Where
is the fan-in; the number of incoming connections, and is the gain

New approach: BATCH NORMALISATION [Ioffe15] Keep distribution of activations sane
as part of the network architecture

BATCH NORMALISATION For each mini-batch of examples during training Normalise
using mean and standard deviation

Layer equation becomes: (scale, not needed if is ReLU) and
(bias) are learned parameters σ = () μ = () = ( − + )

Note: For a fully connected layer, each unit/output should have
its own mean and std-dev; aggregate across examples in the mini-batch

For a convolutional layer, each channel should have its own
mean and std-dev; aggregate across examples in the mini- batch and across image rows and columns

During training, keep a running exponential moving average of mean
and std-dev During test time, use the averaged mean and std-dev

Reducing Over-fitting

Over-fitting always a problem in ML Model over-fits when it
is very good at matching samples in training set but not those in validation/test

Neural networks are very prone to over- fitting

Two techniques DropOut (quite a lot of people use batch
normalisation instead) Dataset augmentation

DropOut [Hinton12] During training, randomly choose units to ‘drop out’
by setting their output to 0, with probability , usually around 0.5 (compensate by multiplying values by 1 1− )

During test/predict: Run as normal (no DropOut)

Normally applied to later, fully connected layers

Dropout OFF Input layer Hidden layer 0 Output layer

Dropout ON (1) Input layer Hidden layer 0 Output layer

Dropout ON (2) Input layer Hidden layer 0 Output layer

Sampling a different subset of the network for each training
example Kind of like model averaging with only one model 

What effect does it have? (approx. replication of [Hinton12])

Dataset: MNIST Digits Network: Single hidden layer, fully connected, 256
units, = 0.4 5000 iterations over training set

DropOut OFF Train loss: 0.0003 Validation loss: 0.094 Validation error:
1.9% DropOut ON Train loss: 0.0034 Validation loss: 0.077 Validation error: 1.56%

Dataset augmentation Take existing dataset and expand by adding transformed
version of existing samples

Dataset augmentation for images [Krizhevsky12] Cropping and translation Scaling Rotation
Lighting/colour modifications

Neural network software

Two categories of software: Neural network toolkit (normally faster) Expression
compilers

Neural network toolkit Most popular is CAFFE (from Berkeley) http://caffe.berkeleyvision.org/

Specify network architecture in terms of layers

Layers usually described using custom config/language CAFFE uses Google Protocol
Buffers for base syntax (YAML/JSON like) and for data (since GPB is binary)

CAFFE can be used from: command line MATLAB Python

Expression compilers Theano (from University of Montreal) Torch 7 Tensorflow
(more recent)

Describe network architecture in terms of mathematical expressions Expressions compiled
to CUDA code and executed on GPU

Theano, Torch 7 and Tensorflow provide automatic symbolic differentiation: Big
win; less bugs and less manual work

In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE
is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)

Resources and tutorials to get you going

http://cs.stanford.edu/people/karpathy /convnetjs/ Neural networks running in your web browser Excellent
demos that shows how they work and what they can do

https://github.com/Newmu/Theano- Tutorials Very simple Python code examples proceeding through logistic
regression, fully connected and convolutional models. Shows complete mathematical expressions and training procedures

http://deeplearning.net/tutorial/ More Theano tutorials More complete; explains mathematics behind them
Code is longer than previous examples

CAFFE: http://caffe.berkeleyvision.org/ Plenty of documentation and tutorials

Our work

CCTV for Fisheries Project involving Dr. M. Fisher, Dr. M.
Mackiewicz Funded by Marine Scotland

Automatically quantify the amount of fish discarded by fishing trawlers
(preferably by species) Process surveillance footage of discard belt

STEPS: Segment fish from background Separate fish from one another
Classify individual fish (TODO) Measure individual fish to estimate mass (TODO)

Approach Use 4-Fields approach [Ganin14]

-Fields Use network to transform input image patch into 16-element
codeword vector Codeword vector used to look up a target (foreground or edge) patch that most closely matches, in dictionary of words

Use 4-Fields to transform input image to foreground map Use
4-Fields to transform input image to edge map Use Watershed algorithm to separate image into regions

Videos

Has problems with shadows Need more training data Much to
do yet!

Some cool work in the field that might be of
interest

Visualizing and understanding convolutional networks [Zeiler14] Visualisations of responses of
layers to images

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Deep Neural Networks are Easily Fooled: High Confidence Predictions in
Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network

Deep Neural Networks are Easily Fooled: High Confidence Predictions in
Recognizable Images [Nguyen15] Image taken from [Nguyen15]

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network
in reverse; orientation, design colour, etc parameters as input, rendered images as output training images

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Image
taken from [Dosovitskiy15]

A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet
model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.

A Neural Algorithm of Artistic Style [Gatys15] Image taken from
[Gatys15]

Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford
15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set

Generative Adversarial Nets [Radford15] Images of bedrooms generated using neural
net Image taken from [Radford15]

Generative Adversarial Nets [Radford15] Image taken from [Radford15]

Finishing words

Deep learning is a fascinating field with lots going on
Very flexible, wide range of techniques and applications

Deep neural networks have proved to be highly effective* for
computer vision, speech recognition and other areas *like with every other shiny new toy, see the small-print!

SMALL-PRINT Sufficient training data required (curse of dimensionality) Dataset augmentation
advisable

SMALL-PRINT Model will only represent training examples; may not probably
wont generalise

SMALL-PRINT Choose architecture carefully Use a GPU

I hope this has proved to be a good introduction
to the topic!

Thank you!

References

[Ciresan12] Ciresan, Meier and Schmidhuber; Multi-column deep neural networks for
image classification, Computer vision and Pattern Recognition (CVPR), 2012

[Dosovitskiy15] Dosovitskiy, Springenberg and Box; Learning to generate chairs with
convolutional neural networks, arXiv preprint, 2015

[Ganin14] Ganin, Lempitsky; 4-Fields: Neural Network Nearest Neighbor Fields for
Image Transforms, 12th Asian Conference on Computer Vision, 2014

[Gatys15] Gatys, Echer, Bethge; A Neural Algorithm of Artistic Style,
arXiv: 1508.06576, 2015

[Glorot10] Glorot, Bengio; Understanding the difficulty of training deep feedforward
neural networks, International conference on artificial intelligence and statistics, 2010

[Glorot11] Glorot, Bordes, Bengio; Deep Sparse Rectifier Neural Networks, JMLR
2011

[He15] He, Zhang, Ren and Sun; Delving Deep into Rectifiers:
Surpassing Human- Level Performance on ImageNet Classification, arXiv 2015

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and
R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167

[Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the
two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258

[Krizhevsky12] Krizhevsky, Sutskever and Hinton; ImageNet Classification with Deep Convolutional
Neural networks, NIPS 2012

[LeCun95] LeCun, Yann et. al.; Comparison of learning algorithms for
handwritten digit recognition, International conference on artificial neural networks, 1995

[Nguyen15] Nguyen, Yosinski and Clune; Deep Neural Networks are Easily
Fooled: High Confidence Predictions for Unrecognizable Images, Computer Vision and Pattern Recognition (CVPR) 2015

[Radford15] Radford, Metz, Chintala; Unsupervised Representation Learning with Deep Convolutional
Generative Adversarial Networks, arXiv:1511.06434, 2015

[Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for
large-scale image recognition, arXiv:1409.1556, 2014

[Zeiler14] Zeiler and Fergus; Visualizing and understanding convolutional networks, Computer
Vision - ECCV 2014

Introduction to Deep Learning - Cambridge Pytho...

Introduction to Deep Learning - Cambridge Python User Group

More Decks by Britefury

Other Decks in Programming

Featured

Transcript