Deep Learning workshop - PyCon UK 2016

Deep Learning Tutorial PyCon UK 2016 G. French University of
East Anglia Image montages from http://www.image-net.org

Image processing Using Theano1 and Lasagne2 1http://deeplearning.net/software/theano/ 2 https://github.com/Lasagne/Lasagne

What we’ll cover

Intro, gradient descent, Theano Getting started What is a neural
network? The basic model; the multi-layer perceptron Convolutional networks Neural networks for computer vision

Lasagne and VGG-19 Explain Lasagne and use it with a
convolutional network trained by the VGG group at Oxford University Deep learning tricks of the trade Tips to save you some time When things go wrong Detecting problems and debugging

Designing a computer vision pipeline Neural networks aren’t a magic
bullet; how to use them practically Cool work in the field some awesome work done by others

Tutorial materials

Github Repo (originally for PyData London): https://github.com/Britefury/deep-learning-tutorial-pydata2016 The notebooks are
viewable on Github

Slides Intro to Machine Learning Intro to Theano and Lasagne
https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-machine-learning-for-deep-learning-talk-at-pycon-uk-2016 https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning

Amazon AMI (Use GPU machine) AMI ID: ami-5f789e32 AMI Name:
Britefury deep learning - Ubuntu-14.04 Anaconda2- 4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel

ImageNet

Image classification dataset

~1,000,000 images ~1,000 classes Ground truths prepared manually through Amazon
Mechanical Turk

ImageNet Top-5 challenge: You score if ground truth class is
one your top 5 predictions

ImageNet in 2012 Best approaches used hand-crafted features (SIFT, HOGs,
Fisher vectors, etc) + classifier Top-5 error rate: ~25%

Then the game changed.

Krizhevsky, Sutskever and Hinton; ImageNet Classification with Deep Convolutional Neural
networks [Krizhevsky12] Top-5 error rate of ~15%

In the last few years, more modern networks have achieved
better results still [Simonyan14, He15] Top-5 error rates of ~5-7%

I hope this talk will give you an idea of
how!

Gradient descent for machine learning

For a very quick Machine Learning intro, see the notebook
INTRO ML 01 - Machine learning - a very basic introduction

Gradient descent – a very simple example (ideas borrowed from
Tariq Rashid)

See the notebook INTRO ML 02 - gradient descent for
machine learning

Convert temperatures from Fahrenheit to Kelvin

Temperature conversion Simple linear model: = + If and and
temperatures in Fahrenheit and Kelvin respectively

Use machine learning to determine values for parameters and from
samples

Sample temperatures Fahrenheit (x) Kelvin (y) Boiling point of He
-452.1 4.22 Boiling point of N -320.4 77.36 Melting point of H2O 32.0 273.20 Body temperature 98.6 310.50 Boiling point of H2O 212.0 373.20

First step: initialise parameters; come up with a guess Randomly
initialise (scale) Initialise (offset) to 0

Lets see how we do Also, lets compute squared error
= ) − ) + +

with a=1.982141 and b=0 Fahrenheit (x) Kelvin (y) squared err
(ϵ) Boiling point of He -452.1 4.22 -896.126276 810623.417462 Boiling point of N -320.4 77.36 -635.078210 507568.203773 Melting point of H2O 32.0 273.20 63.428535 44004.067369 Body temperature 98.6 310.50 195.439175 13238.993532 Boiling point of H2O 212.0 373.20 420.214047 2210.320605

The error tells us how far away we are from
the optimal parameter values If = 0 then our model is 100% accurate

We can minimise the error iteratively using gradient descent: a′
= + b′ = +

After enough iterations the parameters converge

With a=0.55581 and b=255.484 Fahrenheit (x) Kelvin (y) squared err
(ϵ) Boiling point of He -452.1 4.22 4.202747 0.000298 Boiling point of N -320.4 77.36 77.402958 0.001845 Melting point of H2O 32.0 273.20 273.270495 0.004969 Body temperature 98.6 310.50 310.287458 0.045174 Boiling point of H2O 212.0 373.20 373.316342 0.013535 True values: a=0.556 b=255.372

NOTE In the example notebook, a very low learning rate
() and a large # of iterations were required

Why? Large magnitudes of temperature values (data)

Results in: Huge values for +

Results in: Large gradient of + w.r.t. params

Results in: Large modifications to params

Results: oscillation of increasing magnitude; explosion

This is addressed by data standardisation

Data standardisation Subtract mean; mean will now be 0 Divide
by std-dev; std-dev will now be 1 Discussed later

Summary Gradient descent for machine learning Choose a (non-negative) measure
of error / loss / cost Minimise it iteratively

Theano

Neural network software comes in two flavours: Neural network toolkits
Expression compilers

Neural network toolkit Specify structure of neural network in terms
of layers

Expression compilers Describe network architecture in terms of mathematical expressions

In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE
is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)

Theano An expression compiler

Write numpy style expressions Compiles to either C (CPU) or
CUDA (nVidia GPU)

Notebook: Theano basics Expressions Modify shared variables Variables and functions
Gradient and updates

There is much more to Theano For more information: http://deeplearning.net/

What is a neural network?

Multiple layers Data propagates through layers Transformed by each layer

Neural network image classifier Inputs Outputs = 0.003 = 0.002
= 0.005 = 0.9 Class probabilities Hidden Hidden

Neural network image regressor Inputs Outputs Avg. width= 2.1 Avg.
length = 14.2 … Real-values Hidden Hidden

Neural network Input layer Hidden layer 0 Hidden layer 1
Output layer ⋯ Inputs Outputs ⋯

Single layer of a neural network () Input vector Weighted
connections Bias Activation function / non-linearity Layer activation

= input (M-element vector) = output (N-element vector) = weights
parameter (NxM matrix) = bias parameter (N-element vector) = activation function; normally ReLU but can be tanh or sigmoid = ( + )

In a nutshell: = ( + )

Repeat for each layer Input vector ( + ) Hidden
layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

In mathematical notation: I = (I + I ) J
= J I + J ⋯ K = (K KLJ + K )

As a classifier Input vector Hidden layer 0 activation Final
layer activation (with softmax non- linearity) ⋯ Image pixels = 0.25 = 0.5 = 0.1 = 0.15 Class probabilities

Summary; a neural network is: Built from layers, each of
which is: a matrix multiplication, then add bias, then apply non-linearity.

Want to see it in action? Check out ConvNetJS by
Andrej Karpathy Try: http://cs.stanford.edu/people/karpathy/convnetjs/index.html Github: https://github.com/karpathy/convnetjs

Also: Tensorflow Playground http://playground.tensorflow.org/

Training a neural network

Learn values for parameters; and (for each layer) Use back-propagation

Initialise weights randomly Initialise biases to 0

Weight initialisation [He15a] provides a good rule of thumb Most
toolkits such as Lasagne and Keras do this any so no need to worry abou ti

For each example OPQ)R from training set evaluate network prediction
SPTU given the training input; = OPQ)R Measure cost (error); difference between SPTU and ground truth output OPQ)R

Classification (which of these categories best describes this?) Final layer:
softmax as non-linearity ; output vector of class probabilities Cost: negative-log-likelihood / categorical cross-entropy

Regression (quantify something, real-valued output) Final layer: no non-linearity /
identity as Cost: Sum of squared differences

Reduce cost (also known as loss) using gradient descent

Compute the derivative (gradient) of w.r.t. parameters (all and )

Theano performs symbolic differentiation for you! dCdW = theano.grad(cost, W)
(other toolkits – such as Torch and Tensorflow – can also do this)

Update parameters: I V = I − UW UXY I
V = I − UW UZY γ = learning rate

Randomly split the training set into mini-batches of ~100 samples.
Train on a mini-batch in a single step. The mini-batch cost is the mean of the costs of the samples in the mini-batch.

Training on mini-batches means that ~100 samples are processed in
parallel – very good for running GPUs that do lots of operations in parallel

Training on (enough mini-batches to cover) all examples in the
training set is called an epoch Run multiple epochs (often 200-300)

Summary; train a neural network: Take mini-batch of training samples
Evaluate (run/execute) the network Measure the average error/cost across mini- batch Use gradient descent to modify parameters to reduce cost REPEAT ABOVE UNTIL DONE

Multi-layer perceptron

Simplest network architecture Nothing we haven’t seen so far Uses
only fully-connected / dense layers

Dense layer: each unit is connected too all units in
previous layer

(Obligatory) MNIST example: 2 hidden layers, both 256 units after
300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

MNIST is quite a special case Digits nicely centred within
image Scaled to approx. same size

Visualising the learned weights can be educational

Each image visualises the weights connecting pixels to a specific
unit in the first hidden layer Note the stroke features detected by the various units

The fully connected networks so far have a weakness: No
translation invariance; learned features are position dependent

For more general imagery: requires a training set large enough
to see all features in all possible positions… Requires network with enough units to represent this…

Convolutional networks

Convolution Often used for feature detection

Slide a convolutional filter over an image

Multiply image pixels by filter weights and sum Image Region
of pixels from image Filter weights × Σ = Multiply Sum Result Do this for all possible positions in the image

Convolution: Gabor filters ∗

An output pixel shows the strength of filter response of
the corresponding region of input ∗

Convolution detects features in a position independent manner -- Convolutional
neural networks learn position independent filters (feature detectors)

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias
Activation function (non-linearity) Layer activation

Convolutional layer Each unit only connected to units in its
neighbourhood

Convolutional layer Weights are shared Red weights have same value
As do greens… And yellows

The values of the weights form a filter For practical
computer vision, more than one filter must be used to extract a variety of features

Convolutional layer Different weight-filters: Output is image with multiple channels

Still = ( + ) As convolution can be expressed
as multiplication by weight matrix

Note In subsequent layers, each filter connects to pixels in
ALL channels in previous layer

Another way of looking at it: A single filter of
an e.g. 5x5 convolutional layer is a bit like…

fully-connected layer with 5x5 input image repeated across the whole
image a new ‘fully-connected layer’ for each filter

Max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2 x
2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently

Down-sampling: striding Can also down-sample using strided convolution; generate output
for 1 in every pixels Faster, can work as well as max-pooling

Example: A Simplified LeNet [LeCun95] for MNIST digits

Simplified LeNet for MNIST digits 28 28 24 24 Input
Output 1 20 Conv: 20 5x5 filters Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 filters Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

after 300 iterations over training set: 99.21% validation accuracy Model
Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

What about the learned kernels? Image taken from paper [Krizhevsky12]
(ImageNet dataset, not MNIST) Gabor filters

Image taken from [Zeiler14]

Lasagne and VGG-19

Lasagne is a neural network library built on Theano

Provides API for: constructing layers of a network getting Theano
expressions representing output, loss, etc.

Lasagne is quite a thin layer on top of Theano,
so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.

Notebook: Lasagne basics Build network: modified LeNet for MNIST Train
the network

Using a pre-trained VGG-19 Conv- net

Researchers have developed neural network architectures for e.g. ImageNet classification
Generously made the network parameters available online

The VGG group at Oxford university trained VGG-16 and VGG-19
for ImageNet classification We will use VGG-19; the 19-layer model

VGG models are simple but effective Consist of: 3x3 convolutions
2x2 max pooling fully connected

# Layer Input: 3 x 224 x 224 (RGB image,
zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling, 2x2

# Layer 9 512C3 10 512C3 11 512C3 12 512C3
MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max Notation: FC4096 fully-connected layer 4096 channels drop 50% With 50% drop-out during training

# Layer Input: 3 x 224 x 224 (RGB image,
zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 # Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max

These kinds of architectures tend to work well: Small convolution
filters (3x3) Interspersed with max-pooling

Good first start when choosing a network architecture

Exercise / Demo Classifying an image with VGG-19

Deep learning tricks of the trade

Data standardisation

Standardise your data Ensure zero-mean and unit standard deviation

Pretty much necessary Although you can get away without it
with images I use it anyway

Standardise input data In case of regression, standardise output data
too (don’t forget to invert the standardisation of network predictions!)

Standardisation Extract samples (pixels in the case of images) into
an array Compute distribution and standardise

Either: Zero the mean and scale std-dev to 1, per
channel (RGB for images) V = −

Choosing: mini-batch size

Small mini-batches Maybe around ~8 Good but slower training Small
mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM

Large mini-batches 1000s Ineffective training Won’t reach the same error
rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks

Happy medium (where you want to be) Maybe around 64-256,
lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks

~100 seems to work well; gets good results

Increasing mini-batch size will improve performance up to the point
where all GPU units are in use Increasing it further will not improve performance; it will reduce accuracy

DropOut

Normally applied after later, fully connected layers lyr = lasagne.layers.DenseLayer(lyr,
num_units=256) lyr = lasagne.layers.DropoutLayer(lyr, p=0.5)

Reduces over-fitting

Over-fitting is a well-known problem in machine learning, affects neural
networks particularly A model over-fits when it is very good at correctly predicting samples in training set but fails to generalise to samples outside it

DropOut [Hinton12] During training, randomly choose units to ‘drop out’
by setting their output to 0, with probability , usually around 0.5 (compensate by multiplying values by J JLb )

During test/predict: Run as normal (DropOut turned off)

Dropout OFF Input layer Hidden layer 0 Output layer

Dropout ON (1) Input layer Hidden layer 0 Output layer

Dropout ON (2) Input layer Hidden layer 0 Output layer

Turning on a different subset of units for each sample:
causes units to learn more robust features that cannot rely on the presence of other specific features to cover for flaws

Batch normalization

Apply after convolutional and fully- connected layers, before the non-
linearity

Lasagne batch normalization inserts itself into a layer before the
non- linearity, so its nice and easy to use: l = lasagne.layers.batch_norm(l)

Batch normalization [Ioffe15] is recommended in most cases Lets you
build deeper networks Speeds up training; loss and error drop faster per-epoch

Standardise activations (zero-mean, unit variance) per-channel between network layers Solves
problems caused by exponential growth or shrinkage of layer activations

I = 1 J = 2 )cJ = 2 )
Assume that a layer – grey square – produces activations whose std-dev are twice that of the input:

= 1 = 2d d = 2d I When layers
are stacked together: ⋯

The magnitude of activations and therefore gradients either explode or
vanish (if the layers reduce the magnitude of activations rather than magnify them)

Batch normalization between layers keeps things sane; can train networks
with hundreds of layers [He15b].

Dataset augmentation

Reduce over-fitting by enlarging training set Artificially modify existing training
samples to make new ones

For images: Apply transformations such as move, scale, rotate, reflect,
etc.

Active learning

Training deep neural networks is data hungry Labelled training data
can be expensive to acquire or produce

Active learning reduces the amount of data required Can therefore
reduce cost

Assumption 1: classification problem Assumption 2: unlimited or large quantities
of un-labelled data available

Active learning Train a network with the labelled data we
have

Active learning Predict which un-labelled samples are hardest to classify;
where labels/ground truth would be most helpful

Active learning Get ground truth labels for those samples (by
manual labelling say)

Active learning Train with enlarged dataset

Active learning Repeat as necessary; go back to predicting difficulty
of un-labelled samples

How to determine which un-labelled samples are most worth labelling?

Active learning by confidence is simple and effective Other approaches
out there [Wang14]

Active learning by confidence Predict class probabilities of un-labelled sample

Active learning by confidence The maximum probability is that of
the predicted class and its value is the confidence

Active learning by confidence Choose the samples with the lowest
confidence as the next candidates for labelling

Active learning example: MNIST 50k training, 10k validation, 10k test

Start with 500 labelled training samples Each round, of the
remaining training samples, choose the 500 with the least confidence Add to dataset

0 1 2 3 4 5 6 7 8 9
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Prediction error (%) # labelled samples Random order Confidence Active learning: MNIST Random choice vs least-confidence choice

MNIST: Only needs 5k out of 50k samples to each
(very nearly) the same accuracy

MNIST is special easy case SVHN results less marked; can
save maybe 1/3rd of data

Saving even 25% of labelled data requirements could result in
substantial cost saving though

When things go wrong

What to look for

Loss becomes NaN

Classification error rate equivalent of random guess (its not learning)

Learns to predict constant value; optimizes constant value for best
loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)

Debugging your network

Neural networks (most) often DON’T learn what you want or
expect them to

Local minima – in terms of loss/cost – will be
the bane of your existence

So, what has your network learnt? This is often a
good question.

Saliency maps Determine which parts of an image the network
is using to make its prediction Tells you what the network is ‘looking at’

Saliency: Two approaches

1. Region-level saliency Blank out different regions of the image
and compute the difference in prediction

Exercise / Demo Block-level image saliency

2. Pixel-level saliency Compute the gradient of the prediction of
a specific class w.r.t. the image pixels

Exercise / Demo Pixel-level image saliency

Designing a computer vision pipeline

Simple problems may be solved with just a neural network

Not sufficient for more complex problems

Theoretically possible to use a single network, with enough training
data (where enough is an impractical amount)

For more complex problems, the problem should be broken down

Example Identifying right whales, by Felix Lau 2nd place in
Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html

Identifying right whales, by Felix Lau The first naïve solution
– training a classifier to identify individuals – did not work well

Region-based saliency map revealed that the network had ‘locked on’
to features in the ocean shape rather than the whales

Lau’s solution: Train a localiser to locate the whale in
the image

Lau’s solution: Train a keypoint finder to locate two keypoints
on the whale’s head to identify its orientation

Lau’s solution: Train classifier on oriented and cropped whale head
images

Some cool work in the field that might be of
interest

Visualizing and understanding convolutional networks [Zeiler14] Visualisations of responses of
layers to images

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Deep Neural Networks are Easily Fooled: High Confidence Predictions in
Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network

Deep Neural Networks are Easily Fooled: High Confidence Predictions in
Recognizable Images [Nguyen15] Image taken from [Nguyen15]

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network
in reverse; orientation, design colour, etc parameters as input, rendered images as output training images

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Image
taken from [Dosovitskiy15]

Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford
15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set

Generative Adversarial Nets [Radford15] Images of bedrooms generated using neural
net Image taken from [Radford15]

Generative Adversarial Nets [Radford15] Image taken from [Radford15]

A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet
model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.

A Neural Algorithm of Artistic Style [Gatys15] Image taken from
[Gatys15]

Hope you’ve found it helpful!

Thank you!

References

[Dosovitskiy15] Dosovitskiy, Springenberg and Box; Learning to generate chairs with
convolutional neural networks, arXiv preprint, 2015

[Gatys15] Gatys, Echer, Bethge; A Neural Algorithm of Artistic Style,
arXiv: 1508.06576, 2015

[He15a] He, Zhang, Ren and Sun; Delving Deep into Rectifiers:
Surpassing Human-Level Performance on ImageNet Classification, arXiv 2015

[He15b] He, Kaiming, et al. "Deep Residual Learning for Image
Recognition." arXiv preprint arXiv:1512.03385 (2015).

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and
R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167

[Radford15] Radford, Metz, Chintala; Unsupervised Representation Learning with Deep Convolutional
Generative Adversarial Networks, arXiv:1511.06434, 2015

[Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for
large-scale image recognition, arXiv:1409.1556, 2014

[Wang14] Wang, Dan, and Yi Shang. "A new active labeling
method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.

[Zeiler14] Zeiler and Fergus; Visualizing and understanding convolutional networks, Computer
Vision - ECCV 2014

Deep Learning workshop - PyCon UK 2016

Deep Learning workshop - PyCon UK 2016

More Decks by Britefury

Other Decks in Technology

Featured

Transcript