Deep Learning Tutorial - advanced techniques - PyData London 2016

Deep Learning Tutorial Advanced Techniques G. French Kings College London
University of East Anglia Image montages from http://www.image-net.org

Image processing Using Theano1 and Lasagne2 1http://deeplearning.net/software/theano/ 2 https://github.com/Lasagne/Lasagne

What we’ll cover

Theano What it is and how it works Review: Multi-layer
perceptron The basic model Convolutional networks Neural networks for computer vision

Lasagne and VGG-19 Explain Lasagne and use it with a
convolutional network trained by the VGG group at Oxford University Deep learning tricks of the trade tips to save you some time Active learning less training data by careful choice

Tutorial materials

Github Repo: https://github.com/Britefury/deep-learning-tutorial-pydata2016 The notebooks are viewable on Github

Intro to Theano and Lasagne slides: https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning

Amazon AMI (Use GPU machine) AMI ID: ami-5f789e32 AMI Name:
PyData London 2016 deep learning adv tutorial - Ubuntu-14.04 Anaconda2-4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel

Theano

Neural network software comes in two flavours: Neural network toolkits
Expression compilers

Neural network toolkit Specify structure of neural network in terms
of layers

Expression compilers Describe network architecture in terms of mathematical expressions

In comparison Advantages Disadvantages Network toolkit (e.g. CAFFE) • CAFFE
is fast • Most likely easer to get going • Bindings for MATLAB, Python, command line access • Less flexible; harder to extend (need to learn architecture, manual differentiation) Expression compiler (e.g. Theano) • Extensible; new layer type or cost function: no problem • See what goes on under the hood • Being adventurous is easier! • Slower (Theano) • Debugging can be tricky (compiled expressions are a step away from your code) • Typically only work with one language (e.g. Python for Theano)

Theano An expression compiler

Write numpy style expressions Compiles to either C (CPU) or
CUDA (nVidia GPU)

Notebook: Theano basics Expressions Modify shared variables Variables and functions
Gradient and updates

There is much more to Theano For more information: http://deeplearning.net/

Review: MLP (multi-layer perceptron)

= input (M-element vector) = output (N-element vector) = weights
parameter (NxM matrix) = bias parameter (N-element vector) = activation function; normally ReLU but can be tanh or sigmoid = ( + )

In a nutshell: = ( + )

Repeat for each layer Input vector ( + ) Hidden
layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

To train the network: Compute the derivative of cost w.r.t.
parameters ( and )

Update parameters: + , = + − /0 /12 +
, = + − /0 /32 γ = learning rate

Theano takes care of the differentiation for you!

(Obligatory) MNIST example: 2 hidden layers, both 256 units after
300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

MNIST is quite a special case Digits nicely centred within
image Scaled to approx. same size

The fully connected networks so far have a weakness: No
translation invariance; learned features are position dependent

For more general imagery: requires a training set large enough
to see all features in all possible positions… Requires network with enough units to represent this…

Convolutional networks

Convolution Slide a convolution kernel over an image Multiply image
pixels by kernel pixels and sum

Convolution Convolutions are often used for feature detection

A brief detour…

Gabor filters ∗

Back on track to… Convolutional networks

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias
Activation function (non-linearity) Layer activation

Convolutional layer Each unit only connected to units in its
neighbourhood

Convolutional layer Weights are shared Red weights have same value
As do greens… And yellows

The values of the weights form a convolution kernel For
practical computer vision, more an one kernel must be used to extract a variety of features

Convolutional layer Different weight-kernels: Output is image with multiple channels

Still = ( + ) As convolution can be expressed
as multiplication by weight matrix

Note In subsequent layers, each kernel connects to pixels in
ALL channels in previous layer

Another way of looking at it: A single kernel of
an e.g. 5x5 convolutional layer is a bit like…

fully-connected layer with 5x5 input image repeated across the whole
image a new ‘fully-connected layer’ for each filter

Max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2 x
2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently

Example: A Simplified LeNet [LeCun95] for MNIST digits

Simplified LeNet for MNIST digits 28 28 24 24 Input
Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

after 300 iterations over training set: 99.21% validation accuracy Model
Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

Lasagne and VGG-19

Lasagne is a neural network library built on Theano

Provides API for: constructing layers of a network getting Theano
expressions representing output, loss, etc.

Lasagne is quite a thin layer on top of Theano,
so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.

Notebook: Lasagne basics Build network: modified LeNet for MNIST Train
the network

Using a pre-trained VGG-19 Conv- net

Use VGG-19; the 19-layer model 1000-class image classifier, trained on
ImageNet

VGG models are simple but effective Consist of: 3x3 convolutions
2x2 max pooling fully connected

# Layer Input: 3 x 224 x 224 (RGB image,
zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling, 2x2

# Layer 9 512C3 10 512C3 11 512C3 12 512C3
MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max Notation: FC4096 fully-connected layer 4096 channels drop 50% With 50% drop-out during training

# Layer Input: 3 x 224 x 224 (RGB image,
zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 # Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max

These kinds of architectures tend to work well: Small convolution
kernels (3x3) Interspersed with max-pooling

Good first start when choosing a network architecture

Exercise / Demo Classifying an image with VGG-19

What about using VGG-19 to find a peacock in a
photo?

Can extract square patches from image in sliding window fashion
and classify:

Exercise / Demo Finding a peacock with VGG-19: part 1

Inefficient

Using convolutions to your advantage

Adjacent windows share majority of their pixels

For lower levels, this involves repeating many of the same
computations, getting the same result ∗

If we could apply the first convolutional layer across the
whole image rather than many 224x224 blocks we could re- use those computations….

Then we could also do this for the rest of
the convolutional layers further down…

In fact we can use the whole network in a
convolutional fashion; we just need to convert the fully-connected layers to convolutional layers.

Exercise / Demo Finding a peacock with VGG-19: part 2

This is a trick used when doing image segmentation, when
we want to determine which parts of an image belong to which class At both training time and prediction time

Deep learning tricks of the trade

Choosing: mini-batch size

Small mini-batches Maybe around ~8 Good but slower training Small
mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM

Large mini-batches 1000s Ineffective training Won’t reach the same error
rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks

Happy medium (where you want to be) Maybe around 64-256,
lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks

~100 seems to work well; gets good results

Increasing mini-batch size will improve performance up to the point
where all GPU units are in use Increasing it further will not improve performance; it will reduce accuracy

Caveat When working in a convolutional fashion - like the
example of using VGG- net to find the peacock – or when doing image segmentation

In such cases, pushing large patches of an image through
as a single batch along with a correspondingly large output patch re-uses data due to convolutions and results in substantial savings

My experience: Use patches that are as large as possible
Although it’s a tricky balance with accuracy of the final result

Batch normalization

Batch normalization [Ioffe15] is recommended in most cases Speeds up
training Loss and error drop faster per-epoch

Although epochs take longer (around 2x in my experience) Can
(ultimately) reach lower error rates Lets you build deeper networks

Standardise activations (zero-mean, unit variance) per-channel between network layers Solves
problems caused by exponential growth or shrinkage of layer activations

+ = 1 9 = 2 ;<9 = 2 ;
Assume that a layer – grey square – produces activations whose std-dev are twice that of the input:

= 1 = 2= = = 2= + When layers
are stacked together: ⋯

The magnitude of activations and therefore gradients either explode or
vanish (if the layers reduce the magnitude of activations rather than magnify them)

Can be partially addressed with careful weight initialization [He15]. Batch
normalization between layers keeps things sane; can train networks with hundreds of layers [He15b].

After convolutional, fully-connected or network-in-network* layers, before the non-linearity (*)
kind of like a 1x1 convolutional layer [Lin13]

Lasagne batch normalization inserts itself into a layer before the
non- linearity, so its nice and easy to use: l = lasagne.layers.batch_norm(l)

Data standardisation

Standardise your data Ensure zero-mean and unit standard deviation

Standardise input data In case of regression, standardise output data
too (don’t forget to invert the standardisation of network predictions!)

Autoencoder; image reconstruction (regression), PCA whitening

Autoencoder; image reconstruction (regression), no standardisation

Still a good idea to do this, even when using
batch normalization

Autoencoder; edge map reconstruction (regression), standardisation

Autoencoder; edge map reconstruction (regression), no standardisation

Standardisation Extract samples (pixels in the case of images) into
an array Compute distribution and standardise

Either: Zero the mean and scale std-dev to 1, per
channel (RGB for images) , = −

CIFAR10 RGB distribution

CIFAR10 RGB – standardised

Or better still: Use PCA whitening (retain all channels –
we don’t want to reduce dimensionality)

CIFAR10 RGB – with principal components

CIFAR10 RGB - principal components aligned with standard basis

CIFAR10 RGB – PCA whitened

PCA whitening From Scikit-learn use PCA or IncrementalPCA

Could batch normalisation make standardisation unnecessary?

The previous fish based examples all used batch normalisation and
still benefited from data standardisation, so no.

When training goes wrong and what to look for

Loss becomes NaN

Classification error rate equivalent of random guess (its not learning)

Learns to predict constant value; optimises constant value for best
loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)

Debugging your network

Neural networks (most) often DON’T learn what you want or
expect them to

Local minima will be the bane of your existence

So, what has your network learnt? This is often a
good question.

Saliency maps Determine which parts of an image the network
is using to make its prediction Tells you what the network is ‘looking at’

Two approaches

1. Region-level saliency Blank out different regions of the image
and compute the difference in prediction

Exercise / Demo Block-level image saliency

2. Pixel-level saliency Compute the gradient of the prediction of
a specific class w.r.t. the image pixels

Exercise / Demo Pixel-level image saliency

Designing a computer vision pipeline

Simple problems may be solved with just a neural network

Not sufficient for more complex problems

Theoretically possible to use a single network, with enough training
data (where enough is an impractical amount)

For more complex problems, the problem should be broken down

Example Identifying right whales, by Felix Lau 2nd place in
Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html

Identifying right whales, by Felix Lau The first naïve solution
– training a classifier to identify individuals – did not work well

Region-based saliency map revealed that the network had ‘locked on’
to features in the ocean shape rather than the whales

Lau’s solution: Train a localiser to locate the whale in
the image

Lau’s solution: Train a keypoint finder to locate two keypoints
on the whale’s head to identify its orientation

Lau’s solution: Train classifier on oriented and cropped whale head
images

Active learning

Training deep neural networks is data hungry Labelled training data
can be expensive to acquire or produce

Active learning reduces the amount of data required Can therefore
reduce cost

Assumption 1: classification problem Assumption 2: unlimited or large quantities
of un-labelled data available

Train a network with the labelled data we have

Predict which un-labelled samples are hardest to classify; where ground
truths would be most useful

Get ground truth labels for those samples (by manual labelling
say)

Train with enlarged dataset

Repeat as necessary; go back to predicting difficulty of un-labelled
samples

How to determine which un-labelled samples are most worth labelling?

[Wang14] discusses a few different approaches, confidence being simple and
effetive

Active learning by confidence

Active learning by confidence Predict class probabilities of un-labelled sample

Active learning by confidence The maximum probability is that of
the predicted class and its value is the confidence

Active learning by confidence Choose the samples with the lowest
confidence as the next candidates for labelling

Note: Predicted probabilities from neural nets are often very close
to 0.0 or 1.0; maybe 1e-6 away. Would be nice if they were ‘smoother’

Can use softmax with ‘temperature’ to smooth the predictions

Softmax: ; = BC ∑ BE = FG+

Softmax with temperature t (just divide by t first): ;
= ; ; = JC ∑ JE = FG+

A higher temperature softens the predicted probabilities I find a
value of 3 for t works well

Active learning example: MNIST 50k training, 10k validation, 10k test

Start with 500 labelled training samples Each round, of the
remaining training samples, choose the 500 with the least confidence Add to dataset

0 1 2 3 4 5 6 7 8 9
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Prediction error (%) # labelled samples Random order Confidence Active learning: MNIST Random choice vs least-confidence choice

MNIST: Only needs 5k out of 50k samples to each
(very nearly) the same accuracy

MNIST is special easy case SVHN results less marked; can
get save maybe 1/3rd of data

Saving even 25% of labelled data requirements could result in
substantial cost saving though

Just for fun

Deep Dreams

When training a network, we use gradient descent to iteratively
modify weights given images and ground truths

We just as easily use gradient descent to modify an
image given weights

Deep Dreams: Take an image to hallucinate from Choose a
layer, e.g. ‘pool4’ of VGG-19; choice depends on scale and level of features desired

Deep Dreams: Compute gradient of L-norm of layer w.r.t. image
Use gradient ascent to increase L-norm

Exercise / Demo Deep Dreams

Hope you’ve found it helpful!

Thank you!

References

[He15a] He, Zhang, Ren and Sun; Delving Deep into Rectifiers:
Surpassing Human-Level Performance on ImageNet Classification, arXiv 2015

[He15b] He, Kaiming, et al. "Deep Residual Learning for Image
Recognition." arXiv preprint arXiv:1512.03385 (2015).

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and
R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167

[Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the
two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258

[Lin13] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in
network." arXiv preprint arXiv:1312.4400 (2013).

[Nesterov83] Nesterov, Y. A method of solving a convex programming
problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376 (1983).

[Sutskever13] Sutskever, Ilya, et al. On the importance of initialization
and momentum in deep learning. Proceedings of the 30th international conference on machine learning (ICML-13). 2013.

[Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for
large-scale image recognition, arXiv:1409.1556, 2014

[Wang14] Wang, Dan, and Yi Shang. "A new active labeling
method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.

Deep Learning Tutorial - advanced techniques - ...

Deep Learning Tutorial - advanced techniques - PyData London 2016

More Decks by Britefury

Other Decks in Technology

Featured

Transcript