Deep Learning Tutorial
Advanced Techniques
G. French
Kings College London
University of East Anglia
Image montages from http://www.image-net.org
Slide 2
Slide 2 text
Focus
Slide 3
Slide 3 text
Image processing
Using Theano1 and Lasagne2
1http://deeplearning.net/software/theano/
2 https://github.com/Lasagne/Lasagne
Slide 4
Slide 4 text
What we’ll cover
Slide 5
Slide 5 text
Theano
What it is and how it works
Review: Multi-layer perceptron
The basic model
Convolutional networks
Neural networks for computer vision
Slide 6
Slide 6 text
Lasagne and VGG-19
Explain Lasagne and use it with a convolutional network trained
by the VGG group at Oxford University
Deep learning tricks of the trade
tips to save you some time
Active learning
less training data by careful choice
Slide 7
Slide 7 text
Tutorial materials
Slide 8
Slide 8 text
Github Repo:
https://github.com/Britefury/deep-learning-tutorial-pydata2016
The notebooks are viewable on Github
Slide 9
Slide 9 text
Intro to Theano and Lasagne slides:
https://speakerdeck.com/britefury
https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning
Slide 10
Slide 10 text
Amazon AMI (Use GPU machine)
AMI ID: ami-5f789e32
AMI Name:
PyData London 2016 deep learning adv tutorial -
Ubuntu-14.04 Anaconda2-4.0.0 Cuda-7.5 cuDNN-5
Theano-0.8 Lasagne Fuel
Slide 11
Slide 11 text
Theano
Slide 12
Slide 12 text
Neural network software comes in two
flavours:
Neural network toolkits
Expression compilers
Slide 13
Slide 13 text
Neural network toolkit
Specify structure of neural network in
terms of layers
Slide 14
Slide 14 text
Expression compilers
Describe network architecture in terms
of mathematical expressions
Slide 15
Slide 15 text
In comparison
Advantages Disadvantages
Network
toolkit (e.g.
CAFFE)
• CAFFE is fast
• Most likely easer to
get going
• Bindings for MATLAB,
Python, command
line access
• Less flexible; harder to
extend (need to learn
architecture, manual
differentiation)
Expression
compiler
(e.g.
Theano)
• Extensible; new layer
type or cost function:
no problem
• See what goes on
under the hood
• Being adventurous is
easier!
• Slower (Theano)
• Debugging can be tricky
(compiled expressions
are a step away from
your code)
• Typically only work with
one language (e.g.
Python for Theano)
Slide 16
Slide 16 text
Theano
An expression compiler
Slide 17
Slide 17 text
Write numpy style expressions
Compiles to either C (CPU) or CUDA
(nVidia GPU)
Slide 18
Slide 18 text
Notebook: Theano basics
Expressions
Modify shared variables
Variables and functions
Gradient and updates
Slide 19
Slide 19 text
There is much more to Theano
For more information:
http://deeplearning.net/
Slide 20
Slide 20 text
Review: MLP (multi-layer
perceptron)
Slide 21
Slide 21 text
= input (M-element vector)
= output (N-element vector)
= weights parameter (NxM matrix)
= bias parameter (N-element vector)
= activation function; normally ReLU but
can be tanh or sigmoid
= ( + )
(Obligatory) MNIST example:
2 hidden layers, both 256 units
after 300 iterations over training set:
1.83% validation error
Input Hidden
784
(28x28
images)
256
Hidden Output
256 10
Slide 28
Slide 28 text
MNIST is quite a special case
Digits nicely centred within image
Scaled to approx. same size
Slide 29
Slide 29 text
The fully connected networks so far have a
weakness:
No translation invariance; learned features are
position dependent
Slide 30
Slide 30 text
For more general imagery:
requires a training set large enough to
see all features in all possible
positions…
Requires network with enough units to
represent this…
Slide 31
Slide 31 text
Convolutional networks
Slide 32
Slide 32 text
Convolution
Slide a convolution kernel over an image
Multiply image pixels by kernel pixels
and sum
Slide 33
Slide 33 text
Convolution
Convolutions are often used for feature
detection
Slide 34
Slide 34 text
A brief detour…
Slide 35
Slide 35 text
Gabor filters
∗
Slide 36
Slide 36 text
Back on track to…
Convolutional networks
Slide 37
Slide 37 text
Recap: FC (fully-connected) layer
()
Input
vector
Weighted
connections
Bias
Activation
function
(non-linearity)
Layer
activation
Slide 38
Slide 38 text
Convolutional layer
Each unit only connected to units
in its neighbourhood
Slide 39
Slide 39 text
Convolutional layer
Weights are shared
Red weights
have same
value
As do
greens…
And
yellows
Slide 40
Slide 40 text
The values of the weights form a
convolution kernel
For practical computer vision, more an
one kernel must be used to extract a
variety of features
Slide 41
Slide 41 text
Convolutional layer
Different
weight-kernels:
Output is image
with multiple
channels
Slide 42
Slide 42 text
Still
= ( + )
As convolution can be expressed as
multiplication by weight matrix
Slide 43
Slide 43 text
Note
In subsequent layers, each kernel
connects to pixels in ALL channels in
previous layer
Slide 44
Slide 44 text
Another way of looking at it:
A single kernel of an e.g. 5x5
convolutional layer is a bit like…
Slide 45
Slide 45 text
fully-connected layer with 5x5 input
image
repeated across the whole image
a new ‘fully-connected layer’ for each
filter
Slide 46
Slide 46 text
Max-pooling ‘layer’ [Ciresan12]
Take maximum value from each 2 x 2
pooling region ( x ) in the general case
Down-samples image by factor
Operates on channels independently
Slide 47
Slide 47 text
Example:
A Simplified LeNet [LeCun95] for MNIST
digits
after 300 iterations over training set:
99.21% validation accuracy
Model Error
FC64 2.85%
FC256--FC256 1.83%
20C5--MP2--50C5--MP2--FC256 0.79%
Slide 50
Slide 50 text
Lasagne and VGG-19
Slide 51
Slide 51 text
Lasagne is a neural network library built
on Theano
Slide 52
Slide 52 text
Provides API for:
constructing layers of a network
getting Theano expressions
representing output, loss, etc.
Slide 53
Slide 53 text
Lasagne is quite a thin layer on top of
Theano, so understanding Theano is
helpful
On the plus side, implementing custom
layers, loss functions, etc is quite
doable.
Slide 54
Slide 54 text
Notebook: Lasagne basics
Build network: modified LeNet for
MNIST
Train the network
Slide 55
Slide 55 text
Using a pre-trained VGG-19 Conv-
net
Slide 56
Slide 56 text
Use VGG-19; the 19-layer model
1000-class image classifier, trained on
ImageNet
Slide 57
Slide 57 text
VGG models are simple but effective
Consist of:
3x3 convolutions
2x2 max pooling
fully connected
These kinds of architectures tend to
work well:
Small convolution kernels (3x3)
Interspersed with max-pooling
Slide 62
Slide 62 text
Good first start when choosing a
network architecture
Slide 63
Slide 63 text
Exercise / Demo
Classifying an image with VGG-19
Slide 64
Slide 64 text
What about using VGG-19 to find a
peacock in a photo?
Slide 65
Slide 65 text
Can extract square patches from image
in sliding window fashion and classify:
Slide 66
Slide 66 text
Exercise / Demo
Finding a peacock with VGG-19: part 1
Slide 67
Slide 67 text
Inefficient
Slide 68
Slide 68 text
Using convolutions to your
advantage
Slide 69
Slide 69 text
Adjacent windows share majority of
their pixels
Slide 70
Slide 70 text
For lower levels, this involves repeating many of
the same computations, getting the same result
∗
Slide 71
Slide 71 text
If we could apply the first convolutional
layer across the whole image rather
than many 224x224 blocks we could re-
use those computations….
Slide 72
Slide 72 text
Then we could also do this for the rest
of the convolutional layers further
down…
Slide 73
Slide 73 text
In fact we can use the whole network in
a convolutional fashion; we just need to
convert the fully-connected layers to
convolutional layers.
Slide 74
Slide 74 text
Exercise / Demo
Finding a peacock with VGG-19: part 2
Slide 75
Slide 75 text
This is a trick used when doing image
segmentation, when we want to
determine which parts of an image
belong to which class
At both training time and prediction
time
Slide 76
Slide 76 text
Deep learning tricks of the trade
Slide 77
Slide 77 text
Choosing: mini-batch size
Slide 78
Slide 78 text
Small mini-batches
Maybe around ~8
Good but slower training
Small mini-batch results in regularization (due to noise), reaching lower
error rates in the end [Goodfellow16]. When using very small mini-
batches, need to compensate with lower learning rate and more epochs.
Slow due to low parallelism
Does not use all cores of GPU
Low memory usage
Less neuron activations kept in RAM
Slide 79
Slide 79 text
Large mini-batches
1000s
Ineffective training
Won’t reach the same error rate as with smaller batches and may not
learn at all.
Can be fast due to high parallelism
Uses GPU parallelism (there are limits; gains only achievable if there are
unused CUDA cores)
High memory usage
Lots of neuron activations kept around; can run out of RAM on large
networks
Slide 80
Slide 80 text
Happy medium (where you want to be)
Maybe around 64-256, lots of experiments use ~100
Effective training
Learns reasonably quickly – in terms of improvement per epoch – and
reaches acceptable error rate or loss
Medium performance
Acceptable in many cases
Medium memory usage
Fine for modest sized networks
Slide 81
Slide 81 text
~100 seems to work well; gets good
results
Slide 82
Slide 82 text
Increasing mini-batch size will improve
performance up to the point where all
GPU units are in use
Increasing it further will not improve
performance; it will reduce accuracy
Slide 83
Slide 83 text
Caveat
When working in a convolutional
fashion - like the example of using VGG-
net to find the peacock – or when doing
image segmentation
Slide 84
Slide 84 text
In such cases, pushing large patches of
an image through as a single batch
along with a correspondingly large
output patch re-uses data due to
convolutions and results in substantial
savings
Slide 85
Slide 85 text
My experience:
Use patches that are as large as possible
Although it’s a tricky balance with
accuracy of the final result
Slide 86
Slide 86 text
Batch normalization
Slide 87
Slide 87 text
Batch normalization [Ioffe15] is
recommended in most cases
Speeds up training
Loss and error drop faster per-epoch
Slide 88
Slide 88 text
Although epochs take longer (around 2x
in my experience)
Can (ultimately) reach lower error rates
Lets you build deeper networks
Slide 89
Slide 89 text
Standardise activations (zero-mean, unit
variance) per-channel between network
layers
Solves problems caused by exponential
growth or shrinkage of layer activations
Slide 90
Slide 90 text
+
= 1 9
= 2
;<9
= 2 ;
Assume that a layer – grey square –
produces activations whose std-dev are
twice that of the input:
Slide 91
Slide 91 text
= 1 = 2=
=
= 2= +
When layers are stacked together:
⋯
Slide 92
Slide 92 text
The magnitude of activations and
therefore gradients either explode or
vanish (if the layers reduce the
magnitude of activations rather than
magnify them)
Slide 93
Slide 93 text
Can be partially addressed with careful
weight initialization [He15].
Batch normalization between layers
keeps things sane; can train networks
with hundreds of layers [He15b].
Slide 94
Slide 94 text
After convolutional, fully-connected or
network-in-network* layers, before the
non-linearity
(*) kind of like a 1x1 convolutional layer
[Lin13]
Slide 95
Slide 95 text
Lasagne batch normalization inserts
itself into a layer before the non-
linearity, so its nice and easy to use:
l = lasagne.layers.batch_norm(l)
Slide 96
Slide 96 text
Data standardisation
Slide 97
Slide 97 text
Standardise your data
Ensure zero-mean and unit standard
deviation
Slide 98
Slide 98 text
Standardise input data
In case of regression, standardise
output data too (don’t forget to invert
the standardisation of network
predictions!)
Autoencoder; edge map reconstruction
(regression), no standardisation
Slide 104
Slide 104 text
Standardisation
Extract samples (pixels in the case of
images) into an array
Compute distribution and standardise
Slide 105
Slide 105 text
Either:
Zero the mean and scale std-dev to 1,
per channel (RGB for images)
, =
−
Slide 106
Slide 106 text
CIFAR10 RGB distribution
Slide 107
Slide 107 text
CIFAR10 RGB – standardised
Slide 108
Slide 108 text
Or better still:
Use PCA whitening
(retain all channels – we don’t want to
reduce dimensionality)
Slide 109
Slide 109 text
CIFAR10 RGB – with principal
components
Slide 110
Slide 110 text
CIFAR10 RGB - principal components
aligned with standard basis
Slide 111
Slide 111 text
CIFAR10 RGB – PCA whitened
Slide 112
Slide 112 text
PCA whitening
From Scikit-learn use PCA or
IncrementalPCA
Slide 113
Slide 113 text
Could batch normalisation make
standardisation unnecessary?
Slide 114
Slide 114 text
The previous fish based examples all
used batch normalisation and still
benefited from data standardisation, so
no.
Slide 115
Slide 115 text
When training goes wrong and
what to look for
Slide 116
Slide 116 text
Loss becomes NaN
Slide 117
Slide 117 text
Classification error rate equivalent of
random guess (its not learning)
Slide 118
Slide 118 text
Learns to predict constant value;
optimises constant value for best loss
A constant value is a local minimum
that the network won’t get out of
(neural networks ‘cheat’ like crazy!)
Slide 119
Slide 119 text
Debugging your network
Slide 120
Slide 120 text
Neural networks (most) often DON’T
learn what you want or expect them to
Slide 121
Slide 121 text
Local minima will be the bane of your
existence
Slide 122
Slide 122 text
So, what has your network learnt?
This is often a good question.
Slide 123
Slide 123 text
Saliency maps
Determine which parts of an image the
network is using to make its prediction
Tells you what the network is ‘looking
at’
Slide 124
Slide 124 text
Two approaches
Slide 125
Slide 125 text
1. Region-level saliency
Blank out different regions of the image
and compute the difference in
prediction
Slide 126
Slide 126 text
Exercise / Demo
Block-level image saliency
Slide 127
Slide 127 text
2. Pixel-level saliency
Compute the gradient of the prediction
of a specific class w.r.t. the image pixels
Slide 128
Slide 128 text
Exercise / Demo
Pixel-level image saliency
Slide 129
Slide 129 text
Designing a computer vision
pipeline
Slide 130
Slide 130 text
Simple problems may be solved with
just a neural network
Slide 131
Slide 131 text
Not sufficient for more complex
problems
Slide 132
Slide 132 text
Theoretically possible to use a single
network, with enough training data
(where enough is an impractical
amount)
Slide 133
Slide 133 text
For more complex problems, the
problem should be broken down
Slide 134
Slide 134 text
Example
Identifying right whales, by Felix Lau
2nd place in Kaggle competition
http://felixlaumon.github.io/2015/01/0
8/kaggle-right-whale.html
Slide 135
Slide 135 text
Identifying right whales, by Felix Lau
The first naïve solution – training a
classifier to identify individuals – did
not work well
Slide 136
Slide 136 text
Region-based saliency map revealed that
the network had ‘locked on’ to features
in the ocean shape rather than the
whales
Slide 137
Slide 137 text
Lau’s solution:
Train a localiser to locate the whale in
the image
Slide 138
Slide 138 text
Lau’s solution:
Train a keypoint finder to locate two
keypoints on the whale’s head to
identify its orientation
Slide 139
Slide 139 text
Lau’s solution:
Train classifier on oriented and cropped
whale head images
Slide 140
Slide 140 text
Active learning
Slide 141
Slide 141 text
Training deep neural networks is data
hungry
Labelled training data can be expensive
to acquire or produce
Slide 142
Slide 142 text
Active learning reduces the amount of
data required
Can therefore reduce cost
Slide 143
Slide 143 text
Assumption 1: classification problem
Assumption 2: unlimited or large
quantities of un-labelled data available
Slide 144
Slide 144 text
Train a network with the labelled data
we have
Slide 145
Slide 145 text
Predict which un-labelled samples are
hardest to classify; where ground truths
would be most useful
Slide 146
Slide 146 text
Get ground truth labels for those
samples (by manual labelling say)
Slide 147
Slide 147 text
Train with enlarged dataset
Slide 148
Slide 148 text
Repeat as necessary; go back to
predicting difficulty of un-labelled
samples
Slide 149
Slide 149 text
How to determine which un-labelled
samples are most worth labelling?
Slide 150
Slide 150 text
[Wang14] discusses a few different
approaches, confidence being simple
and effetive
Slide 151
Slide 151 text
Active learning by confidence
Slide 152
Slide 152 text
Active learning by confidence
Predict class probabilities of un-labelled
sample
Slide 153
Slide 153 text
Active learning by confidence
The maximum probability is that of the
predicted class and its value is the
confidence
Slide 154
Slide 154 text
Active learning by confidence
Choose the samples with the lowest
confidence as the next candidates for
labelling
Slide 155
Slide 155 text
Note:
Predicted probabilities from neural nets
are often very close to 0.0 or 1.0; maybe
1e-6 away.
Would be nice if they were ‘smoother’
Slide 156
Slide 156 text
Can use softmax with ‘temperature’ to
smooth the predictions
Slide 157
Slide 157 text
Softmax:
;
=
BC
∑ BE
=
FG+
Slide 158
Slide 158 text
Softmax with temperature t (just divide
by t first):
;
=
;
;
=
JC
∑ JE
=
FG+
Slide 159
Slide 159 text
A higher temperature softens the
predicted probabilities
I find a value of 3 for t works well
Slide 160
Slide 160 text
Active learning example: MNIST
50k training, 10k validation, 10k test
Slide 161
Slide 161 text
Start with 500 labelled training samples
Each round, of the remaining training
samples, choose the 500 with the least
confidence
Add to dataset
Slide 162
Slide 162 text
0
1
2
3
4
5
6
7
8
9
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Prediction error (%)
# labelled samples
Random order
Confidence
Active learning: MNIST
Random choice vs least-confidence choice
Slide 163
Slide 163 text
MNIST:
Only needs 5k out of 50k samples to
each (very nearly) the same accuracy
Slide 164
Slide 164 text
MNIST is special easy case
SVHN results less marked; can get save
maybe 1/3rd of data
Slide 165
Slide 165 text
Saving even 25% of labelled data
requirements could result in substantial
cost saving though
Slide 166
Slide 166 text
Just for fun
Slide 167
Slide 167 text
Deep Dreams
Slide 168
Slide 168 text
When training a network, we use
gradient descent to iteratively modify
weights given images and ground truths
Slide 169
Slide 169 text
We just as easily use gradient descent to
modify an image given weights
Slide 170
Slide 170 text
Deep Dreams:
Take an image to hallucinate from
Choose a layer, e.g. ‘pool4’ of VGG-19;
choice depends on scale and level of
features desired
Slide 171
Slide 171 text
Deep Dreams:
Compute gradient of L-norm of layer
w.r.t. image
Use gradient ascent to increase L-norm
Slide 172
Slide 172 text
Exercise / Demo
Deep Dreams
Slide 173
Slide 173 text
Hope you’ve found it helpful!
Slide 174
Slide 174 text
Thank you!
Slide 175
Slide 175 text
References
Slide 176
Slide 176 text
[He15a] He, Zhang, Ren and Sun;
Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet
Classification, arXiv 2015
Slide 177
Slide 177 text
[He15b] He, Kaiming, et al. "Deep
Residual Learning for Image
Recognition." arXiv preprint
arXiv:1512.03385 (2015).
Slide 178
Slide 178 text
[Hinton12] G.E. Hinton, N. Srivastava, A.
Krizhevsky, I. Sutskever and R. R.
Salakhutdinov; Improving neural
networks by preventing co-adaptation of
feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
Slide 179
Slide 179 text
[Ioffe15] Ioffe, S.; Szegedy C.. (2015).
“Batch Normalization: Accelerating Deep
Network Training by Reducing Internal
Covariate Shift". ICML 2015,
arXiv:1502.03167
Slide 180
Slide 180 text
[Jones87] Jones, J.P.; Palmer, L.A. (1987).
"An evaluation of the two-dimensional
gabor filter model of simple receptive
fields in cat striate cortex". J.
Neurophysiol 58 (6): 1233–1258
Slide 181
Slide 181 text
[Lin13] Lin, Min, Qiang Chen, and
Shuicheng Yan. "Network in
network." arXiv preprint
arXiv:1312.4400 (2013).
Slide 182
Slide 182 text
[Nesterov83] Nesterov, Y. A method of
solving a convex programming problem
with convergence rate O(1/sqr(k)). Soviet
Mathematics Doklady, 27:372–376
(1983).
Slide 183
Slide 183 text
[Sutskever13] Sutskever, Ilya, et al. On
the importance of initialization and
momentum in deep
learning. Proceedings of the 30th
international conference on machine
learning (ICML-13). 2013.
Slide 184
Slide 184 text
[Simonyan14] K. Simonyan and
Zisserman; Very deep convolutional
networks for large-scale image
recognition, arXiv:1409.1556, 2014
Slide 185
Slide 185 text
[Wang14] Wang, Dan, and Yi Shang. "A
new active labeling method for deep
learning."Neural Networks (IJCNN), 2014
International Joint Conference on. IEEE,
2014.