In practice this is done on a mini-batch
of examples (e.g. 128) in parallel per
pass
Compute cost for each example, then
average. Compute derivative of average
cost w.r.t. params.
Slide 29
Slide 29 text
Using mini-batches works well when
using a GPU since computations can be
parallelised
Slide 30
Slide 30 text
Cost function
Slide 31
Slide 31 text
Regression
Final layer: no activation function /
identity.
Cost: Sum of squared differences
Slide 32
Slide 32 text
Classification
Just like logistic regression
Slide 33
Slide 33 text
Final layer: softmax as activation
function ; output vector of class
probabilities
Cost: negative-log-likelihood /
categorical cross-entropy
Slide 34
Slide 34 text
Fully connected neural networks
Slide 35
Slide 35 text
Simplest model
Each unit in each layer is connected too
all units in previous layer
All we have considered so far
Slide 36
Slide 36 text
How well does this perform on image
classification?
Slide 37
Slide 37 text
MNIST hand-written digit dataset
28x28 images, 10 classes
60K training examples, 10K validation, 10K
test
Examples from
MNIST
Slide 38
Slide 38 text
Network: 1 hidden layer of 64 units
after 300 iterations over training set:
2.85% validation error
Hidden layer weights
visualised as 28x28 images
Input Hidden Output
784
(28x28
images)
64 10
Slide 39
Slide 39 text
Network: 2 hidden layers, both 256 units
after 300 iterations over training set:
1.83% validation error
Input Hidden
784
(28x28
images)
256
Hidden Output
256 10
Slide 40
Slide 40 text
MNIST is quite a special case
Digits nicely centred within image
Scaled to approx. same size
Slide 41
Slide 41 text
The fully connected networks so far have a
weakness:
No translation invariance; learned features are
position dependent
Slide 42
Slide 42 text
For more general imagery:
requires a training set large enough to
see all features in all possible
positions…
Requires network with enough units to
represent this…
Slide 43
Slide 43 text
Convolutional networks
Slide 44
Slide 44 text
Convolution
Slide a convolution kernel over an image
Multiply image pixels by kernel pixels
and sum
Slide 45
Slide 45 text
Convolution
Convolutions are often used for feature
detection
Slide 46
Slide 46 text
A brief detour…
Slide 47
Slide 47 text
Gabor filters
∗
Slide 48
Slide 48 text
Used for texture classification
Bares similarity to low levels of cat
visual system [Jones87]
Slide 49
Slide 49 text
Back on track to…
Convolutional networks
Slide 50
Slide 50 text
Recap: FC (fully-connected) layer
()
Input
vector
Weighted
connections
Bias
Activation
function /
non-linearity
Layer
activation
Slide 51
Slide 51 text
Convolutional layer
Each unit only connected to units
in its neighbourhood
Slide 52
Slide 52 text
Convolutional layer
Weights are shared
Red weights
have same
value
As do
greens…
And
yellows
Slide 53
Slide 53 text
The values of the weights form a
convolution kernel
For practical computer vision, more an
one kernel must be used to extract a
variety of features
Slide 54
Slide 54 text
Convolutional layer
Different
weight-kernels:
Output is
vector/image
with multiple
channels
Slide 55
Slide 55 text
Still
= ( + )
As convolution can be expressed as
multiplication by weight matrix
Slide 56
Slide 56 text
Note
In subsequent layers, each kernel
connects to pixels in ALL channels in
previous layer
Slide 57
Slide 57 text
Max-pooling ‘layer’ [Ciresan12]
Take maximum value from each (, )
pooling region
Down-samples image by factor
Operates on channels independently
Slide 58
Slide 58 text
These are the models that have been
getting excellent ImageNet results
after 300 iterations over training set:
99.21% validation accuracy
Model Error
FC64 2.85%
FC256--FC256 1.83%
20C5--MP2--50C5--MP2--FC256 0.79%
Slide 63
Slide 63 text
What about the learned kernels?
Image taken from paper
[Krizhevsky12]
Gabor filters
Slide 64
Slide 64 text
Neural networks – recent
developments
Slide 65
Slide 65 text
GOOD NEWS
Training neural networks is more
practical now
Slide 66
Slide 66 text
More processing power
Less of a black art
Slide 67
Slide 67 text
IMPROVEMENTS
Processing power
ReLU activation function
Batch normalisation
DropOut
Slide 68
Slide 68 text
(most important)
IMPROVEMENT
Processing power
Slide 69
Slide 69 text
Image processing requires large networks
with perhaps millions of parameters
Lots of training examples need to train
Easily results in billions or even trillions of
FLOPS
Slide 70
Slide 70 text
Neural networks are ‘embarrassingly
parallelisable’ therefore ideally suited to
GPUs
Use GPUs for all but the smallest of
networks
Slide 71
Slide 71 text
As of now, nVidia is the most popular
make of GPU.
Cheaper gaming cards perfectly
adequate
Only use Tesla in production
Slide 72
Slide 72 text
IMPROVEMENT
New popular activation function:
ReLU - Rectified Linear Unit
Slide 73
Slide 73 text
ReLU - Rectified Linear Unit
= max(, 0)
Slide 74
Slide 74 text
ReLU works better than tanh / sigmoid in
many cases
I don’t really understand the reasons (to be
honest! )
See [Glorot11] [Glorot10]; written by people
who do!
Slide 75
Slide 75 text
IMPROVEMENT
Batch normalisation
Slide 76
Slide 76 text
PROBLEM:
Magnitudes of activations can vary
considerably, layer to layer
If each layer ‘multiplies’ magnitude by
some factor, they explode or vanish
Slide 77
Slide 77 text
SOLUTIONS
Initally: careful weight initialisation
Now: replaced by batch normalisation
Slide 78
Slide 78 text
y = ( + )
Assume: σ = 1
σ depends on distribution of W;
normal, uniform, std-dev, etc
Slide 79
Slide 79 text
In the past: initialise W, rules of thumb
often used, e.g. normal distribution with
= 0.01
Problems arise when training deep
networks with > 8 layers [Simonyan14],
[He15]
Slide 80
Slide 80 text
Previously: carefully choose distribution
of W so that ≈ ()
Slide 81
Slide 81 text
E.g. approach by He et. Al. [He15]:
=
1
Where is the fan-in; the number of
incoming connections, and is the gain
Slide 82
Slide 82 text
New approach:
BATCH NORMALISATION [Ioffe15]
Keep distribution of activations sane as
part of the network architecture
Slide 83
Slide 83 text
BATCH NORMALISATION
For each mini-batch of examples during
training
Normalise using mean and standard
deviation
Slide 84
Slide 84 text
Layer equation becomes:
(scale, not needed if is ReLU) and (bias)
are learned parameters
σ = ()
μ = ()
= (
−
+ )
Slide 85
Slide 85 text
Note:
For a fully connected layer, each
unit/output should have its own mean
and std-dev; aggregate across examples
in the mini-batch
Slide 86
Slide 86 text
For a convolutional layer, each channel
should have its own mean and std-dev;
aggregate across examples in the mini-
batch and across image rows and
columns
Slide 87
Slide 87 text
During training, keep a running
exponential moving average of mean
and std-dev
During test time, use the averaged mean
and std-dev
Slide 88
Slide 88 text
Reducing Over-fitting
Slide 89
Slide 89 text
Over-fitting always a problem in ML
Model over-fits when it is very good at
matching samples in training set but not
those in validation/test
Slide 90
Slide 90 text
Neural networks are very prone to over-
fitting
Slide 91
Slide 91 text
Two techniques
DropOut
(quite a lot of people use batch
normalisation instead)
Dataset augmentation
Slide 92
Slide 92 text
DropOut [Hinton12]
During training, randomly choose units to
‘drop out’ by setting their output to 0, with
probability , usually around 0.5
(compensate by multiplying values by 1
1−
)
Slide 93
Slide 93 text
During test/predict:
Run as normal (no DropOut)
Slide 94
Slide 94 text
Normally applied to later, fully
connected layers
Slide 95
Slide 95 text
Dropout OFF
Input
layer
Hidden
layer 0
Output
layer
Slide 96
Slide 96 text
Dropout ON (1)
Input
layer
Hidden
layer 0
Output
layer
Slide 97
Slide 97 text
Dropout ON (2)
Input
layer
Hidden
layer 0
Output
layer
Slide 98
Slide 98 text
Sampling a different subset of the
network for each training example
Kind of like model averaging with only
one model
Slide 99
Slide 99 text
What effect does it have?
(approx. replication of [Hinton12])
Slide 100
Slide 100 text
Dataset: MNIST Digits
Network: Single hidden layer, fully
connected, 256 units, = 0.4
5000 iterations over training set
Dataset augmentation
Take existing dataset and expand by
adding transformed version of existing
samples
Slide 103
Slide 103 text
Dataset augmentation for images
[Krizhevsky12]
Cropping and translation
Scaling
Rotation
Lighting/colour modifications
Slide 104
Slide 104 text
Neural network software
Slide 105
Slide 105 text
Two categories of software:
Neural network toolkit (normally faster)
Expression compilers
Slide 106
Slide 106 text
Neural network toolkit
Most popular is CAFFE (from Berkeley)
http://caffe.berkeleyvision.org/
Slide 107
Slide 107 text
Specify network architecture in terms of
layers
Slide 108
Slide 108 text
Layers usually described using custom
config/language
CAFFE uses Google Protocol Buffers for
base syntax (YAML/JSON like) and for
data (since GPB is binary)
Slide 109
Slide 109 text
CAFFE can be used from:
command line
MATLAB
Python
Slide 110
Slide 110 text
Expression compilers
Theano (from University of Montreal)
Torch 7
Tensorflow (more recent)
Slide 111
Slide 111 text
Describe network architecture in terms
of mathematical expressions
Expressions compiled to CUDA code
and executed on GPU
Slide 112
Slide 112 text
Theano, Torch 7 and Tensorflow
provide automatic symbolic
differentiation:
Big win; less bugs and less manual work
Slide 113
Slide 113 text
In comparison
Advantages Disadvantages
Network
toolkit (e.g.
CAFFE)
• CAFFE is fast
• Most likely easer to
get going
• Bindings for MATLAB,
Python, command
line access
• Less flexible; harder to
extend (need to learn
architecture, manual
differentiation)
Expression
compiler
(e.g.
Theano)
• Extensible; new layer
type or cost function:
no problem
• See what goes on
under the hood
• Being adventurous is
easier!
• Slower (Theano)
• Debugging can be tricky
(compiled expressions
are a step away from
your code)
• Typically only work with
one language (e.g.
Python for Theano)
Slide 114
Slide 114 text
Resources and tutorials to get you
going
Slide 115
Slide 115 text
http://cs.stanford.edu/people/karpathy
/convnetjs/
Neural networks running in your web
browser
Excellent demos that shows how they
work and what they can do
Slide 116
Slide 116 text
https://github.com/Newmu/Theano-
Tutorials
Very simple Python code examples
proceeding through logistic regression, fully
connected and convolutional models.
Shows complete mathematical expressions
and training procedures
Slide 117
Slide 117 text
http://deeplearning.net/tutorial/
More Theano tutorials
More complete; explains mathematics
behind them
Code is longer than previous examples
Slide 118
Slide 118 text
CAFFE:
http://caffe.berkeleyvision.org/
Plenty of documentation and tutorials
Slide 119
Slide 119 text
Our work
Slide 120
Slide 120 text
CCTV for Fisheries
Project involving Dr. M. Fisher, Dr. M.
Mackiewicz
Funded by Marine Scotland
Slide 121
Slide 121 text
Automatically quantify the amount of
fish discarded by fishing trawlers
(preferably by species)
Process surveillance footage of discard
belt
Slide 122
Slide 122 text
No content
Slide 123
Slide 123 text
STEPS:
Segment fish from background
Separate fish from one another
Classify individual fish (TODO)
Measure individual fish to estimate
mass (TODO)
Slide 124
Slide 124 text
Approach
Use 4-Fields approach [Ganin14]
Slide 125
Slide 125 text
-Fields
Use network to transform input image patch
into 16-element codeword vector
Codeword vector used to look up a target
(foreground or edge) patch that most closely
matches, in dictionary of words
Slide 126
Slide 126 text
Use 4-Fields to transform input image to
foreground map
Use 4-Fields to transform input image to
edge map
Use Watershed algorithm to separate image
into regions
Slide 127
Slide 127 text
No content
Slide 128
Slide 128 text
No content
Slide 129
Slide 129 text
No content
Slide 130
Slide 130 text
Videos
Slide 131
Slide 131 text
Has problems with shadows
Need more training data
Much to do yet!
Slide 132
Slide 132 text
Some cool work in the field that
might be of interest
Slide 133
Slide 133 text
Visualizing and understanding
convolutional networks [Zeiler14]
Visualisations of responses of layers to
images
Slide 134
Slide 134 text
Visualizing and understanding convolutional
networks [Zeiler14]
Image taken from [Zeiler14]
Slide 135
Slide 135 text
Visualizing and understanding convolutional
networks [Zeiler14]
Image taken from [Zeiler14]
Slide 136
Slide 136 text
Deep Neural Networks are Easily Fooled:
High Confidence Predictions in
Recognizable Images [Nguyen15]
Generate images that are
unrecognizable to human eyes but are
recognized by the network
Slide 137
Slide 137 text
Deep Neural Networks are Easily Fooled: High
Confidence Predictions in Recognizable Images
[Nguyen15]
Image taken from [Nguyen15]
Slide 138
Slide 138 text
Learning to generate chairs with
convolutional neural networks
[Dosovitskiy15]
Network in reverse; orientation, design
colour, etc parameters as input, rendered
images as output training images
Slide 139
Slide 139 text
Learning to generate chairs with convolutional
neural networks [Dosovitskiy15]
Image taken from [Dosovitskiy15]
Slide 140
Slide 140 text
A Neural Algorithm of Artistic Style
[Gatys15]
Take an OxfordNet model [Simonyan14] and
extract texture features from one of the
convolutional layers, given a target style /
painting as input
Use gradient descent to iterate photo – not
weights – so that its texture features match
those of the target image.
Slide 141
Slide 141 text
A Neural Algorithm of Artistic Style
[Gatys15]
Image taken from [Gatys15]
Slide 142
Slide 142 text
Unsupervised representation Learning with
Deep Convolutional Generative Adversarial
Nets [Radford 15]
Train two networks; one given random
parameters to generate an image, another to
discriminate between a generated image and
one from the training set
Slide 143
Slide 143 text
Generative Adversarial Nets [Radford15]
Images of bedrooms generated using
neural net
Image taken from [Radford15]
Slide 144
Slide 144 text
Generative Adversarial Nets [Radford15]
Image taken from [Radford15]
Slide 145
Slide 145 text
Finishing words
Slide 146
Slide 146 text
Deep learning is a fascinating field with
lots going on
Very flexible, wide range of techniques
and applications
Slide 147
Slide 147 text
Deep neural networks have proved to be
highly effective* for computer vision,
speech recognition and other areas
*like with every other shiny new toy, see
the small-print!
Slide 148
Slide 148 text
SMALL-PRINT
Sufficient training data required (curse
of dimensionality)
Dataset augmentation advisable
Slide 149
Slide 149 text
SMALL-PRINT
Model will only represent training
examples; may not probably wont
generalise
Slide 150
Slide 150 text
SMALL-PRINT
Choose architecture carefully
Use a GPU
Slide 151
Slide 151 text
I hope this has proved to be a good
introduction to the topic!
Slide 152
Slide 152 text
Thank you!
Slide 153
Slide 153 text
References
Slide 154
Slide 154 text
[Ciresan12] Ciresan, Meier and
Schmidhuber; Multi-column deep neural
networks for image classification,
Computer vision and Pattern
Recognition (CVPR), 2012
Slide 155
Slide 155 text
[Dosovitskiy15] Dosovitskiy,
Springenberg and Box; Learning to
generate chairs with convolutional
neural networks, arXiv preprint, 2015
Slide 156
Slide 156 text
[Ganin14] Ganin, Lempitsky; 4-Fields:
Neural Network Nearest Neighbor Fields
for Image Transforms, 12th Asian
Conference on Computer Vision, 2014
Slide 157
Slide 157 text
[Gatys15] Gatys, Echer, Bethge; A Neural
Algorithm of Artistic Style, arXiv:
1508.06576, 2015
Slide 158
Slide 158 text
[Glorot10] Glorot, Bengio;
Understanding the difficulty of training
deep feedforward neural networks,
International conference on artificial
intelligence and statistics, 2010
[He15] He, Zhang, Ren and Sun; Delving
Deep into Rectifiers: Surpassing Human-
Level Performance on ImageNet
Classification, arXiv 2015
Slide 161
Slide 161 text
[Hinton12] G.E. Hinton, N. Srivastava, A.
Krizhevsky, I. Sutskever and R. R.
Salakhutdinov; Improving neural
networks by preventing co-adaptation of
feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
Slide 162
Slide 162 text
[Ioffe15] Ioffe, S.; Szegedy C.. (2015).
“Batch Normalization: Accelerating Deep
Network Training by Reducing Internal
Covariate Shift". ICML 2015,
arXiv:1502.03167
Slide 163
Slide 163 text
[Jones87] Jones, J.P.; Palmer, L.A. (1987).
"An evaluation of the two-dimensional
gabor filter model of simple receptive
fields in cat striate cortex". J.
Neurophysiol 58 (6): 1233–1258
Slide 164
Slide 164 text
[Krizhevsky12] Krizhevsky, Sutskever
and Hinton; ImageNet Classification with
Deep Convolutional Neural networks,
NIPS 2012
Slide 165
Slide 165 text
[LeCun95] LeCun, Yann et. al.;
Comparison of learning algorithms for
handwritten digit recognition,
International conference on artificial
neural networks, 1995
Slide 166
Slide 166 text
[Nguyen15] Nguyen, Yosinski and
Clune; Deep Neural Networks are Easily
Fooled: High Confidence Predictions for
Unrecognizable Images, Computer
Vision and Pattern Recognition (CVPR)
2015
Slide 167
Slide 167 text
[Radford15] Radford, Metz, Chintala;
Unsupervised Representation Learning
with Deep Convolutional Generative
Adversarial Networks, arXiv:1511.06434,
2015
Slide 168
Slide 168 text
[Simonyan14] K. Simonyan and
Zisserman; Very deep convolutional
networks for large-scale image
recognition, arXiv:1409.1556, 2014
Slide 169
Slide 169 text
[Zeiler14] Zeiler and Fergus; Visualizing
and understanding convolutional
networks, Computer Vision - ECCV 2014