In practice this is done on a mini-batch
of examples (e.g. 128) in parallel per
pass
Compute cost for each example, then
average. Compute derivative of average
cost w.r.t. params.
Slide 29
Slide 29 text
Using mini-batches works well when
using a GPU since computations can be
parallelised
Slide 30
Slide 30 text
Cost function
Slide 31
Slide 31 text
Regression
Final layer: no activation function /
identity.
Cost: Sum of squared differences
Slide 32
Slide 32 text
Classification
Just like logistic regression
Slide 33
Slide 33 text
Final layer: softmax as activation
function ; output vector of class
probabilities
Cost: negative-log-likelihood /
categorical cross-entropy
Slide 34
Slide 34 text
Fully connected neural networks
Slide 35
Slide 35 text
Simplest model
Each unit in each layer is connected too
all units in previous layer
All we have considered so far
Slide 36
Slide 36 text
How well does this perform on image
classification?
Slide 37
Slide 37 text
MNIST hand-written digit dataset
28x28 images, 10 classes
60K training examples, 10K validation, 10K
test
Examples from
MNIST
Slide 38
Slide 38 text
Network: 1 hidden layer of 64 units
after 300 iterations over training set:
2.85% validation error
Hidden layer weights
visualised as 28x28 images
Input Hidden Output
784
(28x28
images)
64 10
Slide 39
Slide 39 text
Network: 2 hidden layers, both 256 units
after 300 iterations over training set:
1.83% validation error
Input Hidden
784
(28x28
images)
256
Hidden Output
256 10
Slide 40
Slide 40 text
MNIST is quite a special case
Digits nicely centred within image
Scaled to approx. same size
Slide 41
Slide 41 text
The fully connected networks so far have a
weakness:
No translation invariance; learned features are
position dependent
Slide 42
Slide 42 text
For more general imagery:
requires a training set large enough to
see all features in all possible
positions…
Requires network with enough units to
represent this…
Slide 43
Slide 43 text
Convolutional networks
Slide 44
Slide 44 text
Convolution
Slide a convolution kernel over an image
Multiply image pixels by kernel pixels
and sum
Slide 45
Slide 45 text
Convolution
Convolutions are often used for feature
detection
Slide 46
Slide 46 text
A brief detour…
Slide 47
Slide 47 text
Gabor filters
∗
Slide 48
Slide 48 text
Used for texture classification
Bares similarity to low levels of cat
visual system [Jones87]
Slide 49
Slide 49 text
Back on track to…
Convolutional networks
Slide 50
Slide 50 text
Recap: FC (fully-connected) layer
()
Input
vector
Weighted
connections
Bias
Activation
function /
non-linearity
Layer
activation
Slide 51
Slide 51 text
Convolutional layer
Each unit only connected to units
in its neighbourhood
Slide 52
Slide 52 text
Convolutional layer
Weights are shared
Red weights
have same
value
As do
greens…
And
yellows
Slide 53
Slide 53 text
The values of the weights form a
convolution kernel
For practical computer vision, more an
one kernel must be used to extract a
variety of features
Slide 54
Slide 54 text
Convolutional layer
Different
weight-kernels:
Output is
vector/image
with multiple
channels
Slide 55
Slide 55 text
Still
= ( + )
As convolution can be expressed as
multiplication by weight matrix
Slide 56
Slide 56 text
Note
In subsequent layers, each kernel
connects to pixels in ALL channels in
previous layer
Slide 57
Slide 57 text
Max-pooling ‘layer’ [Ciresan12]
Take maximum value from each (, )
pooling region
Down-samples image by factor
Operates on channels independently
Slide 58
Slide 58 text
These are the models that have been
getting excellent ImageNet results
after 300 iterations over training set:
99.21% validation accuracy
Model Error
FC64 2.85%
FC256--FC256 1.83%
20C5--MP2--50C5--MP2--FC256 0.79%
Slide 63
Slide 63 text
What about the learned kernels?
Image taken from paper
[Krizhevsky12]
Gabor filters
Slide 64
Slide 64 text
Neural networks – recent
developments
Slide 65
Slide 65 text
GOOD NEWS
Training neural networks is more
practical now
Slide 66
Slide 66 text
More processing power
Less of a black art
Slide 67
Slide 67 text
(most important)
IMPROVEMENT
Processing power
Slide 68
Slide 68 text
Image processing requires large networks
with perhaps millions of parameters
Lots of training examples need to train
Easily results in billions or even trillions of
FLOPS
Slide 69
Slide 69 text
Neural networks are ‘embarrassingly
parallelisable’ therefore ideally suited to
GPUs
Use GPUs for all but the smallest of
networks
Slide 70
Slide 70 text
As of now, nVidia is the most popular
make of GPU.
Cheaper gaming cards perfectly
adequate
Only use Tesla in production
Slide 71
Slide 71 text
IMPROVEMENT
New popular activation function:
ReLU - Rectified Linear Unit
Slide 72
Slide 72 text
ReLU - Rectified Linear Unit
= max(, 0)
Slide 73
Slide 73 text
ReLU works better than tanh / sigmoid in
many cases
I don’t really understand the reasons (to be
honest! -)
See [Glorot11] [Glorot10]; written by people
who do!
Slide 74
Slide 74 text
IMPROVEMENT
Random weight initialisation
Slide 75
Slide 75 text
Previously: rules of thumb often used,
e.g. normal distribution with = 0.01
Problems arise when training deep
networks with > 8 layers [Simonyan14],
[He15]
Slide 76
Slide 76 text
More recent approaches choose initial
weights to maintain unit variance (as
much as possible) throughout layers
Otherwise layers can reduce or magnify
magnitudes of signals exponentially
Slide 77
Slide 77 text
Recent approach by He et. Al. [He15]:
=
1
Where is the fan-in; the number of
incoming connections, and is the gain
(for ReLU activation function use = 2)
Slide 78
Slide 78 text
For FC layer:
= +
= size of / width of ( is a -element
vector, is a Q, P matrix)
Slide 79
Slide 79 text
For convolutional layer:
= product of kernel width, kernel
height and number of channels
incoming from previous layer
Slide 80
Slide 80 text
This will ensure that ≈ ()
Slide 81
Slide 81 text
Reducing Over-fitting
Slide 82
Slide 82 text
Over-fitting always a problem in ML
Model over-fits when it is very good at
matching samples in training set but not
those in validation/test
Slide 83
Slide 83 text
Neural networks are very prone to over-
fitting
Slide 84
Slide 84 text
Two techniques
DropOut
Dataset augmentation
Slide 85
Slide 85 text
DropOut [Hinton12]
During training, randomly choose units to
‘drop out’ by setting their output to 0, with
probability , usually around 0.5
(compensate by multiplying values by 1
1−
)
Slide 86
Slide 86 text
During test/predict:
Run as normal (no DropOut)
Slide 87
Slide 87 text
Normally applied to later, fully
connected layers
Slide 88
Slide 88 text
Dropout OFF
Input
layer
Hidden
layer 0
Output
layer
Slide 89
Slide 89 text
Dropout ON (1)
Input
layer
Hidden
layer 0
Output
layer
Slide 90
Slide 90 text
Dropout ON (2)
Input
layer
Hidden
layer 0
Output
layer
Slide 91
Slide 91 text
Sampling a different subset of the
network for each training example
Kind of like model averaging with only
one model -
Slide 92
Slide 92 text
What effect does it have?
(approx. replication of [Hinton12])
Slide 93
Slide 93 text
Dataset: MNIST Digits
Network: Single hidden layer, fully
connected, 256 units, = 0.4
5000 iterations over training set
Dataset augmentation
Take existing dataset and expand by
adding transformed version of existing
samples
Slide 97
Slide 97 text
Dataset augmentation for images
[Krizhevsky12]
Cropping and translation
Scaling
Rotation
Lighting/colour modifications
Slide 98
Slide 98 text
Neural network software
Slide 99
Slide 99 text
Two categories of software:
Neural network toolkit (normally faster)
Expression compilers
Slide 100
Slide 100 text
Neural network toolkit
Most popular is CAFFE (from Berkeley)
http://caffe.berkeleyvision.org/
Slide 101
Slide 101 text
Specify network architecture in terms of
layers
Slide 102
Slide 102 text
Layers usually described using custom
config/language
CAFFE uses Google Protocol Buffers for
base syntax (YAML/JSON like) and for
data (since GPB is binary)
Slide 103
Slide 103 text
CAFFE can be used from:
command line
MATLAB
Python
Slide 104
Slide 104 text
Expression compilers
Theano (from University of Montreal)
Torch 7
Tensorflow (more recent)
Slide 105
Slide 105 text
Describe network architecture in terms
of mathematical expressions
Expressions compiled to CUDA code
and executed on GPU
Slide 106
Slide 106 text
Theano, Torch 7 and Tensorflow
provide automatic symbolic
differentiation:
Big win; less bugs and less manual work
Slide 107
Slide 107 text
In comparison
Advantages Disadvantages
Network
toolkit (e.g.
CAFFE)
• CAFFE is fast
• Most likely easer to
get going
• Bindings for MATLAB,
Python, command
line access
• Less flexible; harder to
extend (need to learn
architecture, manual
differentiation)
Expression
compiler
(e.g.
Theano)
• Extensible; new layer
type or cost function:
no problem
• See what goes on
under the hood
• Being adventurous is
easier!
• Slower (Theano)
• Debugging can be tricky
(compiled expressions
are a step away from
your code)
• Typically only work with
one language (e.g.
Python for Theano)
Slide 108
Slide 108 text
Resources and tutorials to get you
going
Slide 109
Slide 109 text
http://cs.stanford.edu/people/karpathy
/convnetjs/
Neural networks running in your web
browser
Excellent demos that shows how they
work and what they can do
Slide 110
Slide 110 text
https://github.com/Newmu/Theano-
Tutorials
Very simple Python code examples
proceeding through logistic regression, fully
connected and convolutional models.
Shows complete mathematical expressions
and training procedures
Slide 111
Slide 111 text
http://deeplearning.net/tutorial/
More Theano tutorials
More complete; explains mathematics
behind them
Code is longer than previous examples
Slide 112
Slide 112 text
CAFFE:
http://caffe.berkeleyvision.org/
Plenty of documentation and tutorials
Slide 113
Slide 113 text
Some cool work in the field that
might be of interest
Slide 114
Slide 114 text
Visualizing and understanding
convolutional networks [Zeiler14]
Visualisations of responses of layers to
images
Slide 115
Slide 115 text
Visualizing and understanding convolutional
networks [Zeiler14]
Image taken from [Zeiler14]
Slide 116
Slide 116 text
Visualizing and understanding convolutional
networks [Zeiler14]
Image taken from [Zeiler14]
Slide 117
Slide 117 text
Deep Neural Networks are Easily Fooled:
High Confidence Predictions in
Recognizable Images [Nguyen15]
Generate images that are
unrecognizable to human eyes but are
recognized by the network
Slide 118
Slide 118 text
Deep Neural Networks are Easily Fooled: High
Confidence Predictions in Recognizable Images
[Nguyen15]
Image taken from [Nguyen15]
Slide 119
Slide 119 text
Learning to generate chairs with
convolutional neural networks
[Dosovitskiy15]
Network in reverse; orientation, design
colour, etc parameters as input, rendered
images as output training images
Slide 120
Slide 120 text
Learning to generate chairs with convolutional
neural networks [Dosovitskiy15]
Image taken from [Dosovitskiy15]
Slide 121
Slide 121 text
A Neural Algorithm of Artistic Style
[Gatys15]
Take an OxfordNet model [Simonyan14] and
extract texture features from one of the
convolutional layers, given a target style /
painting as input
Use gradient descent to iterate photo – not
weights – so that its texture features match
those of the target image.
Slide 122
Slide 122 text
A Neural Algorithm of Artistic Style
[Gatys15]
Image taken from [Gatys15]
Slide 123
Slide 123 text
Unsupervised representation Learning with
Deep Convolutional Generative Adversarial
Nets [Radford 15]
Train two networks; one given random
parameters to generate an image, another to
discriminate between a generated image and
one from the training set
Slide 124
Slide 124 text
Generative Adversarial Nets [Radford15]
Images of bedrooms generated using
neural net
Image taken from [Radford15]
Slide 125
Slide 125 text
Generative Adversarial Nets [Radford15]
Image taken from [Radford15]
Slide 126
Slide 126 text
Finishing words
Slide 127
Slide 127 text
Deep learning is a fascinating field with
lots going on
Very flexible, wide range of techniques
and applications
Slide 128
Slide 128 text
Deep neural networks have proved to be
highly effective* for computer vision,
speech recognition and other areas
*like with every other shiny new toy, see
the small-print!
Slide 129
Slide 129 text
SMALL-PRINT
Sufficient training data required (curse
of dimensionality)
Dataset augmentation advisable
Slide 130
Slide 130 text
SMALL-PRINT
Model will only represent training
examples; may not probably wont
generalise
Slide 131
Slide 131 text
SMALL-PRINT
Choose architecture carefully
Use a GPU
Slide 132
Slide 132 text
I hope this has proved to be a good
introduction to the topic!
Slide 133
Slide 133 text
Thank you!
Slide 134
Slide 134 text
References
Slide 135
Slide 135 text
[Ciresan12] Ciresan, Meier and
Schmidhuber; Multi-column deep neural
networks for image classification,
Computer vision and Pattern
Recognition (CVPR), 2012
Slide 136
Slide 136 text
[Dosovitskiy15] Dosovitskiy,
Springenberg and Box; Learning to
generate chairs with convolutional
neural networks, arXiv preprint, 2015
Slide 137
Slide 137 text
[Gatys15] Gatys, Echer, Bethge; A Neural
Algorithm of Artistic Style, arXiv:
1508.06576, 2015
Slide 138
Slide 138 text
[Glorot10] Glorot, Bengio;
Understanding the difficulty of training
deep feedforward neural networks,
International conference on artificial
intelligence and statistics, 2010
[He15] He, Zhang, Ren and Sun; Delving
Deep into Rectifiers: Surpassing Human-
Level Performance on ImageNet
Classification, arXiv 2015
Slide 141
Slide 141 text
[Hinton12] G.E. Hinton, N. Srivastava, A.
Krizhevsky, I. Sutskever and R. R.
Salakhutdinov; Improving neural
networks by preventing co-adaptation of
feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
Slide 142
Slide 142 text
[Jones87] Jones, J.P.; Palmer, L.A. (1987).
"An evaluation of the two-dimensional
gabor filter model of simple receptive
fields in cat striate cortex". J.
Neurophysiol 58 (6): 1233–1258
Slide 143
Slide 143 text
[Krizhevsky12] Krizhevsky, Sutskever
and Hinton; ImageNet Classification with
Deep Convolutional Neural networks,
NIPS 2012
Slide 144
Slide 144 text
[LeCun95] LeCun, Yann et. al.;
Comparison of learning algorithms for
handwritten digit recognition,
International conference on artificial
neural networks, 1995
Slide 145
Slide 145 text
[Nguyen15] Nguyen, Yosinski and
Clune; Deep Neural Networks are Easily
Fooled: High Confidence Predictions for
Unrecognizable Images, Computer
Vision and Pattern Recognition (CVPR)
2015
Slide 146
Slide 146 text
[Radford15] Radford, Metz, Chintala;
Unsupervised Representation Learning
with Deep Convolutional Generative
Adversarial Networks, arXiv:1511.06434,
2015
Slide 147
Slide 147 text
[Simonyan14] K. Simonyan and
Zisserman; Very deep convolutional
networks for large-scale image
recognition, arXiv:1409.1556, 2014
Slide 148
Slide 148 text
[Zeiler14] Zeiler and Fergus; Visualizing
and understanding convolutional
networks, Computer Vision - ECCV 2014