Slide 1

Slide 1 text

Deep Learning Europython 2016 - Bilbao G. French University of East Anglia Image montages from http://www.image-net.org

Slide 2

Slide 2 text

Focus: Mainly image processing

Slide 3

Slide 3 text

This talk is more about the principles and the maths than code Got to fit this into 1 hour!

Slide 4

Slide 4 text

What we’ll cover

Slide 5

Slide 5 text

Theano What it is and how it works What is a neural network? The basic model; the multi-layer perceptron Convolutional networks Neural networks for computer vision

Slide 6

Slide 6 text

Lasagne The Lasagne neural network library Notes for building neural networks A few tips on building and training neural networks OxfordNet / VGG and transfer learning Using a convolutional network trained by the VGG group at Oxford University and re-purposing it for your needs

Slide 7

Slide 7 text

Talk materials

Slide 8

Slide 8 text

Github Repo (originally for PyData London): https://github.com/Britefury/deep-learning-tutorial-pydata2016 The notebooks are viewable on Github

Slide 9

Slide 9 text

Intro to Theano and Lasagne slides: https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning

Slide 10

Slide 10 text

Amazon AMI (Use GPU machine) AMI ID: ami-e0048af7 AMI Name: Britefury deep learning - Ubuntu-14.04 Anaconda2- 4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel

Slide 11

Slide 11 text

ImageNet

Slide 12

Slide 12 text

Image classification dataset

Slide 13

Slide 13 text

~1,000,000 images ~1,000 classes Ground truths prepared manually through Amazon Mechanical Turk

Slide 14

Slide 14 text

ImageNet Top-5 challenge: You score if ground truth class is one your top 5 predictions

Slide 15

Slide 15 text

ImageNet in 2012 Best approaches used hand-crafted features (SIFT, HOGs, Fisher vectors, etc) + classifier Top-5 error rate: ~25%

Slide 16

Slide 16 text

Then the game changed.

Slide 17

Slide 17 text

Krizhevsky, Sutskever and Hinton; ImageNet Classification with Deep Convolutional Neural networks [Krizhevsky12] Top-5 error rate of ~15%

Slide 18

Slide 18 text

In the last few years, more modern networks have achieved better results still [Simonyan14, He15] Top-5 error rates of ~5-7%

Slide 19

Slide 19 text

I hope this talk will give you an idea of how!

Slide 20

Slide 20 text

Theano

Slide 21

Slide 21 text

Neural network software comes in two flavours: Neural network toolkits Expression compilers

Slide 22

Slide 22 text

Neural network toolkit Specify structure of neural network in terms of layers

Slide 23

Slide 23 text

Expression compilers Lower level Describe the mathematical expressions behind the layers More powerful and flexible

Slide 24

Slide 24 text

Theano An expression compiler

Slide 25

Slide 25 text

Write NumPy style expressions Compiles to either C (CPU) or CUDA (nVidia GPU)

Slide 26

Slide 26 text

Intro to Theano and Lasagne slides: https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning

Slide 27

Slide 27 text

There is much more to Theano For more information: http://deeplearning.net/tutorial http://deeplearning.net/software/theano

Slide 28

Slide 28 text

There are others Tensorflow – developed by Google – is gaining popularity fast

Slide 29

Slide 29 text

What is a neural network?

Slide 30

Slide 30 text

Multiple layers Data propagates through layers Transformed by each layer

Slide 31

Slide 31 text

Neural network image classifier Inputs Outputs = 0.003 = 0.002 = 0.005 = 0.9 Class probabilities Hidden Hidden

Slide 32

Slide 32 text

Neural network Input layer Hidden layer 0 Hidden layer 1 Output layer ⋯ Inputs Outputs ⋯

Slide 33

Slide 33 text

Single layer of a neural network () Input vector Weighted connections Bias Activation function / non-linearity Layer activation

Slide 34

Slide 34 text

= input (M-element vector) = output (N-element vector) = weights parameter (NxM matrix) = bias parameter (N-element vector) = non-linearity (a.k.a. activation function); normally ReLU but can be tanh or sigmoid = ( + )

Slide 35

Slide 35 text

In a nutshell: = ( + )

Slide 36

Slide 36 text

Repeat for each layer Input vector ( + ) Hidden layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

Slide 37

Slide 37 text

In mathematical notation: ; = (; + ; ) < = < ; + < ⋯ = = (= =>< + = )

Slide 38

Slide 38 text

As a classifier Input vector Hidden layer 0 activation Final layer activation with softmax non-linearity ⋯ Image pixels = 0.003 = 0.002 = 0.005 = 0.9 Class probabilities

Slide 39

Slide 39 text

Summary; a neural network is: Built from layers, each of which is: a matrix multiplication, then add bias, then apply non-linearity.

Slide 40

Slide 40 text

Training a neural network

Slide 41

Slide 41 text

Learn values for parameters; and (for each layer) Use back-propagation

Slide 42

Slide 42 text

Initialise weights randomly (more on this later) Initialise biases to 0

Slide 43

Slide 43 text

For each example ?@ABC from training set evaluate network prediction D@EF given the training input; = ?@ABC Measure cost (error); difference between D@EF and ground truth output ?@ABC

Slide 44

Slide 44 text

Classification (which of these categories best describes this?) Final layer: softmax as non-linearity ; output vector of class probabilities Cost: negative-log-likelihood / categorical cross-entropy

Slide 45

Slide 45 text

Regression (quantify something, real-valued output) Final layer: no non-linearity / identity as Cost: Sum of squared differences

Slide 46

Slide 46 text

Reduce cost (also known as loss) using gradient descent

Slide 47

Slide 47 text

Compute the derivative (gradient) of cost w.r.t. parameters (all and )

Slide 48

Slide 48 text

Theano performs symbolic differentiation for you! dCdW = theano.grad(cost, W) (other toolkits – such as Torch and Tensorflow – can also do this)

Slide 49

Slide 49 text

Update parameters: ; G = ; − FJ FKL ; G = ; − FJ FML γ = learning rate

Slide 50

Slide 50 text

Randomly split the training set into mini-batches of ~100 samples. Train on a mini-batch in a single step. The mini-batch cost is the mean of the costs of the samples in the mini-batch.

Slide 51

Slide 51 text

Training on mini-batches means that ~100 samples are processed in parallel – very good for running GPUs that do lots of operations in parallel

Slide 52

Slide 52 text

Training on all examples in the training set is called an epoch Run multiple epochs (often 200-300)

Slide 53

Slide 53 text

Summary; train a neural network: Take mini-batch of training samples Evaluate (run/execute) the network Measure the average error/cost across mini- batch Use gradient descent to modify parameters to reduce cost REPEAT ABOVE UNTIL DONE

Slide 54

Slide 54 text

Multi-layer perceptron

Slide 55

Slide 55 text

Simplest network architecture Nothing we haven’t seen so far Uses only fully-connected / dense layers

Slide 56

Slide 56 text

Dense layer: each unit is connected too all units in previous layer

Slide 57

Slide 57 text

(Obligatory) MNIST example: 2 hidden layers, both 256 units after 300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

Slide 58

Slide 58 text

MNIST is quite a special case Digits nicely centred within image Scaled to approx. same size

Slide 59

Slide 59 text

The fully connected networks so far have a weakness: No translation invariance; learned features are position dependent

Slide 60

Slide 60 text

For more general imagery: requires a training set large enough to see all features in all possible positions… Requires network with enough units to represent this…

Slide 61

Slide 61 text

Convolutional networks

Slide 62

Slide 62 text

Convolution Slide a convolution kernel over an image Multiply image pixels by kernel pixels and sum

Slide 63

Slide 63 text

Convolution Convolutions are often used for feature detection

Slide 64

Slide 64 text

A brief detour…

Slide 65

Slide 65 text

Gabor filters ∗

Slide 66

Slide 66 text

Back on track to… Convolutional networks

Slide 67

Slide 67 text

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias Activation function (non-linearity) Layer activation

Slide 68

Slide 68 text

Convolutional layer Each unit only connected to units in its neighbourhood

Slide 69

Slide 69 text

Convolutional layer Weights are shared Red weights have same value As do greens… And yellows

Slide 70

Slide 70 text

The values of the weights form a convolution kernel For practical computer vision, more an one kernel must be used to extract a variety of features

Slide 71

Slide 71 text

Convolutional layer Different weight-kernels: Output is image with multiple channels

Slide 72

Slide 72 text

Note Each kernel connects to pixels in ALL channels in previous layer

Slide 73

Slide 73 text

Still = ( + ) As convolution can be expressed as multiplication by weight matrix

Slide 74

Slide 74 text

Down-sampling In typical networks for computer vision, we need to shrink the resolution after a layer, by some constant factor Use max-pooling or striding

Slide 75

Slide 75 text

Down-sampling: max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2 x 2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently

Slide 76

Slide 76 text

Down-sampling: striding Can also down-sample using strided convolution; generate output for 1 in every pixels Faster, can work as well as max-pooling

Slide 77

Slide 77 text

Example: A Simplified LeNet [LeCun95] for MNIST digits

Slide 78

Slide 78 text

Simplified LeNet for MNIST digits 28 28 24 24 Input Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

Slide 79

Slide 79 text

after 300 iterations over training set: 99.21% validation accuracy Model Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

Slide 80

Slide 80 text

What about the learned kernels? Image taken from paper [Krizhevsky12] (ImageNet dataset, not MNIST) Gabor filters

Slide 81

Slide 81 text

Image taken from [Zeiler14]

Slide 82

Slide 82 text

Image taken from [Zeiler14]

Slide 83

Slide 83 text

Lasagne

Slide 84

Slide 84 text

Specifying your network as mathematical expressions is powerful but low-level

Slide 85

Slide 85 text

Lasagne is a neural network library built on Theano Makes building networks with Theano much easier

Slide 86

Slide 86 text

Provides API for: constructing layers of a network getting Theano expressions representing output, loss, etc.

Slide 87

Slide 87 text

Lasagne is quite a thin layer on top of Theano, so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.

Slide 88

Slide 88 text

Intro to Theano and Lasagne slides: https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning

Slide 89

Slide 89 text

Notes for building and training neural networks

Slide 90

Slide 90 text

Neural network architecture (OxfordNet / VGG style)

Slide 91

Slide 91 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 Early part Blocks consisting of: A few convolutional layers, often 3x3 kernels - followed by - Down-sampling; max-pooling or striding 64C3 = 3x3 conv, 64 filters MP2 = max-pooling, 2x2

Slide 92

Slide 92 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling, 2x2

Slide 93

Slide 93 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 Note after down- sampling, double the number of convolutional filters

Slide 94

Slide 94 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 FC256 FC10 Later part: After blocks of convolutional and down-sampling layers: Fully-connected (a.k.a. dense) layers

Slide 95

Slide 95 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 FC256 FC10 Notation: FC256 fully-connected layer with 256 channels

Slide 96

Slide 96 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 FC256 FC10 Overall Convolutional layers detect feature in various positions throughout the image

Slide 97

Slide 97 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 FC256 FC10 Overall Fully-connected / dense layers use features detected by convolutional layers to produce output

Slide 98

Slide 98 text

Could also look at architectures developed by others, e.g. Inception by Google, or ResNets by Micrsoft for inspiration

Slide 99

Slide 99 text

Batch normalization

Slide 100

Slide 100 text

Batch normalization [Ioffe15] is recommended in most cases Necessary for deeper networks (> 8 layers)

Slide 101

Slide 101 text

Speeds up training; cost drops faster per-epoch, although epochs take longer (~2x in my experience) Can also reach lower error rates

Slide 102

Slide 102 text

Layers can magnify or shrink magnitudes of values. Multiple layers can result in exponential increase/decrease. Batch normalisation maintains constant scale throughout network

Slide 103

Slide 103 text

Insert into convolutional and fully- connected layers after matrix multiplication/convolution, before the non-linearity

Slide 104

Slide 104 text

Lasagne batch normalization inserts itself into a layer before the non- linearity, so its nice and easy to use: lyr = lasagne.layers.batch_norm(lyr)

Slide 105

Slide 105 text

DropOut

Slide 106

Slide 106 text

Normally necessary for training (turned off at predict/test time) Reduces over-fitting

Slide 107

Slide 107 text

Over-fitting is a well-known problem in machine learning, affects neural networks particularly A model over-fits when it is very good at correctly predicting samples in training set but fails to generalise to samples outside it

Slide 108

Slide 108 text

DropOut [Hinton12] During training, randomly choose units to ‘drop out’ by setting their output to 0, with probability , usually around 0.5 (compensate by multiplying values by < <>Q )

Slide 109

Slide 109 text

During test/predict: Run as normal (DropOut turned off)

Slide 110

Slide 110 text

Normally applied after later, fully connected layers lyr = lasagne.layers.DenseLayer(lyr, num_units=256) lyr = lasagne.layers.DropoutLayer(lyr, p=0.5)

Slide 111

Slide 111 text

Dropout OFF Input layer Hidden layer 0 Output layer

Slide 112

Slide 112 text

Dropout ON (1) Input layer Hidden layer 0 Output layer

Slide 113

Slide 113 text

Dropout ON (2) Input layer Hidden layer 0 Output layer

Slide 114

Slide 114 text

Turning on a different subset of units for each sample: causes units to learn more robust features that cannot rely on the presence of other specific features to cover for flaws

Slide 115

Slide 115 text

Dataset augmentation

Slide 116

Slide 116 text

Reduce over-fitting by enlarging training set Artificially modify existing training samples to make new ones

Slide 117

Slide 117 text

For images: Apply transformations such as move, scale, rotate, reflect, etc.

Slide 118

Slide 118 text

Data standardisation

Slide 119

Slide 119 text

Neural networks train more effectively when training data has: zero-mean unit variance

Slide 120

Slide 120 text

Standardise input data In case of regression, standardise output data too (don’t forget to invert the standardisation of network predictions!)

Slide 121

Slide 121 text

Standardisation Extract samples into an array In case of images, extract all pixels from all sampls, keeping R, G & B channels separate Compute distribution and standardise

Slide 122

Slide 122 text

Either: Zero the mean and scale std-dev to 1, per channel (RGB for images) G = −

Slide 123

Slide 123 text

When training goes wrong and what to look for

Slide 124

Slide 124 text

Loss becomes NaN (ensure you track the loss after each epoch so you can watch for this!)

Slide 125

Slide 125 text

Classification error rate equivalent of random guess (its not learning)

Slide 126

Slide 126 text

Learns to predict constant value; optimises constant value for best loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)

Slide 127

Slide 127 text

Neural networks (most) often DON’T learn what you want or expect them to

Slide 128

Slide 128 text

Local minima will be the bane of your existence

Slide 129

Slide 129 text

Designing a computer vision pipeline

Slide 130

Slide 130 text

Simple problems may be solved with just a neural network

Slide 131

Slide 131 text

Not sufficient for more complex problems (neural networks aren’t a silver bullet; don’t believe the hype)

Slide 132

Slide 132 text

Theoretically possible to use a single network for a complex problem if you have enough training data (often an impractical amount)

Slide 133

Slide 133 text

For more complex problems, the problem should be broken down

Slide 134

Slide 134 text

Example Identifying right whales, by Felix Lau 2nd place in Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html

Slide 135

Slide 135 text

Identifying right whales, by Felix Lau The first naïve solution – training a classifier to identify individuals – did not work well

Slide 136

Slide 136 text

Region-based saliency map revealed that the network had ‘locked on’ to features in the ocean shape rather than the whales

Slide 137

Slide 137 text

Lau’s solution: Train a localiser neural network to locate the whale in the image

Slide 138

Slide 138 text

Lau’s solution: Train a keypoint finder neural network to locate two keypoints on the whale’s head to identify its orientation

Slide 139

Slide 139 text

Lau’s solution: Train classifier neural network on oriented and cropped whale head images

Slide 140

Slide 140 text

OxfordNet / VGG and transfer learning

Slide 141

Slide 141 text

Using a pre-trained network

Slide 142

Slide 142 text

Use Oxford VGG-19; the 19-layer model 1000-class image classifier, trained on ImageNet

Slide 143

Slide 143 text

Can download CC licensed weights from (in Caffe format): http://www.robots.ox.ac.uk/~vgg/research/very_deep/ GitHub repo contains code that downloads a Python version form: http://s3.amazonaws.com/lasagne/recipes/pretrained/imagenet/vgg19.pkl

Slide 144

Slide 144 text

VGG models are simple but effective Consist of: 3x3 convolutions 2x2 max pooling fully connected

Slide 145

Slide 145 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 # Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (dropout 50%) 18 FC4096 (dropout 50%) 19 FC1000 soft-max

Slide 146

Slide 146 text

Exercise / Demo Classifying an image with VGG-19

Slide 147

Slide 147 text

Transfer learning (network re-use)

Slide 148

Slide 148 text

Training a neural network is notoriously data-hungry Preparing training data with ground truths is expensive and time consuming

Slide 149

Slide 149 text

What if we don’t have enough training data to get good results?

Slide 150

Slide 150 text

The ImageNet dataset is huge; millions of images with ground truths What if we could somehow use it to help us with a different task?

Slide 151

Slide 151 text

Good news: we can!

Slide 152

Slide 152 text

Transfer learning Re-use part (often most) of a pre-trained network for a new task

Slide 153

Slide 153 text

Example; can re-use part of VGG-19 net for: Classifying images with classes that weren’t part of the original ImageNet dataset

Slide 154

Slide 154 text

Example; can re-use part of VGG-19 net for: Localisation (find location of object in image) Segmentation (find exact boundary around object in image)

Slide 155

Slide 155 text

Transfer learning: how to Take existing network such as VGG-19

Slide 156

Slide 156 text

# Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 8 256C3 MP2 # Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC4096 (drop 50%) 18 FC4096 (drop 50%) 19 FC1000 soft-max

Slide 157

Slide 157 text

# Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 Remove last layers e.g. the fully- connected ones (just 17,18,19; those in the left box are hidden here for brevity!)

Slide 158

Slide 158 text

# Layer 9 512C3 10 512C3 11 512C3 12 512C3 MP2 13 512C3 14 512C3 15 512C3 16 512C3 MP2 17 FC1024 (drop 50%) 18 FC21 soft-max Build new randomly initialise layers to replace them (the number of layers created their size is only for illustration here)

Slide 159

Slide 159 text

Transfer learning: training Train the network with your training data, only learning parameters for the new layers

Slide 160

Slide 160 text

Transfer learning: fine-tuning After learning parameters for the new layers, fine-tune by learning parameters for the whole network to get better accuracy

Slide 161

Slide 161 text

Result Nice shiny network with good performance that was trained with much less of our training data

Slide 162

Slide 162 text

Some cool work in the field that might be of interest

Slide 163

Slide 163 text

Visualizing and understanding convolutional networks [Zeiler14] Visualisations of responses of layers to images

Slide 164

Slide 164 text

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Slide 165

Slide 165 text

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Slide 166

Slide 166 text

Deep Neural Networks are Easily Fooled: High Confidence Predictions in Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network

Slide 167

Slide 167 text

Deep Neural Networks are Easily Fooled: High Confidence Predictions in Recognizable Images [Nguyen15] Image taken from [Nguyen15]

Slide 168

Slide 168 text

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network in reverse; orientation, design colour, etc parameters as input, rendered images as output training images

Slide 169

Slide 169 text

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Image taken from [Dosovitskiy15]

Slide 170

Slide 170 text

A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.

Slide 171

Slide 171 text

A Neural Algorithm of Artistic Style [Gatys15] Image taken from [Gatys15]

Slide 172

Slide 172 text

Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford 15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set

Slide 173

Slide 173 text

Generative Adversarial Nets [Radford15] Images of bedrooms generated using neural net Image taken from [Radford15]

Slide 174

Slide 174 text

Generative Adversarial Nets [Radford15] Image taken from [Radford15]

Slide 175

Slide 175 text

Hope you’ve found it helpful!

Slide 176

Slide 176 text

Thank you!

Slide 177

Slide 177 text

References

Slide 178

Slide 178 text

[Dosovitskiy15] Dosovitskiy, Springenberg and Box; Learning to generate chairs with convolutional neural networks, arXiv preprint, 2015

Slide 179

Slide 179 text

[Gatys15] Gatys, Echer, Bethge; A Neural Algorithm of Artistic Style, arXiv: 1508.06576, 2015

Slide 180

Slide 180 text

[He15a] He, Zhang, Ren and Sun; Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, arXiv 2015

Slide 181

Slide 181 text

[He15b] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015).

Slide 182

Slide 182 text

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov; Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

Slide 183

Slide 183 text

[Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167

Slide 184

Slide 184 text

[Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258

Slide 185

Slide 185 text

[Lin13] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

Slide 186

Slide 186 text

[Nesterov83] Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376 (1983).

Slide 187

Slide 187 text

[Radford15] Radford, Metz, Chintala; Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, arXiv:1511.06434, 2015

Slide 188

Slide 188 text

[Sutskever13] Sutskever, Ilya, et al. On the importance of initialization and momentum in deep learning. Proceedings of the 30th international conference on machine learning (ICML-13). 2013.

Slide 189

Slide 189 text

[Simonyan14] K. Simonyan and Zisserman; Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556, 2014

Slide 190

Slide 190 text

[Wang14] Wang, Dan, and Yi Shang. "A new active labeling method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.

Slide 191

Slide 191 text

[Zeiler14] Zeiler and Fergus; Visualizing and understanding convolutional networks, Computer Vision - ECCV 2014