Deep Learning - Advanced Techniques Tutorial - PyData Amsterdam 2017

Deep Learning Tutorial Advanced Techniques PyData Amsterdam 2017 G. French
– University of East Anglia Image montages from http://www.image-net.org

Image processing Using Theano1 and Lasagne2 1http://deeplearning.net/software/theano/ 2 https://github.com/Lasagne/Lasagne

What we’ll cover

Theano What it is and how it works Review: Multi-layer
perceptron The basic model Convolutional networks Neural networks for computer vision

Lasagne and VGG-16 Explain Lasagne and use it with a
convolutional network trained by the VGG group at Oxford University Transfer learning Re-using pre-trained networks Deep learning tricks of the trade tips to save you some time

Tutorial materials

Github Repo: https://github.com/Britefury/deep-learning-tutorial-pydata The notebooks are viewable on Github

Intro to Theano and Lasagne slides: https://speakerdeck.com/britefury https://speakerdeck.com/britefury/intro-to-theano-and-lasagne-for-deep-learning

Amazon AMI (Use GPU machine) AMI ID: ami-5f789e32 AMI Name:
PyData London 2016 deep learning adv tutorial - Ubuntu-14.04 Anaconda2-4.0.0 Cuda-7.5 cuDNN-5 Theano-0.8 Lasagne Fuel

Theano

Neural network software exists on a spectrum Higher level -
Specify network in terms of layers Flexible and powerful - Specify network in terms of mathematical expressions Neural network toolkits Expression compilers API style

Neural network software exists on a spectrum Less to debug
so easier Depends on toolkit - (can end up with confusing errors in compiled code e.g. Theano) Neural network toolkits Expression compilers Debugging

Neural network software exists on a spectrum CAFFE Theano Tensorflow
Torch Neural network toolkits Expression compilers Examples

Theano An expression compiler

Write numpy style expressions Compiles to either C (CPU) or
CUDA (nVidia GPU)

Notebook: Theano basics Expressions Modify shared variables Variables and functions
Gradient and updates

There is much more to Theano For more information: http://deeplearning.net/

Review: MLP (multi-layer perceptron)

= input (M-element vector) = output (N-element vector) = weights
parameter (NxM matrix) = bias parameter (N-element vector) = activation function; normally ReLU but can be tanh or sigmoid = ( + )

In a nutshell: = ( + )

Repeat for each layer Input vector ( + ) Hidden
layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

To train the network: Compute the derivative of cost w.r.t.
parameters ( and ) (More on the cost/loss later)

Update parameters: + , = + − /0 /12 +
, = + − /0 /32 γ = learning rate

Theano takes care of the differentiation for you!

MNIST Multi-layer Perceptron

(Obligatory) MNIST example: 2 hidden layers, both 256 units after
300 iterations over training set: 1.83% validation error Input Hidden 784 (28x28 images) 256 Hidden Output 256 10

MNIST is quite a special case Digits nicely centred within
image Scaled to approx. same size

The fully connected networks so far have a weakness: No
translation invariance; learned features are position dependent

For more general imagery: requires a training set large enough
to see all features in all possible positions… Requires network with enough units to represent this…

Non-linearities and loss functions

The final non-linearity and the corresponding loss function is important
Their choice depends on the desired output of the network

Classification Final non-linearity: softmax Softmax produces predicted probability vector from
logits : = = => ∑ => @ AB+

Classification Loss: negative log probability (categorical cross-entropy); = prediction, =
true value = − E ln A A

Lets dig a little deeper and do an experiment.

Lets take a neural network: Input vector ( + )
Hidden layer 0 activation ( + ) Hidden layer 1 activation ( + ) Final layer activation (output) ( + ) ⋯

Remove all but the final layer: Final layer activation (output)
( + ) Final layer

Drive its activation function directly (no matrix multiplication) Final layer
activation (output) () Final layer We are going to ‘simulate’ the network up to that point by providing values for the logits directly

Create logit values: 0.0 -5.0 0.0 -4.0 0.0 -3.0 0.0
-2.0 0.0 -1.0 0.0 0.0 0.0 1.0 0.0 2.0 0.0 3.0 0.0 4.0 0.0 5.0 keep logit for class 0 constant vary logit for class 1

Lets add the predicted probabilities generated by softmax: 0.0 -5.0
0.9933072 0.0066929 0.0 -4.0 0.9820138 0.0179862 0.0 -3.0 0.9525741 0.0474259 0.0 -2.0 0.8807970 0.1192029 0.0 -1.0 0.7310586 0.2689414 0.0 0.0 0.5000000 0.5000000 0.0 1.0 0.2689414 0.7310586 0.0 2.0 0.1192029 0.8807971 0.0 3.0 0.0474259 0.9525741 0.0 4.0 0.0179862 0.9820138 0.0 5.0 0.0066929 0.9933072 => ∑ => @ AB+

Plot the predicted probability N from softmax and the true
probability N (assume a value of 1)

Add negative log-loss: Note: loss is high when N is
negative, tends to 0 when it is positive

Add gradient of negative log-loss:

Learning occurs via gradient descent Note: gradient in range [-1,
0], negative when N is negative, when N is positive the gradient is close to 0 (correct answer, not much learning to do)

Probability regression Use for predicting single value in [0, 1]
range Can use for binary classification Could use to generate mask for an image

Probability regression Final non-linearity: sigmoid Sigmoid outputs values in range
[0, 1]: = = 1 1 + ST

Probability regression Loss: binary cross-entropy; = prediction, = true value
= − E ln A A + ln 1 − A 1 − A

Experiment Drive sigmoid directly Sigmoid takes a scalar per sample,
rather than a vector

Probability regression plot Just like classification with softmax & NLL
J (maths are similar)

In general When network gives correct answer: Gradient of the
loss will be near 0 (no more learning)

In general When network gives incorrect answer: Gradient of loss
will be of magnitude 1 pushing the network to learn

A gradient of 1 The gradient of the final non-linearity
+ loss function having a magnitude of 1 is worth consideration

A gradient of 1 They won’t scale up or down
the gradient that is back-propagated throughout the earlier parts of the network

A gradient of 1 Keeping the gradient magnitudes sane is
important (*) Hence the success of weight initialisation schemes and batch normalisation * Particularly for GANs

Regression with squared error loss Use for predicting real values
NO final non-linearity (linear)

Regression with squared error loss Loss: squared error; U =
prediction, = true value = − E − U V

Regression with squared error loss plot The gradient can have
a large magnitude when U and differ greatly

Regression with Huber loss Uses squared error when difference between
prediction and target is small, absolute error otherwise Used in [Girshick15]

Regression with Huber loss NO final non-linearity (linear)

Regression with Huber loss Loss: Huber loss; U = prediction,
= true value ℎ , U = 1 2 U − V U − ≤ 1 U − − 1 2 ℎ

Regression with Huber loss plot Keeps the gradient magnitude ≤
1

In summary: Final non-linearity and loss function depend on: What
you want your network to generate Effects on gradients can be worth considering

Convolutional networks

Convolution Slide a convolution kernel over an image Multiply image
pixels by kernel pixels and sum

Convolution Convolutions are often used for feature detection

A brief detour…

Gabor filters ∗

Back on track to… Convolutional networks

Recap: FC (fully-connected) layer () Input vector Weighted connections Bias
Activation function (non-linearity) Layer activation

Convolutional layer Each unit only connected to units in its
neighbourhood

Convolutional layer Weights are shared Red weights have same value
As do greens… And yellows

The values of the weights form a convolution kernel For
practical computer vision, more an one kernel must be used to extract a variety of features

Convolutional layer Different weight-kernels: Output is image with multiple channels

Still = ( + ) As convolution can be expressed
as multiplication by weight matrix

Note In subsequent layers, each kernel connects to pixels in
ALL channels in previous layer

Another way of looking at it: A single kernel of
an e.g. 5x5 convolutional layer is a bit like…

fully-connected layer with 5x5 input image repeated across the whole
image a new ‘fully-connected layer’ for each filter

Down-sampling Max-pooling or Striding

Down-sampling: max-pooling ‘layer’ [Ciresan12] Take maximum value from each 2
x 2 pooling region ( x ) in the general case Down-samples image by factor Operates on channels independently

Down-sampling: striding Only retain 1 pixel in every ; skip
the rest Often fast; is built into the convolution operation of many neural network libraries

Example: A Simplified LeNet [LeCun95] for MNIST digits

Simplified LeNet for MNIST digits 28 28 24 24 Input
Output 1 20 Conv: 20 5x5 kernels Maxpool 2x2 12 8 8 20 50 4 4 50 Conv: 50 5x5 kernels Maxpool 2x2 256 10 Fully connected (flatten and) fully connected 12

after 300 iterations over training set: 99.21% validation accuracy Model
Error FC64 2.85% FC256--FC256 1.83% 20C5--MP2--50C5--MP2--FC256 0.79%

Lasagne and VGG-16

Lasagne is a neural network library built on Theano

Provides API for: constructing layers of a network getting Theano
expressions representing output, loss, etc.

Lasagne is quite a thin layer on top of Theano,
so understanding Theano is helpful On the plus side, implementing custom layers, loss functions, etc is quite doable.

An aside note about other libraries Definitely check out Keras
http://keras.io Works on both Theano and Tensorflow

An aside note about other libraries Keras API can be
simpler to get to grips with than Lasagne Lots of cool examples

Notebook: Lasagne basics Build network: modified LeNet for MNIST Train
the network

Using a pre-trained VGG-16 Conv- net

Use VGG-16; the 16-layer model 1000-class image classifier, trained on
ImageNet

VGG models are simple but effective Consist of: 3x3 convolutions
2x2 max pooling fully connected

Notation: 64C3 convolutional layer with 64 3x3 filters MP2 max-pooling,
2x2 # Layer Input: 3 x 224 x 224 (RGB image, zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2

Notation: FC4096 fully-connected layer 4096 channels drop 50% With 50%
drop-out during training # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max

# Layer Input: 3 x 224 x 224 (RGB image,
zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2 # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max

These kinds of architectures tend to work well: Small convolution
kernels (3x3) Interspersed with max-pooling

Good first start when choosing a network architecture

Exercise / Demo Classifying an image with VGG-16

What about using VGG-16 to find a peacock in a
photo?

Can extract square patches from image in sliding window fashion
and classify:

Exercise / Demo Finding a peacock with VGG-16: part 1

Inefficient

Using convolutions to your advantage

Adjacent windows share majority of their pixels

For lower levels, this involves repeating many of the same
computations, getting the same result ∗

If we could apply the first convolutional layer across the
whole image rather than many 224x224 blocks we could re- use those computations….

Then we could also do this for the rest of
the convolutional layers further down…

In fact we can use the whole network in a
convolutional fashion; we just need to convert the fully-connected layers to convolutional layers.

Exercise / Demo Finding a peacock with VGG-16: part 2

This is a trick used when doing image segmentation, when
we want to determine which parts of an image belong to which class At both training time and prediction time

Transfer learning (network re-use)

Training a neural network is notoriously data-hungry Preparing training data
with ground truths is expensive and time consuming

What if we don’t have enough training data to get
good results?

The ImageNet dataset is huge; millions of images with ground
truths What if we could somehow use it to help us with a different task?

Good news: we can!

Transfer learning Re-use part (often most) of a pre-trained network
for a new task

Example; can re-use part of VGG-16 net for: Classifying images
with classes that weren’t part of the original ImageNet dataset

Example; can re-use part of VGG-16 net for: Localisation (find
location of object in image) Segmentation (find exact boundary around object in image)

Transfer learning: how to Take existing network such as VGG-16

# Layer Input: 3 x 224 x 224 (RGB image,
zero-mean) 1 64C3 2 64C3 MP2 3 128C3 4 128C3 MP2 5 256C3 6 256C3 7 256C3 MP2 # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 14 FC4096 (drop 50%) 15 FC4096 (drop 50%) 16 FC1000 soft-max

Remove last layers e.g. the fully- connected ones (just 14,15,16;
those in the left box are hidden here for brevity!) # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2

Build new randomly initialise layers to replace them (the number
of layers created their size is only for illustration here) # Layer 8 512C3 9 512C3 10 512C3 MP2 11 512C3 12 512C3 13 512C3 MP2 FC1024 (drop 50%) FC21 soft-max

Training when using transfer learning Two approaches: Train new layers
only Fine-tuning

Lets start with some code to get the pretrained
and new layer parameters separately

# Get all parameters in network; `get_all_params` works backward #
from the final layer through the network all_params = lasagne.layers.get_all_params(final_layer, trainable=True) # Get parameters from pre-trained layers; give the top pre-trained layer pretrained_params = lasagne.layers.get_all_params( vgg16.network[‘pool5’], trainable=True) new_params = [p for p in all_params if p not in pretrained_params]

Transfer learning: train new layers Train the network with your
training data, only learning parameters for the new layers

# Update new layers only new_updates = lasagne.updates.adam( training_loss, new_params,
learning_rate=lr)

Transfer learning: fine-tuning Learn parameters for new layers Fine-tuning: learn
parameters for pretrained layers using a lower learning rate; normally 1/10th

# Update new layers with standard learning rate new_updates =
lasagne.updates.adam( training_loss, new_params, learning_rate=lr) # Update new layers with standard learning rate pretrained_updates = lasagne.updates.adam( training_loss, pretrained_params, learning_rate=lr * 0.1) # Combine updates updates = new_updates.copy() updates.update(pretrained_updates)

Result Nice shiny network with good performance that was trained
with much less of our training data

Exercise / Demo Solving Dogs vs Cats with Transfer Learning

Our work Quantifying fish discards on-board fishing trawlers We use
VGG-16 based image segmentation

Foreground segmentation

Contour detection

Deep learning tricks of the trade

Early stopping

Early stopping Can get better accuracy Allows you to terminate
training earlier, saving time

Split data into: training set validation set test set

During training, regularly – e.g. after each epoch – evaluate
network performance on validation set

Validation score does not decrease monotonically, so the final score
is not necessarily the best Dogs vs cats, with transfer learning and data augmentation

Each time validation score improves: Save network state Restore saved
state once training is finished

Sometimes validation score can start getting gradually worse as the
network overfits Once you reach this point, there’s no point training any more

Fixed patience Stop training if no improvement detected after a
fixed number of epochs e.g. 50

Geometric patience Set the patience (index of epoch that you
are prepared to wait until) to be the number of epochs elapsed so far times some multiple e.g. 2

Geometric patience An example can be seen in the Logistic
Regression and MLP Theano tutorials: http://deeplearning.net/tutorial/mlp.html

Choosing: mini-batch size

Small mini-batches Maybe around ~8 Good but slower training Small
mini-batch results in regularization (due to noise), reaching lower error rates in the end [Goodfellow16]. When using very small mini- batches, need to compensate with lower learning rate and more epochs. Slow due to low parallelism Does not use all cores of GPU Low memory usage Less neuron activations kept in RAM

Large mini-batches 1000s Ineffective training Won’t reach the same error
rate as with smaller batches and may not learn at all. Can be fast due to high parallelism Uses GPU parallelism (there are limits; gains only achievable if there are unused CUDA cores) High memory usage Lots of neuron activations kept around; can run out of RAM on large networks

Happy medium (where you want to be) Maybe around 64-256,
lots of experiments use ~100 Effective training Learns reasonably quickly – in terms of improvement per epoch – and reaches acceptable error rate or loss Medium performance Acceptable in many cases Medium memory usage Fine for modest sized networks

~100 seems to work well; gets good results

Increasing mini-batch size will improve performance up to the point
where all GPU units are in use Increasing it further will not improve performance; it will reduce accuracy

Caveat When working in a convolutional fashion - like the
example of using VGG- net to find the peacock – or when doing image segmentation

In such cases, pushing large patches of an image through
as a single batch along with a correspondingly large output patch re-uses data due to convolutions and results in substantial savings

My experience: Use patches that are as large as possible
Although it’s a tricky balance with accuracy of the final result

Batch normalization

Batch normalization [Ioffe15] is recommended in most cases Speeds up
training Loss and error drop faster per-epoch

Although epochs take longer (around 2x in my experience) Can
(ultimately) reach lower error rates Lets you build deeper networks

Standardise activations (zero-mean, unit variance) per-channel between network layers Solves
problems caused by exponential growth or shrinkage of layer activations

+ = 1 N = 2 A`N = 2 A
Assume that a layer – grey square – produces activations whose std-dev are twice that of the input:

= 1 = 2@ @ = 2@ + When layers
are stacked together: ⋯

The magnitude of activations and therefore gradients either explode or
vanish (if the layers reduce the magnitude of activations rather than magnify them)

Can be partially addressed with careful weight initialization [He15]. Batch
normalization between layers keeps things sane; can train networks with hundreds of layers [He15b].

After convolutional, fully-connected or network-in-network* layers, before the non-linearity (*)
kind of like a 1x1 convolutional layer [Lin13]

Lasagne batch normalization inserts itself into a layer before the
non- linearity, so its nice and easy to use: l = lasagne.layers.batch_norm(l)

Data standardisation

Standardise your data Ensure zero-mean and unit standard deviation

Standardise input data In case of regression, standardise output data
too (don’t forget to invert the standardisation of network predictions!)

Autoencoder; image reconstruction (regression), PCA whitening

Autoencoder; image reconstruction (regression), no standardisation

Still a good idea to do this, even when using
batch normalization

Autoencoder; edge map reconstruction (regression), standardisation

Autoencoder; edge map reconstruction (regression), no standardisation

Standardisation Extract samples (pixels in the case of images) into
an array Compute distribution and standardise

Either: Zero the mean and scale std-dev to 1, per
channel (RGB for images) , = −

CIFAR10 RGB distribution

CIFAR10 RGB – standardised

Or better still: Use PCA whitening (retain all channels –
we don’t want to reduce dimensionality)

CIFAR10 RGB – with principal components

CIFAR10 RGB - principal components aligned with standard basis

CIFAR10 RGB – PCA whitened

PCA whitening From Scikit-learn use PCA or IncrementalPCA

ZCA whitening is very similar Notice that PCA whitening rotated
the RGB distribution ZCA whitening doesn’t; it only scales along the axes found by PCA

Could batch normalisation make standardisation unnecessary?

The previous fish based examples all used batch normalisation and
still benefited from data standardisation, so no.

Data augmentation

Reduce over-fitting by artificially enlarging training set Modify existing training
samples to make new ones

For images: Transformations: move, scale, rotate, reflect, etc. Consider the
domain: horizontally flipping characters for character recognition would be a bad idea

For images: Lighting: Compute principal components of RGB pixel values
Add normally distributed random multiples of 10% of the std-dev in each principal component

See [Krizhevsky12]; their paper discusses their techniques, that have since
been used by many otehrs

Exercise / Demo Solving Dogs vs Cats with Transfer Learning
and Data Augmentation

When training goes wrong and what to look for

Loss becomes NaN

Classification error rate equivalent of random guess (its not learning)

Learns to predict constant value; optimises constant value for best
loss A constant value is a local minimum that the network won’t get out of (neural networks ‘cheat’ like crazy!)

Debugging your network

Neural networks (most) often DON’T learn what you want or
expect them to

Local minima will be the bane of your existence

So, what has your network learnt? This is often a
good question.

Saliency maps Determine which parts of an image the network
is using to make its prediction Tells you what the network is ‘looking at’

Two approaches

1. Region-level saliency Blank out different regions of the image
and compute the difference in prediction

Exercise / Demo Block-level image saliency

2. Pixel-level saliency Compute the gradient of the prediction of
a specific class w.r.t. the image pixels

Exercise / Demo Pixel-level image saliency

Designing a computer vision pipeline

Simple problems may be solved with just a neural network

Not sufficient for more complex problems

Theoretically possible to use a single network, with enough training
data (where enough is an impractical amount)

For more complex problems, the problem should be broken down

Example Identifying right whales, by Felix Lau 2nd place in
Kaggle competition http://felixlaumon.github.io/2015/01/0 8/kaggle-right-whale.html

Identifying right whales, by Felix Lau The first naïve solution
– training a classifier to identify individuals – did not work well

Region-based saliency map revealed that the network had ‘locked on’
to features in the ocean shape rather than the whales

Lau’s solution: Train a localiser to locate the whale in
the image

Lau’s solution: Train a keypoint finder to locate two keypoints
on the whale’s head to identify its orientation

Lau’s solution: Train classifier on oriented and cropped whale head
images

Just for fun

Deep Dreams

When training a network, we use gradient descent to iteratively
modify weights given images and ground truths

We just as easily use gradient descent to modify an
image given weights

Deep Dreams: Take an image to hallucinate from Choose a
layer, e.g. ‘pool4’ of VGG-19; choice depends on scale and level of features desired

Deep Dreams: Compute gradient of V-norm of layer w.r.t. image
Use gradient ascent to increase V-norm

Exercise / Demo Deep Dreams

For other cool examples Look at the Keras library, they
have loads http://keras.io

Some cool work in the field that might be of
interest

Visualizing and understanding convolutional networks [Zeiler14] Visualisations of responses of
layers to images

Visualizing and understanding convolutional networks [Zeiler14] Image taken from [Zeiler14]

Plug & Play Generative Networks: Conditional Iterative Generation of Images
in Latent Space [Nguyen17] Image taken from http://www.evolvingai.org/ppgn

Deep Neural Networks are Easily Fooled: High Confidence Predictions in
Recognizable Images [Nguyen15] Generate images that are unrecognizable to human eyes but are recognized by the network

Deep Neural Networks are Easily Fooled: High Confidence Predictions in
Recognizable Images [Nguyen15] Image taken from [Nguyen15]

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Network
in reverse; orientation, design colour, etc parameters as input, rendered images as output training images

Learning to generate chairs with convolutional neural networks [Dosovitskiy15] Image
taken from [Dosovitskiy15]

Unsupervised representation Learning with Deep Convolutional Generative Adversarial Nets [Radford
15] Train two networks; one given random parameters to generate an image, another to discriminate between a generated image and one from the training set

Generative Adversarial Nets [Radford15] Images of bedrooms generated using neural
net Image taken from [Radford15]

Generative Adversarial Nets [Radford15] Image taken from [Radford15]

There has been a lot of work on GANs lately;
mainly focused on improving their output quality

BEGAN:BoundaryEquilibriumGenerative AdversarialNetworks [Berthelot17] Image taken from [Berthelot17]

A Neural Algorithm of Artistic Style [Gatys15] Take an OxfordNet
model [Simonyan14] and extract texture features from one of the convolutional layers, given a target style / painting as input Use gradient descent to iterate photo – not weights – so that its texture features match those of the target image.

A Neural Algorithm of Artistic Style [Gatys15] Image taken from
[Gatys15]

Much work on style transfer has focused on either improving
quality or performance Original algorithm used gradient descent; like Deep Dream, so its quite slow

Deep Photo Style Transfer [Luan17] (also iterative, but just awesome)
Image taken from https://github.com/luanfujun/deep-photo-styletransfer

Hope you’ve found it helpful!

Thank you!

References

[Berthelot17] Berthelot D., Schumm T., Metz L.; “BEGAN: Boundary Equilibrium
Generative Adversarial Networks”, arXiv 1703.10717 (2017).

[Girshick15] Girshick, Ross; “Fast R- CNN”, Proceedings of the IEEE
International Conference on Computer Vision, 2015

[He15a] He, Zhang, Ren and Sun; “Delving Deep into Rectifiers:
Surpassing Human-Level Performance on ImageNet Classification”, arXiv 2015

[He15b] He, Kaiming, et al. "Deep Residual Learning for Image
Recognition." arXiv preprint arXiv:1512.03385 (2015).

[Hinton12] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and
R. R. Salakhutdinov; “Improving neural networks by preventing co-adaptation of feature detectors.” arXiv preprint arXiv:1207.0580, 2012.

[Ioffe15] Ioffe, S.; Szegedy C.. (2015). “Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift". ICML 2015, arXiv:1502.03167

[Jones87] Jones, J.P.; Palmer, L.A. (1987). "An evaluation of the
two-dimensional gabor filter model of simple receptive fields in cat striate cortex". J. Neurophysiol 58 (6): 1233–1258

[Lin13] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in
network." arXiv preprint arXiv:1312.4400 (2013).

[Luan17] Luan F., Paris S., Shechtman E., Bala K. “Deep
Photo Style Transfer" arXiv:1703.07511 (2017).

[Krizhevsky12] Krizhevsky, A., Sutskever, I. and Hinton, G. E. "ImageNet
Classification with Deep Convolutional Neural Networks." NIPS 2012.

[Nesterov83] Nesterov, Y. A method of solving a convex programming
problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372–376 (1983).

[Nguyen17] Nguyen A., Yosinski J., Bengio Y., Dosovitskiy A., Clune
J. “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”. CVPR 2017.

[Sutskever13] Sutskever, Ilya, et al. On the importance of initialization
and momentum in deep learning. Proceedings of the 30th international conference on machine learning (ICML-13). 2013.

[Simonyan14] K. Simonyan and Zisserman; “Very deep convolutional networks for
large-scale image recognition”, arXiv:1409.1556, 2014

[Wang14] Wang, Dan, and Yi Shang. "A new active labeling
method for deep learning."Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, 2014.

Deep Learning - Advanced Techniques Tutorial - ...

Deep Learning - Advanced Techniques Tutorial - PyData Amsterdam 2017

More Decks by Britefury

Other Decks in Technology

Featured

Transcript