Deep Learning for Computer Vision by Alex Conway

Slide 1

Slide 1 text

Deep Learning for Computer Vision Executive-ML 2017/09/21 Neither Proprietary nor Confidential – Please Distribute ;) Alex Conway alex @ numberboost.com @alxcnwy PyConZA’17

Slide 2

Slide 2 text

Hands up!

Slide 3

Slide 3 text

Check out the Deep Learning Indaba videos & practicals! http://www.deeplearningindaba.com/videos.html http://www.deeplearningindaba.com/practicals.html

Slide 4

Slide 4 text

Deep Learning is Sexy (for a reason!) 4

Slide 5

Slide 5 text

Image Classification 5 http://yann.lecun.com/exdb/mnist/ https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py (99.25% test accuracy in 192 seconds and 70 lines of code)

Slide 6

Slide 6 text

Image Classification 6

Slide 7

Slide 7 text

Image Classification 7 ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et. Al. Advances in Neural Information Processing Systems 25 (NIPS2012)

Slide 8

Slide 8 text

https://research.googleblog.com/2017/06/supercharge- your-computer-vision-models.html Object detection

Slide 9

Slide 9 text

https://www.youtube.com/watch?v=VOC3huqHrss Object detection

Slide 10

Slide 10 text

Object detection

Slide 11

Slide 11 text

Image Captioning & Visual Attention XXX 11 https://einstein.ai/research/knowing-when-to-look-adaptive-attention- via-a-visual-sentinel-for-image-captioning

Slide 12

Slide 12 text

Image Q&A 12 https://arxiv.org/pdf/1612.00837.pdf

Slide 13

Slide 13 text

Video Q&A XXX 13 https://www.youtube.com/watch?v=UeheTiBJ0Io

Slide 14

Slide 14 text

Pix2Pix https://affinelayer.com/pix2pix/ https://github.com/affinelayer/pix2pix-tensorflow 14

Slide 15

Slide 15 text

Pix2Pix https://medium.com/towards-data-science/face2face-a-pix2pix-demo-that- mimics-the-facial-expression-of-the-german-chancellor-b6771d65bf66 15

Slide 16

Slide 16 text

16 Original input Rear Window (1954) Pix2pix output Fully Automated Remastered Painstakingly by Hand https://hackernoon.com /remastering-classic- films-in-tensorflow-with- pix2pix-f4d551fa0503

Slide 17

Slide 17 text

Style Transfer https://github.com/junyanz/CycleGAN 17

Slide 18

Slide 18 text

Style Transfer SORCERY https://github.com/junyanz/CycleGAN 18

Slide 19

Slide 19 text

Real Fake News https://www.youtube.com/watch?v=MVBe6_o4cMI 19

Slide 20

Slide 20 text

Deep learning is Magic Deep learning is Magic Deep learning is EASY!

Slide 21

Slide 21 text

1. What is a neural network? 2. What is a convolutionalneural network? 3. How to use a convolutional neural network 4. More advanced Methods 5. Case studies & applications 21

Slide 22

Slide 22 text

Big Shout Outs Jeremy Howard & Rachel Thomas http://course.fast.ai Andrej Karpathy http://cs231n.github.io François Chollet (Keras lead dev) https://keras.io/ 22

Slide 23

Slide 23 text

1.What is a neural network?

Slide 24

Slide 24 text

What is a neuron? 24 • 3 inputs [x1,x2,x3] • 3 weights [w1,w2,w3] • Element-wise multiply and sum • Apply activation function f • Often add a bias too (weight of 1) – not shown

Slide 25

Slide 25 text

What is an Activation Function? 25 Sigmoid Tanh ReLU Nonlinearities … “squashing functions” … transform neuron’s output NB: sigmoid output in [0,1]

Slide 26

Slide 26 text

What is a (Deep) Neural Network? 26 Inputs outputs hidden layer 1 hidden layer 2 hidden layer 3 Outputs of one layer are inputs into the next layer

Slide 27

Slide 27 text

How does a neural network learn? 27 • We need labelled examples “training data” • We initialize network weights randomly and initially get random predictions • For each labelled training data point, we calculate the error between the network’s predictions and the ground-truth labels • Use ‘backpropagation’ (chain rule), to update the network parameters (weights + convolutional filters ) in the opposite direction to the error

Slide 28

Slide 28 text

How does a neural network learn? 28 New weight = Old weight Learning rate - Gradient of weight with respect to Error ( ) x “How much error increases when we increase this weight”

Slide 29

Slide 29 text

Gradient Descent Interpretation 29 http://scs.ryerson.ca/~aharley/neural-networks/

Slide 30

Slide 30 text

http://playground.tensorflow.org

Slide 31

Slide 31 text

What is a Neural Network? For much more detail, see: 1. Michael Nielson’s Neural Networks & Deep Learning free online book http://neuralnetworksanddeeplearning.com/chap1.html 2. Anrej Karpathy’s CS231n Notes http://neuralnetworksanddeeplearning.com/chap1.html 31

Slide 32

Slide 32 text

2. What is a convolutional neural network?

Slide 33

Slide 33 text

What is a Convolutional Neural Network? 33 “like a ordinary neural network but with special types of layers that work well on images” (math works on numbers) • Pixel = 3 colour channels (R, G, B) • Pixel intensity ∈[0,255] • Image has width w and height h • Therefore image is w x h x 3 numbers

Slide 34

Slide 34 text

34 This is VGGNet – don’t panic, we’ll break it down piece by piece Example Architecture

Slide 35

Slide 35 text

35 This is VGGNet – don’t panic, we’ll break it down piece by piece Example Architecture

Slide 36

Slide 36 text

How does a neural network learn? 36 • MNIST stop and think how remarkable it is that we can recognise all of these MNISt as a 3 (change number) • Different pixels!

Slide 37

Slide 37 text

Convolutions 37 http://setosa.io/ev/image-kernels/

Slide 38

Slide 38 text

Convolutions 38 http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

Slide 39

Slide 39 text

New Layer Type: ConvolutionalLayer 39 • 2-d weighted average when multiply kernel over pixel patches • We slide the kernel over all pixels of the image (handle borders) • Kernel starts off with “random” values and network updates (learns) the kernel values (using backpropagation) to try minimize loss • Kernels shared across the whole image (parameter sharing)

Slide 40

Slide 40 text

Many Kernels = Many “Activation Maps” = Volume 40 http://cs231n.github.io/convolutional-networks/

Slide 41

Slide 41 text

New Layer Type: ConvolutionalLayer 41

Slide 42

Slide 42 text

Convolutions 42 https://github.com/fchollet/keras/blob/master/examples/conv_filter_visualization.py

Slide 43

Slide 43 text

Convolutions 43

Slide 44

Slide 44 text

Convolutions 44

Slide 45

Slide 45 text

Convolutions 45

Slide 46

Slide 46 text

Convolution Learn Hierarchical Features 46

Slide 47

Slide 47 text

Great vid 47 https://www.youtube.com/watch?v=AgkfIQ4IGaM

Slide 48

Slide 48 text

New Layer Type: Max Pooling 48

Slide 49

Slide 49 text

New Layer Type: Max Pooling • Reduces dimensionality from one layer to next • …by replacing NxN sub-area with max value • Makes network “look” at larger areas of the image at a time • e.g. Instead of identifying fur, identify cat • Reduces overfittingsince losing information helps the network generalize 49 http://cs231n.github.io/convolutional-networks/

Slide 50

Slide 50 text

New Layer Type: Max Pooling 50

Slide 51

Slide 51 text

Stack Conv + Pooling Layers and Go Deep 51 Convolution + max pooling + fully connected + softmax

Slide 52

Slide 52 text

52 Stack Conv + Pooling Layers and Go Deep Convolution + max pooling + fully connected + softmax

Slide 53

Slide 53 text

53 Stack These Layers and Go Deep Convolution + max pooling + fully connected + softmax

Slide 54

Slide 54 text

54 Stack These Layers and Go Deep Convolution + max pooling + fully connected + softmax

Slide 55

Slide 55 text

55 Flatten the Final “Bottleneck” layer Convolution + max pooling + fully connected + softmax Flatten the final 7 x 7 x 512 max pooling layer Add fully-connecteddense layer on top

Slide 56

Slide 56 text

56 Bringing it all together Convolution + max pooling + fully connected + softmax

Slide 57

Slide 57 text

Softmax Convert scores ∈ ℝ to probabilities ∈ [0,1] Final output prediction = highest probability class 57

Slide 58

Slide 58 text

Bringing it all together 58 Convolution + max pooling + fully connected + softmax

Slide 59

Slide 59 text

We need labelled training data!

Slide 60

Slide 60 text

ImageNet 60 http://image-net.org/explore 1000 object categories 1.2 million training images

Slide 61

Slide 61 text

ImageNet 61

Slide 62

Slide 62 text

ImageNet 62

Slide 63

Slide 63 text

ImageNet Top 5 Error Rate 63 Traditional Image Processing Methods AlexNet 8 Layers ZFNet 8 Layers GoogLeNet 22 Layers ResNet 152 Layers SENet Ensamble TSNet Ensamble

Slide 64

Slide 64 text

3. How to use a convolutional neural network

Slide 65

Slide 65 text

Using a Pre-Trained ImageNet-Winning CNN 65 • We’ve been looking at “VGGNet” • Oxford Visual Geometry Group (VGG) • ImageNet 2014 Runner-up • Network is 16 layers (deep!) • Easy to fine-tune https://blog.keras.io/building-powerful-image-classification-models-using- very-little-data.html

Slide 66

Slide 66 text

Example: Classifying Product Images 66 https://github.com/alexcnwy/CTDL_CNN_TALK_20170620 Classifying products into 9 categories

Slide 67

Slide 67 text

67 https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html Start with Pre-Trained ImageNet Model Consumers vs Producers of Machine Learning

Slide 68

Slide 68 text

Slide 69

Slide 69 text

“Transfer Learning” is a game changer

Slide 70

Slide 70 text

Fine-tuning A CNN To Solve A New Problem • Cut off last layer of pre-trained Imagenet winning CNN • Keep learned network (convolutions) but replace final layer • Can learn to predict new (completely different) classes • Fine-tuning is re-training new final layer - learn for new task 70

Slide 71

Slide 71 text

Fine-tuning A CNN To Solve A New Problem 71

Slide 72

Slide 72 text

72 Before Fine-Tuning

Slide 73

Slide 73 text

73 After Fine-Tuning

Slide 74

Slide 74 text

Fine-tuning A CNN To Solve A New Problem • Fix weights in convolutional layers (set trainable=False) • Remove final dense layer that predicts 1000 ImageNet classes • Replace with new dense layer to predict 9 categories 74 88% accuracy in under 2 minutes for classifying products into categories Fine-tuning is awesome! Insert obligatory brain analogy

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

Visual Similarity 76 • Chop off last 2 VGG layers • Use dense layer with 4096 activations • Compute nearest neighbours in the space of these activations https://memeburn.com/2017/06/spree-image-search/

Slide 77

Slide 77 text

77 https://github.com/alexcnwy/CTDL_CNN_TALK_20170620

Slide 78

Slide 78 text

78 Input Image not seen by model Results Top 10 most “visually similar”

Slide 79

Slide 79 text

79 Final Convolutional Layer = Semantic Vector • The final convolutional layer encodes everything the network needs to make predictions • The dense layer added on top and the softmax layer both have lower dimensionality

Slide 80

Slide 80 text

4. More Advanced Methods

Slide 81

Slide 81 text

Use a Better Architecture (or all of them!) 81 “Ensambles win” learn a weighted average of many models’ predictions

Slide 82

Slide 82 text

cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf There are MANY Computer Vision Tasks

Slide 83

Slide 83 text

Long et al. “Fully Convolutional Networks for Semantic Segmentation” CVPR 2015 Noh et al. Learning Deconvolution Network for Semantic Segmentation. IEEE on Computer Vision 2016 Semantic Segmentation

Slide 84

Slide 84 text

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf Object detection

Slide 85

Slide 85 text

Object detection

Slide 86

Slide 86 text

https://www.youtube.com/watch?v=VOC3huqHrss Object detection

Slide 87

Slide 87 text

http://blog.romanofoti.com/style_transfer/ Johnson et al. Perceptual losses for real-time style transfer and super-resolution. 2016 Style Transfer f ( ) = ,

Slide 88

Slide 88 text

https://www.youtube.com/watch?v=LhF_56SxrGk

Slide 89

Slide 89 text

Pixelated Original Output https://arstech nica.com/infor mation- technology/20 17/02/google- brain-super- resolution- zoom- enhance/

Slide 90

Slide 90 text

This image is 3.8 kb Super-Resolution

Slide 91

Slide 91 text

https://github.com/tdeboissiere/BGG16CAM-keras Visual Attention

Slide 92

Slide 92 text

Image Captioning https://einstein.ai/research/knowing-when-to-look-adaptive-attention-via-a-visual-sentinel-for-image-captioning Karpathy & Li. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), pp. 664–676

Slide 93

Slide 93 text

Image Q&A XXX 93 http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook

Slide 94

Slide 94 text

Video Q&A XXX 94 https://www.youtube.com/watch?v=UeheTiBJ0Io

Slide 95

Slide 95 text

Video Q&A XXX 95 https://www.youtube.com/watch?v=UeheTiBJ0Io

Slide 96

Slide 96 text

king + woman – man ≈ queen 96 Frome et al. (2013) ‘DeViSE: A Deep Visual-Semantic Embedding Model’, Advances in Neural Information Processing Systems, pp. 2121–2129 CNN + Word2Vec = AWESOME

Slide 97

Slide 97 text

DeViSE: A Deep Visual-SemanticEmbedding Model XXX 97 Before: Encode labels as 1-hot vector

Slide 98

Slide 98 text

DeViSE: A Deep Visual-SemanticEmbedding Model XXX 98 After: Encode labels as word2vec vectors (FROM A SEPARATE MODEL) Can look these up for all the nouns in ImageNet 300-d word2vec vectors

Slide 99

Slide 99 text

DeViSE: A Deep Visual-SemanticEmbedding Model wv(fish)+ wv(net) 99 https://www.youtube.com/watch?v=uv0gmrXSXVg 2 wv* = …get nearest neighbours to wv*

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

5. Case Studies

Slide 102

Slide 102 text

Estimating Accident Repair Cost from Photos TODO 102 Prototype for large SA insurer Detect car make & model from registration disk Predict repair cost using learnedmodel

Slide 103

Slide 103 text

Image & Video Moderation TODO 103 Large international gay dating app with tens of millions of users uploading hundreds-of-thousands of photos per day

Slide 104

Slide 104 text

Segmenting Medical Images TODO 104

Slide 105

Slide 105 text

m Counting People TODO Countshoppers, segment on age & gender facial recognition loyalty is next