Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Learning for Computer Vision by Alex Conway

Pycon ZA
October 06, 2017

Deep Learning for Computer Vision by Alex Conway

The state-of-the-art in image classification has skyrocketed thanks to the development of deep convolutional neural networks and increases in the amount of data and computing power available to train them. The top-5 error rate in the international ImageNet competition to predict which of 1000 classes an image belongs to has plummeted from 28% error in 2010 before deep learning to just 2.25% in 2017 (human level error is around 5%).

In addition to being able to classify objects in images (including not hotdogs), deep learning can be used to automatically generate captions for images, convert photos into paintings, detect cancer in pathology slide images, and help self-driving cars ‘see’.

The talk will give an overview of the cutting edge in the field and some of the core mathematical concepts behind the models. It will also include a short code-first tutorial to show how easy it is to get started using deep learning for computer vision in python…

Pycon ZA

October 06, 2017
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. Deep Learning
    for Computer Vision
    Executive-ML 2017/09/21
    Neither Proprietary nor Confidential – Please Distribute ;)
    Alex Conway
    alex @ numberboost.com
    @alxcnwy
    PyConZA’17

    View Slide

  2. Hands up!

    View Slide

  3. Check out the
    Deep Learning Indaba
    videos & practicals!
    http://www.deeplearningindaba.com/videos.html
    http://www.deeplearningindaba.com/practicals.html

    View Slide

  4. Deep Learning is Sexy (for a reason!)
    4

    View Slide

  5. Image Classification
    5
    http://yann.lecun.com/exdb/mnist/
    https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py
    (99.25% test accuracy in 192 seconds and 70 lines of code)

    View Slide

  6. Image Classification
    6

    View Slide

  7. Image Classification
    7
    ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et. Al.
    Advances in Neural Information Processing Systems 25 (NIPS2012)

    View Slide

  8. https://research.googleblog.com/2017/06/supercharge-
    your-computer-vision-models.html
    Object detection

    View Slide

  9. https://www.youtube.com/watch?v=VOC3huqHrss
    Object detection

    View Slide

  10. Object detection

    View Slide

  11. Image Captioning & Visual Attention
    XXX
    11
    https://einstein.ai/research/knowing-when-to-look-adaptive-attention-
    via-a-visual-sentinel-for-image-captioning

    View Slide

  12. Image Q&A
    12
    https://arxiv.org/pdf/1612.00837.pdf

    View Slide

  13. Video Q&A
    XXX
    13
    https://www.youtube.com/watch?v=UeheTiBJ0Io

    View Slide

  14. Pix2Pix
    https://affinelayer.com/pix2pix/
    https://github.com/affinelayer/pix2pix-tensorflow
    14

    View Slide

  15. Pix2Pix
    https://medium.com/towards-data-science/face2face-a-pix2pix-demo-that-
    mimics-the-facial-expression-of-the-german-chancellor-b6771d65bf66
    15

    View Slide

  16. 16
    Original input
    Rear Window (1954)
    Pix2pix output
    Fully Automated
    Remastered
    Painstakingly by Hand
    https://hackernoon.com
    /remastering-classic-
    films-in-tensorflow-with-
    pix2pix-f4d551fa0503

    View Slide

  17. Style Transfer
    https://github.com/junyanz/CycleGAN
    17

    View Slide

  18. Style Transfer SORCERY
    https://github.com/junyanz/CycleGAN
    18

    View Slide

  19. Real Fake News
    https://www.youtube.com/watch?v=MVBe6_o4cMI
    19

    View Slide

  20. Deep learning is Magic
    Deep learning is Magic
    Deep learning is EASY!

    View Slide

  21. 1. What is a neural network?
    2. What is a convolutionalneural network?
    3. How to use a convolutional neural network
    4. More advanced Methods
    5. Case studies & applications
    21

    View Slide

  22. Big Shout Outs
    Jeremy Howard & Rachel Thomas
    http://course.fast.ai
    Andrej Karpathy
    http://cs231n.github.io
    François Chollet (Keras lead dev)
    https://keras.io/
    22

    View Slide

  23. 1.What is a neural network?

    View Slide

  24. What is a neuron?
    24
    • 3 inputs [x1,x2,x3]
    • 3 weights [w1,w2,w3]
    • Element-wise multiply and sum
    • Apply activation function f
    • Often add a bias too (weight of 1) – not shown

    View Slide

  25. What is an Activation Function?
    25
    Sigmoid Tanh ReLU
    Nonlinearities … “squashing functions” … transform neuron’s output
    NB: sigmoid output in [0,1]

    View Slide

  26. What is a (Deep) Neural Network?
    26
    Inputs outputs
    hidden
    layer 1
    hidden
    layer 2
    hidden
    layer 3
    Outputs of one layer are inputs into the next layer

    View Slide

  27. How does a neural network learn?
    27
    • We need labelled examples “training data”
    • We initialize network weights randomly and initially get random predictions
    • For each labelled training data point, we calculate the error between the
    network’s predictions and the ground-truth labels
    • Use ‘backpropagation’ (chain rule), to update the network parameters
    (weights + convolutional filters ) in the opposite direction to the error

    View Slide

  28. How does a neural network learn?
    28
    New
    weight
    = Old
    weight
    Learning
    rate
    - Gradient of
    weight with
    respect to Error
    ( )
    x
    “How much
    error increases
    when we increase
    this weight”

    View Slide

  29. Gradient Descent Interpretation
    29
    http://scs.ryerson.ca/~aharley/neural-networks/

    View Slide

  30. http://playground.tensorflow.org

    View Slide

  31. What is a Neural Network?
    For much more detail, see:
    1. Michael Nielson’s Neural Networks & Deep
    Learning free online book
    http://neuralnetworksanddeeplearning.com/chap1.html
    2. Anrej Karpathy’s CS231n Notes
    http://neuralnetworksanddeeplearning.com/chap1.html
    31

    View Slide

  32. 2. What is a convolutional
    neural network?

    View Slide

  33. What is a Convolutional Neural Network?
    33
    “like a ordinary neural network but with special
    types of layers that work well on images”
    (math works on numbers)
    • Pixel = 3 colour channels (R, G, B)
    • Pixel intensity ∈[0,255]
    • Image has width w and height h
    • Therefore image is w x h x 3 numbers

    View Slide

  34. 34
    This is VGGNet – don’t panic, we’ll break it down piece by piece
    Example Architecture

    View Slide

  35. 35
    This is VGGNet – don’t panic, we’ll break it down piece by piece
    Example Architecture

    View Slide

  36. How does a neural network learn?
    36
    • MNIST stop and think how remarkable it is that we can recognise all of these
    MNISt as a 3 (change number)
    • Different pixels!

    View Slide

  37. Convolutions
    37
    http://setosa.io/ev/image-kernels/

    View Slide

  38. Convolutions
    38
    http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

    View Slide

  39. New Layer Type: ConvolutionalLayer
    39
    • 2-d weighted average when multiply kernel over pixel patches
    • We slide the kernel over all pixels of the image (handle borders)
    • Kernel starts off with “random” values and network updates (learns)
    the kernel values (using backpropagation) to try minimize loss
    • Kernels shared across the whole image (parameter sharing)

    View Slide

  40. Many Kernels = Many “Activation Maps” = Volume
    40
    http://cs231n.github.io/convolutional-networks/

    View Slide

  41. New Layer Type: ConvolutionalLayer
    41

    View Slide

  42. Convolutions
    42
    https://github.com/fchollet/keras/blob/master/examples/conv_filter_visualization.py

    View Slide

  43. Convolutions
    43

    View Slide

  44. Convolutions
    44

    View Slide

  45. Convolutions
    45

    View Slide

  46. Convolution Learn Hierarchical Features
    46

    View Slide

  47. Great vid
    47
    https://www.youtube.com/watch?v=AgkfIQ4IGaM

    View Slide

  48. New Layer Type: Max Pooling
    48

    View Slide

  49. New Layer Type: Max Pooling
    • Reduces dimensionality from one layer to next
    • …by replacing NxN sub-area with max value
    • Makes network “look” at larger areas of the image at a time
    • e.g. Instead of identifying fur, identify cat
    • Reduces overfittingsince losing information helps the network generalize
    49
    http://cs231n.github.io/convolutional-networks/

    View Slide

  50. New Layer Type: Max Pooling
    50

    View Slide

  51. Stack Conv + Pooling Layers and Go Deep
    51
    Convolution + max pooling + fully connected + softmax

    View Slide

  52. 52
    Stack Conv + Pooling Layers and Go Deep
    Convolution + max pooling + fully connected + softmax

    View Slide

  53. 53
    Stack These Layers and Go Deep
    Convolution + max pooling + fully connected + softmax

    View Slide

  54. 54
    Stack These Layers and Go Deep
    Convolution + max pooling + fully connected + softmax

    View Slide

  55. 55
    Flatten the Final “Bottleneck” layer
    Convolution + max pooling + fully connected + softmax
    Flatten the final 7 x 7 x 512 max pooling layer
    Add fully-connecteddense layer on top

    View Slide

  56. 56
    Bringing it all together
    Convolution + max pooling + fully connected + softmax

    View Slide

  57. Softmax
    Convert scores ∈ ℝ to probabilities ∈ [0,1]
    Final output prediction = highest probability class
    57

    View Slide

  58. Bringing it all together
    58
    Convolution + max pooling + fully connected + softmax

    View Slide

  59. We need labelled training data!

    View Slide

  60. ImageNet
    60
    http://image-net.org/explore
    1000 object categories
    1.2 million training images

    View Slide

  61. ImageNet
    61

    View Slide

  62. ImageNet
    62

    View Slide

  63. ImageNet Top 5 Error Rate
    63
    Traditional
    Image Processing
    Methods
    AlexNet
    8 Layers
    ZFNet
    8 Layers
    GoogLeNet
    22 Layers ResNet
    152 Layers SENet
    Ensamble
    TSNet
    Ensamble

    View Slide

  64. 3. How to use a
    convolutional neural
    network

    View Slide

  65. Using a Pre-Trained ImageNet-Winning CNN
    65
    • We’ve been looking at “VGGNet”
    • Oxford Visual Geometry Group (VGG)
    • ImageNet 2014 Runner-up
    • Network is 16 layers (deep!)
    • Easy to fine-tune
    https://blog.keras.io/building-powerful-image-classification-models-using-
    very-little-data.html

    View Slide

  66. Example: Classifying Product Images
    66
    https://github.com/alexcnwy/CTDL_CNN_TALK_20170620
    Classifying
    products into
    9 categories

    View Slide

  67. 67
    https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
    Start with Pre-Trained ImageNet Model
    Consumers vs Producers of Machine Learning

    View Slide

  68. 68

    View Slide

  69. “Transfer Learning”
    is a game changer

    View Slide

  70. Fine-tuning A CNN To Solve A New Problem
    • Cut off last layer of pre-trained Imagenet winning CNN
    • Keep learned network (convolutions) but replace final layer
    • Can learn to predict new (completely different) classes
    • Fine-tuning is re-training new final layer - learn for new task
    70

    View Slide

  71. Fine-tuning A CNN To Solve A New Problem
    71

    View Slide

  72. 72
    Before Fine-Tuning

    View Slide

  73. 73
    After Fine-Tuning

    View Slide

  74. Fine-tuning A CNN To Solve A New Problem
    • Fix weights in convolutional layers (set trainable=False)
    • Remove final dense layer that predicts 1000 ImageNet classes
    • Replace with new dense layer to predict 9 categories
    74
    88% accuracy in under 2 minutes for
    classifying products into categories
    Fine-tuning is awesome!
    Insert obligatory brain analogy

    View Slide

  75. View Slide

  76. Visual Similarity
    76
    • Chop off last 2 VGG layers
    • Use dense layer with 4096 activations
    • Compute nearest neighbours in the space of these activations
    https://memeburn.com/2017/06/spree-image-search/

    View Slide

  77. 77
    https://github.com/alexcnwy/CTDL_CNN_TALK_20170620

    View Slide

  78. 78
    Input Image
    not seen by model
    Results
    Top 10 most
    “visually similar”

    View Slide

  79. 79
    Final Convolutional Layer = Semantic Vector
    • The final convolutional
    layer encodes everything
    the network needs to make
    predictions
    • The dense layer added on
    top and the softmax layer
    both have lower
    dimensionality

    View Slide

  80. 4. More Advanced Methods

    View Slide

  81. Use a Better Architecture (or all of them!)
    81
    “Ensambles win”
    learn a weighted average of many models’ predictions

    View Slide

  82. cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
    There are MANY Computer Vision Tasks

    View Slide

  83. Long et al. “Fully Convolutional Networks for Semantic Segmentation” CVPR 2015
    Noh et al. Learning Deconvolution Network for Semantic Segmentation. IEEE on Computer Vision 2016
    Semantic Segmentation

    View Slide

  84. http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
    Object detection

    View Slide

  85. Object detection

    View Slide

  86. https://www.youtube.com/watch?v=VOC3huqHrss
    Object detection

    View Slide

  87. http://blog.romanofoti.com/style_transfer/
    Johnson et al. Perceptual losses for real-time style transfer and super-resolution. 2016
    Style Transfer
    f ( ) =
    ,

    View Slide

  88. https://www.youtube.com/watch?v=LhF_56SxrGk

    View Slide

  89. Pixelated Original
    Output
    https://arstech
    nica.com/infor
    mation-
    technology/20
    17/02/google-
    brain-super-
    resolution-
    zoom-
    enhance/

    View Slide

  90. This image is
    3.8 kb
    Super-Resolution

    View Slide

  91. https://github.com/tdeboissiere/BGG16CAM-keras
    Visual Attention

    View Slide

  92. Image Captioning
    https://einstein.ai/research/knowing-when-to-look-adaptive-attention-via-a-visual-sentinel-for-image-captioning
    Karpathy & Li. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), pp. 664–676

    View Slide

  93. Image Q&A
    XXX
    93
    http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook

    View Slide

  94. Video Q&A
    XXX
    94
    https://www.youtube.com/watch?v=UeheTiBJ0Io

    View Slide

  95. Video Q&A
    XXX
    95
    https://www.youtube.com/watch?v=UeheTiBJ0Io

    View Slide

  96. king + woman – man ≈ queen
    96
    Frome et al. (2013) ‘DeViSE: A Deep Visual-Semantic Embedding Model’, Advances in
    Neural Information Processing Systems, pp. 2121–2129
    CNN + Word2Vec = AWESOME

    View Slide

  97. DeViSE: A Deep Visual-SemanticEmbedding Model
    XXX
    97
    Before:
    Encode labels as 1-hot vector

    View Slide

  98. DeViSE: A Deep Visual-SemanticEmbedding Model
    XXX
    98
    After:
    Encode labels as word2vec vectors
    (FROM A SEPARATE MODEL)
    Can look these up for all the nouns in ImageNet
    300-d
    word2vec
    vectors

    View Slide

  99. DeViSE: A Deep Visual-SemanticEmbedding Model
    wv(fish)+ wv(net)
    99
    https://www.youtube.com/watch?v=uv0gmrXSXVg
    2
    wv* =
    …get nearest neighbours to wv*

    View Slide

  100. View Slide

  101. 5. Case Studies

    View Slide

  102. Estimating Accident Repair Cost from Photos
    TODO
    102
    Prototype for
    large SA insurer
    Detect car make
    & model from
    registration disk
    Predict repair
    cost using
    learnedmodel

    View Slide

  103. Image & Video Moderation
    TODO
    103
    Large international gay dating app with tens of millions of users
    uploading hundreds-of-thousands of photos per day

    View Slide

  104. Segmenting Medical Images
    TODO
    104

    View Slide

  105. m
    Counting People
    TODO
    Countshoppers, segment on age & gender
    facial recognition loyalty is next

    View Slide

  106. Counting Cars
    TODO

    View Slide

  107. Detecting Potholes

    View Slide

  108. Extracting Data from Product Catalogues

    View Slide

  109. Real-time ATM Video Classification
    109

    View Slide

  110. Optical Sorting
    TODO
    110
    https://www.youtube.com/watch?v=Xf7jaxwnyso

    View Slide

  111. We are hiring!
    alex @ numberboost.com

    View Slide

  112. GET IN TOUCH!
    Alex Conway
    alex @ numberboost.com
    @alxcnwy

    View Slide

  113. https://www.autoblog.com/2017/08/04/self-driving-car-sign-hack-stickers/

    View Slide