Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Teaching Machines to Understand What They See

Teaching Machines to Understand What They See

Humans rely a lot on their visual perception to understand the world. Can machines do the same? In this talk we will briefly go over the basics of machine learning with biologically inspired neural networks. We will then discuss the details of convolutional neural networks and how to use them to understand images.

Semih Yağcıoğlu

April 15, 2017
Tweet

More Decks by Semih Yağcıoğlu

Other Decks in Research

Transcript

  1. What is Intelligence? Intelligence is the computational part of the

    ability to achieve goals in the world. Varying kinds and degrees of intelligence occur in people, many animals and some machines. 5 Credit: John McCarthy
  2. What is Artificial Intelligence? “It is the science and engineering

    of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.” 6 Credit: John McCarthy
  3. Dartmouth AI Project Proposal (1955) “We propose that a 2

    month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire.” The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.” 7 Slide Credit: Dhruv Batra
  4. Timeline • 1940’s — Interest in neurons, neural networks and

    their relationship to mathematics and learning • 1943 — McCulloch & Pitts: “A logical calculus of the ideas immanent in nervous activity” • 1950 — Turing’s paper “Computing Machinery and Intelligence”, A.M. Turing introduced “The Turing Test”. • 1956 — Dartmouth conference • 1950’s and 1960’s — enthusiasm and optimism; big promises • Late 1960’s and 1970’s — Realization that further progress was really hard; disillusionment • 1980’s — Expert Systems, neural networks, etc.; AI now a little different; quiet successes • 1990’s to present — Intelligent agents • 2000’s — robot pets, self-driving cars 8 Credit: Peter Norvig
  5. What’s Missing? This type of mechanism is very useful for

    modelling a simple function but there is no learning at all. 14
  6. What is Learning? “the acquisition of knowledge or skills through

    experience, study, or by being taught.” 15 Slide Credit: Dhruv Batra
  7. What is Machine Learning? “algorithms that improve their performance (P)

    at some task (T) with experience (E)” - Tom Mitchell 19
  8. Machine Learning in a Nutshell • Tens of thousands of

    machine learning algorithms – Hundreds new every year • Decades of ML research oversimplified: – All of Machine Learning: – Learn a mapping from input to output f: X → Y – X: emails, Y: {spam, notspam} 24 Slide Credit: Pedro Domingos
  9. Machine Learning in a Nutshell • Input: x (images, text,

    emails…) • Output: y (spam or non-spam…) • (Unknown) Target Function – f: X → Y (the “true” mapping / reality) • Data – (x 1 ,y 1 ), (x 2 ,y 2 ), …, (x N ,y N ) • Model / Hypothesis Class – g: X → Y – y = g(x) = sign(wTx) 25 Slide Credit: Dhruv Batra
  10. Machine Learning in a Nutshell • Every machine learning algorithm

    has three components: – Representation / Model Class – Evaluation / Objective Function – Optimization 26 Slide Credit: Pedro Domingos
  11. Perceptron Learning Rule • initialize random weights wi, set learning

    rate a = 0.1 • while termination condition is satisfied ⚬ for each training example ( x, y ) ⚬ calculate the output: o = fthreshold( wi*xi ) ⚬ if the Perceptron does not respond correctly update the weights • wi = wi + a ( y - o ) xi • where y is the desired output, o is the output generated by the Perceptron, wi is the weight associated with the i-th connection. 30
  12. Perceptron Learning Rule Suppose an example of perceptron which accepts

    two inputs x1 = 2 and x2 = 1, with weights w1 = 0.5 and w2 = 0.3 and w0 = -1. The output of the perceptron is : O = 2 * 0.5 + 1 * 0.3 - 1 = 0.3 Therefore the output is 1. If the correct output however is 0, the weights will be adjusted according to the Perceptron rule as follows: w 1 = 0.5 + ( 0 - 1 ) * 2 = - 1.5 w 2 = 0.3 + ( 0 - 1 ) * 1 = - 0.7 w 0 = - 1 + ( 0 - 1 ) * 1 = - 2 31
  13. Multilayer Perceptron 39 Slide Credit: Fei-Fei Li & Andrej Karpathy

    & Justin Johnson “2-layer Neural Net”, or “1-hidden-layer Neural Net”
  14. Multilayer Perceptron 40 Slide Credit: Fei-Fei Li & Andrej Karpathy

    & Justin Johnson “3-layer Neural Net”, or “2-hidden-layer Neural Net”
  15. Chain Rule 47 Slide Credit: Fei-Fei Li & Andrej Karpathy

    & Justin Johnson e.g. x = -2, y = 5, z = -4 Want: Chain rule:
  16. Image Classification 51 Slide Credit: Fei-Fei Li & Andrej Karpathy

    & Justin Johnson cat (assume given set of discrete labels) {dog, cat, truck, plane, ...}
  17. 32 32 3 Convolution Layer 32x32x3 image width height depth

    76 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  18. 32 32 3 Convolution Layer 5x5x3 filter 32x32x3 image Convolve

    the filter with the image i.e. “slide over the image spatially, computing dot products” 77 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  19. 32 32 3 Convolution Layer 5x5x3 filter 32x32x3 image Convolve

    the filter with the image i.e. “slide over the image spatially, computing dot products” Filters always extend the full depth of the input volume 78 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  20. 32 32 3 Convolution Layer 32x32x3 image 5x5x3 filter 1

    number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) 79 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  21. 32 32 3 Convolution Layer 32x32x3 image 5x5x3 filter convolve

    (slide) over all spatial locations activation map 1 28 28 80 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  22. 32 32 3 Convolution Layer 32x32x3 image 5x5x3 filter convolve

    (slide) over all spatial locations activation maps 1 28 28 consider a second, green filter 81 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  23. 32 32 3 Convolution Layer activation maps 6 28 28

    For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6! 82 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  24. Preview: ConvNet is a sequence of Convolution Layers, interspersed with

    activation functions 32 32 3 28 28 6 CONV, ReLU e.g. 6 5x5x3 filters 83 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  25. Preview: ConvNet is a sequence of Convolutional Layers, interspersed with

    activation functions 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU …. 10 24 24 84 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  26. A closer look at spatial dimensions: 32 32 3 32x32x3

    image 5x5x3 filter convolve (slide) over all spatial locations activation map 1 28 28 88 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  27. 7x7 input (spatially) assume 3x3 filter 7 7 A closer

    look at spatial dimensions: 89 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  28. 7x7 input (spatially) assume 3x3 filter 7 7 A closer

    look at spatial dimensions: 90 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  29. 7x7 input (spatially) assume 3x3 filter 7 7 A closer

    look at spatial dimensions: 91 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  30. 7x7 input (spatially) assume 3x3 filter 7 7 A closer

    look at spatial dimensions: 92 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  31. 7x7 input (spatially) assume 3x3 filter => 5x5 output 7

    7 A closer look at spatial dimensions: 93 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  32. 7x7 input (spatially) assume 3x3 filter applied with stride 2

    7 7 A closer look at spatial dimensions: 94 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  33. 7x7 input (spatially) assume 3x3 filter applied with stride 2

    7 7 A closer look at spatial dimensions: 95 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  34. 7x7 input (spatially) assume 3x3 filter applied with stride 2

    => 3x3 output! 7 7 A closer look at spatial dimensions: 96 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  35. 7x7 input (spatially) assume 3x3 filter applied with stride 3?

    7 7 A closer look at spatial dimensions: 97 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  36. 7x7 input (spatially) assume 3x3 filter applied with stride 3?

    7 7 A closer look at spatial dimensions: doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3. 98 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  37. N N F F Output size: (N - F) /

    stride + 1 e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 :\ 99 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  38. In practice: Common to zero pad the border 0 0

    0 0 0 0 0 0 0 0 e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? (recall:) (N - F) / stride + 1 100 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  39. In practice: Common to zero pad the border e.g. input

    7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output! 0 0 0 0 0 0 0 0 0 0 101 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  40. 102 In practice: Common to zero pad the border e.g.

    input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output! in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially) e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3 0 0 0 0 0 0 0 0 0 0 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  41. 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28

    28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU …. 10 24 24 Convolving 103 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson shrinks volumes spatially! (32 -> 28 -> 24 ...)
  42. Input volume: 32x32x3 10 5x5 filters with stride 1, pad

    2 Output volume size: ? Example 104 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  43. Input volume: 32x32x3 10 5x5 filters with stride 1, pad

    2 Output volume size: (32+2*2-5)/1+1 = 32 spatially, so 32x32x10 Example 105 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  44. Input volume: 32x32x3 10 5x5 filters with stride 1, pad

    2 Number of parameters in this layer? Example 106 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  45. Input volume: 32x32x3 10 5x5 filters with stride 1, pad

    2 Number of parameters in this layer? each filter has 5*5*3 + 1 = 76 params (+1 for bias) => 76*10 = 760 Example 107 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  46. 108 32 32 3 32x32x3 image 5x5x3 filter 1 number:

    the result of taking a dot product between the filter and this part of the image (i.e. 5*5*3 = 75-dimensional dot product) The brain/neuron view of CONV Layer Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  47. 32 32 3 32x32x3 image 5x5x3 filter 1 number: the

    result of taking a dot product between the filter and this part of the image (i.e. 5*5*3 = 75-dimensional dot product) It’s just a neuron with local connectivity... The brain/neuron view of CONV Layer 109 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  48. 32 32 3 An activation map is a 28x28 sheet

    of neuron outputs: 1. Each is connected to a small region in the input 2. All of them share parameters “5x5 filter” -> “5x5 receptive field for each neuron” 28 28 The brain/neuron view of CONV Layer 110 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  49. 32 32 3 28 28 E.g. with 5 filters, CONV

    layer consists of neurons arranged in a 3D grid (28x28x5) There will be 5 different neurons all looking at the same region in the input volume 5 The brain/neuron view of CONV Layer 111 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson
  50. Pooling Layer 112 Slide Credit: Fei-Fei Li & Andrej Karpathy

    & Justin Johnson Smaller representations, adds translation invariance
  51. 1 1 2 4 5 6 7 8 3 2

    1 0 1 2 3 4 Single depth slice x y max pool with 2x2 filters and stride 2 6 8 3 4 Max Pooling 113 Slide Credit: Fei-Fei Li & Andrej Karpathy & Justin Johnson