Slide 1

Slide 1 text

Convolutional Networks (Deep Learning §9–9.4) Satoshi Murashige Machine Learning Reading Club (July 9, 2018) Mathematical Informatics Lab., NAIST

Slide 2

Slide 2 text

Today’s agenda Introduction (9.10, 9.11) 9.1 The Convolution Operation 9.2 Motivation 9.3 Pooling 9.4 Convolution and Pooling as an Infinitely Strong Prior 1/42

Slide 3

Slide 3 text

Introduction (9.10, 9.11)

Slide 4

Slide 4 text

Convolutional Neural Networks (CNNs) won ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 image-net.org/challenges/LSVRC/2012/browse-synsets Team name Error Description SuperVision 0.15315 Using extra training data from ImageNet Fall 2011 release SuperVision 0.16422 Using only supplied training data ISI 0.26172 Weighted sum of scores from each classi- fier with SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV, respectively. ISI 0.26602 Weighted sum of scores from classifiers using each FV. ISI 0.26646 Naive sum of scores from classifiers using each FV. ISI 0.26952 Naive sum of scores from each classifier with SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV, respectively. OXFORD VGG 0.26979 Mixed selection from High-Level SVM scores and Baseline Scores, decision is performed by looking at the validation performance image-net.org/challenges/LSVRC/2012/results.html 2/42

Slide 5

Slide 5 text

CNN is designed to capture three properties of primary visual cortex (V1) • V1 is the first area of the brain that begins some processing of visual input • V1 contains many simple cells and many complex cells • Simple cells: • Simple cells respond to specific oriented line segment exactly • Convolution layer is designed to emulate properties of simple cells • Complex cells: • Complex cells are similar to simple cells, but they are invariant to small shifts in the position of the input • Pooling layer is inspired by complex cells 3/42

Slide 6

Slide 6 text

9.1 The Convolution Operation

Slide 7

Slide 7 text

Definition and Property of Convolution • Convolution is defined as: s(t) = ∫ x(a)w(t − a)da (9.1) = (x ∗ w)(t) (9.2) • Convolution is commutative (x ∗ w)(t) = (w ∗ x)(t) • In the context of CNN, s(t) feature map = ∫ x(a) input w(t − a) kernel da (9.1) 4/42

Slide 8

Slide 8 text

One of Interpretation of Convolution: flipping, multiplication and summation s(t) = ∫ x(a)w(t − a)da (9.1) 5/42

Slide 9

Slide 9 text

One of Interpretation of Convolution: flipping, multiplication and summation s(t) = ∫ x(a)w(t − a)da (9.1) = ∫ x(a) w(−(a − t)) Flipped w(a) in the y-axis and shifted t to the positive da 5/42

Slide 10

Slide 10 text

One of Interpretation of Convolution: flipping, multiplication and summation s(t) = ∫ x(a)w(t − a)da (9.1) = ∫ x(a) w(−(a − t)) Flipped w(a) in the y-axis and shifted t to the positive da 4 3 2 1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 x(a) 4 3 2 1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 w(a) 4 3 2 1 0 1 2 3 4 0.0 0.5 1.0 t=0 x(a) w( a) 4 3 2 1 0 1 2 3 4 0.0 0.5 1.0 t=2 x(a) w( (a 2)) 4 3 2 1 0 1 2 3 4 0.0 0.5 1.0 t=4 x(a) w( (a 4)) 5/42

Slide 11

Slide 11 text

Usually, when we work with data on a computer, functions are assumed as array structures • Continuous convolution (t, a ∈ R): s(t) = ∫ x(a)w(t − a)da (9.1) • Discrete convolution (t, a ∈ Z): s(t) = ∞ ∑ a=−∞ x(a)w(t − a) (9.3) • Discrete function can be represented as an array • Usually, assume that these signals are zero everywhere in out of the array x(t) 0 1 2 · · · · · · · · · N − 1 14 220 128 · · · · · · · · · 96 68 6/42

Slide 12

Slide 12 text

Convolution for image data (2D signal) • 1D convolution (t, a ∈ Z): s(t) = (x ∗ w)(t) = ∞ ∑ a=−∞ x(a)w(t − a) (9.3) • 2D convolution (i, j, m, n ∈ Z): S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(m, n)K(i − m, j − n) (9.4) = (K ∗ I)(i, j) = ∑ m ∑ n I(i − m, j − n)K(m, n) (9.5) • Generally, the range of K(i, j) is smaller than I(i, j) • (9.5) is more straightforward to implement in a ML library 7/42

Slide 13

Slide 13 text

In the context of NN, the cross-correlation is used instead of the convolution • The only reason to flip the kernel is to obtain the commutative property: (I ∗ K)(i, j) = (K ∗ I)(i, j) • Instead, many NN libraries implement the cross-correlation but call it convolution: S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(i + m, j + n)K(m, n) (9.6) = ML algorithm with kernel flipping + ML algorithm without kernel flipping flip learned the kernel 8/42

Slide 14

Slide 14 text

In the context of NN, the cross-correlation is used instead of the convolution • The only reason to flip the kernel is to obtain the commutative property: (I ∗ K)(i, j) = (K ∗ I)(i, j) • Instead, many NN libraries implement the cross-correlation but call it convolution: S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(i + m, j + modified n)K(m, n) (9.6) = ML algorithm with kernel flipping + ML algorithm without kernel flipping flip learned the kernel 8/42

Slide 15

Slide 15 text

An example of 2-D convolution without kernel flipping (Goo 2D Convolution a b c d e f g h i j k l w x y z aw + bx + ey + fz aw + bx + ey + fz bw + cx + fy + gz bw + cx + fy + gz cw + dx + gy + hz cw + dx + gy + hz ew + fx + iy + jz ew + fx + iy + jz fw + gx + jy + kz fw + gx + jy + kz gw + hx + ky + lz gw + hx + ky + lz Input Kernel Output Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict the output to only positions where the kernel lies entirely within the image, called “valid” convolution in some contexts. We draw boxes with arrows to indicate how the upper-left Figure 9.1 Figure 9.1 9/42

Slide 16

Slide 16 text

Feed forward neural networks can be viewed as multiplication by a matrix x1 x2 x3 x4 u2 z2 u1 z1 u3 z3 z1 z2 z3 • wji: weight between xj and ui • Outputs of the network: z = f(u) u = Wx    u1 u2 u3    =    w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34         x1 x2 x3 x4      10/42

Slide 17

Slide 17 text

Discrete convolution can be viewed as multiplication by a sparse matrix u = Wx           u(0) u(1) u(2) u(3) u(4) u(5)           =           w0 w1 w2 0 0 0 0 w0 w1 w2 0 0 0 0 w0 w1 w2 0 0 0 0 w0 w1 w2 0 0 0 0 w0 w1 0 0 0 0 0 w0                     x(0) x(1) x(2) x(3) x(4) x(5)           11/42

Slide 18

Slide 18 text

Summary (9.1) • In context of CNN, we are not interested in mathematical properties of convolution and use its result only • Many machine learning libraries implement cross-correlation but call it convolution • Discrete convolution can be viewed as multiplication by a sparse matrix 12/42

Slide 19

Slide 19 text

9.2 Motivation

Slide 20

Slide 20 text

Motivation: Convolution leverages three important ideas that can help improve a machine learning system • Sparse Interactions • Parameter Sharing • Equivariant Representations 13/42

Slide 21

Slide 21 text

In the feed forward neural network, each outputs are depend on any inputs x1 x2 x3 x4 u2 z2 u1 z1 u3 z3 z1 z2 z3 • wji: weight between xj and ui • Outputs of the network: z1 = f(w11x1 + w12x2 + w13x3 + w14x4) z2 = f(w12x1 + w22x2 + w23x3 + w24x4) z3 = f(w13x1 + w32x2 + w33x3 + w34x4) 14/42

Slide 22

Slide 22 text

Sparse interactions reduce the memory requirements of the model and improve its statistical efficiency Convolutional Neural Network x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Fully Connected Neural Network x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 15/42

Slide 23

Slide 23 text

Sparse interactions reduce the memory requirements of the model and improve its statistical efficiency Convolutional Neural Network x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Fully Connected Neural Network x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 15/42

Slide 24

Slide 24 text

For many practical apps, ML task can obtain good performance while keeping k several orders of magnitude smaller than m • If there are m inputs and n outputs, the algorithm requires O(m × n) • If we limit the number of connections each output may have to k, the algorithm requires O(k × n) 16/42

Slide 25

Slide 25 text

Although convolution layers are very sparse, the deeper layers can be indirectly connected to all or most of the inputs (Goodfellow 2 Growing Receptive Fields ly three inputs affect s 3 . (Bottom)When s is formed by matrix mu y is no longer sparse, so all of the inputs affect s 3 . x 1 x 1 x 2 x 2 x 3 x 3 h 2 h 2 h 1 h 1 h 3 h 3 x 4 x 4 h 4 h 4 x 5 x 5 h 5 h 5 g 2 g 2 g 1 g 1 g 3 g 3 g 4 g 4 g 5 g 5 The receptive field of the units in the deeper layers of a convolution an the receptive field of the units in the shallow layers. This effect k includes architectural features like strided convolution (figure 9.12) Figure 9.4 Figure 9.4 (Goodfellow, 2016) 17/42

Slide 26

Slide 26 text

Motivation: Convolution leverages three important ideas that can help improve a machine learning system • Sparse Interactions • Parameter Sharing • Equivariant Representations 18/42

Slide 27

Slide 27 text

Parameter sharing refers to using the same parameter for more than one function in a model (Goodfellow 2016) Parameter Sharing CHAPTER 9. CONVOLUTIONAL NETWORKS x 1 x 1 x 2 x 2 x 3 x 3 s 2 s 2 s 1 s 1 s 3 s 3 x 4 x 4 s 4 s 4 x 5 x 5 s 5 s 5 x 1 x 1 x 2 x 2 x 3 x 3 x 4 x 4 x 5 x 5 s 2 s 2 s 1 s 1 s 3 s 3 s 4 s 4 s 5 s 5 Figure 9.5: Parameter sharing: Black arrows indicate the connections that use a pa parameter in two different models. (Top)The black arrows indicate uses of the element of a 3-element kernel in a convolutional model. Due to parameter shari single parameter is used at all input locations. (Bottom)The single black arrow i the use of the central element of the weight matrix in a fully connected model. Th Convolution shares the same parameters across all spatial locations Traditional matrix multiplication does not share any parameters Figure 9.5 Figure 9.5 (Goodfellow, 2016) 19/42

Slide 28

Slide 28 text

Sparse connectivity and parameter sharing can dramatically improve the efficiency of an operation in an image (Goodfellow 2016) Edge Detection by Convolution Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are 280 pixels tall. The input image is 320 pixels wide while the output image is 319 pixels wide. This transformation can be described by a convolution kernel containing two elements, and requires 319 ⇥ 280 ⇥ 3 = 267, 960 floating point operations (two multiplications and one addition per output pixel) to compute using convolution. To describe the same transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over eight billion, entries in the matrix, making convolution four billion times more efficient for representing this transformation. The straightforward matrix multiplication algorithm performs over sixteen billion floating point operations, making convolution roughly 60,000 times more efficient computationally. Of course, most of the entries of the matrix would be zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication CHAPTER 9. CONVOLUTIONAL NETWORKS Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are 280 pixels tall. The input image is 320 pixels wide while the output image is 319 pixels wide. This transformation can be described by a convolution kernel containing two elements, and requires 319 ⇥ 280 ⇥ 3 = 267, 960 floating point operations (two multiplications and one addition per output pixel) to compute using convolution. To describe the same transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over -1 -1 Input Kernel Output Figure 9.6 Figure 9.6 (Goodfellow, 2016) 20/42

Slide 29

Slide 29 text

Sparse connectivity and parameter sharing can dramatically improve the efficiency of an operation in an image • Input size: 320 × 280 • Kernel size: 2 × 1 • Output size: 319 × 280 Convolution Dence matrix Sparse matrix Stored floats ɹ 2 320×280×319×280 ≃ 8.0 × 109 319 × 280 × 2 ≃ 1.8 × 105 Float muls and adds 320 × 280 × 3 ≃ 2.7×104 ɹ 1.6 × 1010 Same as convolution (≃ 2.7 × 104 ) 21/42

Slide 30

Slide 30 text

Motivation: Convolution leverages three important ideas that can help improve a machine learning system • Sparse Interactions • Parameter Sharing • Equivariant Representations 22/42

Slide 31

Slide 31 text

Equivariance between two functions f(x) and g(x) f(g(x)) = g(f(x)) 23/42

Slide 32

Slide 32 text

Equivariance between two functions f(x) and g(x) f(g(x)) = g(f(x)) x x g(·) f(·) f(·) g(·) f(g(x)) g(f(x)) g(x) f(x) = 23/42

Slide 33

Slide 33 text

Equivariance between convolution and shifting Let I(x, y) be an image (I : R2 → R), g(I) = (shift every pixel of the image I(x, y) one unit to the right.) = I(x − 1, y) f(I) = (I ∗ K)(x, y) I(x, y) I(x, y) g(·) f(·) f(·) g(·) (g(I) ∗ K)(x, y) (I ∗ K)(x − 1, y) g(I) f(I) = 24/42

Slide 34

Slide 34 text

Convolution with a kernel generates a 2-D feature map of where certain features appear in the input ∗ ∗ = = 25/42

Slide 35

Slide 35 text

Convolution is unsuited for use in some cases • We may not wish to share parameters across the entire image We want to look for eyebrows We want to look for a chin • Convolution is not equivariant to some transformations such as: Original Scaling Rotation 26/42

Slide 36

Slide 36 text

Summary (9.2): Convolution leverages three important ideas that can help improve a machine learning system • Sparse Interactions and Parameter Sharing dramatically decrease memory requirements and computational costs • Equivariant Representations generate 2-D feature map of where certain features appear in the input • In some cases, convolution is unsuited 27/42

Slide 37

Slide 37 text

9.3 Pooling

Slide 38

Slide 38 text

A typical layer of CNN consists of three layers: convolution, detector and pooling (Goodfellow Convolutional Network Components Convolutional Layer Input to layer Convolution stage: Affine transform Detector stage: Nonlinearity e.g., rectified linear Pooling stage Next layer Input to layers Convolution layer: Affine transform Detector layer: Nonlinearity e.g., rectified linear Pooling layer Next layer Complex layer terminology Simple layer terminology Figure 9.7: The components of a typical convolutional neural network layer. There are two commonly used sets of terminology for describing these layers. (Left)In this terminology, the convolutional net is viewed as a small number of relatively complex layers, with Figure 9.7 Figure 9.7 28/42

Slide 39

Slide 39 text

Pooling replaces the output of the net at a certain location with a summary statistic of the nearby outputs Example: Max Pooling 210 196 203 195 184 207 213 203 183 175 149 235 198 177 185 193 210 218 200 201 70 134 208 222 199 Output of Detector Stage Output of Pooling Stage 29/42

Slide 40

Slide 40 text

Pooling replaces the output of the net at a certain location with a summary statistic of the nearby outputs Example: Max Pooling 210 196 203 195 184 207 213 203 183 175 149 235 198 177 185 193 210 218 200 201 70 134 208 222 199 235 Output of Detector Stage Output of Pooling Stage 29/42

Slide 41

Slide 41 text

Pooling replaces the output of the net at a certain location with a summary statistic of the nearby outputs Example: Max Pooling 210 196 203 195 184 207 213 203 183 175 149 235 198 177 185 193 210 218 200 201 70 134 208 222 199 235 235 Output of Detector Stage Output of Pooling Stage 29/42

Slide 42

Slide 42 text

Pooling replaces the output of the net at a certain location with a summary statistic of the nearby outputs Example: Max Pooling 210 196 203 195 184 207 213 203 183 175 149 235 198 177 185 193 210 218 200 201 70 134 208 222 199 235 235 203 Output of Detector Stage Output of Pooling Stage 29/42

Slide 43

Slide 43 text

Pooling helps to make the representation approximately invariant to small translations of the input (Goodfellow 2016) Max Pooling and Invariance to Translation 0.1 1. 0.2 1. 1. 1. 0.1 0.2 ... ... ... ... 0.3 0.1 1. 1. 0.3 1. 0.2 1. ... ... ... ... DETECTOR STAGE POOLING STAGE POOLING STAGE DETECTOR STAGE Figure 9.8: Max pooling introduces invariance. (Top)A view of the middle of the output of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top row shows the outputs of max pooling, with a stride of one pixel between pooling regions and a pooling region width of three pixels. (Bottom)A view of the same network, after the input has been shifted to the right by one pixel. Every value in the bottom row has changed, but only half of the values in the top row have changed, because the max pooling Figure 9.8 Figure 9.8 (Goodfellow, 2016) 30/42

Slide 44

Slide 44 text

If we pool over the outputs of separately parameterized convo- lutions, the features can learn which transformations to become invariant to (Goodfellow 2016) Cross-Channel Pooling and Invariance to Learned Transformations CHAPTER 9. CONVOLUTIONAL NETWORKS Large response in pooling unit Large response in pooling unit Large response in detector unit 1 Large response in detector unit 3 Figure 9.9: Example of learned invariances: A pooling unit that pools over multiple features that are learned with separate parameters can learn to be invariant to transformations of the input. Here we show how a set of three learned filters and a max pooling unit can learn to become invariant to rotation. All three filters are intended to detect a hand-written 5. Figure 9.9 Figure 9.9 (Goodfellow, 2016) 31/42

Slide 45

Slide 45 text

Pooling with step k pixels improves the computational efficiency of the network (Goodfellow 201 Pooling with Downsampling ame either way. This principle is leveraged by maxout networks (Goodfellow ) and other convolutional networks. Max pooling over spatial positions is nat ant to translation; this multi-channel approach is only necessary for learning formations. 0.1 1. 0.2 1. 0.2 0.1 0.1 0.0 0.1 e 9.10: Pooling with downsampling. Here we use max-pooling with a pool wid and a stride between pools of two. This reduces the representation size by a o, which reduces the computational and statistical burden on the next layer. the rightmost pooling region has a smaller size, but must be included if we d to ignore some of the detector units. Figure 9.10 Figure 9.10 (Goodfellow, 2016) 32/42

Slide 46

Slide 46 text

For many tasks, pooling is essential for handling inputs of varying size (Goodfellow 2016) Example Classification Architectures CHAPTER 9. CONVOLUTIONAL NETWORKS Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling with stride 4: 16x16x64 Output of reshape to vector: 16,384 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling to 3x3 grid: 3x3x64 Output of reshape to vector: 576 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of convolution: 16x16x1,000 Output of average pooling: 1x1x1,000 Output of softmax: 1,000 class probabilities Output of pooling with stride 4: 16x16x64 Figure 9.11: Examples of architectures for classification with convolutional networks. The Figure 9.11 Figure 9.11 (Goodfellow, 2016) 33/42

Slide 47

Slide 47 text

For many tasks, pooling is essential for handling inputs of varying size (Goodfellow 2016) Example Classification Architectures CHAPTER 9. CONVOLUTIONAL NETWORKS Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling with stride 4: 16x16x64 Output of reshape to vector: 16,384 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling to 3x3 grid: 3x3x64 Output of reshape to vector: 576 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of convolution: 16x16x1,000 Output of average pooling: 1x1x1,000 Output of softmax: 1,000 class probabilities Output of pooling with stride 4: 16x16x64 Figure 9.11: Examples of architectures for classification with convolutional networks. The Figure 9.11 • “Output of pooling with stride 4”: • The output size depends on the input size • “Output of pooling to 3x3 grid”: • The output size fixes Figure 9.11 (Goodfellow, 2016) 33/42

Slide 48

Slide 48 text

Research works of pooling • [Boureau et al., 2010] • theoretical work of pooling • It gives guidance as to which kinds of pooling one should use in various situations • [Boureau et al., 2011] • Results of pooling are spatially local neighborhoods, but not local in the feature space • It is possible to dynamically pool features on the locations of interesting features in the feature space • [Jia et al., 2012] • In the commonly, the pooling uses manually defined parameters • In this work, optimal parameters are trained by data 34/42

Slide 49

Slide 49 text

Summary (9.3): Pooling is awesome • Pooling summarizes the output of the net as a summary statistic • Pooling helps to make an invariance to some translations of the input • Pooling handles inputs of varying size 35/42

Slide 50

Slide 50 text

9.4 Convolution and Pooling as an Infinitely Strong Prior

Slide 51

Slide 51 text

Prior probability distribution encodes our beliefs about that models are reasonable, before we have seen any data • W: a set of parameters • D: a set of data p(W|D) posterior ∝ p(D|W) likelihood × p(W) prior 36/42

Slide 52

Slide 52 text

Weak/Strong Prior Distributions Weak Prior Distribution (Large Variance, High Entropy) W p(W) O Strong Prior Distribution (Small Variance, Low Entropy) W p(W) O 37/42

Slide 53

Slide 53 text

What’s “Infinitely String Prior”? An infinitely strong prior places zero probability on some parameters and says that these parameter values are completely forbidden, regardless of how much support the data give to those values. 38/42

Slide 54

Slide 54 text

We can interpret CNN as fully connected NN with an infinitely strong prior but implementation of it is extremely wasteful computationally x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 • The infinitely strong prior says ... 39/42

Slide 55

Slide 55 text

We can interpret CNN as fully connected NN with an infinitely strong prior but implementation of it is extremely wasteful computationally x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 • The infinitely strong prior says ... • “The weights for one hidden unit must be identical to the weights of its neighbor but shifted in space.” • “The prior also says that the weights must be zero, except for in the small, spatially contiguous receptive field assigned to that hidden unit.” • “In the use of pooling, each unit should be invariant to small translations.” 39/42

Slide 56

Slide 56 text

From the reinterpretaion of CNN, we can get some insights into how CNNs work • Convolution and pooling can cause underfitting • They are only useful when the assumptions made by the prior are reasonably accurate • We should only compare convolutional models to other convolutional models in benchmarks of statistical learning performance • Convolutional models already have hard-coded knowledge of spatial • Models without convolution would be able to learn even if we permuted all the pixels in the image (Permutation Invariant) 40/42

Slide 57

Slide 57 text

Summary (9.4) • An infinitely strong prior places zero probability on some parameters and says that these parameter values are completely forbidden. • We can interpret CNN as fully connected NN with an infinitely strong prior and get some insights into how CNNs work • Convolution and pooling can cause underfitting • Comparison of convolutional models and models without convolution is unfair 41/42

Slide 58

Slide 58 text

Summary • CNNs are one of the success story of biologically inspired AI • In the context of NN, the cross-correlation is used instead of the convolution • Convolution leverages three important ideas that can help improve a ML system: • Sparse Interactions, Parameter Sharing, Equivariant Representations • Pooling summarizes an output, makes an invariance and handles inputs of varying size • We can interpret CNN as fully connected NN with an infinitely strong prior and get some insights: • Convolution and pooling can cause underfitting • Comparison of convolutional models and models without convolution is unfair 42/42