Challenge (ILSVRC) in 2012 image-net.org/challenges/LSVRC/2012/browse-synsets Team name Error Description SuperVision 0.15315 Using extra training data from ImageNet Fall 2011 release SuperVision 0.16422 Using only supplied training data ISI 0.26172 Weighted sum of scores from each classi- fier with SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV, respectively. ISI 0.26602 Weighted sum of scores from classifiers using each FV. ISI 0.26646 Naive sum of scores from classifiers using each FV. ISI 0.26952 Naive sum of scores from each classifier with SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV, respectively. OXFORD VGG 0.26979 Mixed selection from High-Level SVM scores and Baseline Scores, decision is performed by looking at the validation performance image-net.org/challenges/LSVRC/2012/results.html 2/42
cortex (V1) • V1 is the first area of the brain that begins some processing of visual input • V1 contains many simple cells and many complex cells • Simple cells: • Simple cells respond to specific oriented line segment exactly • Convolution layer is designed to emulate properties of simple cells • Complex cells: • Complex cells are similar to simple cells, but they are invariant to small shifts in the position of the input • Pooling layer is inspired by complex cells 3/42
are assumed as array structures • Continuous convolution (t, a ∈ R): s(t) = ∫ x(a)w(t − a)da (9.1) • Discrete convolution (t, a ∈ Z): s(t) = ∞ ∑ a=−∞ x(a)w(t − a) (9.3) • Discrete function can be represented as an array • Usually, assume that these signals are zero everywhere in out of the array x(t) 0 1 2 · · · · · · · · · N − 1 14 220 128 · · · · · · · · · 96 68 6/42
a ∈ Z): s(t) = (x ∗ w)(t) = ∞ ∑ a=−∞ x(a)w(t − a) (9.3) • 2D convolution (i, j, m, n ∈ Z): S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(m, n)K(i − m, j − n) (9.4) = (K ∗ I)(i, j) = ∑ m ∑ n I(i − m, j − n)K(m, n) (9.5) • Generally, the range of K(i, j) is smaller than I(i, j) • (9.5) is more straightforward to implement in a ML library 7/42
of the convolution • The only reason to flip the kernel is to obtain the commutative property: (I ∗ K)(i, j) = (K ∗ I)(i, j) • Instead, many NN libraries implement the cross-correlation but call it convolution: S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(i + m, j + n)K(m, n) (9.6) = ML algorithm with kernel flipping + ML algorithm without kernel flipping flip learned the kernel 8/42
of the convolution • The only reason to flip the kernel is to obtain the commutative property: (I ∗ K)(i, j) = (K ∗ I)(i, j) • Instead, many NN libraries implement the cross-correlation but call it convolution: S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(i + m, j + modified n)K(m, n) (9.6) = ML algorithm with kernel flipping + ML algorithm without kernel flipping flip learned the kernel 8/42
Convolution a b c d e f g h i j k l w x y z aw + bx + ey + fz aw + bx + ey + fz bw + cx + fy + gz bw + cx + fy + gz cw + dx + gy + hz cw + dx + gy + hz ew + fx + iy + jz ew + fx + iy + jz fw + gx + jy + kz fw + gx + jy + kz gw + hx + ky + lz gw + hx + ky + lz Input Kernel Output Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict the output to only positions where the kernel lies entirely within the image, called “valid” convolution in some contexts. We draw boxes with arrows to indicate how the upper-left Figure 9.1 Figure 9.1 9/42
interested in mathematical properties of convolution and use its result only • Many machine learning libraries implement cross-correlation but call it convolution • Discrete convolution can be viewed as multiplication by a sparse matrix 12/42
while keeping k several orders of magnitude smaller than m • If there are m inputs and n outputs, the algorithm requires O(m × n) • If we limit the number of connections each output may have to k, the algorithm requires O(k × n) 16/42
be indirectly connected to all or most of the inputs (Goodfellow 2 Growing Receptive Fields ly three inputs affect s 3 . (Bottom)When s is formed by matrix mu y is no longer sparse, so all of the inputs affect s 3 . x 1 x 1 x 2 x 2 x 3 x 3 h 2 h 2 h 1 h 1 h 3 h 3 x 4 x 4 h 4 h 4 x 5 x 5 h 5 h 5 g 2 g 2 g 1 g 1 g 3 g 3 g 4 g 4 g 5 g 5 The receptive field of the units in the deeper layers of a convolution an the receptive field of the units in the shallow layers. This effect k includes architectural features like strided convolution (figure 9.12) Figure 9.4 Figure 9.4 (Goodfellow, 2016) 17/42
than one function in a model (Goodfellow 2016) Parameter Sharing CHAPTER 9. CONVOLUTIONAL NETWORKS x 1 x 1 x 2 x 2 x 3 x 3 s 2 s 2 s 1 s 1 s 3 s 3 x 4 x 4 s 4 s 4 x 5 x 5 s 5 s 5 x 1 x 1 x 2 x 2 x 3 x 3 x 4 x 4 x 5 x 5 s 2 s 2 s 1 s 1 s 3 s 3 s 4 s 4 s 5 s 5 Figure 9.5: Parameter sharing: Black arrows indicate the connections that use a pa parameter in two different models. (Top)The black arrows indicate uses of the element of a 3-element kernel in a convolutional model. Due to parameter shari single parameter is used at all input locations. (Bottom)The single black arrow i the use of the central element of the weight matrix in a fully connected model. Th Convolution shares the same parameters across all spatial locations Traditional matrix multiplication does not share any parameters Figure 9.5 Figure 9.5 (Goodfellow, 2016) 19/42
of an operation in an image (Goodfellow 2016) Edge Detection by Convolution Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are 280 pixels tall. The input image is 320 pixels wide while the output image is 319 pixels wide. This transformation can be described by a convolution kernel containing two elements, and requires 319 ⇥ 280 ⇥ 3 = 267, 960 floating point operations (two multiplications and one addition per output pixel) to compute using convolution. To describe the same transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over eight billion, entries in the matrix, making convolution four billion times more efficient for representing this transformation. The straightforward matrix multiplication algorithm performs over sixteen billion floating point operations, making convolution roughly 60,000 times more efficient computationally. Of course, most of the entries of the matrix would be zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication CHAPTER 9. CONVOLUTIONAL NETWORKS Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are 280 pixels tall. The input image is 320 pixels wide while the output image is 319 pixels wide. This transformation can be described by a convolution kernel containing two elements, and requires 319 ⇥ 280 ⇥ 3 = 267, 960 floating point operations (two multiplications and one addition per output pixel) to compute using convolution. To describe the same transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over -1 -1 Input Kernel Output Figure 9.6 Figure 9.6 (Goodfellow, 2016) 20/42
may not wish to share parameters across the entire image We want to look for eyebrows We want to look for a chin • Convolution is not equivariant to some transformations such as: Original Scaling Rotation 26/42
improve a machine learning system • Sparse Interactions and Parameter Sharing dramatically decrease memory requirements and computational costs • Equivariant Representations generate 2-D feature map of where certain features appear in the input • In some cases, convolution is unsuited 27/42
detector and pooling (Goodfellow Convolutional Network Components Convolutional Layer Input to layer Convolution stage: Affine transform Detector stage: Nonlinearity e.g., rectified linear Pooling stage Next layer Input to layers Convolution layer: Affine transform Detector layer: Nonlinearity e.g., rectified linear Pooling layer Next layer Complex layer terminology Simple layer terminology Figure 9.7: The components of a typical convolutional neural network layer. There are two commonly used sets of terminology for describing these layers. (Left)In this terminology, the convolutional net is viewed as a small number of relatively complex layers, with Figure 9.7 Figure 9.7 28/42
translations of the input (Goodfellow 2016) Max Pooling and Invariance to Translation 0.1 1. 0.2 1. 1. 1. 0.1 0.2 ... ... ... ... 0.3 0.1 1. 1. 0.3 1. 0.2 1. ... ... ... ... DETECTOR STAGE POOLING STAGE POOLING STAGE DETECTOR STAGE Figure 9.8: Max pooling introduces invariance. (Top)A view of the middle of the output of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top row shows the outputs of max pooling, with a stride of one pixel between pooling regions and a pooling region width of three pixels. (Bottom)A view of the same network, after the input has been shifted to the right by one pixel. Every value in the bottom row has changed, but only half of the values in the top row have changed, because the max pooling Figure 9.8 Figure 9.8 (Goodfellow, 2016) 30/42
lutions, the features can learn which transformations to become invariant to (Goodfellow 2016) Cross-Channel Pooling and Invariance to Learned Transformations CHAPTER 9. CONVOLUTIONAL NETWORKS Large response in pooling unit Large response in pooling unit Large response in detector unit 1 Large response in detector unit 3 Figure 9.9: Example of learned invariances: A pooling unit that pools over multiple features that are learned with separate parameters can learn to be invariant to transformations of the input. Here we show how a set of three learned filters and a max pooling unit can learn to become invariant to rotation. All three filters are intended to detect a hand-written 5. Figure 9.9 Figure 9.9 (Goodfellow, 2016) 31/42
the network (Goodfellow 201 Pooling with Downsampling ame either way. This principle is leveraged by maxout networks (Goodfellow ) and other convolutional networks. Max pooling over spatial positions is nat ant to translation; this multi-channel approach is only necessary for learning formations. 0.1 1. 0.2 1. 0.2 0.1 0.1 0.0 0.1 e 9.10: Pooling with downsampling. Here we use max-pooling with a pool wid and a stride between pools of two. This reduces the representation size by a o, which reduces the computational and statistical burden on the next layer. the rightmost pooling region has a smaller size, but must be included if we d to ignore some of the detector units. Figure 9.10 Figure 9.10 (Goodfellow, 2016) 32/42
varying size (Goodfellow 2016) Example Classification Architectures CHAPTER 9. CONVOLUTIONAL NETWORKS Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling with stride 4: 16x16x64 Output of reshape to vector: 16,384 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling to 3x3 grid: 3x3x64 Output of reshape to vector: 576 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of convolution: 16x16x1,000 Output of average pooling: 1x1x1,000 Output of softmax: 1,000 class probabilities Output of pooling with stride 4: 16x16x64 Figure 9.11: Examples of architectures for classification with convolutional networks. The Figure 9.11 Figure 9.11 (Goodfellow, 2016) 33/42
varying size (Goodfellow 2016) Example Classification Architectures CHAPTER 9. CONVOLUTIONAL NETWORKS Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling with stride 4: 16x16x64 Output of reshape to vector: 16,384 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling to 3x3 grid: 3x3x64 Output of reshape to vector: 576 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of convolution: 16x16x1,000 Output of average pooling: 1x1x1,000 Output of softmax: 1,000 class probabilities Output of pooling with stride 4: 16x16x64 Figure 9.11: Examples of architectures for classification with convolutional networks. The Figure 9.11 • “Output of pooling with stride 4”: • The output size depends on the input size • “Output of pooling to 3x3 grid”: • The output size fixes Figure 9.11 (Goodfellow, 2016) 33/42
theoretical work of pooling • It gives guidance as to which kinds of pooling one should use in various situations • [Boureau et al., 2011] • Results of pooling are spatially local neighborhoods, but not local in the feature space • It is possible to dynamically pool features on the locations of interesting features in the feature space • [Jia et al., 2012] • In the commonly, the pooling uses manually defined parameters • In this work, optimal parameters are trained by data 34/42
of the net as a summary statistic • Pooling helps to make an invariance to some translations of the input • Pooling handles inputs of varying size 35/42
probability on some parameters and says that these parameter values are completely forbidden, regardless of how much support the data give to those values. 38/42
infinitely strong prior but implementation of it is extremely wasteful computationally x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 • The infinitely strong prior says ... • “The weights for one hidden unit must be identical to the weights of its neighbor but shifted in space.” • “The prior also says that the weights must be zero, except for in the small, spatially contiguous receptive field assigned to that hidden unit.” • “In the use of pooling, each unit should be invariant to small translations.” 39/42
into how CNNs work • Convolution and pooling can cause underfitting • They are only useful when the assumptions made by the prior are reasonably accurate • We should only compare convolutional models to other convolutional models in benchmarks of statistical learning performance • Convolutional models already have hard-coded knowledge of spatial • Models without convolution would be able to learn even if we permuted all the pixels in the image (Permutation Invariant) 40/42
on some parameters and says that these parameter values are completely forbidden. • We can interpret CNN as fully connected NN with an infinitely strong prior and get some insights into how CNNs work • Convolution and pooling can cause underfitting • Comparison of convolutional models and models without convolution is unfair 41/42
biologically inspired AI • In the context of NN, the cross-correlation is used instead of the convolution • Convolution leverages three important ideas that can help improve a ML system: • Sparse Interactions, Parameter Sharing, Equivariant Representations • Pooling summarizes an output, makes an invariance and handles inputs of varying size • We can interpret CNN as fully connected NN with an infinitely strong prior and get some insights: • Convolution and pooling can cause underfitting • Comparison of convolutional models and models without convolution is unfair 42/42