Convolutional Networks (Deep Learning, 9.1-9.4)

Convolutional Networks (Deep Learning §9–9.4) Satoshi Murashige Machine Learning Reading
Club (July 9, 2018) Mathematical Informatics Lab., NAIST

Today’s agenda Introduction (9.10, 9.11) 9.1 The Convolution Operation 9.2
Motivation 9.3 Pooling 9.4 Convolution and Pooling as an Inﬁnitely Strong Prior 1/42

Introduction (9.10, 9.11)

Convolutional Neural Networks (CNNs) won ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) in 2012 image-net.org/challenges/LSVRC/2012/browse-synsets Team name Error Description SuperVision 0.15315 Using extra training data from ImageNet Fall 2011 release SuperVision 0.16422 Using only supplied training data ISI 0.26172 Weighted sum of scores from each classifier with SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV, respectively. ISI 0.26602 Weighted sum of scores from classifiers using each FV. ISI 0.26646 Naive sum of scores from classifiers using each FV. ISI 0.26952 Naive sum of scores from each classifier with SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV, respectively. OXFORD VGG 0.26979 Mixed selection from High-Level SVM scores and Baseline Scores, decision is performed by looking at the validation performance image-net.org/challenges/LSVRC/2012/results.html 2/42

CNN is designed to capture three properties of primary visual
cortex (V1) • V1 is the ﬁrst area of the brain that begins some processing of visual input • V1 contains many simple cells and many complex cells • Simple cells: • Simple cells respond to speciﬁc oriented line segment exactly • Convolution layer is designed to emulate properties of simple cells • Complex cells: • Complex cells are similar to simple cells, but they are invariant to small shifts in the position of the input • Pooling layer is inspired by complex cells 3/42

9.1 The Convolution Operation

Deﬁnition and Property of Convolution • Convolution is deﬁned as:
s(t) = ∫ x(a)w(t − a)da (9.1) = (x ∗ w)(t) (9.2) • Convolution is commutative (x ∗ w)(t) = (w ∗ x)(t) • In the context of CNN, s(t) feature map = ∫ x(a) input w(t − a) kernel da (9.1) 4/42

One of Interpretation of Convolution: ﬂipping, multiplication and summation s(t)
= ∫ x(a)w(t − a)da (9.1) 5/42

= ∫ x(a)w(t − a)da (9.1) = ∫ x(a) w(−(a − t)) Flipped w(a) in the y-axis and shifted t to the positive da 5/42

= ∫ x(a)w(t − a)da (9.1) = ∫ x(a) w(−(a − t)) Flipped w(a) in the y-axis and shifted t to the positive da 4 3 2 1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 x(a) 4 3 2 1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 w(a) 4 3 2 1 0 1 2 3 4 0.0 0.5 1.0 t=0 x(a) w( a) 4 3 2 1 0 1 2 3 4 0.0 0.5 1.0 t=2 x(a) w( (a 2)) 4 3 2 1 0 1 2 3 4 0.0 0.5 1.0 t=4 x(a) w( (a 4)) 5/42

Usually, when we work with data on a computer, functions
are assumed as array structures • Continuous convolution (t, a ∈ R): s(t) = ∫ x(a)w(t − a)da (9.1) • Discrete convolution (t, a ∈ Z): s(t) = ∞ ∑ a=−∞ x(a)w(t − a) (9.3) • Discrete function can be represented as an array • Usually, assume that these signals are zero everywhere in out of the array x(t) 0 1 2 · · · · · · · · · N − 1 14 220 128 · · · · · · · · · 96 68 6/42

Convolution for image data (2D signal) • 1D convolution (t,
a ∈ Z): s(t) = (x ∗ w)(t) = ∞ ∑ a=−∞ x(a)w(t − a) (9.3) • 2D convolution (i, j, m, n ∈ Z): S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(m, n)K(i − m, j − n) (9.4) = (K ∗ I)(i, j) = ∑ m ∑ n I(i − m, j − n)K(m, n) (9.5) • Generally, the range of K(i, j) is smaller than I(i, j) • (9.5) is more straightforward to implement in a ML library 7/42

In the context of NN, the cross-correlation is used instead
of the convolution • The only reason to flip the kernel is to obtain the commutative property: (I ∗ K)(i, j) = (K ∗ I)(i, j) • Instead, many NN libraries implement the cross-correlation but call it convolution: S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(i + m, j + n)K(m, n) (9.6) = ML algorithm with kernel flipping + ML algorithm without kernel flipping flip learned the kernel 8/42

In the context of NN, the cross-correlation is used instead
of the convolution • The only reason to flip the kernel is to obtain the commutative property: (I ∗ K)(i, j) = (K ∗ I)(i, j) • Instead, many NN libraries implement the cross-correlation but call it convolution: S(i, j) = (I ∗ K)(i, j) = ∑ m ∑ n I(i + m, j + modified n)K(m, n) (9.6) = ML algorithm with kernel flipping + ML algorithm without kernel flipping flip learned the kernel 8/42

An example of 2-D convolution without kernel ﬂipping (Goo 2D
Convolution a b c d e f g h i j k l w x y z aw + bx + ey + fz aw + bx + ey + fz bw + cx + fy + gz bw + cx + fy + gz cw + dx + gy + hz cw + dx + gy + hz ew + fx + iy + jz ew + fx + iy + jz fw + gx + jy + kz fw + gx + jy + kz gw + hx + ky + lz gw + hx + ky + lz Input Kernel Output Figure 9.1: An example of 2-D convolution without kernel-ﬂipping. In this case we restrict the output to only positions where the kernel lies entirely within the image, called “valid” convolution in some contexts. We draw boxes with arrows to indicate how the upper-left Figure 9.1 Figure 9.1 9/42

Feed forward neural networks can be viewed as multiplication by
a matrix x1 x2 x3 x4 u2 z2 u1 z1 u3 z3 z1 z2 z3 • wji: weight between xj and ui • Outputs of the network: z = f(u) u = Wx    u1 u2 u3    =    w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34         x1 x2 x3 x4      10/42

Discrete convolution can be viewed as multiplication by a sparse
matrix u = Wx           u(0) u(1) u(2) u(3) u(4) u(5)           =           w0 w1 w2 0 0 0 0 w0 w1 w2 0 0 0 0 w0 w1 w2 0 0 0 0 w0 w1 w2 0 0 0 0 w0 w1 0 0 0 0 0 w0                     x(0) x(1) x(2) x(3) x(4) x(5)           11/42

Summary (9.1) • In context of CNN, we are not
interested in mathematical properties of convolution and use its result only • Many machine learning libraries implement cross-correlation but call it convolution • Discrete convolution can be viewed as multiplication by a sparse matrix 12/42

9.2 Motivation

Motivation: Convolution leverages three important ideas that can help improve
a machine learning system • Sparse Interactions • Parameter Sharing • Equivariant Representations 13/42

In the feed forward neural network, each outputs are depend
on any inputs x1 x2 x3 x4 u2 z2 u1 z1 u3 z3 z1 z2 z3 • wji: weight between xj and ui • Outputs of the network: z1 = f(w11x1 + w12x2 + w13x3 + w14x4) z2 = f(w12x1 + w22x2 + w23x3 + w24x4) z3 = f(w13x1 + w32x2 + w33x3 + w34x4) 14/42

Sparse interactions reduce the memory requirements of the model and
improve its statistical eﬃciency Convolutional Neural Network x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 Fully Connected Neural Network x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 15/42

For many practical apps, ML task can obtain good performance
while keeping k several orders of magnitude smaller than m • If there are m inputs and n outputs, the algorithm requires O(m × n) • If we limit the number of connections each output may have to k, the algorithm requires O(k × n) 16/42

Although convolution layers are very sparse, the deeper layers can
be indirectly connected to all or most of the inputs (Goodfellow 2 Growing Receptive Fields ly three inputs affect s 3 . (Bottom)When s is formed by matrix mu y is no longer sparse, so all of the inputs affect s 3 . x 1 x 1 x 2 x 2 x 3 x 3 h 2 h 2 h 1 h 1 h 3 h 3 x 4 x 4 h 4 h 4 x 5 x 5 h 5 h 5 g 2 g 2 g 1 g 1 g 3 g 3 g 4 g 4 g 5 g 5 The receptive field of the units in the deeper layers of a convolution an the receptive field of the units in the shallow layers. This effect k includes architectural features like strided convolution (figure 9.12) Figure 9.4 Figure 9.4 (Goodfellow, 2016) 17/42

Parameter sharing refers to using the same parameter for more
than one function in a model (Goodfellow 2016) Parameter Sharing CHAPTER 9. CONVOLUTIONAL NETWORKS x 1 x 1 x 2 x 2 x 3 x 3 s 2 s 2 s 1 s 1 s 3 s 3 x 4 x 4 s 4 s 4 x 5 x 5 s 5 s 5 x 1 x 1 x 2 x 2 x 3 x 3 x 4 x 4 x 5 x 5 s 2 s 2 s 1 s 1 s 3 s 3 s 4 s 4 s 5 s 5 Figure 9.5: Parameter sharing: Black arrows indicate the connections that use a pa parameter in two diﬀerent models. (Top)The black arrows indicate uses of the element of a 3-element kernel in a convolutional model. Due to parameter shari single parameter is used at all input locations. (Bottom)The single black arrow i the use of the central element of the weight matrix in a fully connected model. Th Convolution shares the same parameters across all spatial locations Traditional matrix multiplication does not share any parameters Figure 9.5 Figure 9.5 (Goodfellow, 2016) 19/42

Sparse connectivity and parameter sharing can dramatically improve the efficiency
of an operation in an image (Goodfellow 2016) Edge Detection by Convolution Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are 280 pixels tall. The input image is 320 pixels wide while the output image is 319 pixels wide. This transformation can be described by a convolution kernel containing two elements, and requires 319 ⇥ 280 ⇥ 3 = 267, 960 floating point operations (two multiplications and one addition per output pixel) to compute using convolution. To describe the same transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over eight billion, entries in the matrix, making convolution four billion times more efficient for representing this transformation. The straightforward matrix multiplication algorithm performs over sixteen billion floating point operations, making convolution roughly 60,000 times more efficient computationally. Of course, most of the entries of the matrix would be zero. If we stored only the nonzero entries of the matrix, then both matrix multiplication CHAPTER 9. CONVOLUTIONAL NETWORKS Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left. This shows the strength of all of the vertically oriented edges in the input image, which can be a useful operation for object detection. Both images are 280 pixels tall. The input image is 320 pixels wide while the output image is 319 pixels wide. This transformation can be described by a convolution kernel containing two elements, and requires 319 ⇥ 280 ⇥ 3 = 267, 960 floating point operations (two multiplications and one addition per output pixel) to compute using convolution. To describe the same transformation with a matrix multiplication would take 320 ⇥ 280 ⇥ 319 ⇥ 280, or over -1 -1 Input Kernel Output Figure 9.6 Figure 9.6 (Goodfellow, 2016) 20/42

Sparse connectivity and parameter sharing can dramatically improve the eﬃciency
of an operation in an image • Input size: 320 × 280 • Kernel size: 2 × 1 • Output size: 319 × 280 Convolution Dence matrix Sparse matrix Stored ﬂoats ɹ 2 320×280×319×280 ≃ 8.0 × 109 319 × 280 × 2 ≃ 1.8 × 105 Float muls and adds 320 × 280 × 3 ≃ 2.7×104 ɹ 1.6 × 1010 Same as convolution (≃ 2.7 × 104 ) 21/42

Equivariance between two functions f(x) and g(x) f(g(x)) = g(f(x))
23/42

Equivariance between two functions f(x) and g(x) f(g(x)) = g(f(x))
x x g(·) f(·) f(·) g(·) f(g(x)) g(f(x)) g(x) f(x) = 23/42

Equivariance between convolution and shifting Let I(x, y) be an
image (I : R2 → R), g(I) = (shift every pixel of the image I(x, y) one unit to the right.) = I(x − 1, y) f(I) = (I ∗ K)(x, y) I(x, y) I(x, y) g(·) f(·) f(·) g(·) (g(I) ∗ K)(x, y) (I ∗ K)(x − 1, y) g(I) f(I) = 24/42

Convolution with a kernel generates a 2-D feature map of
where certain features appear in the input ∗ ∗ = = 25/42

Convolution is unsuited for use in some cases • We
may not wish to share parameters across the entire image We want to look for eyebrows We want to look for a chin • Convolution is not equivariant to some transformations such as: Original Scaling Rotation 26/42

Summary (9.2): Convolution leverages three important ideas that can help
improve a machine learning system • Sparse Interactions and Parameter Sharing dramatically decrease memory requirements and computational costs • Equivariant Representations generate 2-D feature map of where certain features appear in the input • In some cases, convolution is unsuited 27/42

9.3 Pooling

A typical layer of CNN consists of three layers: convolution,
detector and pooling (Goodfellow Convolutional Network Components Convolutional Layer Input to layer Convolution stage: Affine transform Detector stage: Nonlinearity e.g., rectified linear Pooling stage Next layer Input to layers Convolution layer: Affine transform Detector layer: Nonlinearity e.g., rectified linear Pooling layer Next layer Complex layer terminology Simple layer terminology Figure 9.7: The components of a typical convolutional neural network layer. There are two commonly used sets of terminology for describing these layers. (Left)In this terminology, the convolutional net is viewed as a small number of relatively complex layers, with Figure 9.7 Figure 9.7 28/42

Pooling replaces the output of the net at a certain
location with a summary statistic of the nearby outputs Example: Max Pooling 210 196 203 195 184 207 213 203 183 175 149 235 198 177 185 193 210 218 200 201 70 134 208 222 199 Output of Detector Stage Output of Pooling Stage 29/42

location with a summary statistic of the nearby outputs Example: Max Pooling 210 196 203 195 184 207 213 203 183 175 149 235 198 177 185 193 210 218 200 201 70 134 208 222 199 235 Output of Detector Stage Output of Pooling Stage 29/42

location with a summary statistic of the nearby outputs Example: Max Pooling 210 196 203 195 184 207 213 203 183 175 149 235 198 177 185 193 210 218 200 201 70 134 208 222 199 235 235 Output of Detector Stage Output of Pooling Stage 29/42

location with a summary statistic of the nearby outputs Example: Max Pooling 210 196 203 195 184 207 213 203 183 175 149 235 198 177 185 193 210 218 200 201 70 134 208 222 199 235 235 203 Output of Detector Stage Output of Pooling Stage 29/42

Pooling helps to make the representation approximately invariant to small
translations of the input (Goodfellow 2016) Max Pooling and Invariance to Translation 0.1 1. 0.2 1. 1. 1. 0.1 0.2 ... ... ... ... 0.3 0.1 1. 1. 0.3 1. 0.2 1. ... ... ... ... DETECTOR STAGE POOLING STAGE POOLING STAGE DETECTOR STAGE Figure 9.8: Max pooling introduces invariance. (Top)A view of the middle of the output of a convolutional layer. The bottom row shows outputs of the nonlinearity. The top row shows the outputs of max pooling, with a stride of one pixel between pooling regions and a pooling region width of three pixels. (Bottom)A view of the same network, after the input has been shifted to the right by one pixel. Every value in the bottom row has changed, but only half of the values in the top row have changed, because the max pooling Figure 9.8 Figure 9.8 (Goodfellow, 2016) 30/42

If we pool over the outputs of separately parameterized convo-
lutions, the features can learn which transformations to become invariant to (Goodfellow 2016) Cross-Channel Pooling and Invariance to Learned Transformations CHAPTER 9. CONVOLUTIONAL NETWORKS Large response in pooling unit Large response in pooling unit Large response in detector unit 1 Large response in detector unit 3 Figure 9.9: Example of learned invariances: A pooling unit that pools over multiple features that are learned with separate parameters can learn to be invariant to transformations of the input. Here we show how a set of three learned ﬁlters and a max pooling unit can learn to become invariant to rotation. All three ﬁlters are intended to detect a hand-written 5. Figure 9.9 Figure 9.9 (Goodfellow, 2016) 31/42

Pooling with step k pixels improves the computational eﬃciency of
the network (Goodfellow 201 Pooling with Downsampling ame either way. This principle is leveraged by maxout networks (Goodfellow ) and other convolutional networks. Max pooling over spatial positions is nat ant to translation; this multi-channel approach is only necessary for learning formations. 0.1 1. 0.2 1. 0.2 0.1 0.1 0.0 0.1 e 9.10: Pooling with downsampling. Here we use max-pooling with a pool wid and a stride between pools of two. This reduces the representation size by a o, which reduces the computational and statistical burden on the next layer. the rightmost pooling region has a smaller size, but must be included if we d to ignore some of the detector units. Figure 9.10 Figure 9.10 (Goodfellow, 2016) 32/42

For many tasks, pooling is essential for handling inputs of
varying size (Goodfellow 2016) Example Classiﬁcation Architectures CHAPTER 9. CONVOLUTIONAL NETWORKS Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling with stride 4: 16x16x64 Output of reshape to vector: 16,384 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling to 3x3 grid: 3x3x64 Output of reshape to vector: 576 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of convolution: 16x16x1,000 Output of average pooling: 1x1x1,000 Output of softmax: 1,000 class probabilities Output of pooling with stride 4: 16x16x64 Figure 9.11: Examples of architectures for classiﬁcation with convolutional networks. The Figure 9.11 Figure 9.11 (Goodfellow, 2016) 33/42

For many tasks, pooling is essential for handling inputs of
varying size (Goodfellow 2016) Example Classification Architectures CHAPTER 9. CONVOLUTIONAL NETWORKS Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling with stride 4: 16x16x64 Output of reshape to vector: 16,384 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of pooling to 3x3 grid: 3x3x64 Output of reshape to vector: 576 units Output of matrix multiply: 1,000 units Output of softmax: 1,000 class probabilities Input image: 256x256x3 Output of convolution + ReLU: 256x256x64 Output of pooling with stride 4: 64x64x64 Output of convolution + ReLU: 64x64x64 Output of convolution: 16x16x1,000 Output of average pooling: 1x1x1,000 Output of softmax: 1,000 class probabilities Output of pooling with stride 4: 16x16x64 Figure 9.11: Examples of architectures for classification with convolutional networks. The Figure 9.11 • “Output of pooling with stride 4”: • The output size depends on the input size • “Output of pooling to 3x3 grid”: • The output size fixes Figure 9.11 (Goodfellow, 2016) 33/42

Research works of pooling • [Boureau et al., 2010] •
theoretical work of pooling • It gives guidance as to which kinds of pooling one should use in various situations • [Boureau et al., 2011] • Results of pooling are spatially local neighborhoods, but not local in the feature space • It is possible to dynamically pool features on the locations of interesting features in the feature space • [Jia et al., 2012] • In the commonly, the pooling uses manually deﬁned parameters • In this work, optimal parameters are trained by data 34/42

Summary (9.3): Pooling is awesome • Pooling summarizes the output
of the net as a summary statistic • Pooling helps to make an invariance to some translations of the input • Pooling handles inputs of varying size 35/42

9.4 Convolution and Pooling as an Inﬁnitely Strong Prior

Prior probability distribution encodes our beliefs about that models are
reasonable, before we have seen any data • W: a set of parameters • D: a set of data p(W|D) posterior ∝ p(D|W) likelihood × p(W) prior 36/42

Weak/Strong Prior Distributions Weak Prior Distribution (Large Variance, High Entropy)
W p(W) O Strong Prior Distribution (Small Variance, Low Entropy) W p(W) O 37/42

What’s “Inﬁnitely String Prior”? An inﬁnitely strong prior places zero
probability on some parameters and says that these parameter values are completely forbidden, regardless of how much support the data give to those values. 38/42

We can interpret CNN as fully connected NN with an
inﬁnitely strong prior but implementation of it is extremely wasteful computationally x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 • The inﬁnitely strong prior says ... 39/42

We can interpret CNN as fully connected NN with an
infinitely strong prior but implementation of it is extremely wasteful computationally x1 x2 x3 x4 x5 s1 s2 s3 s4 s5 • The infinitely strong prior says ... • “The weights for one hidden unit must be identical to the weights of its neighbor but shifted in space.” • “The prior also says that the weights must be zero, except for in the small, spatially contiguous receptive field assigned to that hidden unit.” • “In the use of pooling, each unit should be invariant to small translations.” 39/42

From the reinterpretaion of CNN, we can get some insights
into how CNNs work • Convolution and pooling can cause underﬁtting • They are only useful when the assumptions made by the prior are reasonably accurate • We should only compare convolutional models to other convolutional models in benchmarks of statistical learning performance • Convolutional models already have hard-coded knowledge of spatial • Models without convolution would be able to learn even if we permuted all the pixels in the image (Permutation Invariant) 40/42

Summary (9.4) • An infinitely strong prior places zero probability
on some parameters and says that these parameter values are completely forbidden. • We can interpret CNN as fully connected NN with an infinitely strong prior and get some insights into how CNNs work • Convolution and pooling can cause underfitting • Comparison of convolutional models and models without convolution is unfair 41/42

Summary • CNNs are one of the success story of
biologically inspired AI • In the context of NN, the cross-correlation is used instead of the convolution • Convolution leverages three important ideas that can help improve a ML system: • Sparse Interactions, Parameter Sharing, Equivariant Representations • Pooling summarizes an output, makes an invariance and handles inputs of varying size • We can interpret CNN as fully connected NN with an inﬁnitely strong prior and get some insights: • Convolution and pooling can cause underﬁtting • Comparison of convolutional models and models without convolution is unfair 42/42

Convolutional Networks (Deep Learning, 9.1-9.4)

Convolutional Networks (Deep Learning, 9.1-9.4)

More Decks by eqs

Other Decks in Technology

Featured

Transcript