Slide 1

Slide 1 text

Convolutional Neural Network Naoaki Okazaki School of Computing, Tokyo Institute of Technology okazaki@c.titech.ac.jp PowerPoint template designed by https://ppt.design4u.jp/template/

Slide 2

Slide 2 text

Foundations of Convolutional Neural Networks 1

Slide 3

Slide 3 text

Classifying an image using ResNet 50 (CNN) 2 https://github.com/chokkan/deeplearning/blob/master/notebook/resnet.ipynb

Slide 4

Slide 4 text

Recap: Image classification on MNIST 3 Image (28 × 28) 10 dims .000 .001 .003 .986 … .002 Input vector ∈ ℝ785 � = Output vector � ∈ ℝ10 ∈ ℝ10×785 Single-layer neural network Multi-layer neural network (with ReLU at the 1st layer) � = 2 max 0, 1 1 ∈ ℝℎ×785, 2 ∈ ℝ10×ℎ (ℎ : dimension of the hidden layer)

Slide 5

Slide 5 text

Fully-connected layer on image is inefficient 4  Consider a two-layer neural network with:  Input: 40,000 dimension (an input image is 200 x 200 pixels)  Hidden layer: 20,000 dimension  Output: 1,000 (1,000 categories for objects)  The number of parameters is huge, c.a. 0.82 billion (1.6GB with float16)  1st layer: 40,000 x 20,000 = 800,000,000  2nd layer: 20,000 x 1,000 = 20,000,000  The number depends on the size of input images  This treatment ignores stationarity in images  Patterns appearing different positions  Positional shifts

Slide 6

Slide 6 text

2D Convolution (1/4) 5 Compute a dot product between a submatrix of a matrix and another matrix 2 × −1 + 1 × −1 + 0 × −1 + 3 × −1 + 2 × 8 + 1 × −1 + 4 × −1 + 3 × −1 + 2 × −1 = −2 − 1 − 3 + 16 − 1 − 4 − 3 − 2 = 0

Slide 7

Slide 7 text

2D Convolution (2/4) 6 Compute a dot product between a submatrix of a matrix and another matrix, changing the position of the submatrix (to the right, in this example) 1 × −1 + 0 × −1 + 0 × −1 + 2 × −1 + 1 × 8 + 1 × −1 + 3 × −1 + 2 × −1 + 1 × −1 = −1 − 2 + 8 − 1 − 3 − 2 − 1 = −2

Slide 8

Slide 8 text

2D Convolution (3/4) 7 3 × −1 + 2 × −1 + 1 × −1 + 4 × −1 + 3 × 8 + 2 × −1 + 2 × −1 + 2 × −1 + 1 × −1 = −3 − 2 − 1 − 4 + 24 − 2 − 2 − 2 − 1 = 7

Slide 9

Slide 9 text

2D Convolution (4/4) (convolve: ぐるぐる回る,畳み込む) 8 2 × −1 + 1 × −1 + 1 × −1 + 3 × −1 + 2 × 8 + 1 × −1 + 2 × −1 + 1 × −1 + 0 × −1 = −2 − 1 − 1 − 3 + 16 − 1 − 2 − 1 = 5

Slide 10

Slide 10 text

Size of 2D convolution results 9 * = 4 × 4 3 × 3 Width: (width of input matrix) – (width of weight matrix) + 1 Height: (height of input matrix) – (height of weight matrix) + 1 2 × 2 -1 -1 -1 -1 8 -1 -1 -1 -1

Slide 11

Slide 11 text

Padding 10 Extend the input matrix to adjust the size of the output 0 0 0 0 0 0 0 2 1 0 0 0 0 3 2 1 1 0 0 4 3 2 1 0 0 2 2 1 0 0 0 0 0 0 0 0 -1 -1 -1 -1 8 -1 -1 -1 -1 * = 3 × 3 10 0 -5 -2 12 0 -2 4 20 7 5 3 7 4 0 -4 4 × 4 (excluding padding) 4 × 4

Slide 12

Slide 12 text

Stride 11 2 2 1 1 0 1 2 1 0 0 2 3 2 1 1 3 4 3 2 1 1 2 2 1 0 -1 -1 -1 -1 8 -1 -1 -1 -1 = 4 (stride = 2) Slide the filter with a distance *

Slide 13

Slide 13 text

Stride 12 (stride = 2) Slide the filter with a distance 2 2 1 1 0 1 2 1 0 0 2 3 2 1 1 3 4 3 2 1 1 2 2 1 0 -1 -1 -1 -1 8 -1 -1 -1 -1 = 4 -7 *

Slide 14

Slide 14 text

Stride 13 (stride = 2) Slide the filter with a distance 2 2 1 1 0 1 2 1 0 0 2 3 2 1 1 3 4 3 2 1 1 2 2 1 0 -1 -1 -1 -1 8 -1 -1 -1 -1 = 4 -7 14 *

Slide 15

Slide 15 text

Stride 14 (stride = 2) Slide the filter with a distance -1 -1 -1 -1 8 -1 -1 -1 -1 = 4 -7 14 -5 2 2 1 1 0 1 2 1 0 0 2 3 2 1 1 3 4 3 2 1 1 2 2 1 0 *

Slide 16

Slide 16 text

2D Convolution with multiple channels 15 * = 3 4 2 1 3 2 1 1 4 3 2 1 2 2 1 4 2 3 4 2 3 2 1 1 4 3 2 1 2 2 1 1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 -2 -2 -0 -1 8 -5 -1 -1 -3 -0 -0 -2 -1 8 -0 -1 -1 -3 -1 -1 -1 -1 8 -1 -1 -1 -1 * 2 -8 19 22  A color image is represented with matrices for multiple channels: for example, brightness for red (R), green (G), blue (B)  We extend the filter to multiple channels and compute the sum of all channels to obtain an output matrix with a single channel

Slide 17

Slide 17 text

2D Convolution with multiple filters 16 * = 3 4 2 1 3 2 1 1 4 3 2 1 2 2 1 4 2 3 4 2 3 2 1 1 4 3 2 1 2 2 1 1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 -2 -2 -0 -1 8 -5 -1 -1 -3 -0 -0 -2 -1 8 -0 -1 -1 -3 -1 -1 -1 -1 8 -1 -1 -1 -1 * 5 -9 19 26 -2 -2 -0 -1 8 -5 -1 -1 -3 -0 -0 -2 -1 8 -0 -1 -1 -3 0 -1 0 -1 4 -1 0 -1 0 2 -8 19 22

Slide 18

Slide 18 text

2D Convolution is also known as kernel, mask, and image filter 17 = 0 0 0 0 1 0 0 0 0 Identity https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Slide 19

Slide 19 text

Brighten and Darken 18 = 0 0 0 0 0.5 0 0 0 0 Darken = 0 0 0 0 1.5 0 0 0 0 Brighten https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Slide 20

Slide 20 text

Blur 19 = 1 16 1 2 1 2 4 2 1 2 1 Gaussian blur = 1 9 1 1 1 1 1 1 1 1 1 Box blur https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Slide 21

Slide 21 text

Sharpen and edge detection 20 = −1 −1 −1 −1 8 −1 −1 −1 −1 Edge detection = 0 −1 0 −1 5 −1 0 −1 0 Sharpen https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Slide 22

Slide 22 text

Prewitt edge detection 21 = −1 −1 −1 0 0 0 1 1 1 Horizontal edge detection = −1 0 1 −1 0 1 −1 0 1 Vertical edge detection https://github.com/chokkan/deeplearning/blob/master/notebook/convolution.ipynb

Slide 23

Slide 23 text

2D Convolution represented by dot products of vectors (1/4) 22 * = -1 -1 -1 -1 8 -1 -1 -1 -1 −1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 = 0 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0

Slide 24

Slide 24 text

2D Convolution represented by dot products of vectors (2/4) 23 −1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 1 0 0 2 1 1 3 2 1 = 0 −2 * = -1 -1 -1 -1 8 -1 -1 -1 -1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0 -2

Slide 25

Slide 25 text

2D Convolution represented by dot products of vectors (3/4) 24 −1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 1 0 0 2 1 1 3 2 1 3 2 1 4 3 2 2 2 1 = 0 −2 7 * = -1 -1 -1 -1 8 -1 -1 -1 -1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0 -2 7

Slide 26

Slide 26 text

2D Convolution represented by dot products of vectors (4/4) 25 −1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 1 0 0 2 1 1 3 2 1 3 2 1 4 3 2 2 2 1 2 1 1 3 2 1 2 1 0 = 0 −2 7 5 * = -1 -1 -1 -1 8 -1 -1 -1 -1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0 -2 7 5

Slide 27

Slide 27 text

2D Convolution represented by matrix-vector product 26 −1 −1 −1 −1 8 −1 −1 −1 −1 2 1 0 3 2 1 4 3 2 1 0 0 2 1 1 3 2 1 3 2 1 4 3 2 2 2 1 2 1 1 3 2 1 2 1 0 = 0 −2 7 5 * = -1 -1 -1 -1 8 -1 -1 -1 -1 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 0 -2 7 5

Slide 28

Slide 28 text

2D Convolution in math (single channel output) 27 ℎ1 = 1 ⋅ Suppose we have sliding blocks in the input ∈ ℝ: Flattened vector of the -th block ∈ ℝ: Flattened weight vector of the filter ℎ ∈ ℝ: Output from the filter for the -th block : Dimension of the vectors (e.g., width × height) 1 2 3 ℎ2 = 2 ⋅ ℎ3 = 3 ⋅ ℎ = ⋅ = ⋯ = ⋯ ℎ1 ℎ2 ℎ3 = 1 … ∈ ℝ× = ℎ1 … ℎ ⊺ ∈ ℝ ℎ

Slide 29

Slide 29 text

2D Convolution in math (multiple channel output) 28 Suppose that we have filters ∈ ℝ: Flattened weight vector of the -th filter = 1 … ∈ ℝ×: filters represented as a matrix ∈ ℝ× 1 2 … 1 = 1 = 𝑋𝑋 2 = 2 … = … = = [ ] ∈ ℝ×

Slide 30

Slide 30 text

Convolution layer 29  Applies multiple filters to small blocks of the input image  Filter parameters are shared/reused in different blocks  Filter parameters are acquired from the supervision data  Uses less parameters than fully-connected layer  1,000 filters on 10 x 10 window: only 100,000 parameters (much smaller than a fully connected layer, e.g., 800,000,000 parameters)

Slide 31

Slide 31 text

Translation invariance 30  Assume that three filters are trained to detect a nose, eye and mouth  However, we cannot presume where these objects are located  How can we incorporate invariance to different positions? Input image (Nose filter) (Mouth filter) (Eye filter)

Slide 32

Slide 32 text

Pooling 31  Down-sample outputs from a filter (shrinking a feature map)  Discard exact positions, and focus on rough positions  Popular method: max pooling (taking the max within a partition) Input image (Nose filter) (Mouth filter) (Eye filter)

Slide 33

Slide 33 text

Max pooling 32 Extract the maximum value in a partition 2 1 0 0 3 2 1 1 4 3 2 1 2 2 1 0 3 1 4 2 Max pooling with stride 2x2 Other pooling operations (e.g., average pooling, 2 -norm pooling) are also used (but less popular because of the interest of performance)

Slide 34

Slide 34 text

Integrating primitive detection results 33  How can we integrate outputs from multiple filters?  We apply convolutions to the results of the filters, expecting that the filtering results are integrated in the upper layer Input image First layer Second layer

Slide 35

Slide 35 text

Convolutional Neural Network (CNN) 34 Convolution Non-linear Pooling Convolution Non-linear Pooling Fully-connected Input Output Multiple layers Classification A stack of convolution, non-linear, and pooling layers  Convolution layer  Non-linear transformation (e.g., ReLU)  Pooling layer (e.g., max pooling) Followed by fully-connected layer(s) to make predictions Parameters are trained by backpropagation (end-to-end fashion)

Slide 36

Slide 36 text

Convolutional Neural Network 35

Slide 37

Slide 37 text

36 ImageNet Large-scale Visual Recognition Challenge (ILSVRC) http://www.image-net.org/challenges/LSVRC/  Evaluation workshop for object detection and image classification  Allows researchers to compare the algorithms for the tasks  Held since 2010 until 2017  Based on the large-scale dataset (ImageNet)  ILSVRC uses a subset of ImageNet  For example, the training set of the classification task includes about 1.2M images associated with 1,000 categories  A driving force for the research on deep learning  Convolutional Neural Networks made a remarkable improvements in ILSVRC 2012  Several innovative methods appeared along with the challenges

Slide 38

Slide 38 text

Performance improvements on the classification task 37 http://www.image-net.org/challenges/LSVRC/ 28.19 25.77 16.42 11.74 6.66 3.57 2.99 2.25 0 5 10 15 20 25 30 NEC-UIUC (2010) XRCE (2011) SuperVision (AlexNet) (2012) Clarifai (ZFNet) (2013) GoogLeNet (2014) MSRA (ResNet) (2015) Trimps-Soushen (ResNet) (2016) WMW (SENet) (2017) Error rate [%] Improvements from Convolutional Neural Networks

Slide 39

Slide 39 text

38 Classification task Algorithms produce a list of object categories present in the image (Russakovsky et al., 2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf

Slide 40

Slide 40 text

39 Single-object localization Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf Algorithms produce a list of object categories present in the image, along with an axis- aligned bounding box indicating the position and scale of one instance of each object category. (Russakovsky et al., 2015)

Slide 41

Slide 41 text

40 Detection task Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211-252. http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf Algorithms produce a list of object categories (out of 200 categories) present in the image along with an axis-aligned bounding box indicating the position and scale of every instance of each object category (Russakovsky et al., 2015)

Slide 42

Slide 42 text

41 ImageNet includes 14M+ images annotated with 20k+ categories http://image-net.org/explore_popular.php

Slide 43

Slide 43 text

ImageNet is a knowledge ontology 42 This slide is from: http://www.image-net.org/papers/ImageNet_2010.pdf  Categories of ImageNet are defined by WordNet  WordNet provides a hierarchy between concepts (ontology)

Slide 44

Slide 44 text

AlexNet (Krizhevsky et al., 2012) 43  The winner of ILSVRC 2012  Error rate was drastically reduced (from 25.77% to 16.42%)  Consists of 5 convolution layers and 3 fully-connected layers  The architecture used cutting-edge methods (e.g., ReLU, dropout)  Designed to use two GPUs (to fit it into the small GPU memory) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proc. of NIPS, pp. 1097-1105.

Slide 45

Slide 45 text

Implementation of AlexNet in pytorch 44 https://pytorch.org/docs/stable/_modules/torchvision/models/alexnet.html 5 convolution layers The number of channels at each layer is different from that described in the original paper. This is because this implementation is based on an old one in torch7 that fit into a single GPU. See: https://github.com/pytorch/vision/pull/463 3 fully-connected layers

Slide 46

Slide 46 text

Visualizing internal layers (Zeiler and Fergus, 2014) 45  For each layer and convolution filter, find the top-9 high outputs  Reconstruct the original input using ‘deconvnet’  Inverse transformation: map the outputs back to the input space  Impossible to reconstruct the original image completely, but the pixels contributing to the high outputs are highlighted  The visualization also shows the original image patch Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proc. of ECCV, pp. 818-833.

Slide 47

Slide 47 text

Visualizing internal layers (Zeiler and Fergus, 2014) 46 Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proc. of ECCV, pp. 818-833.  Observations  Strong grouping within each filter (feature map)  Lower layers tend to focus on primitive shape and patterns  Higher layers seem to recognize objects to be classified  Exaggeration of discriminative parts of the image  Eyes and noses of dogs (layer 4, row 1, col 1, next page)  Grass in the background, not the foreground objects (layer 5, row 1, col 2)

Slide 48

Slide 48 text

Visualizing internal layers (Zeiler and Fergus, 2014) 47 Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proc. of ECCV, pp. 818-833.

Slide 49

Slide 49 text

Neocognitron (Fukushima and Miyake, 1982) 48 The idea of integrating local features in a hierarchical network was proposed in 1982 by Kunishiko Fukushima (Fukushima and Miyake, 1982) Kunihiko Fukushima and Sei Miyake. 1982. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15(6):455-459.

Slide 50

Slide 50 text

LeNet (LeCun et al., 1998) 49 Yann LeCun, Léeon Bottou, Yoshua Bengio, and Patrick Haffner. 1998, Gradient-Based Learning Applied to Document Recognition. Proceedings of IEEE, 86(11):2278-2324. The first architecture that is very close to the recent CNNs  Proposed for hand-written recognition  The model is trained by backpropagation Some differences  Sigmoid activation function (instead of ReLU)  Subsampling pooling (instead of max pooling)

Slide 51

Slide 51 text

50 VGGNet (Simonyan and Zisserman, 2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. Proc. of ICLR.  Simple and popular architecture of CNN  Explores a deeper CNN  Mostly uses filters with a small receptive field: 3 x 3 (the smallest size to capture the notion of left/right, up/down, and center)  Max-pooling is performed over a 2 x 2 pixel window (i.e., down-sample to half)  The number of channels is increased by a factor of 2 after each pooling layer  Ranked at the second in ILSVRC 2014

Slide 52

Slide 52 text

51 ResNet (He et al., 2016)  Explores a more deeper architecture (152 layers)  However, deeper networks are difficult to train because of the gradient vanishing problem  Proposed a residual learning framework to ease the training of deep networks  Residual connection  Suppose that we want to learn a function ℎ()  We consider another mapping: = ℎ −  Then, the original mapping is ()+  We can view + as a feedforward neural network with shortcut connections  Training () is easier than ℎ()  Batch normalization  The winner of ILSVRC 2015 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2016. Deep Residual Learning for Image Recognition. Proc. of CVPR. () +

Slide 53

Slide 53 text

Summary  CNN is a stack of convolution, non-linear, and pooling layers  Convolution layer applies a filter to the input (resemblance to image filter)  Non-linear transformation (e.g., ReLU)  Pooling layer down-samples outputs (e.g., max pooling)  After a stack of convolutions, fully-connected layers make predictions  Parameters (e.g., filter weights) are trained by backpropagation  A lot of innovative ideas improved the performance of image classification in addition to the advances in computation power and big data 52