Visualization and Understanding Convolutional Neural Networks

Visualization and Understanding Convolutional Neural Networks Matthew Zeiler NYU Advisor:
Rob Fergus

Overview •  Visualization technique based on Deconvolutional Networks •  Applied
to convolutional neural networks – determine what each layer learns – provides insight for architecture selection

A Picture is worth 1,000 words (or perhaps 10,000 words)
www.clarifai.com

Convolutional Networks (LeCun et al. ’89) •  Supervised & feed-forward
•  Each layer: –  Convolve input with filters –  Non-linearity (rectified linear) –  Pooling (local max) •  Train convolutional filters by back-propagating classification error Input Image Convolution (Learned) Non-linearity Pooling Feature maps

Krizhevsky et al. [NIPS2012] •  7 hidden layers, 650,000 neurons,
60,000,000 parameters •  Trained on 2 GPUs for a week •  Essentially the same model as LeCun’98 but: - bigger model -  more data -  GPU implementation (similar to Ciresan et. al 2011)

Example( dog((0.01)( cat((0.04)( boat((0.94)( Convolu:on( Pooling( Convolu:on( Pooling( Fully( Connected(
Fully( Connected( Output(Predic:ons( bird((0.02)(

Neural(Network(History( Geoﬀ( Hinton( Yann( LeCun( 1980s( 2009( 30x(Speedup( •  more(data(
•  bigger(models( •  faster(itera:ons(

Convnets Show Huge Gains 0( 5( 10( 15( 20( 25(
30( 35( SuperVision( ISI( Oxford( INRIA( Amsterdam( Top$5&error&rate&%& ImageNet 2012 classification competition results

Deep(Nets(vs(Monkey(vs(Human( C.F.(Cadieu,(H.(Hong,(D.(Yamins,(N.(Pinto,(E.A.(Solomon,(N.J.( Majaj,(and(J.J.(DiCarlo.(Deep$Neural$Networks$Rival$the$Object$ Recogni9on$Performance$of$the$Primate$Visual$System.(( (in(submission,(2013).( 3 Linear-SVM Kernel Analysis Complexity
Precision Accuracy IT Cortex DNN Pixel Representation IT Cortex Representation DNN Representation Cars Fruits B Deep Neural Network (DNN) (x) Ventral Stream Retinae Representation IT Cortex DNN IT Evaluation Cars ... Planes Chairs Tables Faces ... A Fruits ... Animals ... Figure 1. Example images used to measure object category recognition performance and

Deep(Nets(vs(Monkey(vs(Humans( [Cadieu(et(al.]( •  Rapid(presenta:on(experiments((100ms)( •  Feed[forward(processing(only(in(monkey/humans( 6 V4 Cortex Extrapolation
IT Cortex Extrapolation Chance ~14% ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (

Overview •  What are the models learning? •  Which part
of the model is key to performance? •  Do the features generalize?

Deconvolutional Networks •  Provides way to map activations at high
layers back to the input •  Same operations as Convnet, but in reverse: –  Unpool feature maps –  Convolve unpooled maps •  Filters copied from Convnet •  Used here purely as a probe –  Originally proposed as unsupervised learning method –  No inference, no learning Input Image Convolution (learned) Unpooling Feature maps Non-linearity [Zeiler et al. CVPR’10, ICCV’11](

Deconvolutional Network (2 layers)

Reversible Max Pooling Pooled Feature Maps Max Locations “Switches” Pooling
Unpooling Feature Map Reconstructed Feature Map

Reuse Feedforward Switches Layer Below Pooled Maps Feature Maps Rectified
Feature Maps Convolu'onal) Filtering){F}) Rec'ﬁed)Linear) Func'on) Pooled Maps Max)Pooling) Reconstruction Rectified Unpooled Maps Unpooled Maps Convolu'onal) Filtering){FT}) Rec'ﬁed)Linear) Func'on) Layer Above Reconstruction Max)Unpooling) Switches)

Layer 1 Filters

Projecting back from Higher Layers Input Image Visualization Layer 1:
Feature maps Layer 2: Feature maps Feature( Map( ....( Filters( Layer 1 Reconstruction Layer 2 Reconstruction 0( 0( ....( Filters( Convnet( Deconvnet(

Visualizations of Higher Layers •  Use ImageNet 2012 validation set
•  Push each image through network Input(( Image( Feature( Map( Lower(Layers( ....( Filters( Validation Images •  Take max activation from feature map associated with each filter •  Use Deconvnet to project back to pixel space •  Use pooling “switches” peculiar to that activation

Layer 1: Top-9 Patches

Layer 2: Top-1

Layer 2: Top-9

Layer 3: Top-1

Layer 3: Top-9

Layer 3: Top-9 Patches Layer(3:(Top[9(Patches(

Layer(4:(Top[1(

Layer 4: Top-9

Layer(5:(Top[1(

Layer(5:(Top[9(

Layer(5:(Top[9(Patches(

Translation (Horizontal) Layer 1 Layer 7 Output ï ï ï
8 Horizontal Translation (Pixels) Canonical Distance Lawn Mower 6KLKï7]X African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Horizontal Translation (Pixels) P(true class) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Horizontal Translation (Pixels) Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center

Translation (Vertical) Layer 1 Layer 7 Output −60 −40 −20
0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Vertical Translation (Pixels) Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vertical Translation (Pixels) P(true class) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center ï ï ï 1 3 5 7 8 9 Vertical Translation (Pixels) Canonical Distance Lawn Mower 6KLKï7]X African Crocodile African Grey Entertrainment Center

Scale Invariance 1 1.2 1.4 1.6 1.8 0 0.1 0.2
0.3 0.4 0.5 0.6 0.7 Scale (Ratio) Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 2 4 6 8 10 12 Scale (Ratio) Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Scale (Ratio) P(true class) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center Layer 1 Layer 7 Output

Rotation Invariance Layer 1 Layer 7 Output 0 50 100
150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rotation Degrees P(true class) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Rotation Degrees Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 5 10 15 Rotation Degrees Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center

Occlusion Experiment •  Mask parts of input with occluding square
•  Monitor output •  Perhaps network using scene context?

Input image( p(True class) ( Most probable class (

Input image( Total activation in most active 5th layer feature
map( Other activations from same feature map(

Lack of Understanding •  What are the models learning? • 
Which part of the model is key to performance? •  Do the features generalize?

Visualizations Help – 2% Boost (a) (b) (c) (d) (e)
Needs(renormaliza:on( Too(simple(mid[level( Too(speciﬁc(low[level( Dead(ﬁlters( Block(Ar:facts( Constrain(RMS( Smaller(Strides( 4(to(2( Smaller(Filters( 7x7(to(11x11(

Improved Architecture Input Image stride 2! image size 224! 3!
96! 5! 2! 110! 55 3x3 max pool stride 2 96! 3! 1! 26 256! filter size 7! 3x3 max pool stride 2 13 256! 3! 1! 13 384! 3! 1! 13 384! Layer 1 Layer 2 13 256! 3x3 max pool stride 2 6 Layer 3 Layer 4 Layer 5 256! 4096 units! 4096 units! Layer 6 Layer 7 C class softmax! Output contrast norm. contrast norm.

Architecture of Krizhevsky et al. •  8 layers total • 
Trained on Imagenet dataset [Deng et al. CVPR’09] •  18.2% top-5 error •  Our reimplementation: 18.1% top-5 error Input Image Layer 1: Conv + Pool Layer 6: Full Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool Layer 7: Full

Architecture of Krizhevsky et al. •  Remove top fully connected
layer – Layer 7 •  Drop 16 million parameters •  Only 1.1% drop in performance! Input Image Layer 1: Conv + Pool Layer 6: Full Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool

Architecture of Krizhevsky et al. •  Remove both fully connected
layers – Layer 6 & 7 •  Drop ~50 million parameters •  5.7% drop in performance Input Image Layer 1: Conv + Pool Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool

Architecture of Krizhevsky et al. •  Now try removing upper
feature extractor layers: – Layers 3 & 4 •  Drop ~1 million parameters •  3.0% drop in performance Input Image Layer 1: Conv + Pool Layer 6: Full Softmax Output Layer 2: Conv + Pool Layer 5: Conv + Pool Layer 7: Full

Architecture of Krizhevsky et al. •  Now try removing upper
feature extractor layers & fully connected: – Layers 3, 4, 6 ,7 •  Now only 4 layers •  33.5% drop in performance ! Depth of network is key Input Image Layer 1: Conv + Pool Softmax Output Layer 2: Conv + Pool Layer 5: Conv + Pool

Ablation Study 40.5 18.1 330 331 332 333 334 335
336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 Visualizing and Understanding Convolutional Neural Networks Val Val Test Error % Top-1 Top-5 Top-5 (Krizhevsky et al., 2012), 1 convnet 40.7 18.2 (Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4 (Krizhevsky et al., 2012), 1 convnets* 39.0 16.6 (Krizhevsky et al., 2012), 7 convnets* 36.7 15.4 15.3 Our replication of (Krizhevsky et al., 2012), 1 convnet 40.5 18.1 1 convnet as per Fig. 2 38.3 16.4 16.5 5 convnets as per Fig. 2 36.6 15.3 15.3 Table 2. ImageNet 2012 classification error rates. The ⇤ indicates models that were trained on both ImageNet 2011 and 2012 training sets with an additional convolution layer. When we combine multiple models, we obtain a test error of 15 . 3%, which matches the absolute best performance on this dataset, despite only using the much smaller 2012 training set. We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classification challenge, which obtained 26 . 1% error. 3.1. Training Details The models were trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 di↵erent classes). Each RGB image was preprocessed by resiz- ing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 di↵erent sub-crops of size 224x224 (corners + center with(out) horizontal flips). change in the image from which the strongest act vation originates. Due to space constraints, only randomly selected subset of feature maps are visua ized and zooming is needed to see the details clearly As expected, the first layer filters consist of Gabor and low-frequency color. The 2nd layer features ar more complex, corresponding to conjunctions of edge and color patterns. The 3rd layer features show large image parts. Within a given feature projection, signif icant variations in contrast can be seen, showing whic parts of the image contribute most to the activatio and thus are most discriminative, e.g. the lips and eye on the persons face (Row 12). The visualization from the 4th and 5th layer show activations that respon to complex objects. Note that little of the scene back ground is reconstructed, since it is irrelevant to pre dicting the class. 4.2. Feature Invariance Fig. 4 shows feature visualizations from our mode once training is complete. However, instead of show ing the single strongest activation for a given fea ture map, we show the top 9 activations. Projectin each separately down to pixel space reveals the di↵er ent structures that excite a given feature map, henc showing its invariance to input deformations. Laye Visualizing and Understanding Convo Train Val Val Error % Top-1 Top-1 Top-5 Our replication of (Krizhevsky et al., 2012), 1 convnet 35.1 40.5 18.1 Removed layers 3,4 41.8 45.4 22.1 Removed layer 7 27.4 40.0 18.4 Removed layers 6,7 27.4 44.8 22.4 Removed layer 3,4,6,7 71.1 71.3 50.1 Adjust layers 6,7: 2048 units 40.3 41.7 18.8 Adjust layers 6,7: 8192 units 26.8 40.0 18.1 Our Model (as per Fig. 3) 33.1 38.4 16.6 Adjust layers 6,7: 2048 units 38.2 40.2 17.6 Adjust layers 6,7: 8192 units 22.0 38.8 17.0 Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0 Adjust layers 6,7: 8192 units and Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9 Table 3. ImageNet 2012 classification error rates with var- ious architectural changes to the model of (Krizhevsky et al., 2012) and our ImageNet model. Figure 9 number 6 trainin

ImageNet Classification 2012 * Trained using Imagnet 2011 and 2012
training sets. riments. he right ns show ature ange 7 0.015 0.013 0.011 0.014 on this dataset1 (despite only using the 2012 training set). We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classi- ﬁcation challenge, which obtained 26 . 2% error (Gunji et al., 2012). Val Val Test Error % Top-1 Top-5 Top-5 (Gunji et al., 2012) - - 26.2 (Krizhevsky et al., 2012), 1 convnet 40.7 18.2 (Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4 (Krizhevsky et al., 2012)⇤, 1 convnets 39.0 16.6 (Krizhevsky et al., 2012)⇤, 7 convnets 36.7 15.4 15.3 Our replication of (Krizhevsky et al., 2012), 1 convnet 40.5 18.1 1 convnet as per Fig. 3 38.4 16.5 5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3 1 convnet as per Fig. 3 but with layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1 6 convnets, (a) & (b) combined 36.0 14.7 14.8 Table 2. ImageNet 2012 classiﬁcation error rates. The ⇤ indicates models that were trained on both ImageNet 2011 and 2012 training sets. Varying ImageNet Model Sizes: In Table 3, we

ImageNet(Classiﬁca:on(2013(Results( 11.2%( 11.7%( 13.0%( 13.5%( 13.6%( 14.2%( 14.3%( 14.8%( 15.2%(
15.2%( 16.1%( 16.4%( 10%( 11%( 12%( 13%( 14%( 15%( 16%( 17%( Test(error((top[5)( Top$5&Error&Rates&(lower&is&be4er)& hfp://www.image[net.org/challenges/LSVRC/2013/results.php(

Recent Success •  Using smaller strides: – Very Deep Convolutional Networks
for Large- Scale Image Recognition, ILSVRC 2014 2nd Classification, 1st Localization Simonyan and Zisserman, Arxiv 2014 – Some Improvements on Deep Convolutional Neural Networks, Howard, Arxiv 2013 •  Using Visualizations for saliency: – Deep Inside Convolutional Networks: Visualizing Image Classification Models and Saliency Maps, Simonyan, Vedaldi, and Zisserman, Arxiv 2014

We(didn’t(stop(there( 10.7( 11.1( 15.3( 26.2( 0( 10( 20( 30( Clarifai(Latest(
Clarifai(Imagenet( 2013( Imagenet(2012( Winner( Tradi:onal( Computer(Vision( %(Top[5(Error(Rate( (lower(is(befer)( Imagenet&2013&Top$5&Error&Rate&

A(Less(Forgiving(Metric( 28.20( 30.60( 32.50( 36.70( 40.70( 25( 30( 35( 40(
45( Clarifai(Latest( (10x(faster)( Clarifai( (mul:ple)( Clarifai((fast)( Krizhevsky(et.( al.((mul:ple)( Krizhevsky(et.( al.( %(Error(Rate( (lower(is(befer)( Imagenet&Top$1&Valida@on&Errors&

Overview •  What are the models learning? •  Which part
of the model is key to performance? •  Do the features generalize?

Implicit Correspondence •  Is correspondence somehow being computed?

Correspondence Measure feature vector of image i at a given
layer feature vector of occluded image i at a given layer Perturbation vector: Change similarity measure:

Using Features on Other Datasets •  Train model on ImageNet
2012 training set •  Re-train classifier on new dataset – Just the softmax layer •  Classify test set of new dataset

0 10 20 30 40 50 60 25 30 35
40 45 50 55 60 65 70 75 7UDLQLQJ,PDJHVSHUïFODVV $FFXUDF\ %RHWDO 6RKQHWDO Caltech 256

Caltech 256 0 10 20 30 40 50 60 25
30 35 40 45 50 55 60 65 70 75 7UDLQLQJ,PDJHVSHUïFODVV $FFXUDF\ 2XU0RGHO %RHWDO 6RKQHWDO 6(training(examples(

Caltech 256 [3] L. Bo, X. Ren, and D. Fox.
Multipath sparse coding using hierarchical matching pursuit. In CVPR, 2013. [16] K. Sohn, D. Jung, H. Lee, and A. Hero III. Efficient learning of sparse, distributed, convolutional feature representations for object recognition. In ICCV, 2011.

Tapping off Features at each Layer Plug features from each
layer into linear SVM or soft-max (

Summary •  Visualization technique based on Deconvolutional Networks •  Applied
to convolutional neural networks – better understanding of what is learned – gives insight into model selection

Thanks! Clarifai is Hiring! www.clarifai.com

Visualization and Understanding Convolutional ...

Visualization and Understanding Convolutional Neural Networks

More Decks by Hakka Labs

Other Decks in Programming

Featured

Transcript