Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualization and Understanding Convolutional ...

Hakka Labs
January 30, 2015

Visualization and Understanding Convolutional Neural Networks

By Matthew Zeiler. Video here:

Hakka Labs

January 30, 2015
Tweet

More Decks by Hakka Labs

Other Decks in Programming

Transcript

  1. Overview •  Visualization technique based on Deconvolutional Networks •  Applied

    to convolutional neural networks – determine what each layer learns – provides insight for architecture selection
  2. Convolutional Networks (LeCun et al. ’89) •  Supervised & feed-forward

    •  Each layer: –  Convolve input with filters –  Non-linearity (rectified linear) –  Pooling (local max) •  Train convolutional filters by back-propagating classification error Input Image Convolution (Learned) Non-linearity Pooling Feature maps
  3. Krizhevsky et al. [NIPS2012] •  7 hidden layers, 650,000 neurons,

    60,000,000 parameters •  Trained on 2 GPUs for a week •  Essentially the same model as LeCun’98 but: - bigger model -  more data -  GPU implementation (similar to Ciresan et. al 2011)
  4. Convnets Show Huge Gains 0( 5( 10( 15( 20( 25(

    30( 35( SuperVision( ISI( Oxford( INRIA( Amsterdam( Top$5&error&rate&%& ImageNet 2012 classification competition results
  5. Deep(Nets(vs(Monkey(vs(Human( C.F.(Cadieu,(H.(Hong,(D.(Yamins,(N.(Pinto,(E.A.(Solomon,(N.J.( Majaj,(and(J.J.(DiCarlo.(Deep$Neural$Networks$Rival$the$Object$ Recogni9on$Performance$of$the$Primate$Visual$System.(( (in(submission,(2013).( 3 Linear-SVM Kernel Analysis Complexity

    Precision Accuracy IT Cortex DNN Pixel Representation IT Cortex Representation DNN Representation Cars Fruits B Deep Neural Network (DNN) (x) Ventral Stream Retinae Representation IT Cortex DNN IT Evaluation Cars ... Planes Chairs Tables Faces ... A Fruits ... Animals ... Figure 1. Example images used to measure object category recognition performance and
  6. Overview •  What are the models learning? •  Which part

    of the model is key to performance? •  Do the features generalize?
  7. Deconvolutional Networks •  Provides way to map activations at high

    layers back to the input •  Same operations as Convnet, but in reverse: –  Unpool feature maps –  Convolve unpooled maps •  Filters copied from Convnet •  Used here purely as a probe –  Originally proposed as unsupervised learning method –  No inference, no learning Input Image Convolution (learned) Unpooling Feature maps Non-linearity [Zeiler et al. CVPR’10, ICCV’11](
  8. Reuse Feedforward Switches Layer Below Pooled Maps Feature Maps Rectified

    Feature Maps Convolu'onal) Filtering){F}) Rec'fied)Linear) Func'on) Pooled Maps Max)Pooling) Reconstruction Rectified Unpooled Maps Unpooled Maps Convolu'onal) Filtering){FT}) Rec'fied)Linear) Func'on) Layer Above Reconstruction Max)Unpooling) Switches)
  9. Projecting back from Higher Layers Input Image Visualization Layer 1:

    Feature maps Layer 2: Feature maps Feature( Map( ....( Filters( Layer 1 Reconstruction Layer 2 Reconstruction 0( 0( ....( Filters( Convnet( Deconvnet(
  10. Visualizations of Higher Layers •  Use ImageNet 2012 validation set

    •  Push each image through network Input(( Image( Feature( Map( Lower(Layers( ....( Filters( Validation Images •  Take max activation from feature map associated with each filter •  Use Deconvnet to project back to pixel space •  Use pooling “switches” peculiar to that activation
  11. Translation (Horizontal) Layer 1 Layer 7 Output ï ï ï

            8   Horizontal Translation (Pixels) Canonical Distance Lawn Mower 6KLKï7]X African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Horizontal Translation (Pixels) P(true class) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Horizontal Translation (Pixels) Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center
  12. Translation (Vertical) Layer 1 Layer 7 Output −60 −40 −20

    0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Vertical Translation (Pixels) Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vertical Translation (Pixels) P(true class) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center ï ï ï      1  3  5  7 8 9  Vertical Translation (Pixels) Canonical Distance Lawn Mower 6KLKï7]X African Crocodile African Grey Entertrainment Center
  13. Scale Invariance 1 1.2 1.4 1.6 1.8 0 0.1 0.2

    0.3 0.4 0.5 0.6 0.7 Scale (Ratio) Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 2 4 6 8 10 12 Scale (Ratio) Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Scale (Ratio) P(true class) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center Layer 1 Layer 7 Output
  14. Rotation Invariance Layer 1 Layer 7 Output 0 50 100

    150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rotation Degrees P(true class) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Rotation Degrees Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 5 10 15 Rotation Degrees Canonical Distance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center
  15. Occlusion Experiment •  Mask parts of input with occluding square

    •  Monitor output •  Perhaps network using scene context?
  16. Input image( Total activation in most active 5th layer feature

    map( Other activations from same feature map(
  17. Input image( Total activation in most active 5th layer feature

    map( Other activations from same feature map(
  18. Input image( Total activation in most active 5th layer feature

    map( Other activations from same feature map(
  19. Lack of Understanding •  What are the models learning? • 

    Which part of the model is key to performance? •  Do the features generalize?
  20. Visualizations Help – 2% Boost (a) (b) (c) (d) (e)

    Needs(renormaliza:on( Too(simple(mid[level( Too(specific(low[level( Dead(filters( Block(Ar:facts( Constrain(RMS( Smaller(Strides( 4(to(2( Smaller(Filters( 7x7(to(11x11(
  21. Improved Architecture Input Image stride 2! image size 224! 3!

    96! 5! 2! 110! 55 3x3 max pool stride 2 96! 3! 1! 26 256! filter size 7! 3x3 max pool stride 2 13 256! 3! 1! 13 384! 3! 1! 13 384! Layer 1 Layer 2 13 256! 3x3 max pool stride 2 6 Layer 3 Layer 4 Layer 5 256! 4096 units! 4096 units! Layer 6 Layer 7 C class softmax! Output contrast norm. contrast norm.
  22. Architecture of Krizhevsky et al. •  8 layers total • 

    Trained on Imagenet dataset [Deng et al. CVPR’09] •  18.2% top-5 error •  Our reimplementation: 18.1% top-5 error Input Image Layer 1: Conv + Pool Layer 6: Full Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool Layer 7: Full
  23. Architecture of Krizhevsky et al. •  Remove top fully connected

    layer – Layer 7 •  Drop 16 million parameters •  Only 1.1% drop in performance! Input Image Layer 1: Conv + Pool Layer 6: Full Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool
  24. Architecture of Krizhevsky et al. •  Remove both fully connected

    layers – Layer 6 & 7 •  Drop ~50 million parameters •  5.7% drop in performance Input Image Layer 1: Conv + Pool Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool
  25. Architecture of Krizhevsky et al. •  Now try removing upper

    feature extractor layers: – Layers 3 & 4 •  Drop ~1 million parameters •  3.0% drop in performance Input Image Layer 1: Conv + Pool Layer 6: Full Softmax Output Layer 2: Conv + Pool Layer 5: Conv + Pool Layer 7: Full
  26. Architecture of Krizhevsky et al. •  Now try removing upper

    feature extractor layers & fully connected: – Layers 3, 4, 6 ,7 •  Now only 4 layers •  33.5% drop in performance ! Depth of network is key Input Image Layer 1: Conv + Pool Softmax Output Layer 2: Conv + Pool Layer 5: Conv + Pool
  27. Ablation Study 40.5 18.1 330 331 332 333 334 335

    336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 Visualizing and Understanding Convolutional Neural Networks Val Val Test Error % Top-1 Top-5 Top-5 (Krizhevsky et al., 2012), 1 convnet 40.7 18.2 (Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4 (Krizhevsky et al., 2012), 1 convnets* 39.0 16.6 (Krizhevsky et al., 2012), 7 convnets* 36.7 15.4 15.3 Our replication of (Krizhevsky et al., 2012), 1 convnet 40.5 18.1 1 convnet as per Fig. 2 38.3 16.4 16.5 5 convnets as per Fig. 2 36.6 15.3 15.3 Table 2. ImageNet 2012 classification error rates. The ⇤ indicates models that were trained on both ImageNet 2011 and 2012 training sets with an additional convolution layer. When we combine multiple models, we obtain a test error of 15 . 3%, which matches the absolute best per- formance on this dataset, despite only using the much smaller 2012 training set. We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classification challenge, which obtained 26 . 1% error. 3.1. Training Details The models were trained on the ImageNet 2012 train- ing set (1.3 million images, spread over 1000 di↵erent classes). Each RGB image was preprocessed by resiz- ing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 di↵erent sub-crops of size 224x224 (corners + center with(out) horizontal flips). change in the image from which the strongest act vation originates. Due to space constraints, only randomly selected subset of feature maps are visua ized and zooming is needed to see the details clearly As expected, the first layer filters consist of Gabor and low-frequency color. The 2nd layer features ar more complex, corresponding to conjunctions of edge and color patterns. The 3rd layer features show large image parts. Within a given feature projection, signif icant variations in contrast can be seen, showing whic parts of the image contribute most to the activatio and thus are most discriminative, e.g. the lips and eye on the persons face (Row 12). The visualization from the 4th and 5th layer show activations that respon to complex objects. Note that little of the scene back ground is reconstructed, since it is irrelevant to pre dicting the class. 4.2. Feature Invariance Fig. 4 shows feature visualizations from our mode once training is complete. However, instead of show ing the single strongest activation for a given fea ture map, we show the top 9 activations. Projectin each separately down to pixel space reveals the di↵er ent structures that excite a given feature map, henc showing its invariance to input deformations. Laye Visualizing and Understanding Convo Train Val Val Error % Top-1 Top-1 Top-5 Our replication of (Krizhevsky et al., 2012), 1 convnet 35.1 40.5 18.1 Removed layers 3,4 41.8 45.4 22.1 Removed layer 7 27.4 40.0 18.4 Removed layers 6,7 27.4 44.8 22.4 Removed layer 3,4,6,7 71.1 71.3 50.1 Adjust layers 6,7: 2048 units 40.3 41.7 18.8 Adjust layers 6,7: 8192 units 26.8 40.0 18.1 Our Model (as per Fig. 3) 33.1 38.4 16.6 Adjust layers 6,7: 2048 units 38.2 40.2 17.6 Adjust layers 6,7: 8192 units 22.0 38.8 17.0 Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0 Adjust layers 6,7: 8192 units and Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9 Table 3. ImageNet 2012 classification error rates with var- ious architectural changes to the model of (Krizhevsky et al., 2012) and our ImageNet model. Figure 9 number 6 trainin
  28. ImageNet Classification 2012 * Trained using Imagnet 2011 and 2012

    training sets. riments. he right ns show ature ange 7 0.015 0.013 0.011 0.014 on this dataset1 (despite only using the 2012 train- ing set). We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classi- fication challenge, which obtained 26 . 2% error (Gunji et al., 2012). Val Val Test Error % Top-1 Top-5 Top-5 (Gunji et al., 2012) - - 26.2 (Krizhevsky et al., 2012), 1 convnet 40.7 18.2 (Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4 (Krizhevsky et al., 2012)⇤, 1 convnets 39.0 16.6 (Krizhevsky et al., 2012)⇤, 7 convnets 36.7 15.4 15.3 Our replication of (Krizhevsky et al., 2012), 1 convnet 40.5 18.1 1 convnet as per Fig. 3 38.4 16.5 5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3 1 convnet as per Fig. 3 but with layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1 6 convnets, (a) & (b) combined 36.0 14.7 14.8 Table 2. ImageNet 2012 classification error rates. The ⇤ indicates models that were trained on both ImageNet 2011 and 2012 training sets. Varying ImageNet Model Sizes: In Table 3, we
  29. ImageNet(Classifica:on(2013(Results( 11.2%( 11.7%( 13.0%( 13.5%( 13.6%( 14.2%( 14.3%( 14.8%( 15.2%(

    15.2%( 16.1%( 16.4%( 10%( 11%( 12%( 13%( 14%( 15%( 16%( 17%( Test(error((top[5)( Top$5&Error&Rates&(lower&is&be4er)& hfp://www.image[net.org/challenges/LSVRC/2013/results.php(
  30. Recent Success •  Using smaller strides: – Very Deep Convolutional Networks

    for Large- Scale Image Recognition, ILSVRC 2014 2nd Classification, 1st Localization Simonyan and Zisserman, Arxiv 2014 – Some Improvements on Deep Convolutional Neural Networks, Howard, Arxiv 2013 •  Using Visualizations for saliency: – Deep Inside Convolutional Networks: Visualizing Image Classification Models and Saliency Maps, Simonyan, Vedaldi, and Zisserman, Arxiv 2014
  31. We(didn’t(stop(there( 10.7( 11.1( 15.3( 26.2( 0( 10( 20( 30( Clarifai(Latest(

    Clarifai(Imagenet( 2013( Imagenet(2012( Winner( Tradi:onal( Computer(Vision( %(Top[5(Error(Rate( (lower(is(befer)( Imagenet&2013&Top$5&Error&Rate&
  32. A(Less(Forgiving(Metric( 28.20( 30.60( 32.50( 36.70( 40.70( 25( 30( 35( 40(

    45( Clarifai(Latest( (10x(faster)( Clarifai( (mul:ple)( Clarifai((fast)( Krizhevsky(et.( al.((mul:ple)( Krizhevsky(et.( al.( %(Error(Rate( (lower(is(befer)( Imagenet&Top$1&Valida@on&Errors&
  33. Overview •  What are the models learning? •  Which part

    of the model is key to performance? •  Do the features generalize?
  34. Correspondence Measure feature vector of image i at a given

    layer feature vector of occluded image i at a given layer Perturbation vector: Change similarity measure:
  35. Using Features on Other Datasets •  Train model on ImageNet

    2012 training set •  Re-train classifier on new dataset – Just the softmax layer •  Classify test set of new dataset
  36. 0 10 20 30 40 50 60 25 30 35

    40 45 50 55 60 65 70 75 7UDLQLQJ,PDJHVSHUïFODVV $FFXUDF\ %RHWDO 6RKQHWDO Caltech 256
  37. Caltech 256 0 10 20 30 40 50 60 25

    30 35 40 45 50 55 60 65 70 75 7UDLQLQJ,PDJHVSHUïFODVV $FFXUDF\ 2XU0RGHO %RHWDO 6RKQHWDO 6(training(examples(
  38. Caltech 256 [3] L. Bo, X. Ren, and D. Fox.

    Multipath sparse coding using hierarchical matching pursuit. In CVPR, 2013. [16] K. Sohn, D. Jung, H. Lee, and A. Hero III. Efficient learning of sparse, distributed, convolutional feature representations for object recognition. In ICCV, 2011.
  39. Summary •  Visualization technique based on Deconvolutional Networks •  Applied

    to convolutional neural networks – better understanding of what is learned – gives insight into model selection