Practical Image Classification & Object Detection

Practical Image Classification & Object Detection (or What I Wish
I Knew Before Practicing Computer Vision) Jaidev Deshpande @jaidevd

Types of Computer Vision Problems ! = “cat 60%, dog
40%” !( ) = !( )= Source: ImageNet & PASCAL VOC Datasets

Types of Computer Vision Problems Classification Detection, Segmentation, Tracking •
Whole image - binary, multiclass or multilabel • Cats vs Dogs, MNIST, CIFAR, ImageNet • LeNet, AlexNet, VGG, etc • Conv layers are feature extractors • Metrics have whole images as data points • Multiple ML problems within a single image • PASCAL VOC, COCO, LSUN • FCNs, UNets, R-CNNs, YOLO • Conv layers are more like predictors • Metrics have image subsets / pixels as data points

Why This Matters – The Role of Convolutions Source: http://yann.lecun.com/exdb/mnist/

ImageNet Benchmarks Source: “Deep Residual Learning”, He et. al, 2014

The Big Idea – Bayes Error Analysis Most time spent
here ConvNets help here

Classification ConvNet Architectures LeNet Figure 2: An illustration of the
architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the ﬁgure while the other runs the layer-parts AlexNet

GoogLeNet* input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv
3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2

Semantic Segmentation Semantic Segmentation Cow Grass Sky Trees Label each
pixel in the image with a category label Don’t differentiate instances, only care about pixels This image is CC0 public domain Grass Cat Sky Trees Classify each pixel into a category.

Semantic Segmentation Architecture Semantic Segmentation Idea: Sliding Window Full image
Extract patch Classify center pixel with CNN Cow Cow Grass

Semantic Segmentation – Fully Convolutional Networks Fei-Fei Li & Justin
Johnson & Serena Yeung Lecture 11 - May 10, 2017 22 Semantic Segmentation Idea: Fully Convolutional Input: 3 x H x W Convolutions: D x H x W Conv Conv Conv Conv Scores: C x H x W argmax Predictions: H x W Design a network as a bunch of convolutional layers to make predictions for pixels all at once!

Efficient Fully Convolutional Layers an Shelhamer⇤ Trevor Darrell UC Berkeley
er,trevor}@cs.berkeley.edu hat olu- els- en- al” uce and olu- nse We 22], olu- ons ne a om m a eg- ate- im- 96 384 256 4096 4096 21 21 backward/learning forward/inference pixelwise prediction segmentation g.t. 256 384 Figure 1. Fully convolutional networks can efﬁciently learn to make dense predictions for per-pixel tasks like semantic segmentation. We show that a fully convolutional network (FCN) trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machin- ery. To our knowledge, this is the ﬁrst work to train FCNs end-to-end (1) for pixelwise prediction and (2) from super- vised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-at-

The Big Idea – Learnable Upsampling Also called un-pooling, transpose
convolution, deconvolution* ei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 33 Learnable Upsampling: Transpose Convolution Input: 4 x 4 Output: 2 x 2 Dot product between filter and input Filter moves 2 pixels in the input for every one pixel in the output Stride gives ratio between movement in input and output Recall: Normal 3 x 3 convolution, stride 2 pad 1 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - May 10, 2017 36 Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where output overlaps Learnable Upsampling: Transpose Convolution 3 x 3 transpose convolution, stride 2 pad 1 Filter moves 2 pixels in the output for every one pixel in the input Stride gives ratio between movement in output and input Convolution Transpose Convolution

Putting it all together – the UNet

From Semantic to Instance Segmentation

Metric for Instance Segmentation – mAP of IoU Accuracy =
0.99 mAP of IoU = 0 Accuracy = 0.91 mAP of IoU = 0.7

Understanding the Metric ! = #$$ ⋯ #$& ⋮ ⋱
⋮ #)$ ⋯ #)& !*,, = -* ∩ /, -* ∪ /, 1 = {0.5, 0.55, … , 0.95} !9: = ! ≥ 1< =>? = 1 1 A <∈9 CD E CD E + GD E + G& (E)

Learning Morphological Erosion* ! " = !$ " + !&
' exp − ,- " + ,. " . 20.

Hacking Keras Models >>> from keras.models import Sequential >>> from
keras.layers import Dense >>> model = Sequential() >>> model.add(Dense(2, input_dim=1)) >>> model.add(Dense(1)) >>> from keras.models import Model >>> from keras.layers import Input >>> from keras.layers import Dense >>> visible = Input(shape=(2,)) >>> hidden = Dense(2)(visible) >>> model = Model(inputs=visible, ... outputs=hidden) Sequential API Functional API • Functional API is closer to the underlying computational graph • Do not ignore naming of layers • Useful callbacks: keras.callbacks.{History, ModelCheckpoint, EarlyStopping}

Gotchas with the Tensorflow Backend • A “batch” is always
a set of images – getting pixelwise loss is not obvious. • Weighted classes and losses are allowed in Keras only at the batch level, not at the pixel level. • Pixelwise weights should be part of the computational graph - Functional API for the win! • Tensorflow losses expect logits, not activated outputs.

Make your life easy with • NumPy’s vectorization and array
broadcasting (refer to @jakevdp’s excellent tutorial here. • Parallelization with joblib for preprocessing • Keras’ inbuilt image streaming and on-the-fly augmentation • Tensorboard

Acknowledgements • Aditi Dani for the endless typing of dictated
code and slides, and enduring so many IPython kernel restarts. • Juxt SmartMandate Pvt Ltd for generously sponsoring my Coursera Deep Learning Specialization • Farhat Habib and Prabhu Ramachandran for their valuable inputs on everything SciPy.

Questions? • A verbose blog post on introducing pixelwise weighted
losses into a UNet will be available soon • For link to slides, blog posts and other resources, watch https://twitter.com/jaidevd and #pydatadelhi

Practical Image Classification & Object Detection

Practical Image Classification & Object Detection

Jaidev Deshpande

More Decks by Jaidev Deshpande

Other Decks in Technology

Featured

Transcript