Deep Learning for Object Detection

Yiqi Yan May 10, 2017

Pa r t I F u n d a m
e n t a l B a c kg r o u n d s

3  Convolution Single Filter Multiple Filters

4  Convolution: case study, 2 filters

5  Convolution: receptive field receptive field 3 X 3
The deeper the network goes, the larger the receptive fields will be

6  Pooling Pooling on one single channel Pooling on
the whole feature map

7 “this is a cat”  Computer Vision Tasks “here
is a cat”

8  Computer Vision Tasks Pixel-wise Classification! Multiple Objects in
one image!

9 Pa r t I I St a t e
- of- a r t m e t h o d s f o r o b j e c t d e t e c t i o n

10  Group One: RCNN & Modifications  (CVPR 2014)
RCNN: Region Based CNN (TOO SLOW!)  (ECCV 2014) SPP-net: Spatial Pyramid Pooling in CNN  (ICCV 2015) Fast RCNN  (NIPS 2015) Faster RCNN (Online Object Detection)  Group Two: Fast! Online Object Detection  (ECCV 2016) SSD: Single Shot MultiBox Detector  (CVPR 2016) YOLO  (CVPR 2017) YOLO9000  Group Three: Deformable Convolutional Filters  (CVPR 2015) DeepID-Net  (arXiv 2017) Deformable Convolutional Networks  Group Four: Detection + Segmentation  (arXiv 2017) Mask R-CNN

11 Pa r t I I I Re g i
o n B a s e d C N N

12  Motivation Problem One: CNN seems unsuited for Object
Detection  Unlike image classification, detection requires localizing objects within an image  Deep CNNs have very large receptive fields, which makes precise localization very challenging Problem Two: Deep networks need large dataset to train

13  Framework Solve Problem One  Extract region proposals
 Recognition using regions Solve Problem Two  Supervised Pre-training on ImageNet  Domain-specified fine-tuning

14  Region Proposals Find “blobby” image regions that are
likely to contain objects

15  Region Proposals: Selective Search Uijlings et al, “Selective
Search for Object Recognition”, IJCV 2013 over-segmentation region-merging

16  Training method Step 1: Supervised Pre-training Train a
classification model for ImageNet (AlexNet, VGGNet, etc.) No localization can be done, thus this is just pre-training the parameters. VGG-19 (2015)

17  Training method Step 2: Domain-specified fine-tuning Change network
architecture Instead of 1000 ImageNet classes, want N object classes + background (N+1 classes) Need to reinitialize the soft-max layer Fine-tuning using Region Proposals Keep training model using positive / negative regions from detection images This time, use Detection Datasets (VOC, ILCVRC, COC, etc.)

18  Run Detection Now we get the trained network,
let us test it!  Step 1: Extract region proposals for all images  Step 2: (for each region) run through CNN, save pool5 features  Step 3: use binary SVM to classify region features (WHY NOT just use soft-max)  Step 5: bounding box regression: For each class, train a linear regression model to make up for “slightly wrong” proposals

19  Run Detection

20 Pa r t I V Fa s t R
C N N

21  R-CNN Problems: too slow!  Training is a
multi-stage pipeline: RCNN⟶SVMs⟶bounding-box regression  Training is expensive in space and time CNN features are stored for use of training SVMs and regression ~200GB disk place for PASCAL dataset!  Object detection is slow: features are extracted from each object proposal 47s / image on a GPU!

22  Fast R-CNN Framework

23  ROI (region of interest) Pooling  Pooling each
region into a fixed size (7 X 7 in the paper)  Back propagate similar to traditional max pooling

24  Multi-task loss The network outputs two vectors per
ROI 1. Soft-max probabilities for classification 2. Per-class bounding-box regression offsets 1st term: traditional cross-entropy loss for soft-max 2nd term: error between predicted and true bounding-box

25  Why superior to RCNN: problem #1 & #2
RCNN problem #1: Training is a multi-stage pipeline RCNN problem #2: Training is expensive in space and time Faster RCNN: end-to-end training; no need to store features

26  Why superior to RCNN: problem #3 RCNN problem
#3: Object detection is slow because features are extracted from each object proposal Fast RCNN: just run the whole image through CNN; regions are extracted from feature map

27  Comparison Fast RCNN is not fast enough! Bottleneck：
Selective Search Region Proposal

28 Pa r t V Fa s t e r
R C N N O n l i n e D e t e c t i o n !

29  Fast RCNN is not fast enough Main bottleneck：
Selective Search Region Proposal Faster RCNN: Why not just make the CNN do region proposals too!

30  Faster RCNN framework Insert a Region Proposal Network
(RPN) trained to produce region proposals directly ROI Pooling, soft-max classifier and bounding box regression are just like Fast RCNN

31  Region Proposal Network Similar to the multi-task training
in fast RCNN: classification + bounding box prediction. The difference is that we only need two-class classification here: object & not object

32  End-to-end joint training!  RPN classification  RPN
bbx regression  Fast RCNN classification  Fast R-CNN bbx regression

33  Comparison

34 Pa r t V I YO LO : Yo
u O n l y L o o k O n O n c e

35  YOLO Framework  Divide image into S x
S grid (7 X 7 in the parper)  Within each grid cell predict: B Boxes: 4 coordinates + confidence C Class scores  Regression from image to 7 x 7 x (5 * B + C) tensor  Direct prediction using a CNN

36 Faster than Faster R-CNN, but not as good 
Comparison

37  Ross Girshick: http://www.rossgirshick.info/  Kaiming He: http://kaiminghe.com/ 
Joseph Chet Redmon: https://pjreddie.com/  Great Contributors  Code  R-CNN (Cafffe+MATLAB): https://github.com/rbgirshick/rcnn  Fast R-CNN (Caffe+MATLAB): https://github.com/rbgirshick/fast-rcnn  Faster R-CNN (Caffe+MATLAB): https://github.com/ShaoqingRen/faster_rcnn (Caffe+Python): https://github.com/rbgirshick/py-faster-rcnn  YOLO: http://pjreddie.com/darknet/yolo/

Deep Learning for Object Detection

Deep Learning for Object Detection

More Decks by Yiqi Yan

Other Decks in Research

Featured

Transcript