Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mask-RCNN for Instance Segmentation

Dat Nguyen
January 17, 2019

Mask-RCNN for Instance Segmentation

This presentation explains some visual perception tasks in Computer Vision, including Instance Segmentation task. It then introduces Mask-RCNN, a state-of-the-art algorithm for the task, provides an overview of the architecture of the Neural Network used by Mask-RCNN. The main part of the presentation explains each building blocks of the network in details.
This presentation was given @Tokyo Machine Learning Kitchen on Jan 17th, 2019.

Dat Nguyen

January 17, 2019
Tweet

Other Decks in Technology

Transcript

  1. Who am I? • AI Research Engineer at BizReach, Inc.

    • Vietnam Japan AI Community (VJAI) organizer • Master degree in computer science (Text mining, NLP, before DL era) • Kaggle Expert • Interests: • Natural Language Understanding • Computer Vision • AI startups
  2. Visual perception tasks 1. Image Classification 2. Object Detection 3.

    Semantic Segmentation 4. Instance Segmentation
  3. Introduction to MaskRCNN • Mask-RCNN stands for Mask-Regions with Convolutional

    Neural Network • State-of-the-art algorithm for Instance Segmentation 
  4. Region Proposals , The approach: 1st stage: • Propose regions

    in which objects are likely located 2nd stage: • Classify the regions • Refine the regions with regression • Predict masks for each class in each region
  5. MaskRCNN architecture 8 ResNet FPN RPN RoIAlign Conv Conv Conv

    Box Class Conv Conv FPN: Feature Pyramid Network RPN: Region Proposal Network RoI: Region of Interest K+1 classes (Background + number of classes) Bounding box regression for box adjustment For feature extraction Predict FG/BG probabilities for each region Predict box regression for each region -> adjust size & position of regions
  6. Feature Pyramid Network (FPN) 9 + + + P5 P4

    P3 P2 C5 C4 C3 C2 ResNet Feature Pyramid Network 2x Up 1x1 conv + Resolution Semantic 1x1 conv
  7. Bounding box regression 10 (x1 ,y1 ) (x2 ,y2 )

    • A bounding box can be represented by (x1 , y1 , x2 , y2 ) • It can be also represented by the center point (cx , cy ), width w, and height h. • cx , cy , w, h are calculated from (x1 , y1 , x2 , y2 ) as: w = x2 – x1 cx = x1 + 0.5*w h = y2 – y1 cy = y1 + 0.5*h • (x1 , y1 , x2 , y2 ) are also reversely calculated from cx , cy , w, h as: x1 = cx – 0.5 * w x2 = x1 + w y1 = cy – 0.5 * h y2 = y1 + h Anchor or predicted bounding box P Ground truth box G • Px , Py , Pw , Ph: center x, center y, width and height of predicted box P • Gx , Gy , Gw , Gh: center x, center y, width and height of the ground truth box G • Define • dx = (Gx – Px )/Pw dy = (Gy – Py )/Ph • dw = log(Gw /Pw ) dh = log(Gh /Ph ) • dx , dy specify a scale-invariant translation of the center of P • dw , dh specify log-space translations of the width & height of P • Box-regression branches predict 4 regression values (dx , dy, dw , dh ) for each box. • In inference step, after predicted a bounding box P as positive box, along with its regression, then the adjusted bounding box is identified as: h w (cx , cy ) Representation of a bounding box Bounding box regression Ph Pw (Px , Py ) (Gx , Gy ) Gw Gh
  8. Region Proposal Network 12 RPN head Anchor generator Proposal layer

    Rpn_probs Rpn_bbox Rpn_RoIs Anchors Depends only on feature maps’ size Massive number of anchors Filter out negative anchors with Rpn_probs and Non-Max Suppression
  9. RPN head network 13 Conv 3x3 (512 filters) (padding=same) Conv

    1x1 (anchors_per_location x 2 filters) Softmax Rpn_probs (FG/BG) Conv 1x1 (anchors_per_location x 4 filters) Rpn_bbox [Anchors, 2] [Anchors, 4] 4 values per anchor for bounding box regression: dy, dx, dh, dw [Ren et al, “Faster R-CNN: Towards real-time object detection with region proposal networks”, NIPS 2015]
  10. Anchor generator 14 Anchor stride Anchor center scale ratio1 ratio2

    ratio3 ratio4 Feature map [https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection] (anchorw /anchorh = ratio)
  11. Proposal layer  • Sort all anchors by rpn_probs (how

    likely an anchor contains an object) • Choose top N anchors, throw the remainings (e.g., N ~ 6000) • Apply Non-Maximum Suppresion (NMS) to eliminate duplicated boxes. Keep up to M anchors (e.g., M ~ 2000). • How to choose anchors as RoI? • Multiple bboxes around the object • There may be multiple objects → How to choose exactly 1 box per object?  
  12. IoU – Intersection over Union  3 I I= :

    5 = = : =G ?9 G ?= :=G E = / • IoU: a method to quantify the overlap between 2 areas • It can be used to evaluate an object detector • IoU of 2 areas A & B is calculate by the common area between the areas divided by the total area of them 0 1 4 7 0 1 . ) ( ( 2 9BEA= 0 1 4 7 0 1 . % . % 0 1 4 7 0 1 . , , . %% = A9E = G=
  13. Non-Maximum Suppresion (NMS) 17 • Multiple bboxes around the object

    • There may be multiple objects → How to choose exactly 1 box per object? Input: - List of boxes - Scores for each box - IoU threshold T (e.g., T= 0.5) - M: Max choosen boxes Output: List of chosen boxes Algorithm STEP1: Sort the boxes by score STEP2: Loop until there is no remaining box: - Choose the box with highest score. Call it A. - Eliminate remaining boxes b with IoU(b, A) >= T Non-Maximum Suppression algorithm
  14. Non-Maximum Suppresion (NMS) 18 Input: - List of boxes -

    Scores for each box - IoU threshold T (e.g., T= 0.5) - M: Max choosen boxes Output: List of chosen boxes Algorithm STEP1: Sort the boxes by score STEP2: Loop until there is no remaining box: - Choose the box with highest score. Call it A. - Eliminate remaining boxes b with IoU(b, A) >= T Non-Maximum Suppression algorithm
  15. Identify Feature Pyramid level for RoIs 20 Resize P2 P3

    P4 P5 w, h: width & height of a RoI 224: canonical ImageNet pre-training size k0 : target level of the RoI whose w*h = 2242 (here, k0 = 5) Target level k of a RoI is identified by: Crop the RoIs on their feature map 7x7 Intuitions: Features of large RoIs from smaller feature map (high semantic level) Features of small RoIs from larger feature map (low semantic level) RoIs [Lin et al, “Feature pyramid networks for object detection”, CVPR 2017]
  16. RoIAlign 21 1024 1024 540 540 Input image Object 64

    1024/16 = 64 540/16 = 33.75 33.75 Feature map RoI 16X less 33.75 / 7 = 4.82 each bin 7x7 Small feature map (for each RoI) RoI Use bilinear interpolation to calculate exact value at each bin No quantization Head network (FCN) [He et al, “Mask R-CNN”, ICCV 2017]
  17. Mask-RCNN head network 23 7x7x256 Small feature map (for each

    RoI) RoiAlign 1024 Fully connected layer implemented by CNN Shared weights over multiple RoIs Class Softmax (K+1) x 4 Box (K+1) 14x14x256 3x3 (256 filters) Conv1 14x14x256 Conv4 14x14x256 3x3 (256 filters) Conv Transpose (Up sampling) 2x2 (256 filters) (stride 2) 28x28x256 ... x 4 conv layers Conv 28x28x(K+1) 1x1 (K+1 filters) Sigmoid activation 28x28x(K+1) 7x7 (1024 filters) Conv1 Conv2 (BG + num classes) K+1 Dense Dense (K+1) x 4 1024 K+1 Predict mask per class BG vs K classes 4 box regression values: dy, dx, dh, dw 1x1 (1024 filters)
  18. Result of Mask-RCNN on COCO data  He et al,

    “Mask R-CNN”, ICCV 2017
  19. Some popular DL-based algorithms for visual perception tasks  Visual

    perception tasks Algorithms Image Classification AlexNet Inception GooLeNet/Inception v1 ResNet VGGNet Object Detection Fast/Faster R-CNN SSD YOLO Semantic Segmentation Fully Convolutional Network (FCN) U-Net Instance Segmentation Mask R-CNN
  20. Summary  • Mask-RCNN for Instance Segmentation • Region Proposals

    for candidate bounding boxes • Predict class, box offsets, binary mask for each box • CNN-based • Object Detection topics: • For accuracy: RCNN family algorithms • For speed: YOLO, SSD
  21. & !