Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mask-RCNN for Instance Segmentation

Avatar for Dat Nguyen Dat Nguyen
January 17, 2019

Mask-RCNN for Instance Segmentation

This presentation explains some visual perception tasks in Computer Vision, including Instance Segmentation task. It then introduces Mask-RCNN, a state-of-the-art algorithm for the task, provides an overview of the architecture of the Neural Network used by Mask-RCNN. The main part of the presentation explains each building blocks of the network in details.
This presentation was given @Tokyo Machine Learning Kitchen on Jan 17th, 2019.

Avatar for Dat Nguyen

Dat Nguyen

January 17, 2019
Tweet

More Decks by Dat Nguyen

Other Decks in Technology

Transcript

  1. Who am I? • AI Research Engineer at BizReach, Inc.

    • Vietnam Japan AI Community (VJAI) organizer • Master degree in computer science (Text mining, NLP, before DL era) • Kaggle Expert • Interests: • Natural Language Understanding • Computer Vision • AI startups
  2. Visual perception tasks 1. Image Classification 2. Object Detection 3.

    Semantic Segmentation 4. Instance Segmentation
  3. Introduction to MaskRCNN • Mask-RCNN stands for Mask-Regions with Convolutional

    Neural Network • State-of-the-art algorithm for Instance Segmentation 
  4. Region Proposals , The approach: 1st stage: • Propose regions

    in which objects are likely located 2nd stage: • Classify the regions • Refine the regions with regression • Predict masks for each class in each region
  5. MaskRCNN architecture 8 ResNet FPN RPN RoIAlign Conv Conv Conv

    Box Class Conv Conv FPN: Feature Pyramid Network RPN: Region Proposal Network RoI: Region of Interest K+1 classes (Background + number of classes) Bounding box regression for box adjustment For feature extraction Predict FG/BG probabilities for each region Predict box regression for each region -> adjust size & position of regions
  6. Feature Pyramid Network (FPN) 9 + + + P5 P4

    P3 P2 C5 C4 C3 C2 ResNet Feature Pyramid Network 2x Up 1x1 conv + Resolution Semantic 1x1 conv
  7. Bounding box regression 10 (x1 ,y1 ) (x2 ,y2 )

    • A bounding box can be represented by (x1 , y1 , x2 , y2 ) • It can be also represented by the center point (cx , cy ), width w, and height h. • cx , cy , w, h are calculated from (x1 , y1 , x2 , y2 ) as: w = x2 – x1 cx = x1 + 0.5*w h = y2 – y1 cy = y1 + 0.5*h • (x1 , y1 , x2 , y2 ) are also reversely calculated from cx , cy , w, h as: x1 = cx – 0.5 * w x2 = x1 + w y1 = cy – 0.5 * h y2 = y1 + h Anchor or predicted bounding box P Ground truth box G • Px , Py , Pw , Ph: center x, center y, width and height of predicted box P • Gx , Gy , Gw , Gh: center x, center y, width and height of the ground truth box G • Define • dx = (Gx – Px )/Pw dy = (Gy – Py )/Ph • dw = log(Gw /Pw ) dh = log(Gh /Ph ) • dx , dy specify a scale-invariant translation of the center of P • dw , dh specify log-space translations of the width & height of P • Box-regression branches predict 4 regression values (dx , dy, dw , dh ) for each box. • In inference step, after predicted a bounding box P as positive box, along with its regression, then the adjusted bounding box is identified as: h w (cx , cy ) Representation of a bounding box Bounding box regression Ph Pw (Px , Py ) (Gx , Gy ) Gw Gh
  8. Region Proposal Network 12 RPN head Anchor generator Proposal layer

    Rpn_probs Rpn_bbox Rpn_RoIs Anchors Depends only on feature maps’ size Massive number of anchors Filter out negative anchors with Rpn_probs and Non-Max Suppression
  9. RPN head network 13 Conv 3x3 (512 filters) (padding=same) Conv

    1x1 (anchors_per_location x 2 filters) Softmax Rpn_probs (FG/BG) Conv 1x1 (anchors_per_location x 4 filters) Rpn_bbox [Anchors, 2] [Anchors, 4] 4 values per anchor for bounding box regression: dy, dx, dh, dw [Ren et al, “Faster R-CNN: Towards real-time object detection with region proposal networks”, NIPS 2015]
  10. Anchor generator 14 Anchor stride Anchor center scale ratio1 ratio2

    ratio3 ratio4 Feature map [https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection] (anchorw /anchorh = ratio)
  11. Proposal layer  • Sort all anchors by rpn_probs (how

    likely an anchor contains an object) • Choose top N anchors, throw the remainings (e.g., N ~ 6000) • Apply Non-Maximum Suppresion (NMS) to eliminate duplicated boxes. Keep up to M anchors (e.g., M ~ 2000). • How to choose anchors as RoI? • Multiple bboxes around the object • There may be multiple objects → How to choose exactly 1 box per object?  
  12. IoU – Intersection over Union  3 I I= :

    5 = = : =G ?9 G ?= :=G E = / • IoU: a method to quantify the overlap between 2 areas • It can be used to evaluate an object detector • IoU of 2 areas A & B is calculate by the common area between the areas divided by the total area of them 0 1 4 7 0 1 . ) ( ( 2 9BEA= 0 1 4 7 0 1 . % . % 0 1 4 7 0 1 . , , . %% = A9E = G=
  13. Non-Maximum Suppresion (NMS) 17 • Multiple bboxes around the object

    • There may be multiple objects → How to choose exactly 1 box per object? Input: - List of boxes - Scores for each box - IoU threshold T (e.g., T= 0.5) - M: Max choosen boxes Output: List of chosen boxes Algorithm STEP1: Sort the boxes by score STEP2: Loop until there is no remaining box: - Choose the box with highest score. Call it A. - Eliminate remaining boxes b with IoU(b, A) >= T Non-Maximum Suppression algorithm
  14. Non-Maximum Suppresion (NMS) 18 Input: - List of boxes -

    Scores for each box - IoU threshold T (e.g., T= 0.5) - M: Max choosen boxes Output: List of chosen boxes Algorithm STEP1: Sort the boxes by score STEP2: Loop until there is no remaining box: - Choose the box with highest score. Call it A. - Eliminate remaining boxes b with IoU(b, A) >= T Non-Maximum Suppression algorithm
  15. Identify Feature Pyramid level for RoIs 20 Resize P2 P3

    P4 P5 w, h: width & height of a RoI 224: canonical ImageNet pre-training size k0 : target level of the RoI whose w*h = 2242 (here, k0 = 5) Target level k of a RoI is identified by: Crop the RoIs on their feature map 7x7 Intuitions: Features of large RoIs from smaller feature map (high semantic level) Features of small RoIs from larger feature map (low semantic level) RoIs [Lin et al, “Feature pyramid networks for object detection”, CVPR 2017]
  16. RoIAlign 21 1024 1024 540 540 Input image Object 64

    1024/16 = 64 540/16 = 33.75 33.75 Feature map RoI 16X less 33.75 / 7 = 4.82 each bin 7x7 Small feature map (for each RoI) RoI Use bilinear interpolation to calculate exact value at each bin No quantization Head network (FCN) [He et al, “Mask R-CNN”, ICCV 2017]
  17. Mask-RCNN head network 23 7x7x256 Small feature map (for each

    RoI) RoiAlign 1024 Fully connected layer implemented by CNN Shared weights over multiple RoIs Class Softmax (K+1) x 4 Box (K+1) 14x14x256 3x3 (256 filters) Conv1 14x14x256 Conv4 14x14x256 3x3 (256 filters) Conv Transpose (Up sampling) 2x2 (256 filters) (stride 2) 28x28x256 ... x 4 conv layers Conv 28x28x(K+1) 1x1 (K+1 filters) Sigmoid activation 28x28x(K+1) 7x7 (1024 filters) Conv1 Conv2 (BG + num classes) K+1 Dense Dense (K+1) x 4 1024 K+1 Predict mask per class BG vs K classes 4 box regression values: dy, dx, dh, dw 1x1 (1024 filters)
  18. Result of Mask-RCNN on COCO data  He et al,

    “Mask R-CNN”, ICCV 2017
  19. Some popular DL-based algorithms for visual perception tasks  Visual

    perception tasks Algorithms Image Classification AlexNet Inception GooLeNet/Inception v1 ResNet VGGNet Object Detection Fast/Faster R-CNN SSD YOLO Semantic Segmentation Fully Convolutional Network (FCN) U-Net Instance Segmentation Mask R-CNN
  20. Summary  • Mask-RCNN for Instance Segmentation • Region Proposals

    for candidate bounding boxes • Predict class, box offsets, binary mask for each box • CNN-based • Object Detection topics: • For accuracy: RCNN family algorithms • For speed: YOLO, SSD
  21. & !