Slide 1

Slide 1 text

Mask-RCNN for Instance Segmentation Nguyen Phuoc Tat Dat BizReach, Inc. @Tokyo Machine Learning Kitchen Jan 17, 2019

Slide 2

Slide 2 text

Who am I? • AI Research Engineer at BizReach, Inc. • Vietnam Japan AI Community (VJAI) organizer • Master degree in computer science (Text mining, NLP, before DL era) • Kaggle Expert • Interests: • Natural Language Understanding • Computer Vision • AI startups

Slide 3

Slide 3 text

Visual perception tasks 1. Image Classification 2. Object Detection 3. Semantic Segmentation 4. Instance Segmentation

Slide 4

Slide 4 text

Agenda • Demo • Visual perception tasks • Mask-RCNN • Summary • Q&A

Slide 5

Slide 5 text

Introduction to MaskRCNN • Mask-RCNN stands for Mask-Regions with Convolutional Neural Network • State-of-the-art algorithm for Instance Segmentation

Slide 6

Slide 6 text

Region Proposals , The approach: 1st stage: • Propose regions in which objects are likely located 2nd stage: • Classify the regions • Refine the regions with regression • Predict masks for each class in each region

Slide 7

Slide 7 text

MaskRCNN architecture 8 ResNet FPN RPN RoIAlign Conv Conv Conv Box Class Conv Conv FPN: Feature Pyramid Network RPN: Region Proposal Network RoI: Region of Interest K+1 classes (Background + number of classes) Bounding box regression for box adjustment For feature extraction Predict FG/BG probabilities for each region Predict box regression for each region -> adjust size & position of regions

Slide 8

Slide 8 text

Feature Pyramid Network (FPN) 9 + + + P5 P4 P3 P2 C5 C4 C3 C2 ResNet Feature Pyramid Network 2x Up 1x1 conv + Resolution Semantic 1x1 conv

Slide 9

Slide 9 text

Bounding box regression 10 (x1 ,y1 ) (x2 ,y2 ) • A bounding box can be represented by (x1 , y1 , x2 , y2 ) • It can be also represented by the center point (cx , cy ), width w, and height h. • cx , cy , w, h are calculated from (x1 , y1 , x2 , y2 ) as: w = x2 – x1 cx = x1 + 0.5*w h = y2 – y1 cy = y1 + 0.5*h • (x1 , y1 , x2 , y2 ) are also reversely calculated from cx , cy , w, h as: x1 = cx – 0.5 * w x2 = x1 + w y1 = cy – 0.5 * h y2 = y1 + h Anchor or predicted bounding box P Ground truth box G • Px , Py , Pw , Ph: center x, center y, width and height of predicted box P • Gx , Gy , Gw , Gh: center x, center y, width and height of the ground truth box G • Define • dx = (Gx – Px )/Pw dy = (Gy – Py )/Ph • dw = log(Gw /Pw ) dh = log(Gh /Ph ) • dx , dy specify a scale-invariant translation of the center of P • dw , dh specify log-space translations of the width & height of P • Box-regression branches predict 4 regression values (dx , dy, dw , dh ) for each box. • In inference step, after predicted a bounding box P as positive box, along with its regression, then the adjusted bounding box is identified as: h w (cx , cy ) Representation of a bounding box Bounding box regression Ph Pw (Px , Py ) (Gx , Gy ) Gw Gh

Slide 10

Slide 10 text

Region Proposal Network (RPN)

Slide 11

Slide 11 text

Region Proposal Network 12 RPN head Anchor generator Proposal layer Rpn_probs Rpn_bbox Rpn_RoIs Anchors Depends only on feature maps’ size Massive number of anchors Filter out negative anchors with Rpn_probs and Non-Max Suppression

Slide 12

Slide 12 text

RPN head network 13 Conv 3x3 (512 filters) (padding=same) Conv 1x1 (anchors_per_location x 2 filters) Softmax Rpn_probs (FG/BG) Conv 1x1 (anchors_per_location x 4 filters) Rpn_bbox [Anchors, 2] [Anchors, 4] 4 values per anchor for bounding box regression: dy, dx, dh, dw [Ren et al, “Faster R-CNN: Towards real-time object detection with region proposal networks”, NIPS 2015]

Slide 13

Slide 13 text

Anchor generator 14 Anchor stride Anchor center scale ratio1 ratio2 ratio3 ratio4 Feature map [https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection] (anchorw /anchorh = ratio)

Slide 14

Slide 14 text

Proposal layer • Sort all anchors by rpn_probs (how likely an anchor contains an object) • Choose top N anchors, throw the remainings (e.g., N ~ 6000) • Apply Non-Maximum Suppresion (NMS) to eliminate duplicated boxes. Keep up to M anchors (e.g., M ~ 2000). • How to choose anchors as RoI? • Multiple bboxes around the object • There may be multiple objects → How to choose exactly 1 box per object?

Slide 15

Slide 15 text

IoU – Intersection over Union 3 I I= : 5 = = : =G ?9 G ?= :=G E = / • IoU: a method to quantify the overlap between 2 areas • It can be used to evaluate an object detector • IoU of 2 areas A & B is calculate by the common area between the areas divided by the total area of them 0 1 4 7 0 1 . ) ( ( 2 9BEA= 0 1 4 7 0 1 . % . % 0 1 4 7 0 1 . , , . %% = A9E = G=

Slide 16

Slide 16 text

Non-Maximum Suppresion (NMS) 17 • Multiple bboxes around the object • There may be multiple objects → How to choose exactly 1 box per object? Input: - List of boxes - Scores for each box - IoU threshold T (e.g., T= 0.5) - M: Max choosen boxes Output: List of chosen boxes Algorithm STEP1: Sort the boxes by score STEP2: Loop until there is no remaining box: - Choose the box with highest score. Call it A. - Eliminate remaining boxes b with IoU(b, A) >= T Non-Maximum Suppression algorithm

Slide 17

Slide 17 text

Non-Maximum Suppresion (NMS) 18 Input: - List of boxes - Scores for each box - IoU threshold T (e.g., T= 0.5) - M: Max choosen boxes Output: List of chosen boxes Algorithm STEP1: Sort the boxes by score STEP2: Loop until there is no remaining box: - Choose the box with highest score. Call it A. - Eliminate remaining boxes b with IoU(b, A) >= T Non-Maximum Suppression algorithm

Slide 18

Slide 18 text

RoIAlign

Slide 19

Slide 19 text

Identify Feature Pyramid level for RoIs 20 Resize P2 P3 P4 P5 w, h: width & height of a RoI 224: canonical ImageNet pre-training size k0 : target level of the RoI whose w*h = 2242 (here, k0 = 5) Target level k of a RoI is identified by: Crop the RoIs on their feature map 7x7 Intuitions: Features of large RoIs from smaller feature map (high semantic level) Features of small RoIs from larger feature map (low semantic level) RoIs [Lin et al, “Feature pyramid networks for object detection”, CVPR 2017]

Slide 20

Slide 20 text

RoIAlign 21 1024 1024 540 540 Input image Object 64 1024/16 = 64 540/16 = 33.75 33.75 Feature map RoI 16X less 33.75 / 7 = 4.82 each bin 7x7 Small feature map (for each RoI) RoI Use bilinear interpolation to calculate exact value at each bin No quantization Head network (FCN) [He et al, “Mask R-CNN”, ICCV 2017]

Slide 21

Slide 21 text

Mask-RCNN head network

Slide 22

Slide 22 text

Mask-RCNN head network 23 7x7x256 Small feature map (for each RoI) RoiAlign 1024 Fully connected layer implemented by CNN Shared weights over multiple RoIs Class Softmax (K+1) x 4 Box (K+1) 14x14x256 3x3 (256 filters) Conv1 14x14x256 Conv4 14x14x256 3x3 (256 filters) Conv Transpose (Up sampling) 2x2 (256 filters) (stride 2) 28x28x256 ... x 4 conv layers Conv 28x28x(K+1) 1x1 (K+1 filters) Sigmoid activation 28x28x(K+1) 7x7 (1024 filters) Conv1 Conv2 (BG + num classes) K+1 Dense Dense (K+1) x 4 1024 K+1 Predict mask per class BG vs K classes 4 box regression values: dy, dx, dh, dw 1x1 (1024 filters)

Slide 23

Slide 23 text

Result of Mask-RCNN on COCO data He et al, “Mask R-CNN”, ICCV 2017

Slide 24

Slide 24 text

Some popular DL-based algorithms for visual perception tasks Visual perception tasks Algorithms Image Classification AlexNet Inception GooLeNet/Inception v1 ResNet VGGNet Object Detection Fast/Faster R-CNN SSD YOLO Semantic Segmentation Fully Convolutional Network (FCN) U-Net Instance Segmentation Mask R-CNN

Slide 25

Slide 25 text

Summary • Mask-RCNN for Instance Segmentation • Region Proposals for candidate bounding boxes • Predict class, box offsets, binary mask for each box • CNN-based • Object Detection topics: • For accuracy: RCNN family algorithms • For speed: YOLO, SSD

Slide 26

Slide 26 text

& !