Introduction of Mask R-CNN

Mask R-CNNによる物体検出とセグメンテーションのマルチタスク学習板垣正敏 2019/8/24 @Python機械学習勉強会in新潟 Restart#8

主な画像系深層学習タスク（⽣成系を除く）

画像認識画像全体を⼊⼒として、何が写っているかをラベル付けディープニューラルネットワーク、特に畳み込みニューラルネットワークのパワーを最初に⾒せつけた分野 Figure 4: (Left)
Eight ILSVRC-2010 test images and the ﬁve The correct label is written under each image, and the probab with a red bar (if it happens to be in the top 5). (Right) Five IL https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

基本となるのは CNN（畳み込みニューラルネットワーク） 150x150x3 148x148x32 74x74x32 72x72x64 36x36x64 34x34x128 17x17x128 15x15x128
7x7x128 6272 512 1 conv3x3, 32 stride (1, 1) maxpool2x2 stride (2, 2) conv3x3, 64 stride (1, 1) maxpool2x2 stride (2, 2) conv3x3, 128 stride (1, 1) maxpool2x2 stride (2, 2) conv3x3, 128 stride (1, 1) maxpool2x2 stride (2, 2) flatten flatten dense dense dense dense ・（積和）カーネル（フィルタ）

物体検出画像の中に存在する物体の位置（四⾓形であることが多い）と物体の種別を判別 https://arxiv.org/abs/1506.01497

セグメンテーション画像のピクセル単位でそのピクセルがなんの物体に属しているかを判別⾔い換えれば、物体の塗りつぶしあるいはマスキング http://jamie.shotton.org/work/research.html

物体検出モデルの系譜

ディープラーニング以前の代表的⼿法 HOG 画像上のスライディングウィンドウに対して、従来⼿法による特徴量抽出を⾏う抽出した特徴量を元に Support Vetctor Machine （SVM）によって判定
(a) (b) (c) (d) (e) (f) (g) Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel” shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image. (e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights. would help to improve the detection results in more general situations. Acknowledgments. This work was supported by the Euro- pean Union research projects ACEMEDIA and PASCAL. We thanks Cordelia Schmid for many useful comments. SVM- Light [10] provided reliable training of large-scale SVM’s. References [1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The 8th ICCV, Vancouver, Canada, pages 454–461, 2001. [2] V. de Poortere, J. Cant, B. Van den Bosch, J. de Prins, F. Fransens, and L. Van Gool. Efficient pedes- trian detection: a test case for svm based categorization. Workshop on Cognitive Vision, 2002. Available online: http://www.vision.ethz.ch/cogvis02/. [3] P. Felzenszwalb and D. Huttenlocher. Efficient matching of pictorial structures. CVPR, Hilton Head Island, South Car- olina, USA, pages 66–75, 2000. [4] W. T. Freeman and M. Roth. Orientation histograms for hand gesture recognition. Intl. Workshop on Automatic Face- and Gesture- Recognition, IEEE Computer Society, Zurich, [10] T. Joachims. Making large-scale svm learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. The MIT Press, Cambridge, MA, USA, 1999. [11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep- resentation for local image descriptors. CVPR, Washington, DC, USA, pages 66–75, 2004. [12] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004. [13] R. K. McConnell. Method of and apparatus for pattern recognition, January 1986. U.S. Patent No. 4,567,610. [14] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. PAMI, 2004. Accepted. [15] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. IJCV, 60(1):63–86, 2004. [16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detection based on a probabilistic assembly of robust part detectors. The 8th ECCV, Prague, Czech Republic, volume I, pages 69– 81, 2004. [17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based 8 https://hal.inria.fr/file/index/docid/548512/filename/hog_cvpr2005.pdf

R-CNN (Region-based CNN) HOGの特徴検出器の代わりにCNNを使うスライディング・ウィンドウでしらみつぶしにCNNによる判定を⾏うのは⾮現実的物体らしさ(Objectness)を⾒つける既存⼿法(Selective
Search)を⽤いて、画像から領域候補(Region Proposals)を探す(2000個程度) 領域候補の領域画像を全て⼀定の⼤きさにリサイズして CNNにかけて特徴量を取り出す取り出した特徴量を使って複数のSVMによって学習しカテゴリ識別、 regressorによってBounding Box の正確な位置を推定 https://arxiv.org/abs/1311.2524

IoUとNMS IoU: Intersection over Union 領域の重なり具合の評価指標 NMS: Non-Maximum
Suppression 1つの正解（Ground Truth）に対する複数の領域候補の重なり具合を IoU で評価し、IoU が最⼤のものだけを残して他を捨てることで、計算量を抑制する⼿法 https://qiita.com/mshinoda88/items/9770ee671ea27f2c81a9

SPP-net (spatial pyramid pooling) R-CNNでは、2000の領域候補を同じサイズの画像に変形したのち、全てCNNにかけて特徴抽出をしているため、計算量が多い
SPPでは、CNNによって画像全体の特徴マップを⽣成し、この特徴マップに階層的プーリングを⾏うことで、任意のサイズの⼊⼒画像を扱えるようにしながら、計算量を抑制する https://arxiv.org/abs/1406.4729

Fast R-CNN RoI pooling layerという、SPPのpyramid構造を取り除いたシンプルな幅可変poolingを⾏う classification/bounding box regressionを同時に学習させるための
multi-task loss によって1回で学習ができるようにする VGG16を⽤いたR-CNNより9倍の学習速度、213倍の識別速度 SPPnetの3倍の学習速度、10倍の識別速度 https://arxiv.org/abs/1504.08083

RoI(Region of Intrest) Pooling 単純で効率的なプーリング⼿法領域候補を同じサイズのセクションに分割（その数は出⼒の次元と同じ）
各セクションで最⼤値を⾒つけるこれらの最⼤値を出⼒バッファにコピーする https://qiita.com/mshinoda88/items/9770ee671ea27f2c81a9

Faster R-CNN RPN(Region Proposal Network) の導⼊とEnd- to-End 学習
アンカーボックスを出発点に物体の存在しそうな領域を推定するサブネットワークを提案 https://arxiv.org/abs/1506.01497

RPN Anchor Boxを出発点に物体のありそうな領域を抽出 Region Proposal Network • 特徴マップ上にAnchorを定義（方眼紙に見立てて、各マスの中心のイメージ）
• 各Anchor毎にk個のAnchor Boxを定義（スケールとアスペクト比の組み合わせ） • 各Anchor Box毎に、物体らしさのスコアと位置・サイズの修正項を予測するように訓練する Faster R-CNN: http://arxiv.org/abs/1506.01497 画像特徴マップ CNN (特徴抽出）・・・スケールアスペクト比 × 各アンカーごとにk個のBox （例: k = 3 × 3） 2k scores （物体 or 背景） 4k coordinates （x, y, w, hの修正項） H x W x 3 H/16 x W/16 x 3 8 https://www.slideshare.net/ToshinoriHanya/ohs3

YOLO 領域をあらかじめ決めたグリッドに分割グリッド単位の物体の存在可能性と、領域の位置・サイズを推定 https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

FPN (Feature Pyramid Net) さまざまなスケールの物体を検出するための⼯夫 CNNの各層の出⼒を特徴として取り出すボトムアップ処理
低解像度で意味的にはっきりした特徴マップをアップサンプリングし、より⾼解像度の特徴マップと混合することでより解像度の⾼い検出を⾏うボトムアップ処理 https://arxiv.org/abs/1612.03144

YOLO v.2 Faster R-CNNなどのアイデアを取り込み性能アップ BachNormalizationの導⼊⾼解像度化 Anchor Boxの導⼊ IoUに基づく距離を使ったk-means法によるBouding Boxのクラス
タリング Box位置の直接推定 https://pjreddie.com/darknet/yolov2/

SSD CNNの各層の特徴マップを使って領域と物体を同時に推定 https://arxiv.org/abs/1512.02325

RetinaNet Facebook AI Researchによる提案分類器にFocal Lossを導⼊（前景後景のオブ
ジェクトの不均衡分類対策） ResNetの複数の層の特徴マップを使い、クラス分類器とBox推定器を訓練 class+box subnets class subnet box subnet W×H ×256 W×H ×256 W×H ×4A W×H ×256 W×H ×256 W×H ×KA ×4 ×4 + + class+box subnets class+box subnets (a) ResNet (b) feature pyramid net (c) class subnet (top) (d) box subnet (bottom) Figure 3. The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [20] backbone on top of a feedforward ResNet architecture [16] (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks, one for classifying anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d). The network design is intentionally simple, which enables this work to focus on a novel focal loss function that eliminates the accuracy gap between our one-stage detector and state-of-the-art two-stage detectors like Faster R-CNN with FPN [20] while running at faster speeds. Classification Subnet: The classification subnet predicts the probability of object presence at each spatial position for each of the A anchors and K object classes. This subnet is a small FCN attached to each FPN level; parameters of this subnet are shared across all pyramid levels. Its design is simple. Taking an input feature map with C channels from a given pyramid level, the subnet applies four 3⇥3 conv layers, each with C filters and each followed by ReLU activations, followed by a 3⇥3 conv layer with KA filters. Finally sigmoid activations are attached to output the KA binary predictions per spatial location, see Figure 3 (c). We use C = 256 and A = 9 in most experiments. In contrast to RPN [28], our object classification subnet is deeper, uses only 3⇥3 convs, and does not share parameters with the box regression subnet (described next). We found these higher-level design decisions to be more im- portant than specific values of hyperparameters. Box Regression Subnet: In parallel with the object classification subnet, we attach another small FCN to each pyramid level for the purpose of regressing the offset from each anchor box to a nearby ground-truth object, if one exists. The design of the box regression subnet is identical to the classification subnet except that it terminates in 4A linear regression subnet, see Figure 3. As such, inference involves simply forwarding an image through the network. To improve speed, we only decode box predictions from at most 1k top-scoring predictions per FPN level, after threshold- ing detector confidence at 0.05. The top predictions from all levels are merged and non-maximum suppression with a threshold of 0.5 is applied to yield the final detections. Focal Loss: We use the focal loss introduced in this work as the loss on the output of the classification subnet. As we will show in §5, we find that = 2 works well in practice and the RetinaNet is relatively robust to 2 [0.5, 5]. We emphasize that when training RetinaNet, the focal loss is applied to all ⇠100k anchors in each sampled image. This stands in contrast to common practice of using heuristic sampling (RPN) or hard example mining (OHEM, SSD) to select a small set of anchors (e.g., 256) for each minibatch. The total focal loss of an image is computed as the sum of the focal loss over all ⇠100k anchors, normalized by the number of anchors assigned to a ground-truth box. We perform the normalization by the number of assigned anchors, not total anchors, since the vast majority of anchors are easy negatives and receive negligible loss values under the focal loss. Finally we note that ↵, the weight assigned to the rare https://arxiv.org/abs/1708.02002

YOLO v3 Bounding Boxの指標に誤差⾃乗和を採⽤物体らしさの評価にロジスティック回帰を使⽤クラス分類にSoftmaxではなくロジス
ティック回帰によるマルチラベル識別器を使⽤マルチスケールなBounding Box予測⾼速化のため新しい特徴抽出ネットワークを使⽤ YOLOv3: An Incremental Improvement Joseph Redmon, Ali Farhadi University of Washington Abstract We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320 ⇥ 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, com- pared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8⇥ faster. As always, all the code is online at https://pjreddie.com/yolo/. 1. Introduction Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, hon- estly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little. Actually, that’s what brings us here today. We have a camera-ready deadline [4] and we need to cite some of the random updates I made to YOLO but we don’t have a source. So get ready for a TECH REPORT! The great thing about tech reports is that they don’t need intros, y’all know why we’re here. So the end of this introduction will signpost for the rest of the paper. First we’ll tell you what the deal is with YOLOv3. Then we’ll tell you how 50 100 150 200 250 inference time (ms) 28 30 32 34 36 38 COCO AP B C D E F G RetinaNet-50 RetinaNet-101 YOLOv3 Method [B] SSD321 [C] DSSD321 [D] R-FCN [E] SSD513 [F] DSSD513 [G] FPN FRCN RetinaNet-50-500 RetinaNet-101-500 RetinaNet-101-800 YOLOv3-320 YOLOv3-416 YOLOv3-608 mAP 28.0 28.0 29.9 31.2 33.2 36.2 32.5 34.4 37.8 28.2 31.0 33.0 time 61 85 85 125 156 172 73 90 198 22 29 51 Figure 1. We adapt this ﬁgure from the Focal Loss paper [9]. YOLOv3 runs signiﬁcantly faster than other detection methods with comparable performance. Times from either an M40 or Titan X, they are basically the same GPU. 2.1. Bounding Box Prediction Following YOLO9000 our system predicts bounding boxes using dimension clusters as anchor boxes [15]. The network predicts 4 coordinates for each bounding box, tx , ty , tw , th . If the cell is offset from the top left corner of the image by (cx, cy) and the bounding box prior has width and height pw , ph , then the predictions correspond to: bx = (tx) + cx by = (ty) + cy bw = pwetw bh = pheth https://arxiv.org/abs/1804.02767

Mask R-CNNとその実装例

Mask R-CNN ≒ Faster R-CNN + Mask Branch https://arxiv.org/abs/1703.06870

シングルタスクよりも⾼い性能を達成 backbone APbb APbb 50 APbb 75 APbb S APbb
M APbb L Faster R-CNN+++ [19] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9 Faster R-CNN w FPN [27] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2 Faster R-CNN by G-RMI [21] Inception-ResNet-v2 [41] 34.7 55.5 36.7 13.5 38.1 52.0 Faster R-CNN w TDM [39] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1 Faster R-CNN, RoIAlign ResNet-101-FPN 37.3 59.6 40.3 19.8 40.2 48.8 Mask R-CNN ResNet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2 Mask R-CNN ResNeXt-101-FPN 39.8 62.3 43.4 22.1 43.2 51.2 Table 3. Object detection single-model results (bounding box AP), vs. state-of-the-art on test-dev. Mask R-CNN using ResNet-101- FPN outperforms the base variants of all previous state-of-the-art models (the mask output is ignored in these experiments). The gains of Mask R-CNN over [27] come from using RoIAlign (+1.1 APbb), multitask training (+0.9 APbb), and ResNeXt-101 (+1.6 APbb). (50% relative improvement). Moreover, we note that with RoIAlign, using stride-32 C5 features (30.9 AP) is more accurate than using stride-16 C4 features (30.3 AP, Table 2c). RoIAlign largely resolves the long-standing challenge of using large-stride features for detection and segmentation. Finally, RoIAlign shows a gain of 1.5 mask AP and 0.5 4.4. Timing Inference: We train a ResNet-101-FPN model that shares features between the RPN and Mask R-CNN stages, following the 4-step training of Faster R-CNN [36]. This model runs at 195ms per image on an Nvidia Tesla M40 GPU (plus 15ms CPU time resizing the outputs to the original resolu- Figure 5. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1). backbone AP AP50 AP75 APS APM APL MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6 FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0 FCIS+++ [26] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - - Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1 Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4 Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5 Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes multi-scale train/test, horizontal flip test, and OHEM [38]. All entries are single-model results. can predict K masks per RoI, but we only use the k-th mask, where k is the predicted class by the classification branch. The m⇥m floating-number mask output is then resized to the RoI size, and binarized at a threshold of 0.5. Note that since we only compute masks on the top 100 detection boxes, Mask R-CNN adds a small overhead to its Faster R-CNN counterpart (e.g., ⇠20% on typical models). 4. Experiments: Instance Segmentation We perform a thorough comparison of Mask R-CNN to the state of the art along with comprehensive ablations on the COCO dataset [28]. We report the standard COCO met- rics including AP (averaged over IoU thresholds), AP50 , AP75 , and APS , APM , APL (AP at different scales). Un- less noted, AP is evaluating using mask IoU. As in previous work [5, 27], we train using the union of 80k train images and a 35k subset of val images (trainval35k), and re- 4.1. Main Results We compare Mask R-CNN to the state-of-the-art methods in instance segmentation in Table 1. All instantia- tions of our model outperform baseline variants of previous state-of-the-art models. This includes MNC [10] and FCIS [26], the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN with ResNet-101-FPN backbone outperforms FCIS+++ [26], which includes multi-scale train/test, horizontal flip test, and online hard example mining (OHEM) [38]. While outside the scope of this work, we expect many such improvements to be applicable to ours. Mask R-CNN outputs are visualized in Figures 2 and 5. Mask R-CNN achieves good results even under challeng- ing conditions. In Figure 6 we compare our Mask R-CNN baseline and FCIS+++ [26]. FCIS+++ exhibits systematic artifacts on overlapping instances, suggesting that it is chal-

https://github.com/matterport/Mask_RCNN

さまざまな実装例 matterport/Mask_RCNN Keras による実装 tensorflow/models TensorFlow Object Detection API
の機能として実装 facebookresearch/detectron Facebook の Object Detection System の⼀部として実装（Caffe2ベース） facebookresearch/maskrcnn-benchmark PyTorch による実装

訓練⽤データセット PASCAL VOC http://host.robots.ox.ac.uk/pascal/VOC/ 2005年から2012年まで⾏われたコンテストのデータセット MS COCO http://cocodataset.org/ Microsoftが提供しているデータセット 2014年、2015年、2017年のデータセット
2018年、2019年にはキーポイント検出や姿勢推定のアノテーションも追加

カスタムトレーニングと留意事項データセットやモデルによって訓練⽤データセットのアノテーションの形式が異なる XML、JSON、CSVなど物体検出の位置情報も、左上、右下の座標を前提とするものと、中⼼の座標と⾼さと幅を前提とするものがあるマスクのアノテーションも、ビットマップ（バイナリ、整数）や、RLEによって圧縮されたものなどさまざま

アノテーションツール CVAT (Computer Vision Annotation Tool) https://github.com/opencv/cvat OpenCV が提供している画像アノンテーションツール画像分類、物体検出、セグメンテーションに対応
Dockerを使って起動可能 Abeja Platform Annotation （有料サービス） https://abejainc.com/platform/ja/

コンテスト SIIM-ACR Pneumothorax Segmentation -- Identify Pneumothorax disease in
chest x-rays https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation レントゲン画像から気胸の部位を検出するもうすぐコンペ終了 2018 Data Science Bowl -- Find the nuclei in divergent images to advance medical discovery https://www.kaggle.com/c/data-science-bowl-2018 顕微鏡写真から細胞核を検出する終了したコンペ

まとめ

Mask R-CNNの使い所 Mask R-CNNは、物体検出とセグメンテーション、キーポイント検出を同時に⾏えるモデル物体検出だけ、セグメンテーションだけを⾏うモデルよりも⾼い精度が期待できる
画像診断や故障部位検出などの分野に応⽤可能と考えられるベースとなるCNNや物体検出やセグメンテーションのデータセットで訓練済みのネットワークを使い、転移学習・ファインチューニングが可能ただし、物体検出の推論スピードは、YOLOやSSDなどに⽐べて速くないので、⽬的によっては留意が必要

参考

参考にした記事等物体検出、セグメンテーションをMask R-CNNで理解してみる (初⼼者) https://qiita.com/shtamura/items/4283c851bc3d9721ed96 物体検出についての歴史まとめ
https://qiita.com/mshinoda88/items/9770ee671ea27f2c81a9 物体検出モデルの進展 Part3 ~FPNとRetinaNet~ https://qiita.com/TaigaHasegawa/items/653abc81ac4ee1f0d7b8 アノテーションツール(正解⼊⼒ツール)が進化している。 2 https://qiita.com/nonbiri15/items/819efb0d42b1541c29c0 画像を扱う機械学習のためのデータセットまとめ https://qiita.com/leetmikeal/items/7c0d23e39bf38ab8be23

Introduction of Mask R-CNN

Introduction of Mask R-CNN

masa-ita

More Decks by masa-ita

Other Decks in Technology

Featured

Transcript

Mask R-CNNによる物体検出とセグメンテーションのマルチタスク学習板垣正敏 2019/8/24 @Python機械学習勉強会in新潟 Restart#8

主な画像系深層学習タスク（⽣成系を除く）

画像認識画像全体を⼊⼒として、何が写っているかをラベル付けディープニューラルネットワーク、特に畳み込みニューラルネットワークのパワーを最初に⾒せつけた分野 Figure 4: (Left)

基本となるのは CNN（畳み込みニューラルネットワーク） 150x150x3 148x148x32 74x74x32 72x72x64 36x36x64 34x34x128 17x17x128 15x15x128

物体検出画像の中に存在する物体の位置（四⾓形であることが多い）と物体の種別を判別 https://arxiv.org/abs/1506.01497

セグメンテーション画像のピクセル単位でそのピクセルがなんの物体に属しているかを判別⾔い換えれば、物体の塗りつぶしあるいはマスキング http://jamie.shotton.org/work/research.html

物体検出モデルの系譜

ディープラーニング以前の代表的⼿法 HOG 画像上のスライディングウィンドウに対して、従来⼿法による特徴量抽出を⾏う抽出した特徴量を元に Support Vetctor Machine （SVM）によって判定

R-CNN (Region-based CNN) HOGの特徴検出器の代わりにCNNを使うスライディング・ウィンドウでしらみつぶしにCNNによる判定を⾏うのは⾮現実的物体らしさ(Objectness)を⾒つける既存⼿法(Selective

IoUとNMS IoU: Intersection over Union 領域の重なり具合の評価指標 NMS: Non-Maximum

SPP-net (spatial pyramid pooling) R-CNNでは、2000の領域候補を同じサイズの画像に変形したのち、全てCNNにかけて特徴抽出をしているため、計算量が多い

Fast R-CNN RoI pooling layerという、SPPのpyramid構造を取り除いたシンプルな幅可変poolingを⾏う classification/bounding box regressionを同時に学習させるための

RoI(Region of Intrest) Pooling 単純で効率的なプーリング⼿法領域候補を同じサイズのセクションに分割（その数は出⼒の次元と同じ）

Faster R-CNN RPN(Region Proposal Network) の導⼊とEnd- to-End 学習

RPN Anchor Boxを出発点に物体のありそうな領域を抽出 Region Proposal Network • 特徴マップ上にAnchorを定義（方眼紙に見立てて、各マスの中心のイメージ）

YOLO 領域をあらかじめ決めたグリッドに分割グリッド単位の物体の存在可能性と、領域の位置・サイズを推定 https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

FPN (Feature Pyramid Net) さまざまなスケールの物体を検出するための⼯夫 CNNの各層の出⼒を特徴として取り出すボトムアップ処理

YOLO v.2 Faster R-CNNなどのアイデアを取り込み性能アップ BachNormalizationの導⼊⾼解像度化 Anchor Boxの導⼊ IoUに基づく距離を使ったk-means法によるBouding Boxのクラス

SSD CNNの各層の特徴マップを使って領域と物体を同時に推定 https://arxiv.org/abs/1512.02325

RetinaNet Facebook AI Researchによる提案分類器にFocal Lossを導⼊（前景後景のオブ

YOLO v3 Bounding Boxの指標に誤差⾃乗和を採⽤物体らしさの評価にロジスティック回帰を使⽤クラス分類にSoftmaxではなくロジス

Mask R-CNNとその実装例

Mask R-CNN ≒ Faster R-CNN + Mask Branch https://arxiv.org/abs/1703.06870

シングルタスクよりも⾼い性能を達成 backbone APbb APbb 50 APbb 75 APbb S APbb

https://github.com/matterport/Mask_RCNN

さまざまな実装例 matterport/Mask_RCNN Keras による実装 tensorflow/models TensorFlow Object Detection API

訓練⽤データセット PASCAL VOC http://host.robots.ox.ac.uk/pascal/VOC/ 2005年から2012年まで⾏われたコンテストのデータセット MS COCO http://cocodataset.org/ Microsoftが提供しているデータセット 2014年、2015年、2017年のデータセット

アノテーションツール CVAT (Computer Vision Annotation Tool) https://github.com/opencv/cvat OpenCV が提供している画像アノンテーションツール画像分類、物体検出、セグメンテーションに対応

コンテスト SIIM-ACR Pneumothorax Segmentation -- Identify Pneumothorax disease in

まとめ

Mask R-CNNの使い所 Mask R-CNNは、物体検出とセグメンテーション、キーポイント検出を同時に⾏えるモデル物体検出だけ、セグメンテーションだけを⾏うモデルよりも⾼い精度が期待できる

参考

参考にした記事等物体検出、セグメンテーションをMask R-CNNで理解してみる (初⼼者) https://qiita.com/shtamura/items/4283c851bc3d9721ed96 物体検出についての歴史まとめ