論文解説 Mask2Former

Slide 1

Slide 1 text

論⽂解説 Masked-attention Mask Transformer for Universal Image Segmentation Takehiro Matsuda

Slide 2

Slide 2 text

2 論⽂情報 • タイトル：Masked-attention Mask Transformer for Universal Image Segmentation • 論⽂： https://arxiv.org/abs/2112.01527 • コード： https://github.com/facebookresearch/Mask2Former • 投稿学会： CVPR2022 • 著者： Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar • 所属：Facebook AI Research (FAIR), University of Illinois at Urbana-Champaign (UIUC) 選んだ理由： • Transformerを使ったユニバーサルなアーキテクチャを提案し、セグメンテーションタスクについてSemantic, Instance, Panopticの違いによらず使える • Semantic, Instance, PanopticそれぞれでこれまでのSOTAを超える性能を達成した。

Slide 3

Slide 3 text

3 論⽂概要 Panoptic Instance Semantic Transformer DecoderにMasked Attentionを導⼊する Transformer decoderをMulti-scaleにする。学習で得られたMask領域におけるMasked Attentionにより局所的な特徴を精度良く捉える。 Panoptic: COCO Panopnic val2017 Instance: COCO val2017 Semantic: ADE20K SOTAを達成 Ground Truth Prediction Ground Truth Prediction

Slide 4

Slide 4 text

4 Segmentationの違い Pixel毎にクラスを認識指定したクラスの存在する場所を認識、同じクラスでも別個体は分ける (空などを対象クラスにしなければ識別されない Pixelがある) Thingはinstanceとして認識、 Stuff(空や道路)も認識

Slide 5

Slide 5 text

5 関連論⽂ DETR( Detection Transformer) : Object DetectionでTransformerを導⼊ MaskFormer: SegmentationでTransformerによるMaskを作り出し、推定する FAIR (Facebook AI Research)が出しているTransformerを使った画像認識に関する⼀連の論⽂の流れ DETR MaskFormer TransformerでGlobalな特徴や関係を抽出できるが、⼩さい物体の認識は若⼲苦⼿だったことや⼤きな計算リソースが必要だった点を改良する。

Slide 6

Slide 6 text

6 Transformer概説 https://www.slideshare.net/SSII_Slides/ssii2022-ts1-transformer (⽜久⽒資料より)

Slide 7

Slide 7 text

7 Transformer概説

Slide 8

Slide 8 text

8 Transformer概説

Slide 9

Slide 9 text

9 Transformer概説

Slide 10

Slide 10 text

10 Transformer概説

Slide 11

Slide 11 text

11 Transformer概説

Slide 12

Slide 12 text

12 Transformer概説

Slide 13

Slide 13 text

13 Transformer概説

Slide 14

Slide 14 text

14 DETR Anchorの設定やNMS(Non Maximum Suppression)を必要としない。

Slide 15

Slide 15 text

15 DETR ⾼解像度の近傍Pixel(領域) の特徴はCNNネットワークでエンコードして取得(W, H は1/32, Cは2048) CNNから取り出された画像の特徴量からAttentionを⽤いて各物体の位置や種類の情報に変換事前に決められた個数Nの物体を予測する他の予測内容を考慮して⾃⾝の予測するEncoder-Decoder ネットワーク Transformerの出⼒を物体の位置座標・クラスラベルにデコードするネットワーク

Slide 16

Slide 16 text

16 MaskFormer TransformerでSemantic SegmentationとPanoptic Segmentationを⾏う Ground Truth Prediction Ground Truth Prediction

Slide 17

Slide 17 text

17 MaskFormer Per-Pixel Classification is Not All You Need for Semantic Segmentation Binary mask predictionsを取得する transformer decoderでN個のclass predictionsと mask embeddingsを取得 Binary MaskにたいしてPixelごとのmask lossを算出 Maskごとにクラス推定のlossを算出 Segmentation TaskをMask classificationとして、 (1) 画像からN個のbinary mask 領域を作成 (2) 各マスク領域をK個の認識カテゴリそれぞれに所属する確率をだす

Slide 18

Slide 18 text

18 Mask2Former MaskFormerの弱点を改良 • ⼩さな対象の精度が悪い • ⼤きなコンピュータリソース • ⻑い学習時間 panoptic segmentation (57.8 PQ on COCO) instance segmentation (50.1 AP on COCO) Semantic segmentation (57.7 mIoU on ADE20K). SOTAを達成

Slide 19

Slide 19 text

19 Masked Attention Masked attention 画像全体から学習されるcross-attentionに変わり、オブジェクトクエリの予測に基づいて⽣成されたマスクを使って特定領域内でAttentionをとる。通常のcross attention Masked attention ⼩物体や物体境界などの細部の認識が改善されるのではないか。 We hypothesize that local features are enough to update query features and context information can be gathered through self-attention.

Slide 20

Slide 20 text

20 Multi-scale high-resolution features Pixel Decoderで元画像の1/32, 1/16, 1/8の Feature Pyramidを作り、Transformer Decoder もそれぞれに対応する Transformer Decoder 3 x L layers 画像系ではよく使われる解像度のPyramid構造を採⽤⼩さなオブジェクトの認識性能を上げる

Slide 21

Slide 21 text

21 Optimization improvements 通常のTransformer Decoder layerはquery featuresを⽣み出すのにself-attention module, cross- attention, feed-forward networkを順に送るが、 SelfとMasked(Cross) -attentionの順番を変え、query featuresを学習可能にした。 Dropoutをなくした。 (これまではresidual connectionsと attention mapsに適応していた)

Slide 22

Slide 22 text

22 Computer resource reduction MaskFormerでは１つの画像で32GメモリのGPUが必要だった。 PointRendやImplicit PointRendから着想を得て、mask lossを計算するのに、mask全体でなく、 K(=12544=112 x112)個のランダムサンプルされた点で計算する。推論とground truthとのfinal lossはimportance samplingで別にとったK個のサンプルされた点で⾏う。最終的に、Mask2Formerでは1つの画像で必要なメモリが18GBから6GBまで削減された。⾼解像のMask predictionのため

Slide 23

Slide 23 text

23 PQ Metrics Average IoU 正しく認識できたものの割合(F1 scoreに似たもの) IoU >=0.5でTP Panoptic Segmentationの性能評価指標

Slide 24

Slide 24 text

24 Experiment – Panoptic Segmentation COCO panoptic val 2017 with 133 categories

Slide 25

Slide 25 text

25 Panoptic Segmentation Visualization GT GT predict predict

Slide 26

Slide 26 text

26 Experiment – Instance Segmentation COCO val 2017 with 80 categories

Slide 27

Slide 27 text

27 Instance Segmentation Visualization GT GT predict predict

Slide 28

Slide 28 text

28 Experiment – Semantic Segmentation ADE20K val with 150 categories Single scale Multi scale

Slide 29

Slide 29 text

29 Semantic Segmentation Visualization GT GT predict predict

Slide 30

Slide 30 text

30 参考資料 DETR https://arxiv.org/abs/2005.12872 https://github.com/facebookresearch/detr MaskFormer https://arxiv.org/abs/2107.06278 https://github.com/facebookresearch/MaskFormer Panoptic Segmentation https://arxiv.org/abs/1801.00868 Transformerの最前線 (オムロンサイニックエックス⽜久⽒) https://www.slideshare.net/SSII_Slides/ssii2022-ts1-transformer