Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文まとめ「Segmentation Anything」

論文まとめ「Segmentation Anything」

個人的な論文のまとめです

Hidenori Itaya

July 17, 2023
Tweet

More Decks by Hidenori Itaya

Other Decks in Research

Transcript

  1. ൘୩ӳయ (த෦େֶ) ࿦จ·ͱΊ Segmentation Anything "MFYBOEFS,JSJMMPW &SJD.JOUVO /JLIJMB 3BW )BO[J.BP

    $IMPF3PMMBOE -BVSB(VTUBGTPO 5FUF9JBP4QFODFS8IJUFIFBE "MFYBOEFS$#FSH 8BO:FO-P  1JPUS%PMMBS 3PTT(JSTIJDL <.FUB"*3FTFBSDI>
  2. 2 • MetaʹΑΓ։ൃ͞Εͨθϩγϣοτηάϝϯςʔγϣϯٕज़ r ಛఆͷλεΫΛ܇࿅ͳ͠ʹ೚ҙͷΦϒδΣΫτʹηάϝϯτʢྖҬ෼ׂʣ͢Δ͜ͱ͕Մೳ (zero-shot) r λεΫʹಛԽͨ͠ڭࢣ͋Γֶशͱൺֱͯ͠ಉ౳΋͘͠͸༏Εͨਫ਼౓Λ֫ಘ Segmentation Anything

    Model (SAM) [Kirillov+, arXiv2023] oken pre- ning and engineer- entation, . om NLP reground orm text, segment hen, is to . The re- en when e objects Fig. 3), st one of ng a lan- mbiguous a natural zero-shot mpting. uggests a uence of ing sam- Figure 3: Each column shows 3 valid masks generated by SAM from a single ambiguous point prompt (green circle). a broadly capable model that can adapt to many (though ҰͭͷϙΠϯτ͔ΒSAMʹΑΓ ੜ੒ͨ͠ϚεΫ COCO [66] LVIS v1 [44] method AP APS APM APL AP APS APM APL ViTDet-H [62] 51.0 32.0 54.3 68.9 46.6 35.0 58.0 66.3 zero-shot transfer methods (segmentation module only): SAM 46.5 30.8 51.0 61.7 44.7 32.5 57.6 65.5 Table 5: Instance segmentation results. SAM is prompted with ViTDet boxes to do zero-shot segmentation. The fully- supervised ViTDet outperforms SAM, but the gap shrinks on the higher-quality LVIS masks. Interestingly, SAM out- performs ViTDet according to human ratings (see Fig. 11).      ("&*#!)+')!$    '$)%')!$ (  .  .  . !)  . Figure 11: Mask quality rating distribution from our human study for ViTDet and SAM, both applied to LVIS ground truth boxes. We also report LVIS and COCO ground truth quality. The legend shows rating means and 95% confi-  3       3  7     3  7    3 Figure 12: Zero-shot text-to-mask. SAM can work with simple and nuanced text prompts. When SAM fails to make a correct prediction, an additional point prompt can help. Results. We show qualitative results in Fig. 12. SAM can segment objects based on simple text prompts like “a wheel” as well as phrases like “beaver tooth grille”. When Zero-shot text-to-mask Zero-shot edge prediction image ground truth SAM Figure 10: Zero-shot edge prediction on BSDS500. SAM was not trained to predict edge maps nor did it have access to BSDS images or annotations during training. method year ODS OIS AP R50 HED [108] 2015 .788 .808 .840 .923 EDETR [79] 2022 .840 .858 .896 .930 mask AR@1000 method all small med. large freq. com. rar ViTDet-H [62] 63.0 51.7 80.8 87.0 63.1 63.3 58 zero-shot transfer methods: SAM – single out. 54.9 42.8 76.7 74.4 54.7 59.8 62 SAM 59.3 45.5 81.6 86.9 59.1 63.9 65 Table 4: Object proposal generation on LVIS v1. SAM applied zero-shot, i.e. it was not trained for object propo generation nor did it access LVIS images or annotations. intermediate step in pioneering systems (e.g., [102, 41, 84 To generate object proposals, we run a slightly modifi version of our automatic mask generation pipeline and o put the masks as proposals (see §D.3 for details). We compute the standard average recall (AR) metric LVIS v1 [44]. We focus on LVIS because its large numb
  3. 3 SAͷ3ཁૉ Segment Anything Alexander Kirillov1,2,4 Eric Mintun2 Nikhila Ravi1,2

    Hanzi Mao2 Chloe Rolland3 Laura Gustafson3 Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll´ ar4 Ross Girshick4 1project lead 2joint first author 3equal contribution 4directional lead Meta AI Research, FAIR (b) Model: Segment Anything Model (SAM) prompt image valid mask image encoder prompt encoder lightweight mask decoder (a) Task: promptable segmentation segmentation prompt image model cat with black ears valid mask (c) Data: data engine (top) & dataset (bottom) • 1+ billion masks • 11 million images • privacy respecting • licensed images annotate train data model Segment Anything 1B (SA-1B): Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt- able segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks. Abstract matching in some cases) fine-tuned models [10, 21]. Empir- [cs.CV] 5 Apr 2023 Segment Anything Alexander Kirillov1,2,4 Eric Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland3 Laura Gustafson3 Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll´ ar4 Ross Girshick4 1project lead 2joint first author 3equal contribution 4directional lead Meta AI Research, FAIR (b) Model: Segment Anything Model (SAM) prompt image valid mask image encoder prompt encoder lightweight mask decoder (a) Task: promptable segmentation segmentation prompt image model cat with black ears valid mask (c) Data: data engine (top) & dataset (bottom) • 1+ billion masks • 11 million images • privacy respecting • licensed images annotate train data model Segment Anything 1B (SA-1B): Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt- able segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks. Abstract matching in some cases) fine-tuned models [10, 21]. Empir- [cs.CV] 5 Apr 2023 Segment Anything Alexander Kirillov1,2,4 Eric Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland3 Laura Gustafson3 Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll´ ar4 Ross Girshick4 1project lead 2joint first author 3equal contribution 4directional lead Meta AI Research, FAIR (b) Model: Segment Anything Model (SAM) prompt image valid mask image encoder prompt encoder lightweight mask decoder (a) Task: promptable segmentation segmentation prompt image model cat with black ears valid mask (c) Data: data engine (top) & dataset (bottom) • 1+ billion masks • 11 million images • privacy respecting • licensed images annotate train data model Segment Anything 1B (SA-1B): Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt- able segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks. Abstract We introduce the Segment Anything (SA) project: a new matching in some cases) fine-tuned models [10, 21]. Empir- ical trends show this behavior improving with model scale, [cs.CV] 5 Apr 2023 Data engine & Dataset Task Model promptable segmentation Segment Anything Model (SAM) Data
  4. 4 SAͷ3ཁૉ - Task - Data Model Task Alexander Kirillov1,2,4

    Eric Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland Tete Xiao3 Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll´ a 1project lead 2joint first author 3equal contribution 4directional l Meta AI Research, FAIR (b) Model: Segment Anything Model (SAM) prompt image valid mask image encoder prompt encoder lightweight mask decoder (a) Task: promptable segmentation segmentation prompt image model cat with black ears valid mask (c) Data: data e • 1+ bill • 11 mil • privacy • license model Segment Figure 1: We aim to build a foundation model for segmentation by introducing three interconnec ϓϩϯϓτՄೳͳηάϝϯςʔγϣϯλεΫ (Promptable segmentation) /-1΍$7ʹ͓͚Δج൫Ϟσϧ͸ɼzϓϩϯϓτzٕ๏Λ༻͍Δ͜ͱͰɼ ৽͍͠σʔληοτ΍λεΫʹରͯ͠[FSPTIPU΍GFXTIPUͷֶश͕ Մೳ Background ೚ҙͷηάϝϯςʔγϣϯϓϩϯϓτʹର͠ɼ༗ޮͳηάϝϯςʔγϣ ϯϚεΫΛฦ͢͜ͱ ϓϩϯϓτ͸ᐆດੑ͕͋ΔͨΊɼਖ਼͍͠ϚεΫ͕Ұͭͱ͸ݶΒͳ͍ ※ϓϩϯϓτɿը૾಺ͷԿΛηάϝϯςʔγϣϯ͢Δ͔Λࢦఆ͢Δ৘ใ Target of Tasks ϓϩϯϓτΛ༻͍ͨηάϝϯςʔγϣϯλεΫΛࣄલֶशͷ໨తͱ͠ɼ ԼྲྀͷηάϝϯςʔγϣϯλεΫʹzero-shotͰసૹ →
  5. Mintun2 Nikhila Ravi1,2 Hanzi Mao2 Chloe Rolland3 Laura Gustafson3 ead

    Alexander C. Berg Wan-Yen Lo Piotr Doll´ ar4 Ross Girshick4 2joint first author 3equal contribution 4directional lead Meta AI Research, FAIR (b) Model: Segment Anything Model (SAM) prompt image valid mask image encoder prompt encoder lightweight mask decoder (c) Data: data engine (top) & dataset (bottom) • 1+ billion masks • 11 million images • privacy respecting • licensed images annotate train data model Segment Anything 1B (SA-1B): on model for segmentation by introducing three interconnected components: a prompt- 5 SAͷ3ཁૉ - Model - Data Model Task Segment Anything Model (SAM) • Encoder - Prompt encoder - Image encoder Structure  ը૾ͱϓϩϯϓτΛͦΕͧΕຒΊࠐΉ  5SBOTGPSNFSCBTFE %FDPEFSͰຒΊࠐΈ͔ΒϚεΫΛੜ੒ • Decoder - Mask decoder Process Flow
  6. 6 SAͷ3ཁૉ - Model - Data Model Task Segment Anything

    Model (SAM) • ը૾ͷಛ௃ΛຒΊࠐΉ • ωοτϫʔΫߏ଄͸ViT • Masked AutoEncoder (MAE) [He+, CVPR2022] ʹΑΔࣄલֶशࡁΈViTΛ࢖༻ • 1൪ܭࢉίετ͕ߴ͍෦෼͕ͩɼਪ࿦࣌ʹ͸image embeddingΛอ࣋͢Ε͹ϓϩϯϓτΛϦΞϧλΠϜ ʹมߋՄ Image encoder , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores. 3. Segment Anything Model We next describe the Segment Anything Model (SAM) for promptable segmentation. SAM has three components, illustrated in Fig. 4: an image encoder, a flexible prompt loss [15, 45, 64] over masks. To rank masks, the model pre- dicts a confidence score (i.e., estimated IoU) for each mask. Efficiency. The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on
  7. 7 SAͷ3ཁૉ - Model - Data Model Task Segment Anything

    Model (SAM) • ϓϩϯϓτΛຒΊࠐΉ • Position encodingͰදݱ͠pronptຖͷֶशՄೳͳύϥϝʔλ(ຒΊࠐΈ)ͱՃࢉ Prompt encoder (point, box) , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores. 3. Segment Anything Model We next describe the Segment Anything Model (SAM) for promptable segmentation. SAM has three components, illustrated in Fig. 4: an image encoder, a flexible prompt loss [15, 45, 64] over masks. To rank masks, the model pre- dicts a confidence score (i.e., estimated IoU) for each mask. Efficiency. The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on
  8. 8 SAͷ3ཁૉ - Model - Data Model Task Segment Anything

    Model (SAM) • ϓϩϯϓτΛຒΊࠐΉ • CLIP [Radford+, ICML2021] ͷtext encoderΛ࢖༻ Prompt encoder (text) , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores. 3. Segment Anything Model We next describe the Segment Anything Model (SAM) for promptable segmentation. SAM has three components, illustrated in Fig. 4: an image encoder, a flexible prompt loss [15, 45, 64] over masks. To rank masks, the model pre- dicts a confidence score (i.e., estimated IoU) for each mask. Efficiency. The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on
  9. 9 SAͷ3ཁૉ - Model - Data Model Task Segment Anything

    Model (SAM) • ϓϩϯϓτΛຒΊࠐΉ • Convolution LayerʹΑΓ৞ΈࠐΜͩಛ௃ϚοϓΛImage embeddingͱ଍͠߹ΘͤΔ Prompt encoder (mask) , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores. 3. Segment Anything Model We next describe the Segment Anything Model (SAM) for promptable segmentation. SAM has three components, illustrated in Fig. 4: an image encoder, a flexible prompt loss [15, 45, 64] over masks. To rank masks, the model pre- dicts a confidence score (i.e., estimated IoU) for each mask. Efficiency. The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on
  10. 10 SAͷ3ཁૉ - Model - Data Model Task Segment Anything

    Model (SAM) • ϚεΫީิΛग़ྗ • ϓϩϯϓτͷᐆດੑΛߟྀ͠3ͭͷϚεΫީิΛग़ྗ • ωοτϫʔΫߏ଄ΛTransformer Decoder • Promptͷself-attentionɼ2ํ޲ͷcross-attentionΛ࢖༻ Mask decoder , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores. 3. Segment Anything Model We next describe the Segment Anything Model (SAM) for promptable segmentation. SAM has three components, illustrated in Fig. 4: an image encoder, a flexible prompt loss [15, 45, 64] over masks. To rank masks, the model pre- dicts a confidence score (i.e., estimated IoU) for each mask. Efficiency. The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on • Decoder Layer (4 step) 1. Tokenʹର͢ΔSelf-attention 2. Token͔ΒImage embeddingʹର͢ΔCross-attention 3. PointຖͷMLPʹΑΔ֤Tokenͷߋ৽ 4. Image embedding͔ΒToken΁ͷCross-attention etails k Details r can be any dding. Mo- training, we er (ViT) [33] image embedding (256x64x64) x2 token to image attn. 2x conv. trans. IoU scores mlp masks dot product per mask prompt tokens (N tokens x256) output tokens + output token per mask IoU output token mlp mask decoder self attn. token to image attn. mlp image to token attn. Figure 14: Details of the lightweight mask decoder. A two-layer decoder updates both the image embedding and prompt tokens via cross-attention. Then the image embed- ding is upscaled, from which the updated output tokens are used to dynamically predict masks. (Not illustrated for fig- ure clarity: At every attention layer, positional encodings are added to the image embedding, and the entire original prompt token (including position encoding) is re-added to the token queries and keys.) and image embedding are then added element-wise. If there
  11. 11 SAͷ3ཁૉ - Model - Data Model Task Segment Anything

    Model (SAM) • Focal loss [Lin+, ICCV2017] ͱDice loss [Milletari+, 3DV2016] ͷ૊Έ߹Θͤ • ϓϩϯϓτ͸ϥϯμϜʹαϯϓϦϯά Training , score score score , , valid masks image image encoder image embedding mask points box text prompt encoder mask decoder conv Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores. 3. Segment Anything Model We next describe the Segment Anything Model (SAM) for promptable segmentation. SAM has three components, illustrated in Fig. 4: an image encoder, a flexible prompt loss [15, 45, 64] over masks. To rank masks, the model pre- dicts a confidence score (i.e., estimated IoU) for each mask. Efficiency. The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on
  12. 12 SAͷ3ཁૉ - Data - Data Model Task Chloe Rolland3

    Laura Gustafson3 o Piotr Doll´ ar4 Ross Girshick4 4directional lead (c) Data: data engine (top) & dataset (bottom) • 1+ billion masks • 11 million images • privacy respecting • licensed images annotate train data model Segment Anything 1B (SA-1B): three interconnected components: a prompt- ion and enables zero-shot transfer to a range Data engine Model-in-the-loopɿΞϊςʔγϣϯʹSAMΛ༻͍ͯσʔληοτΛ࡞੒ 1. Assisted-manual stage : ϞσϧΞγετʹΑΔϚχϡΞϧΞϊςʔγϣϯ SAM͕༧ଌͨ͠ϚεΫΛਓखͰमਖ਼ (ผσʔληοτͰֶशࡁΈͷSAMΛ࢖༻) मਖ਼σʔλ͕Ұఆ਺ू·Ε͹ͦͷσʔλͰSAMΛֶश मਖ਼ϧʔϧɿ1ը૾͋ͨΓ30ඵҎ಺Ͱ෇༩ՄೳͳൣғΛΞϊςʔγϣϯ 2. Semi-automatic stage :ϚχϡΞϧͱϞσϧʹΑΔࣗಈΞϊςʔγϣϯ ༧ଌͨ͠ΦϒδΣΫτҎ֎Λ௥ՃͰਓखʹΑΓΞϊςʔγϣϯ ͜ͷ࣌ɼ৽͍͠௥ՃσʔλͰSAMΛֶश ※͜ͷ࣌఺Ͱ1020ສݸͷϚεΫΛऩू 3. Fully-automatic stage : Ξϊςʔλͳ͠ʹϞσϧͷ׬શࣗಈΞϊςʔγϣϯ SAMͷ༧ଌΛ༻͍ͨΞϊςʔγϣϯͰσʔληοτ࡞੒ ༧ଌ಺ͷ֬৴౓͕ߴ͍݁ՌΛબ୒͠NMSʹΑΓॏෳΛআڈ ※2ͷ࣌఺ͰSAM͸ߴਫ਼౓ͳͨΊɼ༧ଌ݁ՌΛͦͷ··࢖༻Մೳ σʔληοτͷ࡞Γํ
  13. 13 SAͷ3ཁૉ - Data - Data Model Task <50 masks

    50-100 masks 100-200 masks 200-300 masks 300-400 masks 400-500 masks >500 masks Figure 2: Example images with overlaid masks from our newly introduced dataset, SA-1B. SA-1B contains 11M diverse, high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks. These masks were annotated fully automatically by SAM, and as we verify by human ratings and numerous experiments, are of high quality and diversity. We group images by number of masks per image for visualization (there are ⇠100 masks per image on average). New Dataset : Segment Anything 1B (SA-1B) • 1100ສຕͷը૾ͱ11ԯݸͷϚεΫΛؚΉσʔληοτ • ैདྷͷσʔληοτͱൺֱ͠1ը૾͋ͨΓͷϚεΫ਺͕๲େ  "  "" # "  ""   "  ""   "  "" ! " "  ""                                              Figure 6: Dataset mask properties. The legend references the number of images and masks in each dataset. Note, that SA-1B has 11⇥ more images and 400⇥ more masks than the largest existing segmentation dataset Open Images [60]. Per country image count ≥ 100k < 100k < 10k < 1k             "            #                    # !   #        .034&0..0/&05/42,(3&0'(3  - - - - 5.%(20),.$*(31(2&05/426 3,$&($/,$ )2,&$ 5201( 024+ .(2,&$ $4,/ .(2,&$ $2,%%($/ Figure 7: Estimated geographic distribution of SA-1B images. Most of the world’s countries have more than 1000 images in SA-1B, and the three countries with the most images are from different parts of the world. Mask properties. In Fig. 5 we plot the spatial distribution of object centers in SA-1B compared to the largest existing segmentation datasets. Common photographer biases are present in all datasets. We observe that SA-1B has greater coverage of image corners compared to LVIS v1 [44] and ADE20K [117], the two most similarly distributed datasets, while COCO [66] and Open Images V5 [60] have a more SA-1B % images # countries #imgs #masks SA-1B COCO O.I. Africa 54 300k 28M 2.8% 3.0% 1.7% Asia & Oceania 70 3.9M 423M 36.2% 11.4% 14.3% Europe 47 5.4M 540M 49.8% 34.2% 36.2% Latin America & Carib. 42 380k 36M 3.5% 3.1% 5.0% North America 4 830k 80M 7.7% 48.3% 42.8% At the start of this stage, SAM was trained using com- mon public segmentation datasets. After sufficient data an- notation, SAM was retrained using only newly annotated masks. As more masks were collected, the image encoder was scaled from ViT-B to ViT-H and other architectural de- ails evolved; in total we retrained our model 6 times. Av- erage annotation time per mask decreased from 34 to 14 Figure 5: Image-size normalized mask center distributions. • ϚεΫҐஔͷภΓ΋গͳ͍ ैདྷͷσʔληοτ͸தԝʹภΓ͕͋Δ
  14. 14 • Point͔ΒͷMask༧ଌ ධՁ࣮ݧ STREETS [91] TimberSeg [38] TrashCan [52]

    VISOR [28, 27] WoodScape [112] PIDRay [104] ZeroWaste-f [6] Figure 8: Samples from the 23 diverse segmentation datasets used to evaluate SAM’s zero-shot transfer capabilities. -20 0 +20 +40 IoU delta at 1 center point GTEA [34, 63] TrashCan [52] DRAM [24] PIDRay [104] Cityscapes [25] WoodScape [112] IBD [17] EgoHOS [113] Plittersdorf [46] VISOR [28, 27] NDISPark [22, 23] Hypersim [86] OVIS [81] ADE20K [117] iShape [111] ZeroWaste-f [6] STREETS [91] LVIS [44] NDD20 [100] TimberSeg [38] DOORS [80] BBBC038v1 [12] PPDLS [74] -21.4 -15.0 -6.5 -5.8 -2.0 -0.6 -0.3 +0.8 +1.5 +1.8 +2.7 +6.1 +7.0 +7.8 +8.8 +9.1 +17.3 +18.5 +21.1 +28.9 +41.1 +44.7 +46.9 (a) SAM vs. RITM [92] on 23 datasets           ! $#$#    &#"$ " %"%$    # %$!%$   (b) Mask quality ratings by human annotators     %" ! $#    $#$#   "    !                (c) Center points (default) (d) Random points Figure 9: Point to mask evaluation on 23 datasets. (a) Mean IoU of SAM and the strongest single point segmenter, RITM [92]. Due to ambiguity, a single mask may not match ground truth; circles show “oracle” results of the most relevant of SAM’s 3 predictions. (b) Per-dataset comparison of mask quality ratings by annotators from 1 (worst) to 10 (best). All methods use the ground truth mask center as the prompt. (c, d) mIoU with varying number of points. SAM significantly outperforms prior interactive segmenters with 1 point and is on par with more points. Low absolute mIoU at 1 point is the result of ambiguity. Results. First, we look at automatic evaluation on the full suite of 23 datasets using mIoU. We compare per-dataset fall between 7 and 9, which corresponds to the qualitative rating guideline: “A high score (7-9): The object is identi- SAMͱRITM [Sofiiuk+, ICIP2022]ͷ༷ʑͳσʔληοτͰͷൺֱ (meanIoU) ؙҹ͸ɼSAMͷ3ͭͷ༧ଌͷ͏ͪ࠷΋ؔ࿈ੑͷߴ͍΋ͷͰ͋Δɽ → 23σʔληοτͷ͏ͪ16ݸͰSAM͕ߴ͍IoUΛୡ੒ IBD [17] iShape [111] LVIS [44] NDD20 [100] NDISPark [22, 23] OVIS [81] PPDLS [74] Plittersdorf [46] STREETS [91] TimberSeg [38] TrashCan [52] VISOR [28, 27] WoodScape [112] PIDRay [104] ZeroWaste-f [6] Figure 8: Samples from the 23 diverse segmentation datasets used to evaluate SAM’s zero-shot transfer capabilities. -20 0 +20 +40 IoU delta at 1 center point GTEA [34, 63] TrashCan [52] DRAM [24] PIDRay [104] Cityscapes [25] WoodScape [112] IBD [17] EgoHOS [113] Plittersdorf [46] VISOR [28, 27] NDISPark [22, 23] Hypersim [86] OVIS [81] ADE20K [117] iShape [111] ZeroWaste-f [6] STREETS [91] LVIS [44] NDD20 [100] TimberSeg [38] DOORS [80] BBBC038v1 [12] PPDLS [74] -21.4 -15.0 -6.5 -5.8 -2.0 -0.6 -0.3 +0.8 +1.5 +1.8 +2.7 +6.1 +7.0 +7.8 +8.8 +9.1 +17.3 +18.5 +21.1 +28.9 +41.1 +44.7 +46.9 (a) SAM vs. RITM [92] on 23 datasets           ! $#$#    &#"$ " %"%$    # %$!%$   (b) Mask quality ratings by human annotators     %" ! $#    $#$#   "    !                (c) Center points (default) (d) Random points Figure 9: Point to mask evaluation on 23 datasets. (a) Mean IoU of SAM and the strongest single point segmenter, RITM [92]. Mask quality ratings by human annotators → Ξϊςʔλ͸SAMͷϚεΫ඼࣭ΛRITMΑΓ΋େ෯ʹߴ͘ධՁ → ଟ͘ͷϕϯνϚʔΫͰZero-shotʹΑΓैདྷϞσϧΛ্ճΔੑೳΛ֫ಘ
  15. 15 SAͷ༷ʑͳ೿ੜख๏ CLIP_Surgery [Li+, Apr.2023] Segment Anything Is Not Always

    Perfect [Ji+, Apr.2023] PerSAM [Zhang+, May2023] Matcher [Liu+, May2023] Segment Anything in High Quality [Ke+, Jun2023] SAͷ෼ੳͱػೳ֦ு Fast Segment Anything [Zhao+, Jun.2023] Detect Any Shadow [Wang+, May2023] MobileSAM [Zhang+, July2023] ը૾ͷम෮ ௒ղ૾౓ ϦϞʔτηϯγϯά ҩ༻ը૾ͷSegmetation SAM for Digital Pathology Surgery [Deng+, Apr.2023] Segment Anything in Medical Images [Ma+, Apr.2023] SAM Fails to Segment Anything? [Chen+, Apr.2023] SAM for Medical Image Analysis [Mazurowski+, May2023] ը૾ͷ੾Γൈ͖ Matte Anything [Yao+, Jun2023] Matting Anything [Li+, Jun2023] 3Dσʔλ Seal [Liu+, Jun2023] TomoSAM [Semerato+, Jun2023] SAͷ೿ੜख๏ Inpaint Anything [Yu+, Apr2023] Robotics Instruct2Act [Huang+, May2023] Segment Anything in Video Super-resolution [Lu+, May2023] SAM-IQA [Li+, Jul2023] όΠΦΠϯϑΥϚςΟΫε IAMSAM [Lee+, Mayr2023] RSPrompter [Chen+, Jun2023] ِ৭෺ମͷݕग़ SAMCOD [Tang+, Apr2023] ※ Reference : https://github.com/Hedlen/awesome-segment-anything
  16. 16 • Official site r https://segment-anything.com/ • Paper r https://scontent.fhnd2-3.fna.fbcdn.net/v/t39.2365-

    6/10000000_900554171201033_1602411987825904100_n.pdf?_nc_cat=100&ccb=1- 7&_nc_sid=3c67a6&_nc_ohc=iMsE1fjDr4EAX__pmB2&_nc_ht=scontent.fhnd2- 3.fna&oh=00_AfDUvhlaLmAdep94YXoayUUE9T_A1lAcXDNM8Si7T5M-jA&oe=648220A7 • Code r https://github.com/facebookresearch/segment-anything • Dataset r https://segment-anything.com/dataset/index.html • Demo r https://segment-anything.com/demo ࢀߟࢿྉ