Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[IROS23]Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

[IROS23]Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

    Keio University, Japan Yui Iioka, Yu Yoshida, Yuiga Wada, Shumpei Hatanaka and Komei Sugiura
  2. Background: Aiming for enhancement in human-robot interaction • Motivation –

    Understand human instructions ➢ Manipulate target objects • Useful – Pixel-wise segmentation masks ➢ Identify their shapes & locations 2 8x
  3. Target Task: Object Segmentation from Manipulation Instructions(OSMI) Difference from Referring

    Expression Segmentation – Need to Identify the target object from long instruction sentences 3 "Go to the living room and fetch the pillow closest to the radio art on the wall." Inputs Output Dataset Sentence length (avg.) G-Ref [Mao+, CVPR16] 8.4 SHIMRIE (Ours) 18.8
  4. Proposed Method: Multimodal Diffusion Segmentation Model (MDSM) • Main Novelty

    – Refine coarse masks by the extended diffusion model 4 • Generate more appropriate masks – w/o under- or over-segmentation
  5. Proposed Method: The structure of MDSM 7 2nd-stage: Mask Refinement

    1st-stage: Coarse Mask Generation Visual features extracted by denoising with DDPM [Ho+, NeurIPS20]
  6. Quantitative results: Outperformed the existing method 8 [%] Method mIoU

    oIoU [email protected] (i) LAVT [Yang+, CVPR22] 24.27±3.15 22.25±2.85 21.27±5.66 (ii) Ours (w/o 2nd-stage) 33.03±5.51 30.25±4.91 32.76±5.28 (iii) Ours 36.15±5.95 33.18±5.12 36.63±6.92 + 11.88 • Train models on our SHIMRIE dataset (train : valid : test = 10153:856:362)
  7. Qualitative results: Generate appropriate masks from complex sentences 9 "Go

    to the laundry room and straighten the picture closest to the light switch." Baseline
  8. Qualitative results: Generate appropriate masks from complex sentences 10 "Go

    to the lounge and remove the small brown chair facing the counter." Baseline
  9. Summary: MDSM • Background – Object manipulation from a human

    instruction • Our method: MDSM – 1st-stage: Introduce Multimodal Encoder – 2nd-stage: Refine the mask by the diffusion model • Result – Outperformed the existing method on the SHIMRIE dataset • generate more appropriate masks from long sentences 11
  10. Appendix: Difference between OSMI and RES • Sentence – RES:

    ``the candle on the right’’ – OSMI: ``Go to the dining table. Then pick up the candle on the right.’’ 14 Even the latest segmentation model, SEEM [Zou+, 23], is difficult to address OSMI tasks e.g. ``Pick up the plant in front of the mirror.’’
  11. Appendix: SHIMRIE dataset 15 REVERIE [Qi+, CVPR20]: Instructions, bboxes &

    real-world images Matterport3D [Chang+, 3DV17]: Voxel-wise segmentation & simulator images Some instructions
  12. Appendix: Ablation Study 17 mIoU oIoU [email protected] [email protected] [email protected] [email protected]

    [email protected] ✔ ✔ ✔ ✔ 34.09±4.14 31.57±3.07 35.52±6.04 27.29±4.93 16.35±3.16 6.35±1.22 1.82±1.33 ✔ ✔ ✔ 34.40±3.79 31.59±3.03 36.63±6.14 27.79±5.28 16.30±2.98 6.41±1.19 0.66±0.62 ✔ ✔ ✔ 33.68±4.37 30.61±4.17 35.80±6.73 26.46±4.66 15.69±3.88 6.08±1.58 0.50±0.41 ✔ ✔ ✔ 33.44±4.52 30.51±3.89 35.03±6.36 26.85±6.02 15.91±4.36 5.25±0.82 1.66±1.48 ✔ ✔ ✔ 32.54±4.97 29.97±4.14 35.30±6.72 26.24±6.26 13.15±3.77 3.26±1.95 0.50±0.41
  13. Appendix: Failure cases • ``Empty the tissue box in the

    bathroom on level one.‘’ 18 • ``Visit the bathroom and bring me the picture nearest the toilet.‘’
  14. Appendix: Error Analysis 19 Errors Description #Error SC Serious comprehension

    errors for handling visual and language information 11 RE Reference/Exophora resolution errors for linguistic information 31 SEO Segmentation of extra objects 19 OUS Over- or under-segmentation 16 NSG No segmentation in any region of the image 11 SNI Segmentation of non-target objects in the instruction 6 AE Annotation errors in the ground-truth mask or instruction 6 Total - 100