[IROS23]Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions
Keio University, Japan Yui Iioka, Yu Yoshida, Yuiga Wada, Shumpei Hatanaka and Komei Sugiura

Background: Aiming for enhancement in human-robot interaction • Motivation –
Understand human instructions ➢ Manipulate target objects • Useful – Pixel-wise segmentation masks ➢ Identify their shapes & locations 2 8x

Target Task: Object Segmentation from Manipulation Instructions（OSMI） Difference from Referring
Expression Segmentation – Need to Identify the target object from long instruction sentences 3 "Go to the living room and fetch the pillow closest to the radio art on the wall." Inputs Output Dataset Sentence length (avg.) G-Ref [Mao+, CVPR16] 8.4 SHIMRIE (Ours) 18.8

Proposed Method: Multimodal Diffusion Segmentation Model (MDSM) • Main Novelty
– Refine coarse masks by the extended diffusion model 4 • Generate more appropriate masks – w/o under- or over-segmentation

Proposed Method: The structure of MDSM 5 2nd-stage: Mask Refinement
1st-stage: Coarse Mask Generation

1st-stage: Coarse Mask Generation

1st-stage: Coarse Mask Generation Visual features extracted by denoising with DDPM [Ho+, NeurIPS20]

Quantitative results: Outperformed the existing method 8 [%] Method mIoU
oIoU [email protected] (i) LAVT [Yang+, CVPR22] 24.27±3.15 22.25±2.85 21.27±5.66 (ii) Ours (w/o 2nd-stage) 33.03±5.51 30.25±4.91 32.76±5.28 (iii) Ours 36.15±5.95 33.18±5.12 36.63±6.92 + 11.88 • Train models on our SHIMRIE dataset (train : valid : test = 10153：856：362)

Qualitative results: Generate appropriate masks from complex sentences 9 "Go
to the laundry room and straighten the picture closest to the light switch." Baseline

Qualitative results: Generate appropriate masks from complex sentences 10 "Go
to the lounge and remove the small brown chair facing the counter." Baseline

Summary: MDSM • Background – Object manipulation from a human
instruction • Our method: MDSM – 1st-stage: Introduce Multimodal Encoder – 2nd-stage: Refine the mask by the diffusion model • Result – Outperformed the existing method on the SHIMRIE dataset • generate more appropriate masks from long sentences 11

Appendix: Difference between OSMI and RES • Sentence – RES:
``the candle on the right’’ – OSMI: ``Go to the dining table. Then pick up the candle on the right.’’ 14 Even the latest segmentation model, SEEM [Zou+, 23], is difficult to address OSMI tasks e.g. ``Pick up the plant in front of the mirror.’’

Appendix: SHIMRIE dataset 15 REVERIE [Qi+, CVPR20]: Instructions, bboxes &
real-world images Matterport3D [Chang+, 3DV17]: Voxel-wise segmentation & simulator images Some instructions

Appendix: PWAM 16 [Yang+, CVPR22]

Appendix: Ablation Study 17 mIoU oIoU [email protected] [email protected] [email protected] [email protected]
[email protected] ✔ ✔ ✔ ✔ 34.09±4.14 31.57±3.07 35.52±6.04 27.29±4.93 16.35±3.16 6.35±1.22 1.82±1.33 ✔ ✔ ✔ 34.40±3.79 31.59±3.03 36.63±6.14 27.79±5.28 16.30±2.98 6.41±1.19 0.66±0.62 ✔ ✔ ✔ 33.68±4.37 30.61±4.17 35.80±6.73 26.46±4.66 15.69±3.88 6.08±1.58 0.50±0.41 ✔ ✔ ✔ 33.44±4.52 30.51±3.89 35.03±6.36 26.85±6.02 15.91±4.36 5.25±0.82 1.66±1.48 ✔ ✔ ✔ 32.54±4.97 29.97±4.14 35.30±6.72 26.24±6.26 13.15±3.77 3.26±1.95 0.50±0.41

Appendix: Failure cases • ``Empty the tissue box in the
bathroom on level one.‘’ 18 • ``Visit the bathroom and bring me the picture nearest the toilet.‘’

Appendix: Error Analysis 19 Errors Description #Error SC Serious comprehension
errors for handling visual and language information 11 RE Reference/Exophora resolution errors for linguistic information 31 SEO Segmentation of extra objects 19 OUS Over- or under-segmentation 16 NSG No segmentation in any region of the image 11 SNI Segmentation of non-target objects in the instruction 6 AE Annotation errors in the ground-truth mask or instruction 6 Total - 100

Appendix: Compared to w/o diffusion step 20

[IROS23]Multimodal Diffusion Segmentation Model...

[IROS23]Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

Semantic Machine Intelligence Lab., Keio Univ. PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

Background: Aiming for enhancement in human-robot interaction • Motivation –

Target Task: Object Segmentation from Manipulation Instructions（OSMI） Difference from Referring

Proposed Method: Multimodal Diffusion Segmentation Model (MDSM) • Main Novelty

Proposed Method: The structure of MDSM 5 2nd-stage: Mask Refinement

Proposed Method: The structure of MDSM 6 2nd-stage: Mask Refinement

Proposed Method: The structure of MDSM 7 2nd-stage: Mask Refinement

Quantitative results: Outperformed the existing method 8 [%] Method mIoU

Qualitative results: Generate appropriate masks from complex sentences 9 "Go

Qualitative results: Generate appropriate masks from complex sentences 10 "Go

Summary: MDSM • Background – Object manipulation from a human

Appendix: Difference between OSMI and RES • Sentence – RES:

Appendix: SHIMRIE dataset 15 REVERIE [Qi+, CVPR20]: Instructions, bboxes &

Appendix: PWAM 16 [Yang+, CVPR22]

Appendix: Ablation Study 17 mIoU oIoU [email protected] [email protected] [email protected] [email protected]

Appendix: Failure cases • ``Empty the tissue box in the

Appendix: Error Analysis 19 Errors Description #Error SC Serious comprehension

Appendix: Compared to w/o diffusion step 20