[IROS23]Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

Slide 1

Slide 1 text

Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions Keio University, Japan Yui Iioka, Yu Yoshida, Yuiga Wada, Shumpei Hatanaka and Komei Sugiura

Slide 2

Slide 2 text

Background: Aiming for enhancement in human-robot interaction • Motivation – Understand human instructions ➢ Manipulate target objects • Useful – Pixel-wise segmentation masks ➢ Identify their shapes & locations 2 8x

Slide 3

Slide 3 text

Target Task: Object Segmentation from Manipulation Instructions（OSMI） Difference from Referring Expression Segmentation – Need to Identify the target object from long instruction sentences 3 "Go to the living room and fetch the pillow closest to the radio art on the wall." Inputs Output Dataset Sentence length (avg.) G-Ref [Mao+, CVPR16] 8.4 SHIMRIE (Ours) 18.8

Slide 4

Slide 4 text

Proposed Method: Multimodal Diffusion Segmentation Model (MDSM) • Main Novelty – Refine coarse masks by the extended diffusion model 4 • Generate more appropriate masks – w/o under- or over-segmentation

Slide 5

Slide 5 text

Proposed Method: The structure of MDSM 5 2nd-stage: Mask Refinement 1st-stage: Coarse Mask Generation

Slide 6

Slide 6 text

Proposed Method: The structure of MDSM 6 2nd-stage: Mask Refinement 1st-stage: Coarse Mask Generation

Slide 7

Slide 7 text

Proposed Method: The structure of MDSM 7 2nd-stage: Mask Refinement 1st-stage: Coarse Mask Generation Visual features extracted by denoising with DDPM [Ho+, NeurIPS20]

Slide 8

Slide 8 text

Quantitative results: Outperformed the existing method 8 [%] Method mIoU oIoU [email protected] (i) LAVT [Yang+, CVPR22] 24.27±3.15 22.25±2.85 21.27±5.66 (ii) Ours (w/o 2nd-stage) 33.03±5.51 30.25±4.91 32.76±5.28 (iii) Ours 36.15±5.95 33.18±5.12 36.63±6.92 + 11.88 • Train models on our SHIMRIE dataset (train : valid : test = 10153：856：362)

Slide 9

Slide 9 text

Qualitative results: Generate appropriate masks from complex sentences 9 "Go to the laundry room and straighten the picture closest to the light switch." Baseline

Slide 10

Slide 10 text

Qualitative results: Generate appropriate masks from complex sentences 10 "Go to the lounge and remove the small brown chair facing the counter." Baseline

Slide 11

Slide 11 text

Summary: MDSM • Background – Object manipulation from a human instruction • Our method: MDSM – 1st-stage: Introduce Multimodal Encoder – 2nd-stage: Refine the mask by the diffusion model • Result – Outperformed the existing method on the SHIMRIE dataset • generate more appropriate masks from long sentences 11

Slide 12

Slide 12 text

Appendix: Difference between OSMI and RES • Sentence – RES: ``the candle on the right’’ – OSMI: ``Go to the dining table. Then pick up the candle on the right.’’ 14 Even the latest segmentation model, SEEM [Zou+, 23], is difficult to address OSMI tasks e.g. ``Pick up the plant in front of the mirror.’’

Slide 13

Slide 13 text

Appendix: SHIMRIE dataset 15 REVERIE [Qi+, CVPR20]: Instructions, bboxes & real-world images Matterport3D [Chang+, 3DV17]: Voxel-wise segmentation & simulator images Some instructions

Slide 14

Slide 14 text

Appendix: PWAM 16 [Yang+, CVPR22]

Slide 15

Slide 15 text

Appendix: Ablation Study 17 mIoU oIoU [email protected] [email protected] [email protected] [email protected] [email protected] ✔ ✔ ✔ ✔ 34.09±4.14 31.57±3.07 35.52±6.04 27.29±4.93 16.35±3.16 6.35±1.22 1.82±1.33 ✔ ✔ ✔ 34.40±3.79 31.59±3.03 36.63±6.14 27.79±5.28 16.30±2.98 6.41±1.19 0.66±0.62 ✔ ✔ ✔ 33.68±4.37 30.61±4.17 35.80±6.73 26.46±4.66 15.69±3.88 6.08±1.58 0.50±0.41 ✔ ✔ ✔ 33.44±4.52 30.51±3.89 35.03±6.36 26.85±6.02 15.91±4.36 5.25±0.82 1.66±1.48 ✔ ✔ ✔ 32.54±4.97 29.97±4.14 35.30±6.72 26.24±6.26 13.15±3.77 3.26±1.95 0.50±0.41

Slide 16

Slide 16 text

Appendix: Failure cases • ``Empty the tissue box in the bathroom on level one.‘’ 18 • ``Visit the bathroom and bring me the picture nearest the toilet.‘’

Slide 17

Slide 17 text

Appendix: Error Analysis 19 Errors Description #Error SC Serious comprehension errors for handling visual and language information 11 RE Reference/Exophora resolution errors for linguistic information 31 SEO Segmentation of extra objects 19 OUS Over- or under-segmentation 16 NSG No segmentation in any region of the image 11 SNI Segmentation of non-target objects in the instruction 6 AE Annotation errors in the ground-truth mask or instruction 6 Total - 100

Slide 18

Slide 18 text

Appendix: Compared to w/o diffusion step 20