$30 off During Our Annual Pro Sale. View Details »

[IROS23]Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

[IROS23]Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. Multimodal Diffusion Segmentation Model for
    Object Segmentation from Manipulation Instructions
    Keio University, Japan
    Yui Iioka, Yu Yoshida, Yuiga Wada, Shumpei Hatanaka and Komei Sugiura

    View Slide

  2. Background: Aiming for enhancement in human-robot interaction
    • Motivation
    – Understand human instructions
    ➢ Manipulate target objects
    • Useful
    – Pixel-wise segmentation masks
    ➢ Identify their shapes & locations
    2
    8x

    View Slide

  3. Target Task: Object Segmentation from Manipulation Instructions(OSMI)
    Difference from Referring Expression Segmentation
    – Need to Identify the target object from long
    instruction sentences
    3
    "Go to the living room and
    fetch the pillow closest to
    the radio art on the wall."
    Inputs Output
    Dataset
    Sentence
    length (avg.)
    G-Ref [Mao+, CVPR16] 8.4
    SHIMRIE (Ours) 18.8

    View Slide

  4. Proposed Method: Multimodal Diffusion Segmentation Model (MDSM)
    • Main Novelty
    – Refine coarse masks by the
    extended diffusion model
    4
    • Generate more appropriate masks
    – w/o under- or over-segmentation

    View Slide

  5. Proposed Method: The structure of MDSM
    5
    2nd-stage: Mask Refinement
    1st-stage: Coarse Mask Generation

    View Slide

  6. Proposed Method: The structure of MDSM
    6
    2nd-stage: Mask Refinement
    1st-stage: Coarse Mask Generation

    View Slide

  7. Proposed Method: The structure of MDSM
    7
    2nd-stage: Mask Refinement
    1st-stage: Coarse Mask Generation
    Visual features extracted by denoising with DDPM [Ho+, NeurIPS20]

    View Slide

  8. Quantitative results: Outperformed the existing method
    8
    [%] Method mIoU oIoU [email protected]
    (i) LAVT
    [Yang+, CVPR22]
    24.27±3.15 22.25±2.85 21.27±5.66
    (ii) Ours
    (w/o 2nd-stage)
    33.03±5.51 30.25±4.91 32.76±5.28
    (iii) Ours 36.15±5.95 33.18±5.12 36.63±6.92
    + 11.88
    • Train models on our SHIMRIE dataset (train : valid : test = 10153:856:362)

    View Slide

  9. Qualitative results: Generate appropriate masks from complex sentences
    9
    "Go to the laundry room and straighten the picture
    closest to the light switch."
    Baseline

    View Slide

  10. Qualitative results: Generate appropriate masks from complex sentences
    10
    "Go to the lounge and remove the small brown
    chair facing the counter."
    Baseline

    View Slide

  11. Summary: MDSM
    • Background
    – Object manipulation from a human instruction
    • Our method: MDSM
    – 1st-stage: Introduce Multimodal Encoder
    – 2nd-stage: Refine the mask by the diffusion model
    • Result
    – Outperformed the existing method on the SHIMRIE dataset
    • generate more appropriate masks from long sentences 11

    View Slide

  12. Appendix: Difference between OSMI and RES
    • Sentence
    – RES: ``the candle on the right’’
    – OSMI: ``Go to the dining table. Then pick up the candle on the right.’’
    14
    Even the latest segmentation model,
    SEEM [Zou+, 23], is difficult to address OSMI tasks
    e.g. ``Pick up the plant in front of the mirror.’’

    View Slide

  13. Appendix: SHIMRIE dataset
    15
    REVERIE [Qi+, CVPR20]:
    Instructions, bboxes & real-world images
    Matterport3D [Chang+, 3DV17]:
    Voxel-wise segmentation & simulator images
    Some instructions

    View Slide

  14. Appendix: PWAM
    16
    [Yang+, CVPR22]

    View Slide

  15. Appendix: Ablation Study
    17
    mIoU oIoU [email protected] [email protected] [email protected] [email protected] [email protected]
    ✔ ✔ ✔ ✔ 34.09±4.14 31.57±3.07 35.52±6.04 27.29±4.93 16.35±3.16 6.35±1.22 1.82±1.33
    ✔ ✔ ✔ 34.40±3.79 31.59±3.03 36.63±6.14 27.79±5.28 16.30±2.98 6.41±1.19 0.66±0.62
    ✔ ✔ ✔ 33.68±4.37 30.61±4.17 35.80±6.73 26.46±4.66 15.69±3.88 6.08±1.58 0.50±0.41
    ✔ ✔ ✔ 33.44±4.52 30.51±3.89 35.03±6.36 26.85±6.02 15.91±4.36 5.25±0.82 1.66±1.48
    ✔ ✔ ✔ 32.54±4.97 29.97±4.14 35.30±6.72 26.24±6.26 13.15±3.77 3.26±1.95 0.50±0.41

    View Slide

  16. Appendix: Failure cases
    • ``Empty the tissue box in the
    bathroom on level one.‘’
    18
    • ``Visit the bathroom and bring me
    the picture nearest the toilet.‘’

    View Slide

  17. Appendix: Error Analysis
    19
    Errors Description #Error
    SC Serious comprehension errors for handling visual and language information 11
    RE Reference/Exophora resolution errors for linguistic information 31
    SEO Segmentation of extra objects 19
    OUS Over- or under-segmentation 16
    NSG No segmentation in any region of the image 11
    SNI Segmentation of non-target objects in the instruction 6
    AE Annotation errors in the ground-truth mask or instruction 6
    Total - 100

    View Slide

  18. Appendix: Compared to w/o diffusion step
    20

    View Slide