Slide 1

Slide 1 text

Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models Takayuki Nishimura, Katsuyuki Kuyo, Motonari Kambara and Komei Sugiura Keio University, Japan

Slide 2

Slide 2 text

- 2 - Object Segmentation from Manipulation Instructions ×8 image point clouds “Go to the living room and pick up the pillow closest to the radio art on the wall.” instruction segmentation mask

Slide 3

Slide 3 text

“Walk to the living room and fetch me the leftmost pillow on the smaller white sofa, the pillow closest to the plant on the small table.” Ours - 3 - EVF-SAM-2 [Zhang+, 24] Even SOTA foundation models struggle with our task

Slide 4

Slide 4 text

Proposed method: Polygon-based mask generation based on optimal transport - 4 - Main novelty: Polygon Matching Loss based on Optimal Transport Polygon’s vertex order must be the same Predicted Mask Our method Existing methods Ground Truth Mask Predicted Mask Ground Truth Mask

Slide 5

Slide 5 text

Quantitative results: Our method outperformed baselines in all metrics Model mIoU↑ [%] Prec@0.5↑[%] Prec@0.7↑[%] LAVT [Yang+, CVPR22] 28.16±2.85 26.46±4.01 18.75±3.29 SeqTR [Zhu+, ECCV22] 21.84±2.28 17.87±7.00 5.16±5.26 MDSM [Iioka+, IROS23] 24.36±3.87 22.49±5.46 13.71±3.34 Ours 38.16±2.46 48.85±2.70 22.29±3.32 +10.00 - 5 - +22.39

Slide 6

Slide 6 text

Qualitative results: Our method could identify the target object and generate mask appropriately - 6 - Ours LAVT Ground Truth ☺ Understood the target object. ☺ appropriate mask

Slide 7

Slide 7 text

Please come to poster ThPI5T5 - 7 - Segment target object from manipulation instructions by polygon matching using optimal transport