1. Ask robots to bring objects at home e.g., “Bring me the towel from the sink” 2. Search for objects in unknown places e.g., Shopping malls, exhibition halls, etc Domestic service robots (DSRs) https://www.toyota.com/usa/toyota-effect/romy-robot.html User “Where is… ?”
Images taken in advance ・・・ Model Rank: N Rank: 3 Rank: 2 Rank: 1 Output ▪ Ranked image list of the target object The images containing desired objects should be highly ranked ・・・ - 3 - ・・・ Indoor env. “Go to the… and bring me the towel directly across from the…”
ICRA23] Mobile manipulation based on image retrieval settings Only focusing on a small-scale environment RREx-BoT [Sigurdsson+, IROS23] Vision-and-Language Navigation based on pre-exploration Cannot handle object manipulation tasks OVMM [Yenamandra+, CoRL23] Open-vocabulary mobile manipulation task SOTA method achieved SR ≒ 10% NLMap - 4 -
vase underneath a black chandelier in the dining room” Challenge: Complex instructions with referring expressions Ground Truth CLIP [Radford+, ICML21] Avg. sentence length: 18.78 words > G-Ref: 8.4 words [Mao+, CVPR16] - 5 - Rank: 1 Incorrect
(1/2): Target Phrase Extractor & Crossmodal Noun Phrase Encoder ▪ Model the relationships between the objects and phrases extracted from instructions that contain referring expressions ◼ Obtain the noun phrase of the target object using LLMs ◼ Align text and regional visual features - 7 -
(2/2): Crossmodal Region Feature Encoder ▪ Models the relationships between the objects and its out-of-view context ◼ Visual features from CLIP image encoder ◼ Panoramic images of the target object - 8 -
Ground Truth Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 ”Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door” - 11 -
Ground Truth Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 ”Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door” - 12 - ”Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door”
to the bathroom with a picture of a wagon and bring me the towel directly across from the sink” Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 - 13 - Ground Truth
2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 - 14 - Ground Truth ”Go to the bathroom with a picture of a wagon and bring me the towel directly across from the sink”
(e.g., DSRs) Proposed Method: MultiRankIt ▪ Identify the target object via learning-to-rank approach Results ▪ Achieved a success rate of 80% in a standardized domestic environment - 18 - Our code is available!