[IROS24] Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search Engine

Slide 1

Slide 1 text

Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search Engine Kanta Kaneda, Shunya Nagashima, Ryosuke Korekata, Motonari Kambara, and Komei Sugiura Keio University

Slide 2

Slide 2 text

Background: Language-guided physical-world search engine - 2 - Use cases: 1. Ask robots to bring objects at home e.g., “Bring me the towel from the sink” 2. Search for objects in unknown places e.g., Shopping malls, exhibition halls, etc Domestic service robots (DSRs) https://www.toyota.com/usa/toyota-effect/romy-robot.html User “Where is… ?”

Slide 3

Slide 3 text

Problem Statement: Learning-to-rank physical objects task Input ■ Instruction ■ Images taken in advance ・・・ Model Rank: N Rank: 3 Rank: 2 Rank: 1 Output ■ Ranked image list of the target object The images containing desired objects should be highly ranked ・・・ - 3 - ・・・ Indoor env. “Go to the… and bring me the towel directly across from the…”

Slide 4

Slide 4 text

Related Work: Most existing methods are non-search-based OVMM NLMap [Chen+, ICRA23] Mobile manipulation based on image retrieval settings  Only focusing on a small-scale environment RREx-BoT [Sigurdsson+, IROS23] Vision-and-Language Navigation based on pre-exploration  Cannot handle object manipulation tasks OVMM [Yenamandra+, CoRL23] Open-vocabulary mobile manipulation task  SOTA method achieved SR ≒ 10% NLMap - 4 -

Slide 5

Slide 5 text

■ E.g., “Please polish the black round table with a vase underneath a black chandelier in the dining room” Challenge: Complex instructions with referring expressions Ground Truth CLIP [Radford+, ICML21] Avg. sentence length: 18.78 words > G-Ref: 8.4 words [Mao+, CVPR16] - 5 - Rank: 1  Incorrect

Slide 6

Slide 6 text

Proposed Method: MultiRankIt Identify target objects via learning-to-rank approach Rank: 3 Rank: 2 Rank: 1 Rank: N - 6 -

Slide 7

Slide 7 text

Rank: 3 Rank: 2 Rank: 1 Rank: N Proposed Method (1/2): Target Phrase Extractor & Crossmodal Noun Phrase Encoder ■ Model the relationships between the objects and phrases extracted from instructions that contain referring expressions ◼ Obtain the noun phrase of the target object using LLMs ◼ Align text and regional visual features - 7 -

Slide 8

Slide 8 text

Rank: 3 Rank: 2 Rank: 1 Rank: N Proposed Method (2/2): Crossmodal Region Feature Encoder ■ Models the relationships between the objects and its out-of-view context ◼ Visual features from CLIP image encoder ◼ Panoramic images of the target object - 8 -

Slide 9

Slide 9 text

Experimental Settings: Constructed large-scale dataset including instructions and real-world images ■ Based on REVERIE [Qi+, CVPR20], MP3D [Chang+, 3DV17] https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif Building-scale environments - 9 - # of environments 58 # of objects 4,352 Vocabulary size 53,118 # of Instructions 5,501 Average sentence length 18.78

Slide 10

Slide 10 text

Quantitative Results: Outperformed the baseline method on standard metrics ■ Evaluation metrics 1. Mean reciprocal rank (MRR) 2. Recall@K (K=1,5,10,20) [%] MRR↑ Recall@5↑ Recall@10↑ Recall@20↑ CLIP (ext.) [Radford+, ICML21] 41.5±0.9 45.3±1.7 63.8±2.5 80.8±2.0 Ours 50.1±0.8 52.2±1.4 69.8±1.5 83.8±0.6 Ours (ext.) 56.3±1.3 58.7±1.1 77.7±1.1 90.0±0.5 +14.8 +13.4 +13.9 +9.2 - 10 -

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Qualitative Results (1/2): Comprehended complex referring expressions Rank: 1 … Ground Truth Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 ”Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door” - 12 - ”Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door”

Slide 13

Slide 13 text

Qualitative Results (2/2): Considered out-of-view context Rank: 1 … ”Go to the bathroom with a picture of a wagon and bring me the towel directly across from the sink” Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 - 13 - Ground Truth

Slide 14

Slide 14 text

Qualitative Results (2/2): Considered out-of-view context Rank: 1 … Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 - 14 - Ground Truth ”Go to the bathroom with a picture of a wagon and bring me the towel directly across from the sink”

Slide 15

Slide 15 text

Physical Experimental Settings: Zero-shot transfer ■ Replicated the standardized env. of WRS 2020 Partner Robot Challenge ■ DSR: HSR by Toyota ■ Objects: YCB ■ Metrics: SR [%] 8x - 15 -

Slide 16

Slide 16 text

Procedure: Object retrieval based on open-vocabulary instructions Step 1: Pre-exploration Step 2: Inputs instruction Step 4: Grasping 8x 8x Step 3: Selects image Instruction: “Could you bring me a green cup?” - 16 -

Slide 17

Slide 17 text

Quantitative Results: Outperformed the baseline methods in terms of SR ☺ Achieved SR of 80% despite the zero-shot transfer setting - 17 - Method SR↑[%] CLIP (ext.) [Radford+, ICML21] 60 (12/20) NLMap (reprod.) [Chen+, ICRA23] 70 (14/20) Ours (ext.) 80 (16/20) +10 4x

Slide 18

Slide 18 text

Conclusion Background ■ Language-guided object search system for various applications (e.g., DSRs) Proposed Method: MultiRankIt ■ Identify the target object via learning-to-rank approach Results ■ Achieved a success rate of 80% in a standardized domestic environment - 18 - Our code is available!