Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[IROS24] Learning-To-Rank Approach for Identify...

[IROS24] Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search Engine

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Transcript

  1. Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search

    Engine Kanta Kaneda, Shunya Nagashima, Ryosuke Korekata, Motonari Kambara, and Komei Sugiura Keio University
  2. Background: Language-guided physical-world search engine - 2 - Use cases:

    1. Ask robots to bring objects at home e.g., “Bring me the towel from the sink” 2. Search for objects in unknown places e.g., Shopping malls, exhibition halls, etc Domestic service robots (DSRs) https://www.toyota.com/usa/toyota-effect/romy-robot.html User “Where is… ?”
  3. Problem Statement: Learning-to-rank physical objects task Input ▪ Instruction ▪

    Images taken in advance ・・・ Model Rank: N Rank: 3 Rank: 2 Rank: 1 Output ▪ Ranked image list of the target object The images containing desired objects should be highly ranked ・・・ - 3 - ・・・ Indoor env. “Go to the… and bring me the towel directly across from the…”
  4. Related Work: Most existing methods are non-search-based OVMM NLMap [Chen+,

    ICRA23] Mobile manipulation based on image retrieval settings  Only focusing on a small-scale environment RREx-BoT [Sigurdsson+, IROS23] Vision-and-Language Navigation based on pre-exploration  Cannot handle object manipulation tasks OVMM [Yenamandra+, CoRL23] Open-vocabulary mobile manipulation task  SOTA method achieved SR ≒ 10% NLMap - 4 -
  5. ▪ E.g., “Please polish the black round table with a

    vase underneath a black chandelier in the dining room” Challenge: Complex instructions with referring expressions Ground Truth CLIP [Radford+, ICML21] Avg. sentence length: 18.78 words > G-Ref: 8.4 words [Mao+, CVPR16] - 5 - Rank: 1  Incorrect
  6. Rank: 3 Rank: 2 Rank: 1 Rank: N Proposed Method

    (1/2): Target Phrase Extractor & Crossmodal Noun Phrase Encoder ▪ Model the relationships between the objects and phrases extracted from instructions that contain referring expressions ◼ Obtain the noun phrase of the target object using LLMs ◼ Align text and regional visual features - 7 -
  7. Rank: 3 Rank: 2 Rank: 1 Rank: N Proposed Method

    (2/2): Crossmodal Region Feature Encoder ▪ Models the relationships between the objects and its out-of-view context ◼ Visual features from CLIP image encoder ◼ Panoramic images of the target object - 8 -
  8. Experimental Settings: Constructed large-scale dataset including instructions and real-world images

    ▪ Based on REVERIE [Qi+, CVPR20], MP3D [Chang+, 3DV17] https://yuankaiqi.github.io/REVERIE_Challenge/static/img/demo.gif Building-scale environments - 9 - # of environments 58 # of objects 4,352 Vocabulary size 53,118 # of Instructions 5,501 Average sentence length 18.78
  9. Quantitative Results: Outperformed the baseline method on standard metrics ▪

    Evaluation metrics 1. Mean reciprocal rank (MRR) 2. Recall@K (K=1,5,10,20) [%] MRR↑ Recall@5↑ Recall@10↑ Recall@20↑ CLIP (ext.) [Radford+, ICML21] 41.5±0.9 45.3±1.7 63.8±2.5 80.8±2.0 Ours 50.1±0.8 52.2±1.4 69.8±1.5 83.8±0.6 Ours (ext.) 56.3±1.3 58.7±1.1 77.7±1.1 90.0±0.5 +14.8 +13.4 +13.9 +9.2 - 10 -
  10. Qualitative Results (1/2): Comprehended complex referring expressions Rank: 1 …

    Ground Truth Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 ”Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door” - 11 -
  11. Qualitative Results (1/2): Comprehended complex referring expressions Rank: 1 …

    Ground Truth Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 ”Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door” - 12 - ”Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the wine bottles at the second table from the door”
  12. Qualitative Results (2/2): Considered out-of-view context Rank: 1 … ”Go

    to the bathroom with a picture of a wagon and bring me the towel directly across from the sink” Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 - 13 - Ground Truth
  13. Qualitative Results (2/2): Considered out-of-view context Rank: 1 … Rank:

    2 Rank: 3 Rank: 4 Rank: 5 Rank: 6 - 14 - Ground Truth ”Go to the bathroom with a picture of a wagon and bring me the towel directly across from the sink”
  14. Physical Experimental Settings: Zero-shot transfer ▪ Replicated the standardized env.

    of WRS 2020 Partner Robot Challenge ▪ DSR: HSR by Toyota ▪ Objects: YCB ▪ Metrics: SR [%] 8x - 15 -
  15. Procedure: Object retrieval based on open-vocabulary instructions Step 1: Pre-exploration

    Step 2: Inputs instruction Step 4: Grasping 8x 8x Step 3: Selects image Instruction: “Could you bring me a green cup?” - 16 -
  16. Quantitative Results: Outperformed the baseline methods in terms of SR

    ☺ Achieved SR of 80% despite the zero-shot transfer setting - 17 - Method SR↑[%] CLIP (ext.) [Radford+, ICML21] 60 (12/20) NLMap (reprod.) [Chen+, ICRA23] 70 (14/20) Ours (ext.) 80 (16/20) +10 4x
  17. Conclusion Background ▪ Language-guided object search system for various applications

    (e.g., DSRs) Proposed Method: MultiRankIt ▪ Identify the target object via learning-to-rank approach Results ▪ Achieved a success rate of 80% in a standardized domestic environment - 18 - Our code is available!