Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[IRC24] Real-world Instance-specific Image Goal...

[IRC24] Real-world Instance-specific Image Goal Navigation: Bridging Domain Gaps via Contrastive Learning

Improving instance-specific image goal navigation (InstanceImageNav), which involves locating an object in the real world that is identical to a query image, is essential for enabling robots to help users find desired objects. The challenge lies in the domain gap between the low-quality images observed by the moving robot, characterized by motion blur and low resolution, and the high-quality query images provided by the user. These domain gaps can significantly reduce the task success rate, yet previous work has not adequately addressed them. To tackle this issue, we propose a novel method: few-shot cross-quality instance-aware adaptation (CrossIA). This approach employs contrastive learning with an instance classifier to align features between a large set of low-quality images and a small set of high-quality images. We fine-tuned the SimSiam model, pre-trained on ImageNet, using CrossIA with instance labels based on a 3D semantic map. Additionally, our system integrates object image collection with a pre-trained deblurring model to enhance the quality of the observed images. Evaluated on an InstanceImageNav task with 20 different instance types, our method improved the task success rate by up to three-fold compared to a baseline based on SuperGlue. These findings highlight the potential of contrastive learning and image enhancement techniques in improving object localization in robotic applications. The project website is
CrossIA.

Shoichi Hasegawa

December 12, 2024
Tweet

More Decks by Shoichi Hasegawa

Other Decks in Research

Transcript

  1. Real-world Instance-specific Image Goal Navigation: Bridging Domain Gaps via Contrastive

    Learning Taichi Sakaguchi1,Akira Taniguchi1,Yoshinobu Hagiwara1,2, Lotfi El Hafi1,Shoichi Hasegawa1,Tadahiro Taniguchi1,3 Ritsumeikan Univ.1 Soka Univ.2 Kyoto Univ.3 Project Page 2024 IEEE International Conference on Robotic Computing (IRC2024) 1
  2. Motivation In an environment, there are multiple instances of same

    class. In such environment, robot must classify between instances of the same class objects to find same instance as user indicated. Same class but different instances 2
  3. Challenge of the target task – Domain Gap • Domain

    Gap between given query image and robot observation because of motion blur It is important to bridging this gap for each instance to enhance task precision. • Robot need to classify instances of same class objects to find specific instance Leveraging contrastive learning to learn instance-aware domain invariant feature representation. Idea 3
  4. (Typical) Similar Navigation Tasks Instance Specific Image Goal Navigation [1]

    Object Goal Navigation [2] Vision-and-Language Navigation [3] Input: An instance image Task: Finding specific instance same as given query image Input: Object class name (“bed”, “tv”) Task: Finding any instance of the class Input: Language instruction Task: Finding an object represented by language instructions 4 [1] J. Krantz, et al. "Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances." arXiv preprint arXiv:2211.15876, 2022. [2] D. Batra, et al. "ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects." arXiv preprint arXiv:2006.13171, 2020. [3] P. Anderson, et al. "Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments." IEEE/CVF CVPR, 2018. No object instance is handled. User knows the location of the object in advance.
  5. Related work – Instance Specific Image Goal Navigation [4] J.

    Krantz, et al. “Navigating to Objects Specified by Images.“ IEEE/CVF ICCV, 2023. [5] P E. Sarlin, et al. "SuperGlue: Learning Feature Matching with Graph Neural Networks." IEEE/CVF CVPR, 2020. [6] T. Sakaguchi, et al. “Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map.” IEEE/RSJ IROS, 2024. Limitation of prior works : The target objects are limited to large object such as “chair”, “bed” and so on. Avoided in image retrieval because the detected images of them are of low quality. • Deep Learning-based Keypoint Matching (Mod-INN) [4] using SuperGlue [5] • Image-based Contrastive Learning (SimView) [6] Methods for InstanceImageNav Mod-INN SimView 5
  6. Task Definition in this work • Searching an object identical

    to a given query image (same as InstanceImageNav) • We focus on small everyday objects such as “book”, “toy” and so on. 6 Input: Query image captured by user’s mobile phone Output: Position of the object identical to given query image.
  7. Proposal: Few-shot Cross-quality Instance-aware Adaptation (CrossIA) Latent space before fine-tuning

    via CrossIA Latent space after fine-tuning via CrossIA CrossIA 7 Main Idea Learning instance-aware domain invariant feature representation leveraging contrastive learning and few shot high quality images.
  8. Proposed system - Data Collection Module Purpose of this module

    • Removing motion blur • Building 3D Semantic Map [7] • Building database of object images Process for Building Semantic Map. 9 [7] A. Kanechika, et al. “Interactive Learning System for 3D Semantic Segmentation with Autonomous Robots.” IEEE/SICE SII, 2024.
  9. Proposed system - Fine-tuning Module Architecture overview of CrossIA Purpose

    of this module Learning domain invariant feature representation from robot observations and few-shot high-quality images. Loss function of CrossIA ℒ = σm=1 𝑀 ℒm robot + σn=1 𝑁 ℒn cross ℒn cross is a term related similarity of different quality images of same instances. ℒm robot is a term related similarity of different view of same instances observed by robot. Why CrossIA learn domain invariant feature representation? ℒn cross allows the model to learn similarities between the same instances in different domains. 10 Adding linear classifiers to the SimSiam to reduce the variation in feature vectors between images belonging to the same instance.
  10. Proposed system - Navigation Module Purpose of this module •

    Identifying the same instance as give query image • Conducting navigate to target object Third person view Robot egocentric view How to identify the same instance as given query image Utilized cosine similarity of query image embedding and embeddings of object images robot observed. 11
  11. Experiment - Overview Purpose of experiment • To evaluate if

    CrossIA can improve task precision • To evaluate Impact of Few-Shot Learning using high quality images for training Task: Identifying same instance as given query image. Target Objects: 7 classes and 20 instances Objects were captured from multiple angles of same distance. (32 images were collected for each instance) → 640 tasks executed (32 x 20) 12
  12. Experiment – Conditions Conditions : Instance identifier • SuperGlue [5]

    (utilized in prior works [4]) • SimSiam (contrastive learning method [8],pre-trained with ScanNet [9] and ImageNet [10]) • SimView (contrastive learning method [6],Fine-Tuned with low-quality images of observation) • CrossIA (ours,Fine-Tuned with low-quality images and a few high-quality images) We also utilized deblurring method [11] during environment exploration. [4] J. Krantz, et al. “Navigating to Objects Specified by Images.“ IEEE/CVF ICCV, 2023. [5] P E. Sarlin, et al. "SuperGlue: Learning Feature Matching with Graph Neural Networks." IEEE/CVF CVPR, 2020. [6] T. Sakaguchi, et al. “Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map.” IEEE/RSJ IROS, 2024. [8] X. Chen, et al. "Exploring Simple Siamese Representation Learning." IEEE/CVF CVPR, 2021. [9] A. Dai, et al. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes.” IEEE/CVF CVPR, 2017. [10] J. Deng, et al. “ImageNet: A Large-Scale Hierarchical Image Database.” IEEE/CVF CVPR, 2009. [11] K. Kiyeon, et al. "MSSNet: Multi-Scale-Stage Network for Single Image Deblurring." ECCV, 2022. 13 Purpose of experiment • To evaluate if CrossIA can improve task precision • To evaluate Impact of Few-Shot Learning using high quality images for training
  13. Experiment – Metrics Success Rate (SR) Rate of retrieving the

    same object as the query image in Top 1. SR = 1 N ෍ 𝑛=1 N 𝑠𝑛 𝑠𝑛 = ቊ 1 (if the same object is identified at Top − 1) 0 (otherwise) Mean Rank (MR) The average rank of retrieving the object same as the query image. MR = 1 N ෍ 𝑛=1 N 𝑟𝑛 𝑟𝑛 : Similarity ranking between the query image and the same object 14 Mean Reciprocal Rank (MRR) Rate at which objects matching the query image are identified with the highest rank. MRR = 1 N ෍ 𝑛=1 N 1 𝑘𝑛 𝑘𝑛 : The rank at which the same object as the query image is retrieved
  14. Experiment – Quantitative Result (Effectiveness of CrossIA) Suggestion from result

    CrossIA can improve task precision. Suggests that contrastive learning is more effective than deburring Methods Deblurring SR↑ MR↓ MRR↑ SuperGlue [5] -- 0.275 2.73 0.365 SuperGlue [5] ✓ 0.281 2.73 0.370 SimSiam [8] -- 0.293 2.41 0.413 SimSiam [8] ✓ 0.290 2.40 0.416 SimView [6] -- 0.066 8.64 0.115 SimView [6] ✓ 0.034 10.3 0.097 CrossIA (ours) ✓ 0.751 1.24 0.820 ×3.00 • CrossIA improve success rate about 3x compared to SuperGlue which is used in prior work [4]. ×2.70 • Even with the blur removal module, CrossIA has a 2.7x improvement in success rate. [4] J. Krantz, et al. “Navigating to Objects Specified by Images.“ IEEE/CVF ICCV, 2023. [5] P E. Sarlin, et al. "SuperGlue: Learning Feature Matching with Graph Neural Networks." IEEE/CVF CVPR, 2020. [6] T. Sakaguchi, et al. “Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map.” IEEE/RSJ IROS, 2024. [8] X. Chen, et al. "Exploring Simple Siamese Representation Learning." IEEE/CVF CVPR, 2021. 15
  15. Experiment – Quantitative Result (Effect of Few-shot Learning) Purpose of

    ablation study To evaluate Impact of Few-Shot Learning using high quality images for training Domain adaptation method which is combined with CrossIA: Adversarial Learning [12] which is often utilized in study related to domain adaptation. Experiment content: Compared training conditions with fewer high-quality images. One-shot 0.421 (0.587) 1.82 (1.43) 0.548 (0.696) Three shot 0.646 (0.690) 1.32 (1.31) 0.752 (0.763) Five-shot 0.751 (0.753) 1.24 (1.21) 0.812 (0.820) Condition SR↑ MR↓ MRR↑ Influence few-shot learning and adversarial learning on task precision: Task precision degrades as high-quality images become scarce. Combining CrossIA with adversarial learning can suppress degradation. 16 [12] Y. Ganin, et al. "Domain-Adversarial Training of Neural Networks." Journal of Machine Learning Research, Vol.17, No.59, pp.1-35, 2016. () is the results with the addition of adversarial learning
  16. Experiment – Qualitative Result SimSiam Robot observation Query Images CrossIA

    Images from different domains get close to each other. Images of different domains are separate Changes in latent space due to fine-tuning (visualization by Principal component analysis) SimSiam CrossIA Suggestion from result CrossIA can learn domain invariant feature representation. This aspect may lead to enhance task precision. 17
  17. Summary of this research Task : InstanceImageNav, we focus on

    small everyday objects such as “book”, “toy”. Proposal : CrossIA CorssIA allows model to learn instance-aware domain invariant feature representation. Experimental result Challenge of this task Domain Gap between given query image and robot observation. CrossIA can improve task precision This is due to the fact that CrossIA learns instance- aware domain invariant feature representations. Before fine-tuning Robot observation Query Images After fine-tuning 18
  18. Future Works 19 Diffusion Models Limitation Relies on the user

    providing several high-quality images for each object. A few high-quality images As the number of search objects increases, the need for high quality images increases. It is impractical to obtain images of many objects. Solution Using a pre-trained diffusion model [13] to automatically generate high quality images from low quality images captured by the robot [13] X. Lin et al. “DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior.” ECCV, 2025. Project Page