[IRC24] Real-world Instance-specific Image Goal Navigation: Bridging Domain Gaps via Contrastive Learning

Real-world Instance-specific Image Goal Navigation: Bridging Domain Gaps via Contrastive
Learning Taichi Sakaguchi1，Akira Taniguchi1，Yoshinobu Hagiwara1,2， Lotfi El Hafi1，Shoichi Hasegawa1，Tadahiro Taniguchi1,3 Ritsumeikan Univ.1 Soka Univ.2 Kyoto Univ.3 Project Page 2024 IEEE International Conference on Robotic Computing (IRC2024) 1

Motivation In an environment, there are multiple instances of same
class. In such environment, robot must classify between instances of the same class objects to find same instance as user indicated. Same class but different instances 2

Challenge of the target task – Domain Gap • Domain
Gap between given query image and robot observation because of motion blur It is important to bridging this gap for each instance to enhance task precision. • Robot need to classify instances of same class objects to find specific instance Leveraging contrastive learning to learn instance-aware domain invariant feature representation. Idea 3

(Typical) Similar Navigation Tasks Instance Specific Image Goal Navigation [1]
Object Goal Navigation [2] Vision-and-Language Navigation [3] Input: An instance image Task: Finding specific instance same as given query image Input: Object class name (“bed”, “tv”) Task: Finding any instance of the class Input: Language instruction Task: Finding an object represented by language instructions 4 [1] J. Krantz, et al. "Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances." arXiv preprint arXiv:2211.15876, 2022. [2] D. Batra, et al. "ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects." arXiv preprint arXiv:2006.13171, 2020. [3] P. Anderson, et al. "Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments." IEEE/CVF CVPR, 2018. No object instance is handled. User knows the location of the object in advance.

Related work – Instance Specific Image Goal Navigation [4] J.
Krantz, et al. “Navigating to Objects Specified by Images.“ IEEE/CVF ICCV, 2023. [5] P E. Sarlin, et al. "SuperGlue: Learning Feature Matching with Graph Neural Networks." IEEE/CVF CVPR, 2020. [6] T. Sakaguchi, et al. “Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map.” IEEE/RSJ IROS, 2024. Limitation of prior works : The target objects are limited to large object such as “chair”, “bed” and so on. Avoided in image retrieval because the detected images of them are of low quality. • Deep Learning-based Keypoint Matching (Mod-INN) [4] using SuperGlue [5] • Image-based Contrastive Learning (SimView) [6] Methods for InstanceImageNav Mod-INN SimView 5

Task Definition in this work • Searching an object identical
to a given query image (same as InstanceImageNav) • We focus on small everyday objects such as “book”, “toy” and so on. 6 Input: Query image captured by user’s mobile phone Output: Position of the object identical to given query image.

Proposal: Few-shot Cross-quality Instance-aware Adaptation (CrossIA) Latent space before fine-tuning
via CrossIA Latent space after fine-tuning via CrossIA CrossIA 7 Main Idea Learning instance-aware domain invariant feature representation leveraging contrastive learning and few shot high quality images.

Proposed System - Overview 8

Proposed system - Data Collection Module Purpose of this module
• Removing motion blur • Building 3D Semantic Map [7] • Building database of object images Process for Building Semantic Map. 9 [7] A. Kanechika, et al. “Interactive Learning System for 3D Semantic Segmentation with Autonomous Robots.” IEEE/SICE SII, 2024.

Proposed system - Fine-tuning Module Architecture overview of CrossIA Purpose
of this module Learning domain invariant feature representation from robot observations and few-shot high-quality images. Loss function of CrossIA ℒ = σm=1 𝑀 ℒm robot + σn=1 𝑁 ℒn cross ℒn cross is a term related similarity of different quality images of same instances. ℒm robot is a term related similarity of different view of same instances observed by robot. Why CrossIA learn domain invariant feature representation? ℒn cross allows the model to learn similarities between the same instances in different domains. 10 Adding linear classifiers to the SimSiam to reduce the variation in feature vectors between images belonging to the same instance.

Proposed system - Navigation Module Purpose of this module •
Identifying the same instance as give query image • Conducting navigate to target object Third person view Robot egocentric view How to identify the same instance as given query image Utilized cosine similarity of query image embedding and embeddings of object images robot observed. 11

Experiment - Overview Purpose of experiment • To evaluate if
CrossIA can improve task precision • To evaluate Impact of Few-Shot Learning using high quality images for training Task: Identifying same instance as given query image. Target Objects: 7 classes and 20 instances Objects were captured from multiple angles of same distance. (32 images were collected for each instance) → 640 tasks executed (32 x 20) 12

Experiment – Conditions Conditions : Instance identifier • SuperGlue [5]
(utilized in prior works [4]) • SimSiam (contrastive learning method [8]，pre-trained with ScanNet [9] and ImageNet [10]) • SimView (contrastive learning method [6]，Fine-Tuned with low-quality images of observation) • CrossIA (ours，Fine-Tuned with low-quality images and a few high-quality images) We also utilized deblurring method [11] during environment exploration. [4] J. Krantz, et al. “Navigating to Objects Specified by Images.“ IEEE/CVF ICCV, 2023. [5] P E. Sarlin, et al. "SuperGlue: Learning Feature Matching with Graph Neural Networks." IEEE/CVF CVPR, 2020. [6] T. Sakaguchi, et al. “Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map.” IEEE/RSJ IROS, 2024. [8] X. Chen, et al. "Exploring Simple Siamese Representation Learning." IEEE/CVF CVPR, 2021. [9] A. Dai, et al. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes.” IEEE/CVF CVPR, 2017. [10] J. Deng, et al. “ImageNet: A Large-Scale Hierarchical Image Database.” IEEE/CVF CVPR, 2009. [11] K. Kiyeon, et al. "MSSNet: Multi-Scale-Stage Network for Single Image Deblurring." ECCV, 2022. 13 Purpose of experiment • To evaluate if CrossIA can improve task precision • To evaluate Impact of Few-Shot Learning using high quality images for training

Experiment – Metrics Success Rate (SR) Rate of retrieving the
same object as the query image in Top 1. SR = 1 N ෍ 𝑛=1 N 𝑠𝑛 𝑠𝑛 = ቊ 1 (if the same object is identified at Top − 1) 0 (otherwise) Mean Rank (MR) The average rank of retrieving the object same as the query image. MR = 1 N ෍ 𝑛=1 N 𝑟𝑛 𝑟𝑛 : Similarity ranking between the query image and the same object 14 Mean Reciprocal Rank (MRR) Rate at which objects matching the query image are identified with the highest rank. MRR = 1 N ෍ 𝑛=1 N 1 𝑘𝑛 𝑘𝑛 : The rank at which the same object as the query image is retrieved

Experiment – Quantitative Result (Effectiveness of CrossIA) Suggestion from result
CrossIA can improve task precision. Suggests that contrastive learning is more effective than deburring Methods Deblurring SR↑ MR↓ MRR↑ SuperGlue [5] -- 0.275 2.73 0.365 SuperGlue [5] ✓ 0.281 2.73 0.370 SimSiam [8] -- 0.293 2.41 0.413 SimSiam [8] ✓ 0.290 2.40 0.416 SimView [6] -- 0.066 8.64 0.115 SimView [6] ✓ 0.034 10.3 0.097 CrossIA (ours) ✓ 0.751 1.24 0.820 ×3.00 • CrossIA improve success rate about 3x compared to SuperGlue which is used in prior work [4]. ×2.70 • Even with the blur removal module, CrossIA has a 2.7x improvement in success rate. [4] J. Krantz, et al. “Navigating to Objects Specified by Images.“ IEEE/CVF ICCV, 2023. [5] P E. Sarlin, et al. "SuperGlue: Learning Feature Matching with Graph Neural Networks." IEEE/CVF CVPR, 2020. [6] T. Sakaguchi, et al. “Object Instance Retrieval in Assistive Robotics: Leveraging Fine-Tuned SimSiam with Multi-View Images Based on 3D Semantic Map.” IEEE/RSJ IROS, 2024. [8] X. Chen, et al. "Exploring Simple Siamese Representation Learning." IEEE/CVF CVPR, 2021. 15

Experiment – Quantitative Result (Effect of Few-shot Learning) Purpose of
ablation study To evaluate Impact of Few-Shot Learning using high quality images for training Domain adaptation method which is combined with CrossIA: Adversarial Learning [12] which is often utilized in study related to domain adaptation. Experiment content: Compared training conditions with fewer high-quality images. One-shot 0.421 (0.587) 1.82 (1.43) 0.548 (0.696) Three shot 0.646 (0.690) 1.32 (1.31) 0.752 (0.763) Five-shot 0.751 (0.753) 1.24 (1.21) 0.812 (0.820) Condition SR↑ MR↓ MRR↑ Influence few-shot learning and adversarial learning on task precision: Task precision degrades as high-quality images become scarce. Combining CrossIA with adversarial learning can suppress degradation. 16 [12] Y. Ganin, et al. "Domain-Adversarial Training of Neural Networks." Journal of Machine Learning Research, Vol.17, No.59, pp.1-35, 2016. () is the results with the addition of adversarial learning

Experiment – Qualitative Result SimSiam Robot observation Query Images CrossIA
Images from different domains get close to each other. Images of different domains are separate Changes in latent space due to fine-tuning (visualization by Principal component analysis) SimSiam CrossIA Suggestion from result CrossIA can learn domain invariant feature representation. This aspect may lead to enhance task precision. 17

Summary of this research Task : InstanceImageNav, we focus on
small everyday objects such as “book”, “toy”. Proposal : CrossIA CorssIA allows model to learn instance-aware domain invariant feature representation. Experimental result Challenge of this task Domain Gap between given query image and robot observation. CrossIA can improve task precision This is due to the fact that CrossIA learns instance- aware domain invariant feature representations. Before fine-tuning Robot observation Query Images After fine-tuning 18

Future Works 19 Diffusion Models Limitation Relies on the user
providing several high-quality images for each object. A few high-quality images As the number of search objects increases, the need for high quality images increases. It is impractical to obtain images of many objects. Solution Using a pre-trained diffusion model [13] to automatically generate high quality images from low quality images captured by the robot [13] X. Lin et al. “DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior.” ECCV, 2025. Project Page

[IRC24] Real-world Instance-specific Image Goal...

[IRC24] Real-world Instance-specific Image Goal Navigation: Bridging Domain Gaps via Contrastive Learning

Shoichi Hasegawa

More Decks by Shoichi Hasegawa

Other Decks in Research

Featured

Transcript

Real-world Instance-specific Image Goal Navigation: Bridging Domain Gaps via Contrastive

Motivation In an environment, there are multiple instances of same

Challenge of the target task – Domain Gap • Domain

(Typical) Similar Navigation Tasks Instance Specific Image Goal Navigation [1]

Related work – Instance Specific Image Goal Navigation [4] J.

Task Definition in this work • Searching an object identical

Proposal: Few-shot Cross-quality Instance-aware Adaptation (CrossIA) Latent space before fine-tuning

Proposed System - Overview 8

Proposed system - Data Collection Module Purpose of this module

Proposed system - Fine-tuning Module Architecture overview of CrossIA Purpose

Proposed system - Navigation Module Purpose of this module •

Experiment - Overview Purpose of experiment • To evaluate if

Experiment – Conditions Conditions : Instance identifier • SuperGlue [5]

Experiment – Metrics Success Rate (SR) Rate of retrieving the

Experiment – Quantitative Result (Effectiveness of CrossIA) Suggestion from result

Experiment – Quantitative Result (Effect of Few-shot Learning) Purpose of

Experiment – Qualitative Result SimSiam Robot observation Query Images CrossIA

Summary of this research Task : InstanceImageNav, we focus on

Future Works 19 Diffusion Models Limitation Relies on the user