[CVPR 2025 Workshop] IMC2025 10th place solution

Ryosuke Saito [email protected] Tomoya Okazaki [email protected] Takeru Endo [email protected] Suguru
Kobayashi [email protected] * Please note that this work is NOT related to the Sony Corporation’s official activities. June 11, 2025 Team Sony Matching Our Solution is here: https://www.kaggle.com/competitions/image-matching-challenge-2025/discussion/582898 Can you find the correct match? -> Image Matching: Local Features & Beyond (CVPR 2025 Workshop) 10th Place Solution in Image Matching Challenge 2025 - They All Look the Same... So How Do We Cluster Them? -

Image Matching: Local Features & Beyond (CVPR 2025 Workshop) 10th
Place Solution in Image Matching Challenge 2025 - They All Look the Same... So How Do We Cluster Them? - Can you find the correct match? -> Ryosuke Saito [email protected] Tomoya Okazaki [email protected] Takeru Endo [email protected] Suguru Kobayashi [email protected] * Please note that this work is NOT related to the Sony Corporation’s official activities. June 11, 2025 Team Sony Matching Our Solution is here: https://www.kaggle.com/competitions/image-matching-challenge-2025/discussion/582898 Correct Pair (Adjacent) From different cluster

Main Challenge in IMC2025 • Handle multiple scenes mixed in
a single dataset • Estimate camera poses scene by scene • Filter out outlier images Scene1 … Outliers Mixed dataset (e.g. ETs) … Scene2 … … x Filter out Accurate Image Pairing is the Key Challenge

New Dataset Characteristics & Key Ideas New Dataset Characteristics in
IMC2025 ETs fbk_vineyard Visually Distinct • Easy to identify with global descriptor similarity (e.g. DINOV2) Visually Ambiguous • Difficult to identify with global descriptor similarity However, focusing on the details... Same grapevine shape Same ground texture We use keypoint-based similarity to focus on fine geometric consistency Some images fit together! (like puzzle pieces)

Pipeline overview All image pairs (with multiple scene) - -
… 1. Determine Top-k for Selecting Image Pairs (Dynamic Top-k) 2. Get Top-k Image Pairs Ranked by Keypoint-based Similarity 3. Extract Features by ALIKED & Match by LightGlue 4. Geometric verification & Camera pose estimation Estimated camera pose (Scene by scene) Basic blocks for SfM (Not covered in this talk) Key blocks for pair selection (Focus of this talk) List of image pairs top-k

1. Determine Top-k for Selecting Image Pairs (Dynamic Top-k) Calculate
visual similarity scores (Based on DINOV2 patch features) Quite Hard to Distinguish Visually (e.g. fbk_vineyards) Relatively easy to Distinguish Visually (e.g. et & another_et) Mean similarity score Sim. score: High (0.85~) Sim. score: Mid-Low (~0.75) Low top-k (e.g. 3) -> Precision-Oriented Normal top-k (e.g. 30) -> Recall-Oriented top-k (Max: n_images // 3) All image pairs (with multiple scene) - - … Points • Dynamically adjust top-k according to characteristics of each dataset. • Visually ambiguous scenes like “fbk_vineyard” are easy to mix up clusters, so we reduce top-k (e.g., k=3) to prioritize precision. • In other scenes, we increase top-k to keep recall. • Use DINOv2 patch features to judge dataset characteristics (Only used to adjust top-k number, not for actual pair selection)

2. Get Top-k Image Pairs Ranked by Keypoint Feature-based Similarity
Key.Net (w/ AffNet, OriNet) (Detect keypoints) top-k (from Step 1.) HardNet8v2 (Describe detected kpts) NN -> AdaLAM (Match kpts & Filter outliers) Calculate kpts-based similarity score num_inlier, mean_distances 𝑠𝑖𝑚_𝑠𝑐𝑜𝑟𝑒 = 1.0 − 𝑚𝑒𝑎𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑛𝑢𝑚_𝑖𝑛𝑙𝑖𝑒𝑟 (Higher score -> More similar) Get top-k pairs based on sim_score List of image pairs All image pairs (with multiple scene) - - … Points • Use KeyNet-AdaLAM to define a keypoint-based similarity score. (Designed to reward high inlier count & close descriptor distances) • Why KeyNet-AdaLAM? • Robust to scale & rotation (AffNet + OriNet) • Fast matching over all image pairs • Higher precision than others in custom scoring (in our tests)

Results｜Selecting Image Pairs No overlap Selecting image pairs based on
global descriptor similarity can lead to incorrect matches (Comparison) Examining top-4 images ranked by global descriptor similarity (DINOV2) Overlap Top-1 Top-2 Top-3 Top-4 Reference fbk_vineyard No overlap Overlap

Results｜Selecting Image Pairs By using keypoint-based similarity, we can identify
adjacent image pairs more correctly (Ours) Examining top-4 images ranked by keypoint-based similarity (KeyNet-AdaLAM) Top-1 Top-2 Top-3 Top-4 Reference fbk_vineyard Overlap Overlap Overlap Overlap

Results｜Reconstruction By carefully selecting image pairs, we can separate dataset
into three correct 3D models (though opposite-view pairs are missed) Scene1 (split1) Scene2 (split2) Scene3 (split3) … “fbk_vineyard” is a mix of three different scenes -> vineyard_split1 vineyard_split2 vineyard_split3 ◆: GT ◆: Prediction ◆: GT ◆: Prediction ◆: GT ◆: Prediction … … Miss opposite-view pairs (Ours) Examining top-4 images ranked by keypoint-based similarity (KeyNet-AdaLAM) Results of camera pose estimation

Exploration for VGGT: Visual Geometry Grounded Transformer[1] * Not in
Final Pipeline [1] J. Wang et al., VGGT: Visual Geometry Grounded Transformer, CVPR, 2025 Our Pipeline w/ VGGT tracker VGGT tracker provides strong matching capabilities even for opposite-view image pairs Result of fbk_vineyard split3 (w/ opposite-view pairs) VGGT tracker[1] ◆: GT ◆: Prediction Grapevines Grapevines Integrate w/o VGGT tracker Opposite-view pairs are misclassified as different scenes Grapevines ◆: GT ◆: Prediction w/ VGGT tracker Opposite-view pairs are recognized as same scenes VGGT tracker Incorrectly devided into two models

Thank you for your attention! Our Solution is here: https://www.kaggle.com/competitions/image-matching-challenge-2025/discussion/582898
• Global descriptors alone is difficult to handle visually ambiguous datasets correctly. • To focus on fine geometry consistency, we use keypoint-based similarity score. • Careful pair selection improves reconstruction results on ambiguous datasets. • VGGT offers promising results for future pipelines — even with complex cases like opposite views. Takeaways

[CVPR 2025 Workshop] IMC2025 10th place solution

[CVPR 2025 Workshop] IMC2025 10th place solution

Muku

More Decks by Muku

Other Decks in Technology

Featured

Transcript

Ryosuke Saito [email protected] Tomoya Okazaki [email protected] Takeru Endo [email protected] Suguru

Image Matching: Local Features & Beyond (CVPR 2025 Workshop) 10th

Main Challenge in IMC2025 • Handle multiple scenes mixed in

New Dataset Characteristics & Key Ideas New Dataset Characteristics in

Pipeline overview All image pairs (with multiple scene) - -

1. Determine Top-k for Selecting Image Pairs (Dynamic Top-k) Calculate

2. Get Top-k Image Pairs Ranked by Keypoint Feature-based Similarity

Results｜Selecting Image Pairs No overlap Selecting image pairs based on

Results｜Selecting Image Pairs By using keypoint-based similarity, we can identify

Results｜Reconstruction By carefully selecting image pairs, we can separate dataset

Exploration for VGGT: Visual Geometry Grounded Transformer[1] * Not in

Thank you for your attention! Our Solution is here: https://www.kaggle.com/competitions/image-matching-challenge-2025/discussion/582898