Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[CVPR 2025 Workshop] IMC2025 10th place solution

Avatar for Muku Muku
June 13, 2025

[CVPR 2025 Workshop] IMC2025 10th place solution

This is the presentation material for our 10th place solution in the IMC2025 Challenge Talks, held at the CVPR 2025 Workshop "Image Matching: Local Features & Beyond".

Workshop: https://image-matching-workshop.github.io/

Avatar for Muku

Muku

June 13, 2025
Tweet

More Decks by Muku

Other Decks in Technology

Transcript

  1. Ryosuke Saito [email protected] Tomoya Okazaki [email protected] Takeru Endo [email protected] Suguru

    Kobayashi [email protected] * Please note that this work is NOT related to the Sony Corporation’s official activities. June 11, 2025 Team Sony Matching Our Solution is here: https://www.kaggle.com/competitions/image-matching-challenge-2025/discussion/582898 Can you find the correct match? -> Image Matching: Local Features & Beyond (CVPR 2025 Workshop) 10th Place Solution in Image Matching Challenge 2025 - They All Look the Same... So How Do We Cluster Them? -
  2. Image Matching: Local Features & Beyond (CVPR 2025 Workshop) 10th

    Place Solution in Image Matching Challenge 2025 - They All Look the Same... So How Do We Cluster Them? - Can you find the correct match? -> Ryosuke Saito [email protected] Tomoya Okazaki [email protected] Takeru Endo [email protected] Suguru Kobayashi [email protected] * Please note that this work is NOT related to the Sony Corporation’s official activities. June 11, 2025 Team Sony Matching Our Solution is here: https://www.kaggle.com/competitions/image-matching-challenge-2025/discussion/582898 Correct Pair (Adjacent) From different cluster
  3. Main Challenge in IMC2025 • Handle multiple scenes mixed in

    a single dataset • Estimate camera poses scene by scene • Filter out outlier images Scene1 … Outliers Mixed dataset (e.g. ETs) … Scene2 … … x Filter out Accurate Image Pairing is the Key Challenge
  4. New Dataset Characteristics & Key Ideas New Dataset Characteristics in

    IMC2025 ETs fbk_vineyard Visually Distinct • Easy to identify with global descriptor similarity (e.g. DINOV2) Visually Ambiguous • Difficult to identify with global descriptor similarity However, focusing on the details... Same grapevine shape Same ground texture We use keypoint-based similarity to focus on fine geometric consistency Some images fit together! (like puzzle pieces)
  5. Pipeline overview All image pairs (with multiple scene) - -

    … 1. Determine Top-k for Selecting Image Pairs (Dynamic Top-k) 2. Get Top-k Image Pairs Ranked by Keypoint-based Similarity 3. Extract Features by ALIKED & Match by LightGlue 4. Geometric verification & Camera pose estimation Estimated camera pose (Scene by scene) Basic blocks for SfM (Not covered in this talk) Key blocks for pair selection (Focus of this talk) List of image pairs top-k
  6. 1. Determine Top-k for Selecting Image Pairs (Dynamic Top-k) Calculate

    visual similarity scores (Based on DINOV2 patch features) Quite Hard to Distinguish Visually (e.g. fbk_vineyards) Relatively easy to Distinguish Visually (e.g. et & another_et) Mean similarity score Sim. score: High (0.85~) Sim. score: Mid-Low (~0.75) Low top-k (e.g. 3) -> Precision-Oriented Normal top-k (e.g. 30) -> Recall-Oriented top-k (Max: n_images // 3) All image pairs (with multiple scene) - - … Points • Dynamically adjust top-k according to characteristics of each dataset. • Visually ambiguous scenes like “fbk_vineyard” are easy to mix up clusters, so we reduce top-k (e.g., k=3) to prioritize precision. • In other scenes, we increase top-k to keep recall. • Use DINOv2 patch features to judge dataset characteristics (Only used to adjust top-k number, not for actual pair selection)
  7. 2. Get Top-k Image Pairs Ranked by Keypoint Feature-based Similarity

    Key.Net (w/ AffNet, OriNet) (Detect keypoints) top-k (from Step 1.) HardNet8v2 (Describe detected kpts) NN -> AdaLAM (Match kpts & Filter outliers) Calculate kpts-based similarity score num_inlier, mean_distances 𝑠𝑖𝑚_𝑠𝑐𝑜𝑟𝑒 = 1.0 − 𝑚𝑒𝑎𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑛𝑢𝑚_𝑖𝑛𝑙𝑖𝑒𝑟 (Higher score -> More similar) Get top-k pairs based on sim_score List of image pairs All image pairs (with multiple scene) - - … Points • Use KeyNet-AdaLAM to define a keypoint-based similarity score. (Designed to reward high inlier count & close descriptor distances) • Why KeyNet-AdaLAM? • Robust to scale & rotation (AffNet + OriNet) • Fast matching over all image pairs • Higher precision than others in custom scoring (in our tests)
  8. Results|Selecting Image Pairs No overlap Selecting image pairs based on

    global descriptor similarity can lead to incorrect matches (Comparison) Examining top-4 images ranked by global descriptor similarity (DINOV2) Overlap Top-1 Top-2 Top-3 Top-4 Reference fbk_vineyard No overlap Overlap
  9. Results|Selecting Image Pairs By using keypoint-based similarity, we can identify

    adjacent image pairs more correctly (Ours) Examining top-4 images ranked by keypoint-based similarity (KeyNet-AdaLAM) Top-1 Top-2 Top-3 Top-4 Reference fbk_vineyard Overlap Overlap Overlap Overlap
  10. Results|Reconstruction By carefully selecting image pairs, we can separate dataset

    into three correct 3D models (though opposite-view pairs are missed) Scene1 (split1) Scene2 (split2) Scene3 (split3) … “fbk_vineyard” is a mix of three different scenes -> vineyard_split1 vineyard_split2 vineyard_split3 ◆: GT ◆: Prediction ◆: GT ◆: Prediction ◆: GT ◆: Prediction … … Miss opposite-view pairs (Ours) Examining top-4 images ranked by keypoint-based similarity (KeyNet-AdaLAM) Results of camera pose estimation
  11. Exploration for VGGT: Visual Geometry Grounded Transformer[1] * Not in

    Final Pipeline [1] J. Wang et al., VGGT: Visual Geometry Grounded Transformer, CVPR, 2025 Our Pipeline w/ VGGT tracker VGGT tracker provides strong matching capabilities even for opposite-view image pairs Result of fbk_vineyard split3 (w/ opposite-view pairs) VGGT tracker[1] ◆: GT ◆: Prediction Grapevines Grapevines Integrate w/o VGGT tracker Opposite-view pairs are misclassified as different scenes Grapevines ◆: GT ◆: Prediction w/ VGGT tracker Opposite-view pairs are recognized as same scenes VGGT tracker Incorrectly devided into two models
  12. Thank you for your attention! Our Solution is here: https://www.kaggle.com/competitions/image-matching-challenge-2025/discussion/582898

    • Global descriptors alone is difficult to handle visually ambiguous datasets correctly. • To focus on fine geometry consistency, we use keypoint-based similarity score. • Careful pair selection improves reconstruction results on ambiguous datasets. • VGGT offers promising results for future pipelines — even with complex cases like opposite views. Takeaways