[Journal club] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces CVPR25 Oral 慶應義塾大学杉浦孔明研究室 Hairong Shi 1 Jihan Yang1* Shusheng Yang1∗ Anjali W. Gupta1∗ Rilyn Han2∗ Li Fei-Fei3 Saining Xie1 1New York University 2Yale University 3Stanford University

Abstract 2 ❖ Background ▪ GAP: VLM’s visual-spatial intelligence from
videos remains unexplored (R, AD…) ▪ Humans has visual-spatial intelligence to remember spaces from sequential visual inputs ▪ Natural Question: Can VLMs also "think in space" from videos? ❖ Method: VSI-Bench ▪ A novel video-based visual-spatial intelligence benchmark with over 5,000 QA pairs ▪ Tested various reasoning techniques both linguistically and visually ❖ Result ▪ SR remaining the primary bottleneck ▪ Linguistical reasoning fails while visual reasoning improves performance

BG: VLM’s visual-spatial intelligence from videos remains unexplored 3 ▪
Previous advances in VLMs have primarily concentrated on content understanding or linguistical reasoning ▪ Visual-spatial intelligence is important due to its relevance to robotics, autonomous driving, and AR/VR ▪ Humans possess visual-spatial intelligence to remember spaces from sequential visual observations Qwen2-VL [Wang+, Arxiv24] EmbodiedBench [Yang+, ICML25 Oral] Frames of Mind: The Theory of Multiple Intelligences [Gardner, 1983]

Related Work: cognitive psychology 4 ▪ Previous cognitive psychology pointed
out visual object and spatial processing are neurally distinct in visual-spatial intelligence ▪ Break spatial reasoning into two capabilities: relational reasoning and egocentric- allocentric transformation ▪ Egocentric: Sparse views from given video (camera’s perspective) ▪ Allocentric: Overall layout of the room (bird-eyes perspective)

Related Work: previous benchmarks 5 Benchmark Video Type Key Characteristics
Video-MME [Fu+, CVPR25] Recognition & perception Comprehensive evaluation across various video- related tasks EgoSchema [Mangalam+, NeurIPS23] Understanding abilities Content-level understanding from first-person perspective OpenEQA [Majumdar+, CVPR24] Understanding abilities Episodic question answering in embodied settings Most prior works focus on content-level, which primarily serves as a temporal extension of 2D image understanding without 3D spatial consideration Video-MME [Fu+, CVPR25] EgoSchema [Mangalam+, NeurIPS23]

Method: VSIBench: Visual-Spatial Intelligence Benchmark 6 ❖ Construction • Data
Collection and Unification • 3x 3D indoor scene understanding & reconstruction datasets: ScanNet, ScanNet++, ARKitScenes • Video & Object-level 3D annotations & 3D reconstruction annotation • Unified meta-information format : object categories, bboxes, video specifications (resolution and frame rate ScanNet++ [Yeshwanth+, ICCV23]

Method: VSIBench: Visual-Spatial Intelligence Benchmark 7 ❖ Construction • Question-Answer
Generation • 1st : auto-annotated using the meta-information and question templates • 2nd : Human-in-the-loop Quality Review • Metric Design • Accuracy for Multiple-Choice Subset • Mean Relative Accuracy (MRA) for Numerical Subset confidence thresholds

Method: VSIBench: Visual-Spatial Intelligence Benchmark 8 ❖ Benchmark Statistics •
Over 5,000 QA pairs derived from 288 real videos • Includes eight tasks of three types • configurational, measurement estimation, and spatiotemporal

Evaluation on VSI-Bench: Main Result 9 ❖ Human Level Performance
• Not surprisingly, human evaluators achieve 79% average accuracy on VSI-Bench, outperforming SOTA by 33% • Gap is much narrower on 3 measurement tasks ❖ MLLMs • Gap between Open-Source and Close-Source models & between 72B-level models and 7B-level models • Most of open-source models (7/12) perform below the chance level baseline, indicating significant limitations in their visual-spatial intelligence

Evaluation on VSI-Bench: How MLLMs Think in Space Linguistically 10
• Linguistic prompting techniques, although effective in language reasoning and general visual tasks, are harmful for spatial reasoning • All three CoT techniques lead to performance degradation on VSI- Bench • Further analysis points out errors mainly come from relational reasoning and perspective transformation Spatial Reasoning

Evaluation on VSI-Bench: How MLLMs Think in Space Linguistically 11
• Linguistic prompting techniques, although effective in language reasoning and general visual tasks, are harmful for spatial reasoning • All three CoT techniques lead to performance degradation on VSI- Bench • Further analysis points out errors mainly come from relational reasoning and perspective transformation Spatial Reasoning

Evaluation on VSI-Bench: How MLLMs Think in Space Visually 12
• Base on the input video, prompt Gemini-1.5 Pro to predict Cognitive Map accurate on close pairs MLLM forms a series of local world models rather than a unified global model • Both prompt the MLLM with GT or self-generated Cog Map can benefit relative distance reasoning

Conclusion 13 ❖ Background ▪ VLM’s visual-spatial intelligence from videos
remains unexplored (R, AD…) ▪ Humans has visual-spatial intelligence to remember spaces from sequential visual inputs ▪ Can VLMs also "think in space" from videos? ❖ Method: VSI-Bench ▪ A novel video-based visual-spatial intelligence benchmark with over 5,000 QA pairs ▪ Tested various reasoning techniques both linguistically and visually ❖ Result ▪ SR remaining the primary bottleneck ▪ Linguistical reasoning fails while visual reasoning improves performance

Appendix: Number of Frames 14 • Frame sampling strategies are
a model design choice separate from the benchmark design • It’s not this paper’s work • The number of sampled frames only marginally affects performance not main bottleneck

Comment 15 ❖ Strength ▪ Paper Writing Strategy: ▪ Complex
experiment settings were arranged by several questions -> progressive logic ▪ This paper has led a trend of focus in perspective shifting (CVPR25, ICCV25…) ❖ Weakness ▪ Even though the bench was auto-annotated, they didn’t scale it up to create a training set ▪ Didn’t explore some training-side strategies ▪ But some other works then explored it MINDCUBE [Yin+, Arxiv25, Currently ICLR26 under review]

Links 16 ❖ Code https://github.com/vision-x-nyu/thinking-in-space ❖ Project Page https://vision-x-nyu.github.io/thinking-in-space.github.io/ ❖
Paper https://arxiv.org/abs/2412.14171

[Journal club] Thinking in Space: How Multimoda...

[Journal club] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Semantic Machine Intelligence Lab., Keio Univ. PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

Thinking in Space: How Multimodal Large Language Models See, Remember,

Abstract 2 ❖ Background ▪ GAP: VLM’s visual-spatial intelligence from

BG: VLM’s visual-spatial intelligence from videos remains unexplored 3 ▪

Related Work: cognitive psychology 4 ▪ Previous cognitive psychology pointed

Related Work: previous benchmarks 5 Benchmark Video Type Key Characteristics

Method: VSIBench: Visual-Spatial Intelligence Benchmark 6 ❖ Construction • Data

Method: VSIBench: Visual-Spatial Intelligence Benchmark 7 ❖ Construction • Question-Answer

Method: VSIBench: Visual-Spatial Intelligence Benchmark 8 ❖ Benchmark Statistics •

Evaluation on VSI-Bench: Main Result 9 ❖ Human Level Performance

Evaluation on VSI-Bench: How MLLMs Think in Space Linguistically 10

Evaluation on VSI-Bench: How MLLMs Think in Space Linguistically 11

Evaluation on VSI-Bench: How MLLMs Think in Space Visually 12

Conclusion 13 ❖ Background ▪ VLM’s visual-spatial intelligence from videos

Appendix: Number of Frames 14 • Frame sampling strategies are

Comment 15 ❖ Strength ▪ Paper Writing Strategy: ▪ Complex

Links 16 ❖ Code https://github.com/vision-x-nyu/thinking-in-space ❖ Project Page https://vision-x-nyu.github.io/thinking-in-space.github.io/ ❖