Video-MME [Fu+, CVPR25] Recognition & perception Comprehensive evaluation across various video- related tasks EgoSchema [Mangalam+, NeurIPS23] Understanding abilities Content-level understanding from first-person perspective OpenEQA [Majumdar+, CVPR24] Understanding abilities Episodic question answering in embodied settings Most prior works focus on content-level, which primarily serves as a temporal extension of 2D image understanding without 3D spatial consideration Video-MME [Fu+, CVPR25] EgoSchema [Mangalam+, NeurIPS23]