一人称視点映像解析の最先端（MIRU2025 チュートリアル）

Slide 1

Slide 1 text

25/07/29 MIRU2025チュートリアル一人称視点映像解析の最先端産業技術総合研究所人工知能研究センター八木拓真（Takuma Yagi） 1

Slide 103

Slide 103 text

参考文献（1/6） [Heilbron+, CVPR'15] Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 961-970). [Shahroudy+, CVPR'16] Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010-1019). [Shao+, CVPR'20] Shao, D., Zhao, Y., Dai, B., & Lin, D. (2020). Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2616-2625). [Rai+, CVPR'21] Rai, N., Chen, H., Ji, J., Desai, R., Kozuka, K., Ishizaka, S., ... & Niebles, J. C. (2021). Home action genome: Cooperative compositional action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11184-11193). Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5288-5296). [Bush, ‘45] Bush, V. (1945). As we may think. The atlantic monthly, 176(1), 101-108. [Ohnishi+, CVPR’16] Ohnishi, K., Kanehira, A., Kanezaki, A., & Harada, T. (2016). Recognizing activities of daily living with a wrist-mounted camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3103-3111). [Held & Hein, ‘63] Held, R., & Hein, A. (1963). Movement-produced stimulation in the development of visually guided behavior. Journal of comparative and physiological psychology, 56(5), 872. [Jayaraman & Grauman, ICCV’15] Jayaraman, D., & Grauman, K. (2015). Learning image representations tied to ego-motion. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1413-1421). 104

Slide 104

Slide 104 text

参考文献（2/6） [Huang+, ECCV’18] Y. Huang, M. Cai, Z. Li and Y. Sato, "Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition," European Conference on Computer Vision (ECCV), 2018. [Zhou+, CVPR’17] Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1851-1858). [Zimmermann+, ICCV’17] Zimmermann, C., & Brox, T. (2017). Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE international conference on computer vision (pp. 4903-4911). [Tsukada+, ICCVW’11] Tsukada, A., Shino, M., Devyver, M., & Kanade, T. (2011, November). Illumination-free gaze estimation method for first-person vision wearable device. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 2084-2091). IEEE. [Yarbus, ‘67] Yarbus, A. L. Eye Movements and Vision. Plenum. New York. 1967 (Originally published in Russian 1962) [Huang+, ECCV’18] Huang, Y., Cai, M., Li, Z., & Sato, Y. (2018). Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European conference on computer vision (ECCV) (pp. 754-769). [Yang+, CVPR’20] Yang, Z., Huang, L., Chen, Y., Wei, Z., Ahn, S., Zelinsky, G., ... & Hoai, M. (2020). Predicting goal-directed human attention using inverse reinforcement learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 193-202). [Nishiyasu+, CVPRW’24] Takumi Nishiyasu and Yoichi Sato. Gaze Scanpath Transformer: Predicting Visual Search Target by Spatiotemporal Semantic Modeling of Gaze Scanpath. 6th international workshop on gaze estimation and prediction in the wild in conjunction with CVPR2024. 2024. [Poleg+, ACCV’14] Poleg, Y., Arora, C., & Peleg, S. (2015). Head motion signatures from egocentric videos. In Computer Vision--ACCV 2014: 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part III 12 (pp. 315-329). Springer International Publishing. [Bandini+, PAMI’20] Bandini, A., & Zariffa, J. (2020). Analysis of the hands in egocentric vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 45(6), 6846-6866. [Mueller+, CVPR'18] Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, D., & Theobalt, C. (2018). Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 49-59). [Banerjee+, CVPR'25] Banerjee, P., Shkodrani, S., Moulon, P., Hampali, S., Han, S., Zhang, F., ... & Hodan, T. (2025). Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 7061-7071). 105

Slide 105

Slide 105 text

参考文献（3/6） [Li+, CVPR’23] Li, J., Liu, K., & Wu, J. (2023). Ego-body pose estimation via ego-head pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 17142-17151). [Grauman+, CVPR’22] Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., ... & Malik, J. (2022). Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18995-19012). [Lee+, CVPR'25] Lee, J., Xu, W., Richard, A., Wei, S. E., Saito, S., Bai, S., ... & Saragih, J. (2025). REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 7095-7104). [Wang+, CVPR'25] Wang, J., Dabral, R., Luvizon, D., Cao, Z., Liu, L., Beeler, T., & Theobalt, C. (2025). Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 22668-22679). [Zhang+. UIST’17] Zhang, X., Sugano, Y., & Bulling, A. (2017, October). Everyday eye contact detection using unsupervised gaze target discovery. In Proceedings of the 30th annual ACM symposium on user interface software and technology (pp. 193-203). [Cheng+, NeurIPS’23] Cheng, T., Shan, D., Hassen, A., Higgins, R., & Fouhey, D. (2023). Towards a richer 2d understanding of hands at scale. Advances in Neural Information Processing Systems, 36, 30453-30465. [Tang+, NeurIPS'23 D&B] Tang, H., Liang, K. J., Grauman, K., Feiszli, M., & Wang, W. (2023). Egotracks: A long-term egocentric visual object tracking dataset. Advances in Neural Information Processing Systems, 36, 75716-75739. [Yagi+, IUI’21] Yagi, T., Nishiyasu, T., Kawasaki, K., Matsuki, M., & Sato, Y. (2021, April). GO-finder: a registration-free wearable system for assisting users in finding lost objects via hand-held object discovery. In 26th International Conference on Intelligent User Interfaces (pp. 139-149). [Song+, NeurIPS'23] Song, Y., Byrne, E., Nagarajan, T., Wang, H., Martin, M., & Torresani, L. (2023). Ego4d goal-step: Toward hierarchical understanding of procedural activities. Advances in Neural Information Processing Systems, 36, 38863-38886. 106

Slide 106

Slide 106 text

参考文献（4/6） [Damen+, ECCV’18] Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., ... & Wray, M. (2018). Scaling egocentric vision: The epic- kitchens dataset. In Proceedings of the European conference on computer vision (ECCV) (pp. 720-736). [Damen+, IJCV’22] Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Kazakos, E., Ma, J., ... & Wray, M. (2022). Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, 1-23. [Grauman+, CVPR’24] Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., ... & Wray, M. (2023). Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. CVPR 2024. [Darkhalil+, NeurIPS’22] Darkhalil, A., Shan, D., Zhu, B., Ma, J., Kar, A., Higgins, R., ... & Damen, D. (2022). Epic-kitchens visor benchmark: Video segmentations and object relations. Advances in Neural Information Processing Systems, 35, 13745-13758. [Huh+ ICASSP’23] Huh, J., Chalk, J., Kazakos, E., Damen, D., & Zisserman, A. (2023, June). Epic-sounds: A large-scale dataset of actions that sound. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. [Grauman+, CVPR'24] Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., ... & Wray, M. (2024). Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19383-19400). [Ashutosh+, CVPR'25] Ashutosh, K., Nagarajan, T., Pavlakos, G., Kitani, K., & Grauman, K. (2025). ExpertAF: Expert actionable feedback from video. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 13582-13594). [Perett+, CVPR'25] Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K. K., ... & Damen, D. (2025). Hd-epic: A highly-detailed egocentric video dataset. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 23901-23913). [Sener+, CVPR’22] Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., & Yao, A. (2022). Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 21096- 21106). [Mangalam+, NeurIPS’24] Mangalam, K., Akshulakov, R., & Malik, J. (2024). Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36. [Ye+, ICLR'25] Ye, H., Zhang, H., Daxberger, E., Chen, L., Lin, Z., Li, Y., ... & Yang, Y. MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA. In The Thirteenth International Conference on Learning Representations. 107

Slide 107

Slide 107 text

参考文献（5/6） [Pan+, ICCV'23] Pan, X., Charron, N., Yang, Y., Peters, S., Whelan, T., Kong, C., ... & Ren, Y. C. (2023). Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 20133-20143). [Zhao+, CVPR’23] Zhao, Y., Misra, I., Krähenbühl, P., & Girdhar, R. (2023). Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6586-6597). [Li+, CVPR’24] Li, G., Zhao, K., Zhang, S., Lyu, X., Dusmanu, M., Zhang, Y., ... & Tang, S. (2024). EgoGen: An Egocentric Synthetic Data Generator. CVPR2024. [Wang+, ICCV'23] Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., ... & Pollefeys, M. (2023). Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 20270-20281). [Ma+, ECCV'24] Ma, L., Ye, Y., Hong, F., Guzov, V., Jiang, Y., Postyeni, R., ... & Newcombe, R. (2024, September). Nymeria: A massive collection of multimodal egocentric daily motion in the wild. In European Conference on Computer Vision (pp. 445-465). Cham: Springer Nature Switzerland. [Zhou+, CVPR'25] Zhou, S., Xiao, J., Li, Q., Li, Y., Yang, X., Guo, D., ... & Yao, A. (2025). Egotextvqa: Towards egocentric scene-text aware video question answering. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 3363-3373). [Tateno+, MIRU'25] Tateno, M., Kato, G., Hara, K., Kataoka, H., Sato, Y., and Yagi, T. (2025). HanDyVQA: A Video QA Benchmark for Fine-Grained Hand- Object Interaction Dynamics., 画像の認識・理解シンポジウム. [Lin+, NeurIPS’22] Lin, K. Q., Wang, J., Soldan, M., Wray, M., Yan, R., Xu, E. Z., ... & Shou, M. Z. (2022). Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35, 7575-7586. [Zhao+, CVPR’23] Zhao, Y., Misra, I., Krähenbühl, P., & Girdhar, R. (2023). Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6586-6597). [Hong+, CVPR'25] Hong, F., Guzov, V., Kim, H. J., Ye, Y., Newcombe, R., Liu, Z., & Ma, L. (2025). Egolm: Multi-modal language model of egocentric motions. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5344-5354). 108

Slide 108

Slide 108 text

参考文献（6/6） [Kamikubo+, CHI'25] Kamikubo, R., Kayukawa, S., Kaniwa, Y., Wang, A., Kacorri, H., Takagi, H., & Asakawa, C. (2025, April). Beyond Omakase: Designing Shared Control for Navigation Robots with Blind People. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (pp. 1-17). [Stanescu+, ISMAR’23] Stanescu, A., Mohr, P., Kozinski, M., Mori, S., Schmalstieg, D., & Kalkofen, D. (2023, October). State-Aware Configuration Detection for Augmented Reality Step-by-Step Tutorials. In 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (pp. 157-166). IEEE. [Yagi+, IJCV'25] Yagi, T., Ohashi, M., Huang, Y., Furuta, R., Adachi, S., Mitsuyama, T., & Sato, Y. (2025). FineBio: a fine-grained video dataset of biological experiments with hierarchical annotation. International Journal of Computer Vision, 1-16. [Nair+, CoRL’22] Nair, S., Rajeswaran, A., Kumar, V., Finn, C., & Gupta, A. (2022, August). R3M: A Universal Visual Representation for Robot Manipulation. In 6th Annual Conference on Robot Learning. [Kareer+, ArXiv'24] Kareer, S., Patel, D., Punamiya, R., Mathur, P., Cheng, S., Wang, C., ... & Xu, D. (2024). Egomimic: Scaling imitation learning via egocentric video. arXiv preprint arXiv:2410.24221. [Shi+, ICRA‘25] Shi, J., Zhao, Z., Wang, T., Pedroza, I., Luo, A., Wang, J., ... & Jayaraman, D. (2025). ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos, ICRA. [Yang+, Arxiv’25] Yang, R., Yu, Q., Wu, Y., Yan, R., Li, B., Cheng, A. C., ... & Wang, X. (2025). EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos. arXiv preprint arXiv:2507.12440. [Luo+, ArXiv’25] Luo, H., Feng, Y., Zhang, W., Zheng, S., Wang, Y., Yuan, H., ... & Lu, Z. (2025). Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos. arXiv preprint arXiv:2507.15597. [Bahl+, CVPR'23] Bahl, S., Mendonca, R., Chen, L., Jain, U., & Pathak, D. (2023). Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13778-13790). [Hoque+, ArXiv'25] Hoque, R., Huang, P., Yoon, D. J., Sivapurapu, M., & Zhang, J. (2025). EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video. arXiv preprint arXiv:2505.11709. [Singh+, WACV’16] Singh, K. K., Fatahalian, K., & Efros, A. A. (2016, March). Krishnacam: Using a longitudinal, single-person, egocentric dataset for scene understanding tasks. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 1-9). IEEE. [Yang+, CVPR'25] Yang, J., Liu, S., Guo, H., Dong, Y., Zhang, X., Zhang, S., ... & Liu, Z. (2025). Egolife: Towards egocentric life assistant. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 28885-28900). [Chatterjee+, ICCV’25] Chatterjee, D., Remelli, E., Song, Y., Tekin, B., Mittal, A., Bhatnagar, B., ... & Sener, F. (2025). Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding. ICCV. 109

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text