Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

視覚から身体性を持つAIへ: 巧緻な動作の3次元理解

Avatar for Take Ohkawa Take Ohkawa
December 04, 2025

視覚から身体性を持つAIへ: 巧緻な動作の3次元理解

2025.12.4に開催されたGoogle Developer Groups AI for Science - Japanの講演資料です.
概要: 人間の手が持つ「巧緻性」は,精密な操作,複数指の協調,力制御といった高度な能力からなり,これをコンピュータビジョンで理解することは,AR/VRやロボティクス,身体性を持つAIの発展において重要である.しかし,従来の2D画像処理では遮蔽や深度の曖昧性といった課題があり,幾何的・物理的整合性を考慮する3次元的な理解が不可欠である.本講演では,人間動作を扱うための3次元表現基盤(SMPL/MANO等)およびセンサー基盤(RGB/Depth/IMU等)の特性を整理し,センシング技術の現状とその限界について議論する.また,単眼画像からの手指再構成における最新の基盤モデルや,大規模な実世界映像の活用,拡散モデル等の生成技術を用いた動作の補完・予測といったモデリング技術についても解説する.最後に,微細な巧緻性の3次元的把握が,AIが身体性を持って世界を理解・操作する鍵となることを示唆する.

Link: https://gdg.community.dev/events/details/google-gdg-ai-for-science-japan-presents-shi-jue-karashen-ti-xing-wochi-tsuaihe-qiao-zhi-nadong-zuo-no3ci-yuan-li-jie/

Avatar for Take Ohkawa

Take Ohkawa

December 04, 2025
Tweet

More Decks by Take Ohkawa

Other Decks in Research

Transcript

  1. ޼៛ੑ (Dexterity) ͷཧղ • ख΍ࢦઌΛث༻ʹ࢖͍ણࡉͳ࡞ۀΛਖ਼֬ʹߦ͏ೳྗ 1. Precision / Accuracy: ޡࠩͳ͘খ͞ͳର৅Λѻ͏ೳྗ

    2. Coordination: ෳ਺ࢦͷڠௐɾλΠϛϯά 3. Force Control: ྗՃݮͷඍௐ੔ 4. Speed / Fluency: ແବͷগͳ͍׈Β͔ͳಈ࡞ 5. Visuomotor Integration: ࢹ֮৘ใΛߦಈʹ൓өͤ͞Δ੍ޚ
 • ͜ͷछͷෳࡶͳखٕΛಈը૾͔ΒೝࣝɾϞσϦϯά͢Δ 2 ύσϡɾϖάϘʔυɾςετ
 ҩֶ, ਆܦՊֶ, ൃୡݚڀͰར༻ ͞ΕΔඪ४λεΫ (1948೥) Figure adapted from J. Sawayer+, Comparing the Level of Dexterity offered by Latex and Nitrile SafeSkin Gloves. Annals of Occupational Hygiene, 2005.
  2. 3࣍ݩೝࣝ͢Δҙٛ • 2Dը૾Ͱ͸ᐆດੑ͕ଟ͍ • Occlusion: खɾ෺ମʹΑͬͯःṭ͞ΕΔ • Depth ambiguity: ਂ౓৘ใ͕ෆఆ

    • Contact: ઀৮ͷ൑ఆ΋ࠔ೉ 3 P. Banerjee+, HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos. In CVPR, 2025.
  3. ࢈ۀԠ༻ͷഎܠ • ਓؒߦಈͷղੳɼAR/VRɼΞΫηγϏϦςΟɼϩϘςΟΫε΁ͷల։ 5 ө૾ɾٕೳཧղ 
 [K. Grauman+, CVPR’22] ARάϥε

    [J. Engel+, arXiv’23] VRήʔϜ
 [S. Han+, SIGGRAPH Asia’22] VRςϨϓϨθϯε [J. Lee+, CVPR’25] ख࿩ղੳ
 [Z. Yu+, ECC’24] ϩϘοτ੍ޚ
 [Z. Yu+, ECC’24]
  4. Physical AIʹ޲͚ͯ • ۙ೥ͷϩϘοτج൫Ϟσϧ (VLA)͸ɼVLMʹϩϘοτ࢟੎ͷߦಈग़ྗϔουΛ ௥Ճ͢Δ͜ͱͰϚχϐϡϨʔγϣϯλεΫΛֶश • ਓؒͷ3࣍ݩಈ࡞௥੻͔ΒVLAͷֶश΋ՄೳʹͳΔʁ 7 ϩϘοτج൫Ϟσϧ

    π 0 ैདྷͷࢹ֮ݴޠج൫Ϟσϧ (VLM) ϩϘοτΞΫγϣϯͷग़ྗ K. Black et al. : A Vision-Language-Action Flow Model for General Robot Control, 2024. https://www.physicalintelligence.company/blog/pi0 π 0
  5. ΞδΣϯμ 1. ಋೖ: ޼៛ੑͷ3࣍ݩཧղʹ޲͚ͯ 2. ਓؒಈ࡞Λѻ͏ͨΊͷ 3࣍ݩදݱج൫ 3. ਓؒಈ࡞Λѻ͏ͨΊͷηϯαʔج൫ 4.

    ޼៛ͳಈ࡞Λଊ͑ΔͨΊͷϞσϦϯά A. 3࣍ݩֶशσʔληοτߏங B. ୯؟͔Βͷखࢦ࠶ߏ੒Ϟσϧ C. େن໛σʔλͷ׆༻ D. ੜ੒ϞσϧʹΑΔิ׬ͱ༧ଌ 5. ·ͱΊ 9
  6. 3࣍ݩදݱ֓ཁ • εέϧτϯ/ϙʔζදݱ: ܰྔ͕ͩද໘زԿΛѻ͍ʹ͍͘ • ϝογϡදݱ: 3DϞσϧͷར༻ɼ୯؟ਪఆͰར༻͠΍͍͢ • NeRF /

    GaussianSplat (3DGS): ॊೈʹܗঢ়දݱՄೳɽద༻ʹ͸ଟ਺ࢹ఺͕ඞཁ 11 ϙʔζ (21x3)
 ࠲ඪ / ؔઅ֯දݱ ϝογϡ (778x3)
 τϙϩδʔݻఆ NeRFܕ
 HOLD [Z. Fan+, CVPR’24] 3DGSܕ
 MANUS [C. Pokhariya+, CVPR’24] HOLD: https://zc-alexfan.github.io/hold, MANUS: https://ivl.cs.brown.edu/research/manus.html
  7. 3 Preliminary factorize into two problems Input: 2D image Output:

    3D mesh model For simplicity, we often factorize the problem into two problems: 1. Shape estimation 2. Pose estimation Adapted from from [1], [2] [2] SMPL made Simple – Introduction, https://youtu.be/rzpiSYTrRU0?si=SJUDnd56n3lHNPJV, SMPL made Simple Tutorial at CVPR’21 3DϞσϧͷ׆༻ SMPL • ύϥϝτϦοΫͳ਎ମϞσϧʹΑΓܗঢ়ͱϙʔζΛಉ࣌ʹѻ͑Δ • ࠶ߏ੒λεΫ͸ը૾͔Β͜ΕΒͷύϥϝʔλΛਪఆ͢Δ 12 M. Loper+, SMPL: A Skinned Multi-Person Linear Model. ACM TOG, 2015. J. Lu+, DPoser: Diffusion Model as Robust 3D Human Pose Prior, arXiv, 2023. ICCV2025 test-of-time award!
  8. ओ੒෼෼ੳ (PCA) ʹΑΔܗঢ়දݱ 13 CAESAR dataset Principal shapes PCA …

    ∈ ℝ! # K. M. Robinette+, Civilian American and European Surface Anthropometry Resource (CAESAR) Final Report, 2002. ಉ͡ඪ४࢟੎ͷඃݧऀΛ
 உঁ2000໊ͣͭ3DεΩϟϯ ܗঢ় (shape) ύϥϝʔλ͸දݱྗʹԠͯ͡ ্Ґ10-300੒෼͔Βબ΂Δ
  9. ओ੒෼෼ੳ (PCA) ʹΑΔܗঢ়දݱ • ࠷ऴతͳܗঢ়͸֤ܗঢ়੒෼ͷઢܕ݁߹Ͱܾ·Δ 14 SMPL made Simple –

    Introduction, https://youtu.be/rzpiSYTrRU0 , SMPL made Simple Tutorial at CVPR 2021. Average shape Principal shapes Final shape +1.8× = +2.2× … !! !" !# Shape parameter −3.2×
  10. MANO FLAME SMPL-X શ਎΁ͷ֦ு SMPL-X • SMPLͱಉ͡ߏ੒Λإ (FLAME) / ख

    (MANO) / શ਎ (SMPL-X) ΁֦ுͰ͖Δ 15 PCA parameterize PCA parameterize FLAME MANO Face shape prior 1000 hand scan data Hand shape prior 33000 face scan data G. Pavlakos+, Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In CVPR, 2019. FLAME [T. Li+, ACM TOG, 2017]. MANO [J. Romero+, ACM TOG, 2017].
  11. ࠎ/ے೑/ΞϐΞϥϯεͷߟྀ 16 Y. Li+, NIMBLE: A Non-rigid Hand Model with

    Bones and Muscles. TOG’17. MRIσʔλͷΞϊςʔγϣϯ (ࠎ/ے೑) NIMBLEϞσϧ
  12. ηϯα֓ཁ • RGBΧϝϥ: खΛࡱΔʹ͸ߴղ૾౓ࡱӨ͕Α͍ɽखͷۙ๣ʹΧϝϥΛઃஔ͢Δ౳ • DepthΧϝϥ: ਂ౓Λܭଌ͠఺܈΋ߏ੒Մೳ͕ͩɼϊΠζ͕ҰఆҎ্ؚΉ • AR/VRάϥε: ಄෦ʹ૷ண͢Δ͚ͩͰར༻Մೳ͕ͩɼ਎ମෛՙ͋Γ

    • IMU (׳ੑ): ःṭͷӨڹΛड͚ͣɼؔઅ֯ͷճసΛਪఆ͠΍͍͢ɽޡࠩ஝ੵ͋Γ • EMG (ےి): ःṭͷӨڹΛड͚ͣಈ࡞෼ྨՄೳɽۭؒղ૾౓͕௿͍ɽ 18 Apple Vision Pro EMG & IMU device
 [Y. Xiao+ CHI EA’25] RGBը૾ Depthը૾ ఺܈
  13. ੈքن໛ͷҰਓশࢹ఺ө૾ऩ࿥ • Aria glasses౳ͷҰਓশࢹ఺ΧϝϥΛར༻ͯ͠ɼੈքͷ74౎ࢢ͔Βө૾ऩ࿥ 21 K. Grauman+, Ego4D: Around the

    World in 3,000 Hours of Egocentric Video. In CVPR, 2022. 
 K. Grauman+, Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives. In CVPR, 2024. Ego4D 
 3,670࣌ؒɼҰਓশࢹ఺ө૾ Ego-Exo4D 
 1,286࣌ؒɼҰਓশ+ෳ਺ݻఆࢹ఺ө૾
  14. ϚϧνΧϝϥઃඋ | ଟ਺ΧϝϥυʔϜܕ 22 Mugsy@Meta Pittsburgh
 150ࢹ఺ͷ4K RGBΧϝϥ C. Wuu+,

    Multiface: A Dataset for Neural Face Rendering. In CVPRW, 2023. T. Teufel+, HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis. In ICCV, 2025. Lightstage@MPI-INF 40ࢹ఺ͷ6K RGBΧϝϥɼ331ݸͷLEDϥΠτ
 ༷ʑͳর໌৚݅Λ࠶ݱͰ͖Δ
  15. υʔϜऩ࿥σʔλྫ 23 InterHand2.6M
 ྆खͷΠϯλϥΫγϣϯ [G. Moon+, ECCV’20] Goliath-SC
 ࣗݾ઀৮ʹؔ͢Δશ਎ಈ࡞ [T.

    Ohkawa+, ICCV’25] URHand
 खͷϥΠςΟϯάϞσϧ [Z. Chen+, CVPR’24] InterHand2.6M: https://mks0601.github.io/InterHand2.6M/, Goliath-SC: https://tkhkaeio.github.io/projects/25-scgen/index.html, URHand: https://frozenburning.github.io/projects/urhand/
  16. ϚϧνΧϝϥઃඋ | σεΫτοϓܕ (1/2) 24 Y.W. Chao+, DexYCB: A Benchmark

    for Capturing Hand Grasping of Objects. In CVPR, 2023. https://dex-ycb.github.io/ DexYCB@NVIDIA
 8ࢹ఺ͷRGB-DΧϝϥɼ3Dख෺ମΞϊςʔγϣϯ
  17. ϚϧνΧϝϥઃඋ | σεΫτοϓܕ (2/2) 25 F. Sener+, Assembly101: A Large-Scale

    Multi-View Video Dataset for Understanding Procedural Activities. In CVPR, 2022. https://assembly-101.github.io/
 T. Ohkawa+, AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation. In CVPR, 2023. https://assemblyhands.github.io/ EgoVis Paper Award@CVPR’25͋ 3DखΞϊςʔγϣϯͷఏڙ
 ϙʔζϕʔεߦಈೝࣝ΁ల։ AssemblyHands@Meta Assembly101@Meta EgoVis Paper Award@CVPR’24͋ 8ࢹ఺ͷRGBΧϝϥ+EgoΧϝϥ 
 ༷ʑͳߦಈೝࣝλεΫͷఏҊ
  18. Ξϊςʔγϣϯͷ࣮৘ • 2DΩʔϙΠϯτͷΞϊςʔγϣϯͳΒਓखͰ΋ՄೳͰ͋Δ͕ɼखࢦΞϊςʔγϣ ϯίετ͕ߴ͘ɼਓྗͰ͸ෆՄೳͳྖҬ΋ଟ͍ɽಛʹ3D͸ᐆດੑ͕ߴ͍
 → σʔλऩ࿥؀ڥͷ੍໿ʹ΋ͳ͍ͬͯΔɽ඼࣭ͱଟ༷ੑͷδϨϯϚ 26 Internet videos
 (100DOH)


    [D. Shan+, CVPR’20] Multi-camera dome
 (InterHand2.6M)
 [G. Moon+, ECCV’20, 
 C. Wuu+, arXiv’22] Multi-camera desktop
 (DexYCB / HO3D)
 [Y-W. Chao+, CVPR’21,
 S. Hampali+, CVPR’20] Wild ego videos
 (Ego4D)
 [K. Grauman+, CVPR’22] Diversity Fidelity RGB-D camera
 (Dexter+Object / FPHA)
 [S. Sridhar+, ECCV’16,
 G. Garcia-Hernando+, CVPR’18]
  19. A. 3࣍ݩֶशσʔληοτߏங • खಈΞϊςʔγϣϯ • Syntheticσʔλ • ϚʔΧʔํࣜ • ܭࢉํࣜ

    • ϞσϧϑΟοςΟϯά • Triangulation (ࡾ֯ଌྔ) 29 T. Ohkawa+, Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey. IJCV, 2023.
  20. खಈΞϊςʔγϣϯ • Depthը૾ʹରͯ͠2࣍ݩΩʔϙΠϯτΛΞϊςʔγϣϯͯ͠3DϙʔζΛಘΔ • ར఺: Մࢹؔઅʹରͯ͠ਫ਼֬ͳΞϊςʔγϣϯ • ܽ఺: ःṭ࣌ʹ൑ఆͰ͖ͳ͍ɽ࿑ಇू໿ܕ 30

    S. Sridhar+, Real-time joint tracking of a hand manipulating an object from RGB-D input. In ECCV, 2016. 
 F. Muller+, Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In ICCV, 2017. Dexter+Object EgoDexter
  21. Syntheticσʔλ • γϛϡϨʔλɾCGΤϯδϯΛར༻ • ར఺: ଟ༷ͳσʔλΛੜ੒͠΍͍͢ɼ׬શͳਅ஋Λ࡞ΕΔ • ܽ఺: Sim2RealΪϟοϓɼϞʔγϣϯɾ઀৮ɾ෺ཧಛੑͷߟྀ͕՝୊ 31

    C. Zimmermann+, Learning to estimate 3D hand pose from single RGB images. In ICCV, 2017. Y. Hasson+, Learning joint reconstruction of hands and manipulated objects. In CVPR, 2019. RHD ObMan
  22. ϚʔΧʔํࣜ • ϞʔγϣϯΩϟϓνϟ౳ͷҐஔͱ޲͖Λ௥੻ • ར఺: ःṭʹؤ݈Ͱਫ਼֬ͳΞϊςʔγϣϯ • ܽ఺: ݟ͑ʹӨڹΛ༩͑Δɽಋೖίετ͕ߴ͍ 32

    ARCTIC HOT3D Z. Fan+, ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation. In CVPR, 2023.
 P. Banerjee+, HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos. In CVPR, 2025.
  23. ܭࢉํࣜ | ϞσϧϑΟοςΟϯά • ϚʔΧʔͳ͠ͷ؍ଌʹରͯ͠3Dख෺ମϞσϧΛϑΟοςΟϯά͢Δ • ؍ଌର৅: ϙʔζɼϚεΫɼ఺܈ • ࠷దԽର৅:

    3DϞσϧͷҐஔͱ޲͖ (6DoF)ɼखϙʔζͱܗঢ়ύϥϝʔλ 34 C. Zimmermann+, FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In ICCV, 2019.
 Y.W. Chao+, DexYCB: A Benchmark for Capturing Hand Grasping of Objects. In CVPR, 2023. FreiHAND DexYCB
  24. ܭࢉํࣜ | Triangulation • ϚϧνΧϝϥ͔Βऔಘͨ͠2DΩʔϙΠϯτΛزԿ੍໿ʹΑΓ3࣍ݩ࠶ߏ੒ • RANSACΛར༻ͯ͠֎Ε஋Λআڈ 35 T. Simon+,

    Hand Keypoint Detection in Single Images Using Multiview Bootstrapping. In CVPR, 2017. 
 T. Ohkawa+, AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation. In CVPR, 2023. M+Lࢹ఺ʹରͯ͠ɼMࢹ఺ʹΞϊςʔγϣϯ͢Ε͹ɼ ࢒ΓLࢹ఺͸ࣗಈͰఆ·Δ 3DϘϦϡʔϜಛ௃ϕʔεͷTriangulation networkͷఏҊ
  25. ܭࢉํࣜ | ྫ: AssemblyHands-X 36 T. Banno, T. Ohkawa+, AssemblyHands-X:

    Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities. In ICCVW, 2025. Triangulationͨ͠3Dϙʔζ ϑΟοςΟϯάͨ͠SMPL-X • TriangulationͱϑΟοςϯάΛ׆༻ͨ͠3Dख਎ମΞϊςʔγϣϯ
  26. • جຊߏ੒͸2D heatmap+Depth, 2.5D heatmap, regressionͷ3ύλʔϯ
 
 
 
 


    
 
 • 2023೥ࠒ͔Βج൫Ϟσϧ͕ొ৔ (ViT + Ϟσϧύϥϝʔλճؼܕ) B. ୯؟͔Βͷखࢦ࠶ߏ੒Ϟσϧ 37 T. Ohkawa+, Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey. IJCV, 2023. Hand crop RGB Depth or Backbone (e.g., CNN, Transformer) I. 2D heatmaps (J x L x L) 3D Pose 21 x (x, y, z) Mesh III. Direct regression A. Depth regression II. 2.5D heatmaps (e.g., J x L x L x L) 9
  27. ViT + Ϟσϧύϥϝʔλճؼܕ • ୯७ͳViTΞʔΩςΫνϟͱଟ༷ͳ3D/2DσʔλʹΑΔεέʔϦϯά 38 G. Pavlakos+, Reconstructing Hands

    in 3D with Transformers. In CVPR, 2024. W. Yin+, SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation. TPAMI, 2025. HaMeR: MANOਪఆ SMPLest-X/SMPLer-X SMPL-Xਪఆ
  28. ݕग़ʹয఺Λ౰ͯͨ WiLoR • طଘͷೖྗ͸खΫϩοϓΛ૝ఆ͍ͯ͠Δ͕ɼݕग़΋ಉ࣌ʹղ͘͜ͱ͕ॏཁ • ج൫ϞσϧΛ֦ுͯ͠ݕग़ΞʔΩςΫνϟΛ௥Ճ 39 R.A. Potamias+, WiLoR:

    End-to-end 3D hand localization and reconstruction in-the-wild. In CVPR, 2025. ॳظMANO༧ଌ͔Βಛ௃ྔΛۭؒతʹબผɽϚϧνεέʔϧಛ௃ Λ͞Βʹֶशɽ࠷ޙʹMANOߋ৽෼Λਪఆ
  29. ੈք࠲ඪʹ͓͚Δಈ࡞࠶ߏ੒ HaWoR 42 J. Zhang+, HaWoR: World-Space Hand Motion Reconstruction

    from Egocentric Videos. In CVPR, 2025. WiLoR ༧ଌͷิ׬Ϟσϧ Depthج൫Ϟσϧ SLAM (εέʔϧະ஌)
  30. ੈք࠲ඪʹ͓͚Δಈ࡞࠶ߏ੒ HaWoR 43 J. Zhang+, HaWoR: World-Space Hand Motion Reconstruction

    from Egocentric Videos. In CVPR, 2025. https://hawor-project.github.io/
  31. ਪ࿦ͷߴ଎Խ • ViTϕʔεͷج൫Ϟσϧ͸Ξςϯγϣϯͷܭࢉ͕ॏ͘ɼCNNΑΓਪ࿦͕஗͍ • ͜ΕΛ୅ସ͢Δߴ଎ΞʔΩςΫνϟͷݕ౼ 44 Y. Zhou*, T. Ohkawa*+,

    DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions. In WACV, 2026. DF-Mamba ઙ૚ͰCNNͷಛ௃நग़ɼਂ૚ͰMamba (ঢ়ଶۭؒϞσϧ) ʹΑΔ
 άϩʔόϧಛ௃ͷදݱֶश ঢ়ଶ (ہॴಛ௃) εΩϟϯΛ DeformableͳܗͰ࠶ఆٛ ResNet50ͱಉ౳ͷϞσϧαΠζʹ͓͍ͯɼCNN, CNN+Transformer, ViT ΑΓߴੑೳ͔ͭߴ଎
  32. C. େن໛σʔλͷ׆༻ • ֶशηοτ͸ελδΦσʔλ͕ଟ͘ɼଟ༷ੑʹ͚ܽΔ
 → ଟ༷ͳ࣮ੈքσʔλ͔ΒֶशͰ͖ͳ͍͔ʁ 45 100DOH: YouTube͔Βߏ੒ͨ͠࡞ۀө૾܈ Ego4D:

    ੈքن໛ͷҰਓশࢹ఺ө૾܈ K. Grauman+, Ego4D: Around the World in 3,000 Hours of Egocentric Video. In CVPR, 2022. D. Shan+, . Understanding Human Hands in Contact at Internet Scale. In CVPR, 2020.
  33. ྨࣅखը૾ʹجͮ͘ରরֶश SiMHand 46 N. Lin*, T. Ohkawa*+, SiMHand: Mining Similar

    Hands for Large-Scale 3D Hand Pose Pre-training. In ICLR, 2025. Contrastive learning
 with pair-wise similarity SiMHand: Mining Similar Hands for 3D Hand Pose Pre-Training (ICLR 2025) Large hand image pool 
 from human-centric videos Structure similar hands Pre-training Similar hands as positive Attract Repel ... ... Pairing of similar-looking hand images Ego4D+100DOH͔Β2Mը૾ ͷࣄલֶशσʔλͷߏங ྨࣅ࢟੎Λ࣋ͭը૾Λࣄલֶशηοτ ͔Βݕࡧ͠ɼਖ਼ྫͱͯ͠ར༻ എܠ؀ڥɼख෺ମૢ࡞ɼϢʔβͷࠩҟ ʹؤ݈ͳࣄલֶशϞσϧͷߏங
  34. σʔλ/ղ૾౓ͱ΋࠷େͷਓؒࣄલֶश Sapiens 48 R. Khirodkar+, Sapiens: Foundation for Human Vision

    Models. In ECCV, 2024. https://www.meta.com/emerging-tech/codec-avatars/sapiens/ • ϙʔζਪఆɼηάϝϯςʔγϣϯɼDepthਪఆɼ๏ઢਪఆϞσϧ͕ެ։
  35. ਓؒಈ࡞֦ࢄϞσϧMDM • ֦ࢄաఔʹج͖͔ͮΒϞʔγϣϯੜ੒ • ςΩετ৚݅෇͖TransformerʹΑΔϊΠζআڈϞσϧͷఏҊ 50 G. Tevet+, MDM: Human

    Motion Diffusion Model. In ICLR, 2023. https://guytevet.github.io/mdm-page/ “A person walks forward, bends down to pick something up off the ground.” ϊΠζআڈϞσϧ αϯϓϦϯάํࣜ
  36. Ϟʔγϣϯͷิ׬ • ֦ࢄϞσϧΛ಄෦ͷಈ͖΍ը૾͔Βશ਎ಈ࡞Λ෮ݩ͢ΔΑ͏ʹֶश 51 J. Li+, Ego-Body Pose Estimation via

    Ego-Head Pose Estimation. In CVPR, 2023. https://lijiaman.github.io/projects/egoego/
 C. Petel+, UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation. In ICCV, 2025. EgoEgo
 Best Paper Award 
 Candidate@CVPR’23 UniEgoMotion ը૾Λߟྀ͢ΔΑ͏ʹ֦ு
 ϚϧνλεΫʹରԠ
  37. ΞϑΥʔμϯεߟྀͨ͠ख࢟੎֦ࢄϞσϧ • ෺ମ೺࣋ʹؔ͢ΔςΩετهड़ͷੜ੒ͱςΩετΨΠυʹΑΔ֦ࢄϞσϧ 52 N. Suzuki, T. Ohkawa+, Affordance-guided diffusion

    prior for 3D hand reconstruction. In ICCVW, 2025 VLMͱ೺࣋ࣝผϞσϧʹΑΓΞϑΥʔμϯε ʹؔ͢Δهड़ͷੜ੒ AffHandGen ςΩετ৚݅෇͖ख࢟੎ϊΠζআڈϞσϧ
  38. ΞϑΥʔμϯεߟྀͨ͠ख࢟੎֦ࢄϞσϧ 53 HaMeR 
 [G. Pavlakos, CVPR’24] Ours Description: “A

    right hand, with an adducted thumb grasp, holds a small flat spatula while cooking, with the palm facing the fabric.” N. Suzuki, T. Ohkawa+, Affordance-guided diffusion prior for 3D hand reconstruction. In ICCVW, 2025 • ςΩετΨΠυ֦ࢄϞσϧʹΑΔ༧ଌͷగਖ਼