Slide 1

Slide 1 text

PhD thesis defense Visual Understanding of Human Hands in Interactions ΠϯλϥΫγϣϯʹ͓͚Δखͷࢹ֮తཧղ 2025 7.14 Takehiko Ohkawa (Advisor: Prof. Yoichi Sato) / The University of Tokyo

Slide 2

Slide 2 text

2 Explore Discover Love Emphasize Express Ubiquitous role of hands alongside life stage changes Dexterity Tool use Profession Recall, we've grown with our Hands...

Slide 3

Slide 3 text

3 Video understanding AR glasses VR game Photorealistic telepresence Sign language understanding Robot imitation This centricity of hands open broad applications

Slide 4

Slide 4 text

Spectrum of understanding • Location, pixel-mask, keypoints, 3D pose, and shape • Richer better due to inclusion relation (e.g, 2D kpt ⊂ 3D kpt) • Data always limit the level of understanding 4 Fine & 3D Hand-object detection
 [D. Shan+, CVPR’20] Hand segmentation
 [G. Serra+, ACMMM’13] 2D pose (keypoints)
 [T. Simon+, CVPR’17] 3D pose
 [G. Moon+, ECCV’20] 3D shape
 [C.Zimmermann+, ICCV’19]

Slide 5

Slide 5 text

Spectrum of data captures • Data vary from studio to in-the-wild settings ( fi delity vs diversity) 5 [1] T. Ohkawa, R. Furuta, and Y. Sato. E ffi cient annotation and learning for 3D hand pose estimation: A survey. IJCV, 2023 Internet videos
 (100DOH)
 [D. Shan+, CVPR’20] Multi-camera dome
 (InterHand2.6M)
 [G. Moon+, ECCV’20, 
 C. Wuu+, arXiv’22] Multi-camera desktop
 (DexYCB / HO3D)
 [Y-W. Chao+, CVPR’21,
 S. Hampali+, CVPR’20] Wild ego videos
 (Ego4D)
 [K. Grauman+, CVPR’22] Diversity Fidelity RGB-D camera
 (Dexter+Object / FPHA)
 [S. Sridhar+, ECCV’16,
 G. Garcia-Hernando+, CVPR’18]

Slide 6

Slide 6 text

Research questions • What are challenging scenarios in data capture and perception under status quo? • What are learning issues under the tradeo ff between data fi delity and diversity? • How are tracked hand states linked to semantics? 6

Slide 7

Slide 7 text

Q1: Challenging scenarios • Contact involves heavy occlusion and ambiguity 7 Object contact
 (chapter 3) Self-contact
 (chapter 4)

Slide 8

Slide 8 text

Q1: Challenging scenarios • Contact involves heavy occlusion and ambiguity • Strong inductive cues of semantics 8 Object contact
 (chapter 3) Self-contact
 (chapter 4) Action intent Object affordance Emotion Psychological states

Slide 9

Slide 9 text

Q2: Learning issues • Learning needs to accommodate two data sources • Limited size and variety of labeled data • Diverse in-the-wild data with limited labels and di ff erent views 9 Training data (studio) Support data or test data (in-the-wild) Prior modeling from large data (chapter 4,5) Adaptation to test domain (chapter 6)

Slide 10

Slide 10 text

Q3: Semantic mapping • A geometric state can be interpreted in various ways (i.e., one-to-many) 10 “Grasp something” “Drink co ff ee” “Crack an egg” “Tighten bolt” Learn to map hand states to semantics dependent on the context (chapter 7)

Slide 11

Slide 11 text

11 Visual Understanding of Human Hands in Interactions Data 
 foundation Robust modeling for fi ne details Connecting geometry with semantics ––Precise tracking and interpretation of fi ne-grained hand interactions from real-world visual data––

Slide 12

Slide 12 text

12 Robust modeling for fi ne details Data foundation Connecting geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) Chapter 5: Pre-training from diverse images (ICLR’25) Chapter 6: Adaptation 
 in the wild (ECCV’22) Chapter 7: Video language description from hand tracklets (WACV’25) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose

Slide 13

Slide 13 text

Table of contents I. Introduction II. Survey for 3D hand capture, annotation, and learning methods (IJCV’23) III. Egocentric hand pose estimation under object interactions (ECCV’24) IV. Hand self-contact benchmark and generative pose modeling (ICCV’25) V. 3D hand pose pre-training from in-the-wild images (ICLR’25) VI. Domain adaptive hand state estimation in the wild (ECCV'22) VII.Dense video captioning for egocentric hand activities (WACV’25) VIII.Conclusions and Future Work 13

Slide 14

Slide 14 text

14 Robust modeling for fi ne details Data foundation Connecting geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3

Slide 15

Slide 15 text

Benchmark for egocentric hand pose • Facilitate analysis on natural and intricate object interactions 1. Propose 3D hand pose benchmark using a multi-view egocentric headset 2. Extensive analysis of recent estimation methods at ICCV 2023 competition 3. Identify insights and fi ndings by aggregating community knowledge 15 [2] Z. Fan*, T. Ohkawa*+, Benchmarks and challenges in pose estimation for egocentric hand interactions with objects. In ECCV, 2024 Exo views Ego views

Slide 16

Slide 16 text

Progress of 3D hand pose estimation 16 HANDS’17 
 [S. Yuan+, CVPR’18] HANDS’19
 [A. Armagan+, ECCV’20] • Depth images • Marker-based annotation
 →Bias in RGB images • Static RGB images • Object interactions • Markerless annotation
 →Less accurate in single-view setup • Simple and scripted actions ? HANDS’23 • Address object interactions from RGB images • Can we dive into more dynamic settings? • Egocentric cameras? • Unscripted actions? • Intricate contact? • How do we annotate accurately?

Slide 17

Slide 17 text

AssemblyHands 17 [A] T. Ohkawa,+ AssemblyHands: Towards egocentric activity understanding via 3D hand pose estimation. In CVPR, 2023 • Unscripted object assembly captures with ego-exo cameras • Large and accurate 3D pose GTs with markerless annotation (3M images) • Egocentric capture with a multi-view headset (aligned with commercial AR/VR devices)

Slide 18

Slide 18 text

Multi-view annotation • Multi-view triangulation with volumetric features for 3D pose annotation • Collaborated with Meta for data capture 18 [A] T. Ohkawa,+ AssemblyHands: Towards egocentric activity understanding via 3D hand pose estimation. In CVPR, 2023

Slide 19

Slide 19 text

Task overview 19 Train: 3D hand pose estimation from egocentric views 3D hand joints (GT) SVEgoNet Test: Multi-view inference Multi-view fusion Exo images Ego images Multi-view annotation

Slide 20

Slide 20 text

Competition results at ICCV 2023 20 Task 1 - Egocentric 3D Hand Pose Estimation - HANDS@ICCV2023, Link: https://sites.google.com/view/hands2023/challenges/assemblyhands Method Learning design Architecture Preprocessing Augmentation Multi-view fusion MPJPE↓ Base 2D heatmap and 3D location map ResNet50 - Simple average 20.69 JHands Regression Hiera 
 (ViT-based) Warp perspective, color jitter, random mask Adaptive view selection and average 12.21 PICO-AI 2.5D heatmap Heatmap voting ResNety320 Scale, rotate, fl ip, translate Adaptive view selection FTL in training 12.46 FRDC Regression 2D heatmap HandOccNet with ConvNeXt Scale, rotate, color jitter Weighted average 16.48 Phi-AI 2D heatmap and 3D location map ResNet50 Scale, rotate, translate, color jitter, gaussian blur Weighted average 17.26

Slide 21

Slide 21 text

Finding (i): Distortion from egocentric camera • Stretched hands near the image edge due to the fi sheye distortion • Proposal: Perspective transformation to de fi ne less stretched hand images (JHands) 21 Set a virtual camera on the hand

Slide 22

Slide 22 text

Finding (ii): Bias of detected hands per view 22 • Proposal: Adaptive multi-view fusion • Weighted sum of views, computed from the validation set (FRDC & Phi-AI) • Con fi dent view selection by fi ltering outlier views (JHands & PICO-AI) Frequent detection of hands Cam1 Cam2 Cam3 Cam4

Slide 23

Slide 23 text

Summary • Accurate multi-view annotation in AssemblyHands (with Meta at Redmond) • Benchmark egocentric hand pose estimation from AR/VR headset • Analyze winning methods at the ICCV 2023 competition • Address egocentric-speci fi c challenges • Fisheye distortion • Adaptive multi-view fusion • Organize HANDS workshop with global consortium 
 (ICCV’23,25 & ECCV’24) • AssemblyHands-related works won 
 EgoVis Distinguished Paper Award (CVPR’25) 23

Slide 24

Slide 24 text

24 Robust modeling for fi ne details Data foundation Connecting geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose

Slide 25

Slide 25 text

Self-contact data and generative modeling 25 [3] T. Ohkawa+, Generative modeling of shape-dependent self-contact human poses. In ICCV, 2025 Multi-camera dome capture Prior fi tting to novel subjects Generative pose prior • Address complex self-contact pose modeling 1. Create a new hand self-contact dataset with varying poses and shapes 2. Learn self-contact pose distribution with generative modeling 3. Apply the generative prior into single-view pose estimation

Slide 26

Slide 26 text

Di ffi culty of estimating self-contact poses 26 Common failures in a single-view • Contact ambiguity • Depth ambiguity Contact prior is needed to correct poses in such failures

Slide 27

Slide 27 text

Limitations of existing self-contact datasets 27 Human3DSC
 [M. Fieraru+, AAAI’21] MTP
 [L. Muller+, CVPR’21] • Mocap-based capture • Manual contact part annotation • Small contact data (1.0K) • Mocap-based capture & mimic the pose • Pseudo-GTs lacking accuracy • Small contact data (1.6K)

Slide 28

Slide 28 text

Goliath-SC dataset • Scaling up the capture based on 200+ RGB camera dome • Various self-contact interactions across face, body, and hands (383K poses) • Collaborated with Meta for data capture 28 Multi-camera dome
 (InterHand2.6M)
 [G. Moon+, ECCV’20, 
 C. Wuu+, arXiv’22]

Slide 29

Slide 29 text

Self-contact pose modeling • Pose and body shape are closely correlated! → Model body-shape-dependent pose! • Formulation of generative contact prior 29 AAADCHichVJNa9wwEJWdfqTbr016LBTRZSE9ZLGXNs0lEOil7WkL2SSwXowsy14R+QNpnGaRdeylf6WXHlpKr/0JvfXfVF6bsE0CHRB6vJl58zQoKgVX4Hl/HHfj1u07dzfv9e4/ePjocX9r+1gVlaRsSgtRyNOIKCZ4zqbAQbDTUjKSRYKdRGdvmvzJOZOKF/kRLEs2z0ia84RTApYKt5xnk0CwBHaCqBCxWmb20gEsGBCDa7zOvjM4kDxdwIvecIg/hoAP8Bi33T7exQGwC9CKp1nBY9PydZ2Eflej18UuzKWabU3C8f+K6jocd/jSRTtytQQtWWy0NWVd1+saZdj6IlVq1Xax1vrImFAvjblhYmkahW5AMzLsD7yRtwp8HfgdGKAuJmH/dxAXtMpYDlQQpWa+V8JcEwmcCmZ6QaVYSegZSdnMwpxkTM316g0GDy0T46SQ9uSAV+x6hyaZapzayozAQl3NNeRNuVkFyf5c87ysgOW0HZRUAkOBm1+BYy4ZBbG0gFDJrVdMF0QSCvbv9OwS/KtPvg6OxyN/b/Tqw8vB4ftuHZvoKXqOdpCPXqND9BZN0BRR55PzxfnmfHc/u1/dH+7PttR1up4n6J9wf/0FVzf6Ow== P (ω|I) Pose Shape Image Mesh Contact Instructed to perform “rubbing belly” Agnostic to environmental conditions vs. regression

Slide 30

Slide 30 text

Denoising di ff usion network Part-aware Pose Di ff usion (PAPoseDi ff ) 30 AAAB73icbVBNS8NAEN3Ur1q/qh69LBbBU0lEqseCF09SwX5AG8pmO2mXbjZxdyKU0D/hxYMiXv073vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzaHJYxnrTsAMSKGgiQIldBINLAoktIPxzcxvP4E2IlYPOEnAj9hQiVBwhlbq9HAEyPphv1xxq+4cdJV4OamQHI1++as3iHkagUIumTFdz03Qz5hGwSVMS73UQML4mA2ha6liERg/m987pWdWGdAw1rYU0rn6eyJjkTGTKLCdEcORWfZm4n9eN8Xw2s+ESlIExReLwlRSjOnseToQGjjKiSWMa2FvpXzENONoIyrZELzll1dJ66Lq1aq1+8tK/S6Po0hOyCk5Jx65InVySxqkSTiR5Jm8kjfn0Xlx3p2PRWvByWeOyR84nz8k+JAY ✓f AAAB8nicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi1WPBiyepYD+gDWWz3TRLN7thdyKU0J/hxYMiXv013vw3btsctPpg4PHeDDPzwlRwg5735ZTW1jc2t8rblZ3dvf2D6uFRx6hMU9amSijdC4lhgkvWRo6C9VLNSBIK1g0nN3O/+8i04Uo+4DRlQULGkkecErRSf4AxQzLMdTwbVmte3VvA/Uv8gtSgQGtY/RyMFM0SJpEKYkzf91IMcqKRU8FmlUFmWErohIxZ31JJEmaCfHHyzD2zysiNlLYl0V2oPydykhgzTULbmRCMzao3F//z+hlG10HOZZohk3S5KMqEi8qd/++OuGYUxdQSQjW3t7o0JppQtClVbAj+6st/Seei7jfqjfvLWvOuiKMMJ3AK5+DDFTThFlrQBgoKnuAFXh10np03533ZWnKKmWP4BefjG8bSkaI= ✓rh AAAB8nicbVBNS8NAEN3Ur1q/qh69BIvgqSQi1WPBiyepYD+gDWWznTRLN7thdyKU0J/hxYMiXv013vw3btsctPpg4PHeDDPzwlRwg5735ZTW1jc2t8rblZ3dvf2D6uFRx6hMM2gzJZTuhdSA4BLayFFAL9VAk1BAN5zczP3uI2jDlXzAaQpBQseSR5xRtFJ/gDEgHeYing2rNa/uLeD+JX5BaqRAa1j9HIwUyxKQyAQ1pu97KQY51ciZgFllkBlIKZvQMfQtlTQBE+SLk2fumVVGbqS0LYnuQv05kdPEmGkS2s6EYmxWvbn4n9fPMLoOci7TDEGy5aIoEy4qd/6/O+IaGIqpJZRpbm91WUw1ZWhTqtgQ/NWX/5LORd1v1Bv3l7XmXRFHmZyQU3JOfHJFmuSWtEibMKLIE3khrw46z86b875sLTnFzDH5BefjG72ukZw= ✓lh AAAB8XicbVBNS8NAEN3Ur1q/qh69LBbBU0lEqseCF09SwX5gG8pmO2mXbjZhdyKU0H/hxYMiXv033vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzaHJYxnrTsAMSKGgiQIldBINLAoktIPxzcxvP4E2IlYPOEnAj9hQiVBwhlZ67OEIkPWzYNovV9yqOwddJV5OKiRHo1/+6g1inkagkEtmTNdzE/QzplFwCdNSLzWQMD5mQ+haqlgExs/mF0/pmVUGNIy1LYV0rv6eyFhkzCQKbGfEcGSWvZn4n9dNMbz2M6GSFEHxxaIwlRRjOnufDoQGjnJiCeNa2FspHzHNONqQSjYEb/nlVdK6qHq1au3+slK/y+MokhNySs6JR65IndySBmkSThR5Jq/kzTHOi/PufCxaC04+c0z+wPn8Aee3kSA= ✓b Noised data Denoised data AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseCF09S0X5AG8pmu2mXbjZhdyKU0J/gxYMiXv1F3vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLoXiTRQoeSfRnEaB5O1gfDPz209cGxGrR5wk3I/oUIlQMIpWeuj0sV+uuFV3DrJKvJxUIEejX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophtd+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmldVL1atXZ/Wanf5XEU4QRO4Rw8uII63EIDmsBgCM/wCm+OdF6cd+dj0Vpw8plj+APn8wdHqo3a Xt AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseCF09SwX5IG8pmu2mXbjZhdyKU0F/hxYMiXv053vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLoXiTRQoeSfRnEaB5O1gfDPz209cGxGrB5wk3I/oUIlQMIpWeuyNKGadad/tlytu1Z2DrBIvJxXI0eiXv3qDmKURV8gkNabruQn6GdUomOTTUi81PKFsTIe8a6miETd+Nj94Ss6sMiBhrG0pJHP190RGI2MmUWA7I4ojs+zNxP+8borhtZ8JlaTIFVssClNJMCaz78lAaM5QTiyhTAt7K2EjqilDm1HJhuAtv7xKWhdVr1at3V9W6nd5HEU4gVM4Bw+uoA630IAmMIjgGV7hzdHOi/PufCxaC04+cwx/4Hz+AK+okGM= ˆ X0 Part-aware SA transformer AAAB9XicbVBNS8NAEN34WetX1aOXYBE8lUSkeix48SQV7Ac0sUy2m2bpZhN2J0oJ/R9ePCji1f/izX/jts1BWx8MPN6bYWZekAqu0XG+rZXVtfWNzdJWeXtnd2+/cnDY1kmmKGvRRCSqG4BmgkvWQo6CdVPFIA4E6wSj66nfeWRK80Te4zhlfgxDyUNOAY304EWAuYcRQ5j0g36l6tScGexl4hakSgo0+5Uvb5DQLGYSqQCte66Top+DQk4Fm5S9TLMU6AiGrGeohJhpP59dPbFPjTKww0SZkmjP1N8TOcRaj+PAdMaAkV70puJ/Xi/D8MrPuUwzZJLOF4WZsDGxpxHYA64YRTE2BKji5labRqCAogmqbEJwF19eJu3zmluv1e8uqo3bIo4SOSYn5Iy45JI0yA1pkhahRJFn8krerCfrxXq3PuatK1Yxc0T+wPr8AfqwkuE= ˆ ✓b AAAB+nicbVDLSsNAFJ3UV62vVJduBovgqiQi1WXBjSupYB/QhDCZTpqhkwczN0qJ+RQ3LhRx65e482+ctllo64ELh3Pu5d57/FRwBZb1bVTW1jc2t6rbtZ3dvf0Ds37YU0kmKevSRCRy4BPFBI9ZFzgINkglI5EvWN+fXM/8/gOTiifxPUxT5kZkHPOAUwJa8sy6ExLIHQgZkMLLRVh4ZsNqWnPgVWKXpIFKdDzzyxklNItYDFQQpYa2lYKbEwmcClbUnEyxlNAJGbOhpjGJmHLz+ekFPtXKCAeJ1BUDnqu/J3ISKTWNfN0ZEQjVsjcT//OGGQRXbs7jNAMW08WiIBMYEjzLAY+4ZBTEVBNCJde3YhoSSSjotGo6BHv55VXSO2/arWbr7qLRvi3jqKJjdILOkI0uURvdoA7qIooe0TN6RW/Gk/FivBsfi9aKUc4coT8wPn8AGV6Umg== ˆ ✓lh AAAB+nicbVDLSsNAFJ3UV62vVJduBovgqiQi1WXBjSupYB/QhDCZTpqhkwczN0qJ+RQ3LhRx65e482+ctllo64ELh3Pu5d57/FRwBZb1bVTW1jc2t6rbtZ3dvf0Ds37YU0kmKevSRCRy4BPFBI9ZFzgINkglI5EvWN+fXM/8/gOTiifxPUxT5kZkHPOAUwJa8sy6ExLIHQgZkMLLZVh4ZsNqWnPgVWKXpIFKdDzzyxklNItYDFQQpYa2lYKbEwmcClbUnEyxlNAJGbOhpjGJmHLz+ekFPtXKCAeJ1BUDnqu/J3ISKTWNfN0ZEQjVsjcT//OGGQRXbs7jNAMW08WiIBMYEjzLAY+4ZBTEVBNCJde3YhoSSSjotGo6BHv55VXSO2/arWbr7qLRvi3jqKJjdILOkI0uURvdoA7qIooe0TN6RW/Gk/FivBsfi9aKUc4coT8wPn8AIoKUoA== ˆ ✓rh AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lEqseCF09SwX5AE8Jmu2mWbjZhd1Ioof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMCzPBNTjOt1XZ2Nza3qnu1vb2Dw6P7OOTrk5zRVmHpiJV/ZBoJrhkHeAgWD9TjCShYL1wfDf3exOmNE/lE0wz5idkJHnEKQEjBbbtxQQKD2IGZBYU0Syw607DWQCvE7ckdVSiHdhf3jClecIkUEG0HrhOBn5BFHAq2Kzm5ZplhI7JiA0MlSRh2i8Wl8/whVGGOEqVKQl4of6eKEii9TQJTWdCINar3lz8zxvkEN36BZdZDkzS5aIoFxhSPI8BD7liFMTUEEIVN7diGhNFKJiwaiYEd/XlddK9arjNRvPxut56KOOoojN0ji6Ri25QC92jNuogiiboGb2iN6uwXqx362PZWrHKmVP0B9bnD0a/lCI= ˆ ✓f Shape (w/ perturb) Time step AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KolI9Vjw4klasB/QhrLZbtq1m03YnQgl9Bd48aCIV3+SN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrNbBfKrsVdw6ySryclCFHvV/66g1ilkZcIZPUmK7nJuhnVKNgkk+LvdTwhLIxHfKupYpG3PjZ/NApObfKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYY3fiZUkiJXbLEoTCXBmMy+JgOhOUM5sYQyLeythI2opgxtNkUbgrf88ippXVa8aqXauCrX7vM4CnAKZ3ABHlxDDe6gDk1gwOEZXuHNeXRenHfnY9G65uQzJ/AHzucP51eNDw== t SMPL-X mesh Latent embeds SMPL-X layer Part-wise pose + Pose space Pose space Mesh space Latent space AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKRI8BL+opAfOAZAmzk95kzOzsMjMrhJAv8OJBEa9+kjf/xkmyB00saCiquunuChLBtXHdbye3tr6xuZXfLuzs7u0fFA+PmjpOFcMGi0Ws2gHVKLjEhuFGYDtRSKNAYCsY3cz81hMqzWP5YMYJ+hEdSB5yRo2V6ne9Ysktu3OQVeJlpAQZar3iV7cfszRCaZigWnc8NzH+hCrDmcBpoZtqTCgb0QF2LJU0Qu1P5odOyZlV+iSMlS1pyFz9PTGhkdbjKLCdETVDvezNxP+8TmrCa3/CZZIalGyxKEwFMTGZfU36XCEzYmwJZYrbWwkbUkWZsdkUbAje8surpHlR9irlSv2yVL3P4sjDCZzCOXhwBVW4hRo0gAHCM7zCm/PovDjvzseiNedkM8fwB87nD6T3jOA= I AAAC53icbVHLbtNAFJ2YVzGvFJZsRkRISKiRXZXHAlBFo4pFhYogbaU4ssbjm2bUmbHluQ6NRvMN7BBbPoXP4AvYwh8wTgwiDVeydHzOvfa952SlFAaj6HsnuHT5ytVrG9fDGzdv3b7T3bx7ZIq64jDkhSyqk4wZkELDEAVKOCkrYCqTcJyd7TX68QwqIwr9AecljBU71WIiOENPpd1RohhOOZP2wKWDl4n0ozlLE5wCMnrQgsd/+JmnZn/fbIJwjtT6NahzXlol0m4v6keLousgbkGPtHWYbnYwyQteK9DIJTNmFEclji2rUHAJLkxqAyXjZ+wURh5qpsCM7cIFRx96JqeTovKPRrpg/52wTBkzV5nvbG42F7WG/J82qnHyfGyFLmsEzZc/mtSSYkEbS2kuKuAo5x4wXgm/K+VTVjGO3vgwTAbgj6lg32+1z5SQczt01qA4384yZ91Kw/spK2FFV85qR+2LrVd0wW01Cy4Hw+QtfBy0hu0VSjGd20TovInXW2LdMhDbXNOYsv7h2LnQ5xRfTGUdHG3346f9J+92eruv28Q2yH3ygDwiMXlGdskbckiGhJNv5Af5SX4FIvgUfA6+LFuDTjtzj6xU8PU3BQvwaw== LD = ωωLω + ωvLv + ωcol Lcol Pose loss Vertex loss Collision loss

Slide 31

Slide 31 text

Prior fi tting to single-view pose estimation • Given estimated 2D keypoints and 3D pose, re fi ne 3D pose with denoising 31

Slide 32

Slide 32 text

Results of single-view pose estimation • Pose metric: MPJPE↓ 32 Method Avg. Hands Body Face SMPLer-X 
 [Z . Cai+,NeurIPS’23] 58.0 98.7 41.6 38.9 (Fine-tuned) 42.0 56.7 31.9 34.1 + 2D fi tting 41.7 65.7 30.6 31.6 + BUDDI [L. Muller+, CVPR’24] 71.7 99.9 36.3 66.4 + Ours 31.8 54.6 24.7 19.2 SMPLer-X +Ours Goliath-SC eval set

Slide 33

Slide 33 text

Summary • Construct Goliath-SC dataset for self-contact analysis (with Meta at Pittsburgh) • Generative contact prior for self-contact poses • Part-aware pose di ff usion with latent embedding • Prior adaptation to single-view pose estimation • Outperform the fi ne-tuned SOTA regressor 33

Slide 34

Slide 34 text

34 Robust modeling for fi ne details Data foundation Connecting geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) Chapter 5: Pre-training from diverse images (ICLR’25) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose

Slide 35

Slide 35 text

3D hand pose pre-training in the wild • Leverage large-scale in-the-wild videos to build a prior for hand perception 1. Propose similar hands mining to create informative positive pairs 2. Propose contrastive learning with adaptive weighting based on similarity 3. Validate the e ff ectiveness of the pre-trained prior on 3D HPE 35 [4] N. Lin*, T. Ohkawa*+, SiMHand: Mining similar hands for 3D hand pose pre-training. In ICLR, 2025

Slide 36

Slide 36 text

Learning from in-the-wild videos • We have access to massive in-the-wild human videos but they are unlabeled • Hand perception prior with contrastive learning from diverse images 36 100DOH from YouTube
 [D. Shan+, CVPR’20] Ego4D: Worldwide fi rst-person footage 
 [K. Grauman+, CVPR’22]

Slide 37

Slide 37 text

How to assign positive pairs? • SimCLR / PeCLR: Positive samples originate from the same instance 
 [T. Chen+, ICML’20, A. Spurr+, ICCV’21] 37

Slide 38

Slide 38 text

How to assign positive pairs? • SimCLR / PeCLR: Positive samples originate from the same instance 
 [T. Chen+, ICML’20, A. Spurr+, ICCV’21] • Time contrastive learning assigns temporally neighboring frames as positives
 [A. Ziani+, 3DV’22] 38 → Limited supervision related to interactions, background, and user identity

Slide 39

Slide 39 text

• Can samples with similar pose provide a key to generalizable prior? Ideal positives 39 1) Di ff erent hand-held objects 2) Di ff erent backgrounds 3) Di ff erent user ID and appearance

Slide 40

Slide 40 text

Visualization of assigned positives 40 Query Assigned positive • Sample mining with similar 2D pose

Slide 41

Slide 41 text

Adaptive weighting based on similarity 41 • Contrastive learning loss is weighted by the similarity distance

Slide 42

Slide 42 text

Experimental results 42 Method FreiHand DexYCB Assembly
 Hands w/o pre-train 19.21 19.36 19.17 SimCLR
 [T. Chen+, ICML’20] 20.07 21.09 21.24 PeCLR
 [A. Spurr+, ICCV’21] 18.19 18.06 18.88 Ours
 (Similar hands + weighting) 15.79 16.71 18.23 • Pre-training set: Ego4D + 100DOH (2M images) with processed 2D poses • Fine-tune a pre-trained model on 3D HPE datasets (Metric: MPJPE↓)

Slide 43

Slide 43 text

Qualitative results 43

Slide 44

Slide 44 text

Summary • Learning a generalizable prior from in-the-wild similar hands • Leverage weak guidance from 2D pose in pre-training • Similar hands mining in the pre-training set construction • Propose contrastive learning from similar hands with adaptive weighting • Demonstrate its e ff ectiveness across di ff erent 3D HPE datasets 44

Slide 45

Slide 45 text

45 Robust modeling for fi ne details Data foundation Connecting geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) Chapter 5: Pre-training from diverse images (ICLR’25) Chapter 6: Adaptation 
 in the wild (ECCV’22) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose

Slide 46

Slide 46 text

Domain adaptive hand state estimation • Bridge domain gap to label-scarce domains 1. Propose a self-training method to adapt to unlabeled target data 2. Integrate con fi dence estimation into self-training for reliable modeling 3. Improve hand pose and mask estimation in novel domains 46 [5] T. Ohkawa+, Domain adaptive hand keypoint and pixel localization in the wild. In ECCV, 2022 DexYCB (studio)
 [Y-W. Chao+, CVPR’21] Ego4D (in-the-wild)
 [K. Grauman+, CVPR’22] Before After

Slide 47

Slide 47 text

Transfer learning setup 47 Source data (w/ GTs) Target data (w/o GTs) • Large training images • High- fi delity annotation • Limited diversity (e.g., DexYCB [Y-W. Chao+, CVPR’21]) • Diverse images • Di ff erent captured environments • Lack annotation of pose and mask (e.g., Ego4D [K. Grauman+, CVPR’22])

Slide 48

Slide 48 text

Mean teacher: Stable self-supervised learning • Consistency training (c.f. pseudo-labeling) by separate model design 
 [A. Tarvainen+, ICLR’17] • Teacher: averaged from the student / Student: trained via consistency loss 48 Label transformation Teacher Target data Augmented target data Student Data augmentation AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ== pt1 AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCq5KIqMuCG91VsA9oQ5hMJ+3QmSTM3Ag15EvcuFDErZ/izr9x0mahrQcGDufcyz1zgkRwDY7zbVXW1jc2t6rbtZ3dvf26fXDY1XGqKOvQWMSqHxDNBI9YBzgI1k8UIzIQrBdMbwq/98iU5nH0ALOEeZKMIx5ySsBIvl1P/GwoCUyUzDTkuW83nKYzB14lbkkaqETbt7+Go5imkkVABdF64DoJeBlRwKlgeW2YapYQOiVjNjA0IpJpL5sHz/GpUUY4jJV5EeC5+nsjI1LrmQzMZJFRL3uF+J83SCG89jIeJSmwiC4OhanAEOOiBTziilEQM0MIVdxkxXRCFKFguqqZEtzlL6+S7nnTvWxe3F80WndlHVV0jE7QGXLRFWqhW9RGHURRip7RK3qznqwX6936WIxWrHLnCP2B9fkDyeqT3A== pst Backward AAAC5XichVJNb9QwEHVSPkr42pYjF4vVSuXQVbIqlEulSlwQ4lCkbltps7Icx8ladT5kT0pXXt+5cAAhrvwnbvwXDjibHEpbiZEiP715M2/GTlJLoSEMf3v+xp279+5vPggePnr85Olga/tEV41ifMoqWamzhGouRcmnIEDys1pxWiSSnybnb9v86QVXWlTlMSxrPi9oXopMMAqOIoM/cUFhwag0HywxOWU2GI3wJwL4AE9wLHkGOxHexTHwSzBa5EUlUtvxq1VGol5j4qSSqV4W7jCX1uJYiXwBL11pRib/E61WZNLj/ghGneV6Q6N4ao0byjrl1R416eaiTe667WJjzLF1eyytvcWxtm2H3qC1JINhOA7XgW+CqAdD1McRGfyK04o1BS+BSar1LAprmBuqQDDJbRA3mteUndOczxwsacH13Kx3sHjkmBRnlXJfCXjNXq0wtNDtpE7Zvom+nmvJ23KzBrI3cyPKugFess4oaySGCrdPjlOhOAO5dIAyJdysmC2oogzcjxG4S4iur3wTnEzG0evxq497w8P3/XVsoufoBdpBEdpHh+gdOkJTxLzE++x99b75uf/F/+7/6KS+19c8Q/+E//Mvq/DtCQ== Lgac Consistency training EMA AAACWXicbVFNT9wwEHVCgW1aYClHLlZXSPTQVYJo6RGpF8QJJBaQNqvIcSa7Fs6H7Al05fWf5ICE+ld6wAl74KMjWX568+Z5ZpzWUmgMw0fPX/mwurbe+xh8+ryxudXf/nKpq0ZxGPFKVuo6ZRqkKGGEAiVc1wpYkUq4Sm9+t/mrW1BaVOUFzmuYFGxailxwho5K+rUx5sLaxMytDfZojPAHO1ejILPmLkFLF4s4rWSm54W7TJ10IsOaqbX0O31hEEvIcd+8UtvWIVZiOsNvi0VykPQH4TDsgr4H0RIMyDLOkv59nFW8KaBELpnW4yiscWKYQsEl2CBuNNSM37ApjB0sWQF6YroZLN1zTEbzSrlTIu3YlxWGFbrt1CkLhjP9NteS/8uNG8x/TYwo6wah5M8P5Y2kWNF2zTQTCjjKuQOMK+F6pXzGFOPoPiNwS4jejvweXB4Mo5/DH+eHg+PT5Tp6ZJd8JfskIkfkmJyQMzIinDyQf96qt+b99T2/5wfPUt9b1uyQV+HvPAE1t7i1 Ty AAACWXicbVFLT9wwEHZSHkvoY1uOvViskODQVYIocETqBfVEJRaQNqvIcSa7Fs5D9qRl5fWf7KES4q9wqBMWiddIlj99883nmXFaS6ExDG89/93K6tp6byPYfP/h46f+5y8XumoUhxGvZKWuUqZBihJGKFDCVa2AFamEy/T6R5u//A1Ki6o8x3kNk4JNS5ELztBRSb82xpxbm5gba4MdGiPcYOdqFGTW/EnQ0sUiTiuZ6XnhLlMnnciwZmot/UYfDebWxhJy3DXP1LZ1iJWYznBvsUj2k/4gHIZd0NcgWoIBWcZZ0v8bZxVvCiiRS6b1OAprnBimUHAJNogbDTXj12wKYwdLVoCemG4GS3cck9G8Uu6USDv2aYVhhW47dcqC4Uy/zLXkW7lxg/nxxIiybhBK/vBQ3kiKFW3XTDOhgKOcO8C4Eq5XymdMMY7uMwK3hOjlyK/Bxf4wOhx+/3UwOPm5XEePfCXbZJdE5IickFNyRkaEk3/k3lv11rw73/N7fvAg9b1lzRZ5Fv7WfzPMuLQ= Tx Consistency loss AAAC2HicbVFNbxMxEHWWr7J8pXDkYhEhlUOibFQ+DghVtKo4IFRE01Zko5XX602s2t6VPQtdOZa4Ia78FH4NHOGX4E32QBpGsvz0ZsaeeS8tBTcwHP7qBFeuXrt+Y+tmeOv2nbv3utv3T0xRacrGtBCFPkuJYYIrNgYOgp2VmhGZCnaanu83+dNPTBteqGOoSzaVZKZ4zikBTyXdt4tFnBYiM7X0ly2TGNgFWAPO4T621h47l9jauViwHHbsWrFLwOFY89kcniwWySjp9oaD4TLwJoha0ENtHCXbHYizglaSKaCCGDOJhiVMLdHAqWAujCvDSkLPyYxNPFREMjO1y7UdfuyZDOeF9kcBXrL/dlgiTTOqr5QE5uZyriH/l5tUkL+YWq7KCpiiq4/ySmAocKMhzrhmFETtAaGa+1kxnRNNKHilwzA+YH4ZzQ79VIdEclHbsfOS8otRmjrr1go+zEnJ1vLSWeWwfdl/hZdcvxlw1RjG79jng1aw/UJKojIbc5U1fnpJrLMr/5ptGlE2H46cC71P0WVXNsHJaBA9Gzx9v9vbe906toUeokdoB0XoOdpDb9ARGiOKfqCf6Df6E3wMvgRfg2+r0qDT9jxAaxF8/wtP4+yO ||pst → Ty (pt ) ||2 Student pred Teacher pred

Slide 49

Slide 49 text

AAAC2HicbVFNbxMxEHWWr7J8pXDkYhEhlUOibFQ+DghVtKo4IFRE01Zko5XX602s2t6VPQtdOZa4Ia78FH4NHOGX4E32QBpGsvz0ZsaeeS8tBTcwHP7qBFeuXrt+Y+tmeOv2nbv3utv3T0xRacrGtBCFPkuJYYIrNgYOgp2VmhGZCnaanu83+dNPTBteqGOoSzaVZKZ4zikBTyXdt4tFnBYiM7X0ly2TGNgFWAPO4T621h47l9jauViwHHbsWrFLwOFY89kcniwWySjp9oaD4TLwJoha0ENtHCXbHYizglaSKaCCGDOJhiVMLdHAqWAujCvDSkLPyYxNPFREMjO1y7UdfuyZDOeF9kcBXrL/dlgiTTOqr5QE5uZyriH/l5tUkL+YWq7KCpiiq4/ySmAocKMhzrhmFETtAaGa+1kxnRNNKHilwzA+YH4ZzQ79VIdEclHbsfOS8otRmjrr1go+zEnJ1vLSWeWwfdl/hZdcvxlw1RjG79jng1aw/UJKojIbc5U1fnpJrLMr/5ptGlE2H46cC71P0WVXNsHJaBA9Gzx9v9vbe906toUeokdoB0XoOdpDb9ARGiOKfqCf6Df6E3wMvgRfg2+r0qDT9jxAaxF8/wtP4+yO ||pst → Ty (pt ) ||2 Consistency loss
 Label transformation Teacher Target data Augmented target data Student Data augmentation AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ== pt1 AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCq5KIqMuCG91VsA9oQ5hMJ+3QmSTM3Ag15EvcuFDErZ/izr9x0mahrQcGDufcyz1zgkRwDY7zbVXW1jc2t6rbtZ3dvf26fXDY1XGqKOvQWMSqHxDNBI9YBzgI1k8UIzIQrBdMbwq/98iU5nH0ALOEeZKMIx5ySsBIvl1P/GwoCUyUzDTkuW83nKYzB14lbkkaqETbt7+Go5imkkVABdF64DoJeBlRwKlgeW2YapYQOiVjNjA0IpJpL5sHz/GpUUY4jJV5EeC5+nsjI1LrmQzMZJFRL3uF+J83SCG89jIeJSmwiC4OhanAEOOiBTziilEQM0MIVdxkxXRCFKFguqqZEtzlL6+S7nnTvWxe3F80WndlHVV0jE7QGXLRFWqhW9RGHURRip7RK3qznqwX6936WIxWrHLnCP2B9fkDyeqT3A== pst Backward AAAC5XichVJNb9QwEHVSPkr42pYjF4vVSuXQVbIqlEulSlwQ4lCkbltps7Icx8ladT5kT0pXXt+5cAAhrvwnbvwXDjibHEpbiZEiP715M2/GTlJLoSEMf3v+xp279+5vPggePnr85Olga/tEV41ifMoqWamzhGouRcmnIEDys1pxWiSSnybnb9v86QVXWlTlMSxrPi9oXopMMAqOIoM/cUFhwag0HywxOWU2GI3wJwL4AE9wLHkGOxHexTHwSzBa5EUlUtvxq1VGol5j4qSSqV4W7jCX1uJYiXwBL11pRib/E61WZNLj/ghGneV6Q6N4ao0byjrl1R416eaiTe667WJjzLF1eyytvcWxtm2H3qC1JINhOA7XgW+CqAdD1McRGfyK04o1BS+BSar1LAprmBuqQDDJbRA3mteUndOczxwsacH13Kx3sHjkmBRnlXJfCXjNXq0wtNDtpE7Zvom+nmvJ23KzBrI3cyPKugFess4oaySGCrdPjlOhOAO5dIAyJdysmC2oogzcjxG4S4iur3wTnEzG0evxq497w8P3/XVsoufoBdpBEdpHh+gdOkJTxLzE++x99b75uf/F/+7/6KS+19c8Q/+E//Mvq/DtCQ== Lgac Consistency training EMA AAACWXicbVFNT9wwEHVCgW1aYClHLlZXSPTQVYJo6RGpF8QJJBaQNqvIcSa7Fs6H7Al05fWf5ICE+ld6wAl74KMjWX568+Z5ZpzWUmgMw0fPX/mwurbe+xh8+ryxudXf/nKpq0ZxGPFKVuo6ZRqkKGGEAiVc1wpYkUq4Sm9+t/mrW1BaVOUFzmuYFGxailxwho5K+rUx5sLaxMytDfZojPAHO1ejILPmLkFLF4s4rWSm54W7TJ10IsOaqbX0O31hEEvIcd+8UtvWIVZiOsNvi0VykPQH4TDsgr4H0RIMyDLOkv59nFW8KaBELpnW4yiscWKYQsEl2CBuNNSM37ApjB0sWQF6YroZLN1zTEbzSrlTIu3YlxWGFbrt1CkLhjP9NteS/8uNG8x/TYwo6wah5M8P5Y2kWNF2zTQTCjjKuQOMK+F6pXzGFOPoPiNwS4jejvweXB4Mo5/DH+eHg+PT5Tp6ZJd8JfskIkfkmJyQMzIinDyQf96qt+b99T2/5wfPUt9b1uyQV+HvPAE1t7i1 Ty AAACWXicbVFLT9wwEHZSHkvoY1uOvViskODQVYIocETqBfVEJRaQNqvIcSa7Fs5D9qRl5fWf7KES4q9wqBMWiddIlj99883nmXFaS6ExDG89/93K6tp6byPYfP/h46f+5y8XumoUhxGvZKWuUqZBihJGKFDCVa2AFamEy/T6R5u//A1Ki6o8x3kNk4JNS5ELztBRSb82xpxbm5gba4MdGiPcYOdqFGTW/EnQ0sUiTiuZ6XnhLlMnnciwZmot/UYfDebWxhJy3DXP1LZ1iJWYznBvsUj2k/4gHIZd0NcgWoIBWcZZ0v8bZxVvCiiRS6b1OAprnBimUHAJNogbDTXj12wKYwdLVoCemG4GS3cck9G8Uu6USDv2aYVhhW47dcqC4Uy/zLXkW7lxg/nxxIiybhBK/vBQ3kiKFW3XTDOhgKOcO8C4Eq5XymdMMY7uMwK3hOjlyK/Bxf4wOhx+/3UwOPm5XEePfCXbZJdE5IickFNyRkaEk3/k3lv11rw73/N7fvAg9b1lzRZ5Fv7WfzPMuLQ= Tx Uncertainty of consistency training • The quality of the supervision a ff ects the training progress 49 Sensitive to its quality! AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ== pt1 AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ== pt1

Slide 50

Slide 50 text

Con fi dence-aware consistency training • Evaluate con fi dence and weight the consistency loss 50 AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiQi6rLgRsFFBfuANoTJdNIOnZmEmYlQQhZu/BU3LhRx60e482+cpFlo64GBM+fcy733BDGjSjvOt1VZWV1b36hu1ra2d3b37P2DrooSiUkHRyyS/QApwqggHU01I/1YEsQDRnrB9Cr3ew9EKhqJez2LicfRWNCQYqSN5Nv1IUd6ghFLbzM/LT6Sp3iMcJb5dsNpOgXgMnFL0gAl2r79NRxFOOFEaMyQUgPXibWXIqkpZiSrDRNFYoSnaEwGhgrEifLS4ogMHhtlBMNImic0LNTfHSniSs14YCrzLdWil4v/eYNEh5deSkWcaCLwfFCYMKgjmCcCR1QSrNnMEIQlNbtCPEESYW1yq5kQ3MWTl0n3tOmeN8/uzhqtmzKOKqiDI3ACXHABWuAatEEHYPAInsEreLOerBfr3fqYl1assucQ/IH1+QMH7ZkF Lcgac Label transformation Teacher Target data Augmented target data Student Data augmentation AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCq5KIqMuCG91VsA9oQ5hMJ+3QmSTM3Ag15EvcuFDErZ/izr9x0mahrQcGDufcyz1zgkRwDY7zbVXW1jc2t6rbtZ3dvf26fXDY1XGqKOvQWMSqHxDNBI9YBzgI1k8UIzIQrBdMbwq/98iU5nH0ALOEeZKMIx5ySsBIvl1P/GwoCUyUzDTkuW83nKYzB14lbkkaqETbt7+Go5imkkVABdF64DoJeBlRwKlgeW2YapYQOiVjNjA0IpJpL5sHz/GpUUY4jJV5EeC5+nsjI1LrmQzMZJFRL3uF+J83SCG89jIeJSmwiC4OhanAEOOiBTziilEQM0MIVdxkxXRCFKFguqqZEtzlL6+S7nnTvWxe3F80WndlHVV0jE7QGXLRFWqhW9RGHURRip7RK3qznqwX6936WIxWrHLnCP2B9fkDyeqT3A== pst Backward Consistency training EMA AAACWXicbVFNT9wwEHVCgW1aYClHLlZXSPTQVYJo6RGpF8QJJBaQNqvIcSa7Fs6H7Al05fWf5ICE+ld6wAl74KMjWX568+Z5ZpzWUmgMw0fPX/mwurbe+xh8+ryxudXf/nKpq0ZxGPFKVuo6ZRqkKGGEAiVc1wpYkUq4Sm9+t/mrW1BaVOUFzmuYFGxailxwho5K+rUx5sLaxMytDfZojPAHO1ejILPmLkFLF4s4rWSm54W7TJ10IsOaqbX0O31hEEvIcd+8UtvWIVZiOsNvi0VykPQH4TDsgr4H0RIMyDLOkv59nFW8KaBELpnW4yiscWKYQsEl2CBuNNSM37ApjB0sWQF6YroZLN1zTEbzSrlTIu3YlxWGFbrt1CkLhjP9NteS/8uNG8x/TYwo6wah5M8P5Y2kWNF2zTQTCjjKuQOMK+F6pXzGFOPoPiNwS4jejvweXB4Mo5/DH+eHg+PT5Tp6ZJd8JfskIkfkmJyQMzIinDyQf96qt+b99T2/5wfPUt9b1uyQV+HvPAE1t7i1 Ty AAACWXicbVFLT9wwEHZSHkvoY1uOvViskODQVYIocETqBfVEJRaQNqvIcSa7Fs5D9qRl5fWf7KES4q9wqBMWiddIlj99883nmXFaS6ExDG89/93K6tp6byPYfP/h46f+5y8XumoUhxGvZKWuUqZBihJGKFDCVa2AFamEy/T6R5u//A1Ki6o8x3kNk4JNS5ELztBRSb82xpxbm5gba4MdGiPcYOdqFGTW/EnQ0sUiTiuZ6XnhLlMnnciwZmot/UYfDebWxhJy3DXP1LZ1iJWYznBvsUj2k/4gHIZd0NcgWoIBWcZZ0v8bZxVvCiiRS6b1OAprnBimUHAJNogbDTXj12wKYwdLVoCemG4GS3cck9G8Uu6USDv2aYVhhW47dcqC4Uy/zLXkW7lxg/nxxIiybhBK/vBQ3kiKFW3XTDOhgKOcO8C4Eq5XymdMMY7uMwK3hOjlyK/Bxf4wOhx+/3UwOPm5XEePfCXbZJdE5IickFNyRkaEk3/k3lv11rw73/N7fvAg9b1lzRZ5Fv7WfzPMuLQ= Tx AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ== pt1 Con fi dence-aware 
 consistency loss Con fi dence weight wt AAACYnicbVFNT9wwEHUCtDRtYSlHOFiskODAKkFt6RGpl4oTlVhA2qwix5nsWjgfsictkdd/kltPXPghdcIe+BrJ8tN7b8Yz47SWQmMY/vP8ldW1d+/XPwQfP33e2BxsfbnUVaM4jHklK3WdMg1SlDBGgRKuawWsSCVcpTc/O/3qDygtqvIC2xqmBZuVIhecoaOSQRsj3GJfxyjIrPmboA326Vs0XSzitJKZbgt3mTrpTYY1M2vpETXGXFibmNbaWEKOB+aZ23YVYiVmczxcLJLjZDAMR2Ef9DWIlmBIlnGeDO7irOJNASVyybSeRGGNU8MUCi7BBnGjoWb8hs1g4mDJCtBT089g6b5jMppXyp0Sac8+zTCs0F2nzlkwnOuXWke+pU0azH9MjSjrBqHkjw/ljaRY0W7fNBMKOMrWAcaVcL1SPmeKcXS/ErglRC9Hfg0uj0fR99G331+Hp2fLdayTHbJHDkhETsgp+UXOyZhwcu+teRvepvfgB/6Wv/1o9b1lzjZ5Fv7uf5Sku7k= wt AAAC7XicbVFNb9NAEN2YjxbzlcKRy4oIqRwSxRFQDghVtKo4oSKatlIcWev1JFl117Z2x7TRZn8GN8SVn8Jv4EdwhStrJwfSMJLlp5k3szPvpaUUBvv9n63gxs1bt7e274R3791/8LC98+jUFJXmMOSFLPR5ygxIkcMQBUo4LzUwlUo4Sy8O6vrZZ9BGFPkJzksYKzbNxURwhj6VtFmMcIXNHKshc/YyQUcXizgtZGbmyv9smTQka9A52qXW2hPnEjt3LpYwwV27Rnb1gFiL6QyfLxbJIGl3+r1+E3QTRCvQIas4TnZaGGcFrxTkyCUzZhT1SxxbplFwCS6MKwMl4xdsCiMPc6bAjG1zg6PPfCajk0L7L0faZP/tsEyZelXPVAxn5nqtTv6vNqpw8npsRV5WCDlfPjSpJMWC1sLSTGjgKOceMK6F35XyGdOMo5c/DOND8MdoOPJbHTEl5NwOnZdUXA3S1Fm3Rvg0YyWs1ZWzuaP2TfctbXLdesFlYxh/gMvDlWAHhVIsz2ws8qw22UtinV36V19Ti7I5OHIu9D5F113ZBKeDXvSq9/Lji87+u5Vj2+QJeUp2SUT2yD55T47JkHDyg/wiv8mfoAi+BF+Db0tq0Fr1PCZrEXz/C3GT9jU= wt ||pst → Ty (pt ) ||2

Slide 51

Slide 51 text

How to estimate the con fi dence weight? • Use the disagreement of two networks with di ff erent parameters 51 Teacher1 Teacher2 Target data AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ== pt1 AAAB9XicbVDLSgMxFM3UV62vqks3wSK4KjOlqMuCG91VsA9ox5JJM21okhmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIBbcgOt+O4W19Y3NreJ2aWd3b/+gfHjUNlGiKWvRSES6GxDDBFesBRwE68aaERkI1gkm15nfeWTa8EjdwzRmviQjxUNOCVjpIR70JYGxlinUZqVBueJW3TnwKvFyUkE5moPyV38Y0UQyBVQQY3qeG4OfEg2cCjYr9RPDYkInZMR6lioimfHTeeoZPrPKEIeRtk8Bnqu/N1IijZnKwE5mGc2yl4n/eb0Ewis/5SpOgCm6OBQmAkOEswrwkGtGQUwtIVRzmxXTMdGEgi0qK8Fb/vIqadeq3kW1flevNG7zOoroBJ2ic+ShS9RAN6iJWogijZ7RK3pznpwX5935WIwWnHznGP2B8/kDWsmScg== pt2 AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN7qrYB/QGUomzbShSWZIMkoZ+htuXCji1p9x59+YaWehrQcCh3Pu5Z6cMOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8SaciZp2zDDaS9RFIuQ0244ucn97iNVmsXywUwTGgg8kixiBBsr+U8DX2AzViIzs0G15tbdOdAq8QpSgwKtQfXLH8YkFVQawrHWfc9NTJBhZRjhdFbxU00TTCZ4RPuWSiyoDrJ55hk6s8oQRbGyTxo0V39vZFhoPRWhncwT6mUvF//z+qmJroOMySQ1VJLFoSjlyMQoLwANmaLE8KklmChmsyIyxgoTY2uq2BK85S+vks5F3busN+4bteZdUUcZTuAUzsGDK2jCLbSgDQQSeIZXeHNS58V5dz4WoyWn2DmGP3A+fwC2lpIp wt AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJFPVY8KK3CvYDmhA2m0m7dDcJuxuhhHrxr3jxoIhX/4U3/42bNgdtfTDweG+GmXlByqhUtv1tVFZW19Y3qpu1re2d3T1z/6Ark0wQ6JCEJaIfYAmMxtBRVDHopwIwDxj0gvF14fceQEiaxPdqkoLH8TCmESVYack3j1xgzM9djtVI8DykEg8FwHTqm3W7Yc9gLROnJHVUou2bX26YkIxDrAjDUg4cO1VejoWihMG05mYSUkzGeAgDTWPMQXr57IOpdaqV0IoSoStW1kz9PZFjLuWEB7qzuFQueoX4nzfIVHTl5TROMwUxmS+KMmapxCrisEIqgCg20QQTQfWtFhlhgYnSodV0CM7iy8uke95wLhrNu2a9dVvGUUXH6ASdIQddoha6QW3UQQQ9omf0it6MJ+PFeDc+5q0Vo5w5RH9gfP4AtP6XvA== `disagree AAACQXicbVBNT9tAEF1DPyD9IJQjl1WjSumhkW1B4VIpgkuPqdRApNiy1ptxssr6Q7tjILL917jwD7hx58KhCPXaSzeJpbZJR1rtm/feaHZfmEmh0bbvrI3NZ89fvNzabrx6/ebtTnP33ZlOc8Whz1OZqkHINEiRQB8FShhkClgcSjgPp6dz/fwClBZp8h1nGfgxGyciEpyhoYLm4DJA+oW61JMQYduhn6iHcIWFFuM4FaNqyZdlFhRLAZ2qMq4/vWv6sgxcT4nxBD/WV9Bs2R17UXQdODVokbp6QfPWG6U8jyFBLpnWQ8fO0C+YQsElVA0v15AxPmVjGBqYsBi0XywSqOgHw4xolCpzEqQL9u+JgsVaz+LQOGOGE72qzcn/acMco2O/EEmWIyR8uSjKJcWUzuOkI6GAo5wZwLgS5q2UT5hiHE3oDROCs/rldXDmdpzPncNvB63uSR3HFtkn70mbOOSIdMlX0iN9wsk1uSc/yKN1Yz1YT9bPpXXDqmf2yD9l/foN7VCwng== wt = 2 (1 → sigmoid (||pt1 → pt2 ||2))

Slide 52

Slide 52 text

Method overview • Student update from the dual-teacher ensemble and estimated con fi dence 52 Teacher1 Teacher2 Target data Augmented target data Student Data augmentation Label transformation Backward AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIqMuCG91VsA9ox5JJ0zY0mRmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIJbCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNFGiGW+wSEa6HVDDpQh5AwVK3o41pyqQvBWMrzO/9ci1EVF4j5OY+4oOQzEQjKKVHuJeV1EcaZWiNy31yhW36s5AlomXkwrkqPfKX91+xBLFQ2SSGtPx3Bj9lGoUTPJpqZsYHlM2pkPesTSkihs/naWekhOr9Mkg0vaFSGbq742UKmMmKrCTWUaz6GXif14nwcGVn4owTpCHbH5okEiCEckqIH2hOUM5sYQyLWxWwkZUU4a2qKwEb/HLy6R5VvUuqud355XabV5HEY7gGE7Bg0uowQ3UoQEMNDzDK7w5T86L8+58zEcLTr5zCH/gfP4AWUOScQ== pt1 AAAB9XicbVDLSgMxFM3UV62vqks3wSK4KjOlqMuCG91VsA9ox5JJM21okhmSO0oZ+h9uXCji1n9x59+YaWehrQcCh3Pu5Z6cIBbcgOt+O4W19Y3NreJ2aWd3b/+gfHjUNlGiKWvRSES6GxDDBFesBRwE68aaERkI1gkm15nfeWTa8EjdwzRmviQjxUNOCVjpIR70JYGxlinUZqVBueJW3TnwKvFyUkE5moPyV38Y0UQyBVQQY3qeG4OfEg2cCjYr9RPDYkInZMR6lioimfHTeeoZPrPKEIeRtk8Bnqu/N1IijZnKwE5mGc2yl4n/eb0Ewis/5SpOgCm6OBQmAkOEswrwkGtGQUwtIVRzmxXTMdGEgi0qK8Fb/vIqadeq3kW1flevNG7zOoroBJ2ic+ShS9RAN6iJWogijZ7RK3pznpwX5935WIwWnHznGP2B8/kDWsmScg== pt2 AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsxIUZcFN7qrYB/QGUomzbShSWZIMkoZ+htuXCji1p9x59+YaWehrQcCh3Pu5Z6cMOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8SaciZp2zDDaS9RFIuQ0244ucn97iNVmsXywUwTGgg8kixiBBsr+U8DX2AzViIzs0G15tbdOdAq8QpSgwKtQfXLH8YkFVQawrHWfc9NTJBhZRjhdFbxU00TTCZ4RPuWSiyoDrJ55hk6s8oQRbGyTxo0V39vZFhoPRWhncwT6mUvF//z+qmJroOMySQ1VJLFoSjlyMQoLwANmaLE8KklmChmsyIyxgoTY2uq2BK85S+vks5F3busN+4bteZdUUcZTuAUzsGDK2jCLbSgDQQSeIZXeHNS58V5dz4WoyWn2DmGP3A+fwC2lpIp wt AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCq5KIqMuCG91VsA9oQ5hMJ+3QmSTM3Ag15EvcuFDErZ/izr9x0mahrQcGDufcyz1zgkRwDY7zbVXW1jc2t6rbtZ3dvf26fXDY1XGqKOvQWMSqHxDNBI9YBzgI1k8UIzIQrBdMbwq/98iU5nH0ALOEeZKMIx5ySsBIvl1P/GwoCUyUzDTkuW83nKYzB14lbkkaqETbt7+Go5imkkVABdF64DoJeBlRwKlgeW2YapYQOiVjNjA0IpJpL5sHz/GpUUY4jJV5EeC5+nsjI1LrmQzMZJFRL3uF+J83SCG89jIeJSmwiC4OhanAEOOiBTziilEQM0MIVdxkxXRCFKFguqqZEtzlL6+S7nnTvWxe3F80WndlHVV0jE7QGXLRFWqhW9RGHURRip7RK3qznqwX6936WIxWrHLnCP2B9fkDyeqT3A== pst AAACJ3icbVDLSgMxFM3UV62vqks3wSIIQpkpRd0oBTe6q2Af0JYhk2ba0ExmSDJCCfkbN/6KG0FFdOmfmLazsK0HLhzOuZd77wkSRqVy3W8nt7K6tr6R3yxsbe/s7hX3D5oyTgUmDRyzWLQDJAmjnDQUVYy0E0FQFDDSCkY3E7/1SISkMX9Q44T0IjTgNKQYKSv5xevE190IqaGINOHSGHgFu6FAWP8xlIGeOZsXKsboivGLJbfsTgGXiZeREshQ94tv3X6M04hwhRmSsuO5ieppJBTFjJhCN5UkQXiEBqRjKUcRkT09/dPAE6v0YRgLW1zBqfp3QqNIynEU2M7JoXLRm4j/eZ1UhZc9TXmSKsLxbFGYMqhiOAkN9qkgWLGxJQgLam+FeIhsSspGW7AheIsvL5Nmpeydl6v31VLtLosjD47AMTgFHrgANXAL6qABMHgCL+AdfDjPzqvz6XzNWnNONnMI5uD8/AKfTqem pens = pt1 + pt2 2 AAACAXicbVBNS8NAEN3Ur1q/ol4EL8EieCqJFPVY8KK3CvYDmhA2m0m7dDcJuxuhhHrxr3jxoIhX/4U3/42bNgdtfTDweG+GmXlByqhUtv1tVFZW19Y3qpu1re2d3T1z/6Ark0wQ6JCEJaIfYAmMxtBRVDHopwIwDxj0gvF14fceQEiaxPdqkoLH8TCmESVYack3j1xgzM9djtVI8DykEg8FwHTqm3W7Yc9gLROnJHVUou2bX26YkIxDrAjDUg4cO1VejoWihMG05mYSUkzGeAgDTWPMQXr57IOpdaqV0IoSoStW1kz9PZFjLuWEB7qzuFQueoX4nzfIVHTl5TROMwUxmS+KMmapxCrisEIqgCg20QQTQfWtFhlhgYnSodV0CM7iy8uke95wLhrNu2a9dVvGUUXH6ASdIQddoha6QW3UQQQ9omf0it6MJ+PFeDc+5q0Vo5w5RH9gfP4AtP6XvA== `disagree AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiQi6rLgRsFFBfuANoTJdNIOnZmEmYlQQhZu/BU3LhRx60e482+cpFlo64GBM+fcy733BDGjSjvOt1VZWV1b36hu1ra2d3b37P2DrooSiUkHRyyS/QApwqggHU01I/1YEsQDRnrB9Cr3ew9EKhqJez2LicfRWNCQYqSN5Nv1IUd6ghFLbzM/LT6Sp3iMcJb5dsNpOgXgMnFL0gAl2r79NRxFOOFEaMyQUgPXibWXIqkpZiSrDRNFYoSnaEwGhgrEifLS4ogMHhtlBMNImic0LNTfHSniSs14YCrzLdWil4v/eYNEh5deSkWcaCLwfFCYMKgjmCcCR1QSrNnMEIQlNbtCPEESYW1yq5kQ3MWTl0n3tOmeN8/uzhqtmzKOKqiDI3ACXHABWuAatEEHYPAInsEreLOerBfr3fqYl1assucQ/IH1+QMH7ZkF Lcgac Consistency training AAACWXicbVFNT9wwEHVCgW1aYClHLlZXSPTQVYJo6RGpF8QJJBaQNqvIcSa7Fs6H7Al05fWf5ICE+ld6wAl74KMjWX568+Z5ZpzWUmgMw0fPX/mwurbe+xh8+ryxudXf/nKpq0ZxGPFKVuo6ZRqkKGGEAiVc1wpYkUq4Sm9+t/mrW1BaVOUFzmuYFGxailxwho5K+rUx5sLaxMytDfZojPAHO1ejILPmLkFLF4s4rWSm54W7TJ10IsOaqbX0O31hEEvIcd+8UtvWIVZiOsNvi0VykPQH4TDsgr4H0RIMyDLOkv59nFW8KaBELpnW4yiscWKYQsEl2CBuNNSM37ApjB0sWQF6YroZLN1zTEbzSrlTIu3YlxWGFbrt1CkLhjP9NteS/8uNG8x/TYwo6wah5M8P5Y2kWNF2zTQTCjjKuQOMK+F6pXzGFOPoPiNwS4jejvweXB4Mo5/DH+eHg+PT5Tp6ZJd8JfskIkfkmJyQMzIinDyQf96qt+b99T2/5wfPUt9b1uyQV+HvPAE1t7i1 Ty AAACWXicbVFLT9wwEHZSHkvoY1uOvViskODQVYIocETqBfVEJRaQNqvIcSa7Fs5D9qRl5fWf7KES4q9wqBMWiddIlj99883nmXFaS6ExDG89/93K6tp6byPYfP/h46f+5y8XumoUhxGvZKWuUqZBihJGKFDCVa2AFamEy/T6R5u//A1Ki6o8x3kNk4JNS5ELztBRSb82xpxbm5gba4MdGiPcYOdqFGTW/EnQ0sUiTiuZ6XnhLlMnnciwZmot/UYfDebWxhJy3DXP1LZ1iJWYznBvsUj2k/4gHIZd0NcgWoIBWcZZ0v8bZxVvCiiRS6b1OAprnBimUHAJNogbDTXj12wKYwdLVoCemG4GS3cck9G8Uu6USDv2aYVhhW47dcqC4Uy/zLXkW7lxg/nxxIiybhBK/vBQ3kiKFW3XTDOhgKOcO8C4Eq5XymdMMY7uMwK3hOjlyK/Bxf4wOhx+/3UwOPm5XEePfCXbZJdE5IickFNyRkaEk3/k3lv11rw73/N7fvAg9b1lzRZ5Fv7WfzPMuLQ= Tx Con fi dence-aware 
 consistency loss AAAC7XicbVFNb9NAEN2YjxbzlcKRy4oIqRwSxRFQDghVtKo4oSKatlIcWev1JFl117Z2x7TRZn8GN8SVn8Jv4EdwhStrJwfSMJLlp5k3szPvpaUUBvv9n63gxs1bt7e274R3791/8LC98+jUFJXmMOSFLPR5ygxIkcMQBUo4LzUwlUo4Sy8O6vrZZ9BGFPkJzksYKzbNxURwhj6VtFmMcIXNHKshc/YyQUcXizgtZGbmyv9smTQka9A52qXW2hPnEjt3LpYwwV27Rnb1gFiL6QyfLxbJIGl3+r1+E3QTRCvQIas4TnZaGGcFrxTkyCUzZhT1SxxbplFwCS6MKwMl4xdsCiMPc6bAjG1zg6PPfCajk0L7L0faZP/tsEyZelXPVAxn5nqtTv6vNqpw8npsRV5WCDlfPjSpJMWC1sLSTGjgKOceMK6F35XyGdOMo5c/DOND8MdoOPJbHTEl5NwOnZdUXA3S1Fm3Rvg0YyWs1ZWzuaP2TfctbXLdesFlYxh/gMvDlWAHhVIsz2ws8qw22UtinV36V19Ti7I5OHIu9D5F113ZBKeDXvSq9/Lji87+u5Vj2+QJeUp2SUT2yD55T47JkHDyg/wiv8mfoAi+BF+Db0tq0Fr1PCZrEXz/C3GT9jU= wt ||pst → Ty (pt ) ||2

Slide 53

Slide 53 text

Qualitative results in the wild 53 Before After Before After Before After

Slide 54

Slide 54 text

Quantitative results 54 DexYCB → HO3D DexYCB → HanCo DexYCB → FPHA Method PCK↑ IoU↑ Avg. PCK↑ IoU↑ Avg. PCK↑ IoU↑ Avg. Source only 33.5 49.1 41.3 27.3 41.4 34.3 14.0 24.8 19.4 DANN [Y. Ganin+, ICML’15] 46.8 54.7 50.7 33.0 56.9 45.0 24.4 28.4 26.4 RegDA [J. Jiang+, CVPR’21] 48.2 55.3 51.7 33.6 58.4 46.0 23.7 41.7 32.7 GAC 47.4 56.9 52.1 37.1 58.8 47.9 37.2 33.3 35.3 UMA [M. Cai+, CVPR’20] 45.3 55.0 50.2 35.6 57.7 46.6 36.8 39.3 38.0 Mean-Teacher [A. Tarvainen+, ICLR’17] 44.4 52.3 48.3 33.8 55.1 44.4 31.3 38.4 34.9 Ours (CGAC) 51.1 60.3 55.7 39.9 58.6 49.2 37.2 37.7 37.4 • Pose metric: PCK, Segmentation metric: IoU HO3D HanCo FPHA

Slide 55

Slide 55 text

Summary • Propose a self-training domain adaptation method for estimating hand poses and pixel masks • Consistency training under data augmentation • Con fi dence estimation by using two networks • Teacher-student update • Inspired by my master thesis on consensus pseudo-labeling [T. Ohkawa+, IEEE Access'21] • Patent Submitted! • Work done during research stay at CMU! 55

Slide 56

Slide 56 text

56 Robust modeling for fi ne details Data foundation Connecting geometry 
 with semantics Chapter 2: Survey for 3D hand datasets and learning methods (IJCV’23) Chapter 5: Pre-training from diverse images (ICLR’25) Chapter 6: Adaptation 
 in the wild (ECCV’22) Chapter 7: Video language description from hand tracklets (WACV’25) AssemblyHands dataset 
 with object contact (CVPR’23) Egocentric 3D HPE benchmarks in object interactions (ECCV’24) Chapter 3 Chapter 4: Self-contact (ICCV’25)
 (i) Goliath-SC dataset (ii) Generative prior 
 of contact pose

Slide 57

Slide 57 text

Video captioning for hand activities • Describe the procedure of egocentric activities with natural language 1. Create a new dataset for egocentric dense video captioning 2. Video captioning model from hand-object tracklets 3. Cross-view transfer learning from web instructional videos 57 [6] T. Ohkawa+, Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos. In WACV, 2025

Slide 58

Slide 58 text

Data collection • New EgoYC2 paired with the YouCook2 (YC2) • Re-record 11% of YC2 recipes from 44 users (~43h) • Follow YC2’s vocabulary list and caption granularity 58 YouCook2 (176h of web videos) 
 [L. Zhou+, AAAI’18] • De fi ne video captioning tasks to enhance the expressiveness of actions
 (vs EPIC-KITCHENS [D. Damen+, IJCV’22] / FPHA [G. Garcia-Hernando+, CVPR’18])

Slide 59

Slide 59 text

Challenges • Egocentric view is always shifting 59 • Web videos are rich sources but involve view gap with various cuts → Identity interaction areas with hand-object tracking → Cross-view transfer learning to ego videos while fi nding ego-like views

Slide 60

Slide 60 text

Hand-object tracklets for video representation 60 Hand detection
 (Coarse but temporally coherent) Hand-object segmentation
 (Fine but frame-by-frame)

Slide 61

Slide 61 text

Exo-to-Ego transfer learning • Separate source views to exo and ego-like views • Gradual adaption from exo to ego-like and fi nally to ego view 61 Ego view (EgoYC2) Exo view (YC2) Ego-like view (YC2) Decompose

Slide 62

Slide 62 text

Feature processing 62 Source view decomposition with face detection Web videos with scene cuts (YC2) Exo shots Ego-like shots Ego shots from EgoYC2 Face detection Video features Positive Negative AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J Hands+objects AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J View Labeling HO Tracking Hand-object tracking 
 to represent ego videos

Slide 63

Slide 63 text

Feature processing 63 Source view decomposition with face detection Web videos with scene cuts (YC2) Exo shots Ego-like shots Ego shots from EgoYC2 Face detection Video features Positive Negative AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J Hands+objects AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J View Labeling HO Tracking Hand-object tracking 
 to represent ego videos Hands 1st Obj. 2nd Obj. AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J

Slide 64

Slide 64 text

View-invariant learning 64 Source view decomposition with face detection Web videos with scene cuts (YC2) Exo shots Ego-like shots Ego shots from EgoYC2 Face detection View-invariant video features Feature converter GRL View classi fi er View labels Segment Caption One-stage captioning Positive Negative AAAB6HicbVDJSgNBEK1xjXGLevTSGARPYUbcjgFBPCZoFkiG0NOpSdr09AzdPUIYAt69eFDEq5/kzb+xsxw08UHB470qquoFieDauO63s7S8srq2ntvIb25t7+wW9vbrOk4VwxqLRayaAdUouMSa4UZgM1FIo0BgIxhcj/3GIyrNY3lvhgn6Ee1JHnJGjZWqN51C0S25E5BF4s1IEWaodApf7W7M0gilYYJq3fLcxPgZVYYzgaN8O9WYUDagPWxZKmmE2s8mh47IsVW6JIyVLWnIRP09kdFI62EU2M6Imr6e98bif14rNeGVn3GZpAYlmy4KU0FMTMZfky5XyIwYWkKZ4vZWwvpUUWZsNnkbgjf/8iKpn5a8i9J59axYvnuaxpGDQziCE/DgEspwCxWoAQOEZ3iFN+fBeXHenY9p65Izi/AA/sD5/AHIzo1l F AAAB6HicbVDJSgNBEK1xjXGLevTSGARPYUbcjgEPekzQLJAMoadTk7Tp6Rm6e4QwBLx78aCIVz/Jm39jZzlo4oOCx3tVVNULEsG1cd1vZ2l5ZXVtPbeR39za3tkt7O3XdZwqhjUWi1g1A6pRcIk1w43AZqKQRoHARjC4HvuNR1Sax/LeDBP0I9qTPOSMGitVbzqFoltyJyCLxJuRIsxQ6RS+2t2YpRFKwwTVuuW5ifEzqgxnAkf5dqoxoWxAe9iyVNIItZ9NDh2RY6t0SRgrW9KQifp7IqOR1sMosJ0RNX09743F/7xWasIrP+MySQ1KNl0UpoKYmIy/Jl2ukBkxtIQyxe2thPWposzYbPI2BG/+5UVSPy15F6Xz6lmxfPc0jSMHh3AEJ+DBJZThFipQAwYIz/AKb86D8+K8Ox/T1iVnFuEB/IHz+QPKUo1m G AAAB6HicbVDJSgNBEK1xjXGLevTSGARPYUbcjoFcPCZoFkiG0NOpSdr09AzdPUIYAt69eFDEq5/kzb+xsxw08UHB470qquoFieDauO63s7K6tr6xmdvKb+/s7u0XDg4bOk4VwzqLRaxaAdUouMS64UZgK1FIo0BgMxhWJn7zEZXmsbw3owT9iPYlDzmjxkq1SrdQdEvuFGSZeHNShDmq3cJXpxezNEJpmKBatz03MX5GleFM4DjfSTUmlA1pH9uWShqh9rPpoWNyapUeCWNlSxoyVX9PZDTSehQFtjOiZqAXvYn4n9dOTXjjZ1wmqUHJZovCVBATk8nXpMcVMiNGllCmuL2VsAFVlBmbTd6G4C2+vEwa5yXvqnRZuyiW755mceTgGE7gDDy4hjLcQhXqwADhGV7hzXlwXpx352PWuuLMIzyCP3A+fwDEQo1i C AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J Hands+objects AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0nEqseiF48V7Ae0oWy2m2bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8IOFMG9f9dkpr6xubW+Xtys7u3v5B9fCoo+NUEdomMY9VL8CaciZp2zDDaS9RFIuA024wucv97hNVmsXy0UwT6gs8lixkBJtcGiQRG1Zrbt2dA60SryA1KNAaVr8Go5ikgkpDONa677mJ8TOsDCOcziqDVNMEkwke076lEguq/Wx+6wydWWWEwljZkgbN1d8TGRZaT0VgOwU2kV72cvE/r5+a8MbPmExSQyVZLApTjkyM8sfRiClKDJ9agoli9lZEIqwwMTaeig3BW355lXQu6t5VvfFwWWveFnGU4QRO4Rw8uIYm3EML2kAggmd4hTdHOC/Ou/OxaC05xcwx/IHz+QMWyI5J View-Invariant (VI) Learning View Labeling HO Tracking Hand-object tracking 
 to represent ego videos Uni fi ed view-invariant learning with adversarial adaptation (exo→ego-like→ego) Video features

Slide 65

Slide 65 text

Qualitative results • Base (w/o VI): Pre-trained on YC2 and fi ne-tuned on EgoYC2 without view-invariant learning • Ours (w/ VI): Base with view-invariant learning on both pre-training and fi ne-tuning stages 65 Less irrelevant ingredients

Slide 66

Slide 66 text

Quantitative results 66 dvc_eval SODA Method Ego video feat. BLEU4↑ METEOR↑ CIDEr↑ METEOR↑ CIDEr↑ tIoU↑ Source only Raw 0.00 0.77 3.60 0.89 1.47 17.9 Base (w/o VI) Raw 1.54 7.03 38.1 7.03 25.2 50.5 Det 1.97 8.20 46.3 8.04 32.3 55.0 Det+Seg 1.68 8.91 52.5 8.91 37.3 59.0 +MMD 
 [E. Tzeng+, arXiv’14] Det+Seg 1.74 8.86 50.9 8.86 37.5 58.8 +DANN [Y. Ganin+, ICML’15] Det+Seg 2.05 9.01 53.1 8.97 39.1 58.6 Ours (w/ VI) Det+Seg 2.66 9.19 59.0 9.27 45.2 58.1 EgoYC2 eval set • Metric: BLEU, METEOR (text eval), CIDEr (visual-text eval), tIoU (segment eval) Ego input representation Feature alignment 
 (source → target) Proposed adaptation
 (exo→ego-like→ego)

Slide 67

Slide 67 text

Summary • Describe egocentric activities in natural language • Egocentric video captioning dataset (EgoYC2) aligned to web videos YC2 • Hybrid video representation from hand detection and segmentation • Cross-view knowledge transfer with gradual adaptation • Collaborated with OMRON SINIC X 67

Slide 68

Slide 68 text

Thesis recap • The thesis establishes foundational understanding of hand interactions with precise tracking and interpretation • Highlight contact scenarios (i.e., object & self-contact) with dataset construction • Propose generalizable and adaptable methods that capture fi ne details
 (i.e., contrastive learning, generative prior, and self-training for adaptation) • Link geometry cues to semantic understanding with video captioning 68

Slide 69

Slide 69 text

Remaining issues A. Data: What’s the next frontier beyond current data capture scenarios? B. Modeling: What properties are lacking in current inference models? C. External Prior: Can we leverage strong cues from recent large models? D. Applications: What new capabilities can emerging models unlock beyond current tasks? 69

Slide 70

Slide 70 text

Future work i. Expanding data acquisition, sensors, and captured scenarios ii. Modeling for temporal context, human modalities, and real-time inference iii. Leveraging generative, foundation, and world models iv. Integrate with common-sense knowledge and reasoning v. Towards social and collaborative interactions vi. Physics-based simulation 70 C. External Prior D. Applications B. Modeling A. Data C. External Prior D. Applications

Slide 71

Slide 71 text

Publications covered by the thesis (*co- fi rst authors) 1. T. Ohkawa, R. Furuta, and Y. Sato. 
 E ffi cient annotation and learning for 3D hand pose estimation: A survey. IJCV, 2023 2. Z. Fan*, T. Ohkawa*, L. Yang*, ... (20 authors), A. Yao. 
 Benchmarks and challenges in pose estimation for egocentric hand interactions with objects. In ECCV, 2024 3. T. Ohkawa, J. Lee, S. Saito, J. Saragih, F. Prada, Y. Xu, S. Yu, R. Furuta, Y. Sato, and T. Shiratori. 
 Generative modeling of shape-dependent self-contact human poses. In ICCV, 2025 4. N. Lin*, T. Ohkawa*, M. Zhang, Y. Huang, R. Furuta, and Y. Sato. 
 SiMHand: Mining of similar hands for large-scale 3D hand pose pre-training. In ICLR, 2025 5. T. Ohkawa, Y.-J. Li, Q. Fu, R. Furuta, K. M. Kitani, and Y. Sato. 
 Domain adaptive hand keypoint and pixel localization in the wild. In ECCV, 2022 6. T. Ohkawa, T. Yagi, T. Nishimura, R. Furuta, A. Hashimoto, Y. Ushiku, and Y. Sato. 
 Exo2EgoDVC: Dense video captioning of egocentric procedural activities using web instructional videos. In WACV, 2025
 aa Related publications not covered by the thesis A. T. Ohkawa, K. He, F. Sener, T. Hodan, L. Tran, and C. Keskin. 
 AssemblyHands: Towards egocentric activity understanding via 3D hand pose estimation. In CVPR, 2023 B. T. Banno, T. Ohkawa, R. Liu, R. Furuta, and Y. Sato. 
 AssemblyHands-X: 3D hand-body co-registration for understanding bi-manual human activities. In MIRU, 2025 C. R. Liu, T. Ohkawa, M. Zhang, and Y. Sato. 
 Single-to-dual-view adaptation for egocentric 3D hand pose estimation. In CVPR, 2024 D. T. Ohkawa, T. Yagi, A. Hashimoto, Y. Ushiku, and Y. Sato. 
 Foreground-aware stylization and consensus pseudo-labeling for domain adaptation of fi rst-person hand segmentation. IEEE Access, 2021
 aa Awards / Fellowships • CVPR EgoVis Distinguished Paper Award’25, Google PhD Fellowship’24, ETH Zurich Leading House Asia’23, 
 MSRA D-CORE’23, JSPS DC1’22, JST ACT-X (’20-’22, Accel.’23) 71