IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
Slides of the IEEE AIxVR 2026 Keynote Talk entitled "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
XR requires accurate capture of • user gaze and pose, • surrounding environment structure, and • presence and motion of other people, etc… 7 [Google Developers, Media pipe] [Meta, Segment Anything 3]
compensate for it using alternative forms of observation? • If complete measurement is difficult, can we estimate or reconstruct the whole from partial information? • Can we extract only the information we need, without compromising privacy? 12 ?
Lights [Kato+ MICCAI2023] Kato, Isogawa, Mori, Saito,Kajita, Takatsume, “High-Quality Virtual Single-Viewpoint Surgical Video: Geometric Autocalibration of Multiple Cameras in Surgical Lights”, MICCAI2023. Kato, Mori, Saito,Takatsume, Kajita, Isogawa, “Disturbance-Free Surgical Video Generation from Multi-Camera Shadowless Lamps for Open Surgery”, arXiv 2025. Select the camera with the least occlusion Align the images w/o alignment w/ alignment (Ours) The direction of the surgical target keeps changing.. 24
without occlusion?? → We use Multi-camera Shadowless Lamps (McSL) Challenges: • Surgeon’s head/hands cause occlusion • Each camera view is fixed Cameras Input: Videos from five cameras 4D Scene Output: Occlusion-less free-viewpoint video Occlusion-free 4D Gaussians for Open Surgery Videos Using Multi-Camera Shadowless Lamps[Kato+ MICCAI2025 Spotlight] Kato, Mori, Saito,Takatsume, Kajita, Isogawa, “Occlusion-free 4D Gaussians for Open Surgery Videos Using Multi-Camera Shadowless Lamps” , MICCAI2025. 25 [ChatGPT]
a wrist-mounted camera • Camera sees only one side of the body, while the other side is hidden by the user’s own body Blocked region Observable region 28
wide or omnidirectional camera, half of the body remains self-occluded • Can we still recover human pose? Blocked region Camera’s FOV The left side when the camera is on the right wrist remains occluded.. Observable region 29
single wrist-mounted omnidirectional camera 360° camera Input: 360° camera images Output: Human 3D Poses Red: Ground truth White: Estimated Camera’s FOV Hori, Hachiuma, Saito, Isogawa, Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360 Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360 Camera” , IEEE Access, 2022. 30
training requires substantial data, but data acquisition is time- and cost-intensive • VR makes large-scale data generation possible! • Any motion capture data can be imported • Virtual cameras can be installed Hori, Hachiuma, Saito, Isogawa, Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360 Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360 Camera” , IEEE Access, 2022. MoCap Data Import into Unity Virtual 360° camera Capture Synthesized 360° camera images 33
(domain gap) • Consequently, models trained on VR data do not generalize well to real-world data Domain gap Hori, Hachiuma, Saito, Isogawa, Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360 Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360 Camera” , IEEE Access, 2022. 34
between VR and real data Silhouettes: similar appearance across domains à can be used for bridging the gap! Silhouetting Hori, Hachiuma, Saito, Isogawa, Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360 Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360 Camera” , IEEE Access, 2022. 35
Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360 Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360 Camera” , IEEE Access, 2022. • To bridge the domain gap, silhouetting is applied to input image sequence • BiLSTM introduced for maintaining temporal consistency across frames Red: Ground truth White: Estimated Output: Human 3D Poses 36
videos, but information still exists in the form of other physical quantities • Other sensing modalities may be able to observe them! 37 Low-light scene Prior work: [Mingmin+ 2018] WiFi Person behind the wall Event data Audio
• Single-photon avalanche diode (SPAD) sensors for sensitive photon detection, and • Time-to-digital converter (TDC) with high temporal resolution allows us to measure light intensity and photon arrival time in a transient histogram formà this can be leverages to “see around” occluded regions! Actively emitted light + SPAD&TDC The time it takes for light to travel to the object and back Photon amount ≒light strength 38
Observing Scenes Behind a Wall • Target scene is behind the wall: non-line-of-sight (NLOS) for the sensor • The visible wall is illuminated with laser light to obtain a transient image What information does this transient response contain? 39
substantial data, but acquiring real transient measurements is costly and impractical at scale • To overcome this, we trained our models on virtual data by synthesizing pseudo-transient images from depth images Depth image w/ MoCap data Pseudo transient image Isogawa, Yuan, O’Toole, Kitani, “Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. 46
data representations, the task difficult owing to low spatial resolution (32º32 spatial res.) • Reconstructed image (intermediate representation before pose estimation) is heavily blurred Transient Image P2PSF Net Inverse PSF Optimization Part Max pooling ResNet18 Inverse PSF Volume Transient Feature : Part of conventional NLOS imaging Presence of a person is only barely visible, making pose estimation highly challenging… Reference: actual scene Reconstructed scene Isogawa, Yuan, O’Toole, Kitani, “Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. 48
signals (Wi-Fi, mmWave): Can penetrate materials such as paper, wood, and plastic Acoustic signals: Can propagate over a wide area (e.g., across rooms) through transmission and diffraction (λ) → 52
low spatial resolution without dense arrays à Privacy-preserving as they cannot capture fine details (e.g., facial geometry) Time Amplitude Acoustic signal Millimeter-wave point cloud x z y 54
are long, many objects can become specular reflection sources • Diffraction (a phenomenon wherein the signal bends and propagates behind obstacles) occurs readily → Signal paths ascribed to these reflections/diffractions make analysis difficult Specular reflection Specular Reflection: When the surface roughness is smaller than signal wavelength Signal diffraction Diffuse Reflection: When the surface roughness is larger than signal wavelength [Scl Chua/Wikimedia Commons] 57
What are Millimeter Waves? • Radio-frequency signals in the 30–300 GHz band • Can penetrate paper/fabric/plastic, enabling sensing even in dark environments Our Method: • Input: mmWave point cloud data (locations of strong reflections and their intensities) • Output: Skinned Multi-Person Linear (SMPL) model parameters (pose and shape) [DeepX] Amplitude Time Mesh Estimation Input: mmWave point cloud Output: human mesh Amaya, Isogawa, “Adaptive and Robust Mmwave-Based 3D Human Mesh Estimation for Diverse Poses”, IEEE ICIP2021. 58
Robust human mesh estimation from noisy mmWave point clouds Sphere-based denoising • Approximate human motion using six spherical regions (two arms/legs, head, torso) • Estimate the person’s center and keep points within these regions Separated anchor points extraction • Use separate cuboids for upper/lower body to aggregate points around grid-based anchor points Sphere-based denoising Feature Extraction Separated anchor points extraction 4 5 3 18 19 23 21 17 14 15 13 16 20 22 9 6 0 1 2 7 10 8 11 12 Feature Extraction Input: mmWave point cloud Output: human mesh Amaya, Isogawa, “Adaptive and Robust Mmwave-Based 3D Human Mesh Estimation for Diverse Poses”, IEEE ICIP2021. 59
the sphere Green : Point clouds inside the sphere fast slow ours ( ) Raw point cloud Denoised Anchor Point Existing Method Ours Ground truth Millimeter-wave-based Human Mesh Estimation [Amaya and Isogawa ICIP2023, oral presentation] Amaya, Isogawa, “Adaptive and Robust Mmwave-Based 3D Human Mesh Estimation for Diverse Poses”, IEEE ICIP2021. 60
Kawashima, Isogawa, Irie, Kimura, Aoki, “Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals”, CVPR2023. Method for “listening to person's poses” using only acoustic signals for pose estimation, instead of visible light or wireless signals Speakers Ambisonics microphone Acoustic field is formed between the speakers and microphone 64
frequency increases (or decreases) over time Why TSP signals? à Because they serves as time-stretched approximations of an impulse! • We reinterpret environment state estimation (including human pose) as an analysis of spatial reverberation characteristics • Ideally, the impulse response should be measured to obtain frequency characteristics, but true impulse responses are difficult to measure owing to hardware limitations Frequency [Hz] Time [s] 65
the intensity vector, which represents phase delays (where sound energy is coming from in 3D space) W,X,Y,Z : Each component of captured signal R : real value <latexit sha1_base64="hAkYtmstpT+i6oaIDkOltZh0G/E=">AAACs3ichVFNS9xAGH5MWz+2Vdf2IngJLpYVZJmIH6UgSL3opfjRVWFXQhInGjaThMnsgmb3D/QP9OBJQUrptf+gl/4BD3q1l9KjhV566JtsRFpR35CZZ573fd55ZsaOfC9WjJ33aI8eP+nt6x8oPH02ODRcHHm+GYdN6fCqE/qh3LatmPtewKvKUz7fjiS3hO3zLbuxlOa3WlzGXhi8UwcR3xHWXuC5nmMposzi2xWzLiy1L0UShFJM6YL7nXJjSleT+oK+bCbX2ZTPEu5k3ZWWk6yU3bSqk7TbOWy3O2axxCosC/02MHJQQh6rYfEj6thFCAdNCHAEUIR9WIjpq8EAQ0TcDhLiJCEvy3N0UCBtk6o4VVjENmjco1UtZwNapz3jTO3QLj79kpQ6JtgZ+8Su2Df2mf1gf+7slWQ9Ui8HNNtdLY/M4fejG78fVAmaFfZvVPd6VnDxKvPqkfcoY9JTOF196/DD1cbr9YnkJTthP8n/MTtnX+kEQeuXc7rG149QoAcw/r/u22BzumLMVYy1mdLim/wp+jGGcZTpvuexiGWsokr7fsEFLvFdm9Vqmq3tdku1nlzzAv+EJv4CeY6m/g==</latexit> Inorm,mel(k, t) = Hmel(k, f) I(f, t) ||I(f, t)|| <latexit sha1_base64="huUxO6Wj0UbLoVkBG38NL3i10ts=">AAAC9HichVFLb9NAEB6bV1keTeGC1ItFEpQiFK0RKgipUlUucGtT0gS6JVpvNsmq64fsTaTW8h/gwrEHuACqEELiT3DhD3DoT0AcC+LCgYlt8aqAsbwz8833zc7uepFWiaH0wLKPHT9x8tTMaXLm7Lnzs5W5CxtJOI6FbItQh3HX44nUKpBto4yW3SiW3Pe07Hjbd6b1zkTGiQqD+2Ynkls+HwZqoAQ3CPUqT+41Btccs+AsOcznZiS4TlsZYVoODEtJ51F6NSsZTPRD45B6UWwQ5smhClLDvbHmcZbqjNS6BbfmMEZqD35NHpYJYTLo/xARFqvhyCyUnmW9SpU2aW7O0cAtgyqUthpWXgGDPoQgYAw+SAjAYKyBQ4LfJrhAIUJsC1LEYoxUXpeQAUHtGFkSGRzRbVyHmG2WaID5tGeSqwXuovGPUelAnX6gr+khfU/f0I/02197pXmP6Sw76L1CK6Pe7ONL61//q/LRGxj9VP1zZgMDuJXPqnD2KEempxCFfrK7d7h+u1VPr9AX9BPO/5we0Hd4gmDyWeyvydZTIPgA7p/XfTTYuN50F5vu2o3q8kr5FDMwD5ehgfd9E5bhLqxCG/f9Ys1bNatuT+xn9kt7v6DaVqm5CL+Z/fY7VX24RA==</latexit> I(f, t) = R 8 < : W⇤(f, t) · 0 @ X(f, t) Y (f, t) Z(f, t) 1 A 9 = ; Mel scale conversion Standardization Phase delays Time Frequency Channel 66
Isogawa, Irie, Kimura, Aoki, “Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals”, CVPR2023. Reference video Existing method Ours Ground truth Received Signal Transmitted TSP Signal Intensity Vector Log-Mel Spectrum Sequence of 3D human pose p Loudspeakers Ambisonics microphone 3D Human Pose Estimation Network ch t Frequency bin ch t Frequency bin Subject Discriminator Module 1D CNN 2D CNN Acoustic Signal Features !!"#$ … !#%& !#'""%( Acoustic Signal Features Sequence … 67
of Person Off the Speaker–Microphone Line [Oumi+ BMVC2024] • Features vary significantly with position, and off-line features are close to those in the absence of a person! • This strong position-dependent feature distribution complicates model training Oumi, Shibata, Irie, Kimura, Aoki, Isogawa, “Acoustic-based 3D Human Pose Estimation Robust to Human Position”, BMVC2024. Green: person away from the line Blue: no person present 68
1D CNN 2D CNN Position Label # $ %"#$ Position Discriminator Module Position Discriminator Module 0 0 n k Pose Estimation Module ! & 0 n %%&"' %"(&&#) Pose Sequence … Microphone Speakers … model Data augmentation Acoustic Features log-Mel Spectrum "* Intensity Vector Adversarial Learning Human pose estimation method that is robust to the person’s position • A position classifier is trained to estimate the person’s position • Simultaneously, the model generates feature representations that • Improve pose estimation accuracy, while • Intentionally fail to predict the person’s position Extended Idea: Pose Estimation of Person Off the Speaker–Microphone Line [Oumi+ BMVC2024] Oumi, Shibata, Irie, Kimura, Aoki, Isogawa, “Acoustic-based 3D Human Pose Estimation Robust to Human Position”, BMVC2024. Trained to success in pose estimation Trained to fail at position estimation 69
à Why not use background music as the sensing signal? Shibata, Oumi, Irie, Kimura, Aoki, Isogawa, “BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary Sounds”, CVPRW2025. Acoustic Signal-based Human Pose Estimation with Background Music [Shibata+ CVPRW2025] 71
to spatial dimensions… What about the temporal dimension? à We can use event-based camera! Detects “brightness changes” of scenes asynchronously on a per-pixel basis Differences of capturing process of standard and event-based cameras [Mueggler+ IROS2014] 72
the order of microseconds) • Reduces latency and motion blur in fast- moving scenes • High dynamic range • Minimizes white and black clipping • Suitable for sensing in low-light environments • Energy-efficient and memory-saving • Ideal for building edge devices • Supports privacy protection • Harder for non-experts to extract identifiable information compared with standard images Scaramuzza, “Tutorial on Event-based Cameras” [Prophesee] 76
high-speed sensing capability of event cameras be leveraged for capturing fast human motion? Event camera data consists of • Image coordinates (x, y) of the event location • Timestamp (t) of the event • Polarity, indicating brightness change 77
Hori, Isogawa, Mikami, Saito, “EventPointMesh: Human Mesh Recovery Solely From Event Point Clouds”, IEEE TVCG, 2024. Mesh estimation by first addressing the simpler task of 2D joint position estimation with the weights of another model, rather than trying to solve a difficult task all at once! Input: Event point cloud Output: human mesh (3D pose + shape) 2D joint keypoint 79
Saito, “EventPointMesh: Human Mesh Recovery Solely From Event Point Clouds”, IEEE TVCG, 2024. Human Mesh Recovery only with Event Data [Hori+ TVCG 2024] Dancing in a dark room 81
using our method, applied to 3D characters Hori, Isogawa, Mikami, Saito, “EventPointMesh: Human Mesh Recovery Solely From Event Point Clouds”, IEEE TVCG, 2024. 82
Synthetic event data Reference video ʢnot used for trainingʣ Events are triggered by both the user’s hand motion and camera’s ego motion, making it difficult to estimate hand pose →Introduce segmentation task to extract hand regions as an intermediate task Hara, Ikeda, Hatano, Isogawa, “EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction”, IEEE ICIP2025. 84
“EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction”, IEEE ICIP2025. Egocentric Event-based Human Hand Pose Estimation [Hara+ IEEE ICIP2025, Oral] 85
and camera motion (human’s motion) → Introduce segmentation step that removes events corresponding to scene movement Egocentric Event-based Human 3D Pose Estimation [Ikeda+ IEEE ICIP2025] Ikeda, Hatano, Hara, Isogawa, “Event-based Egocentric Human Pose Estimation in Dynamic Environment”, IEEE ICIP2025. w/ our motion segmentation Original event voxel frames 86
movement Egocentric Event-based Human 3D Pose Estimation [Ikeda+ IEEE ICIP2025] Reference 3rd person view video Ikeda, Hatano, Hara, Isogawa, “Event-based Egocentric Human Pose Estimation in Dynamic Environment”, IEEE ICIP2025. 87
information is beyond the reach of a single viewpoint, human vision, or conventional cameras, but it exists in other forms Deeper fusion of AI and XR can be achieved by • capturing what cannot be seen using alternative sensing modalities • compensating for incomplete observations through estimation and reconstruction 88
sensor set? • Where should sensors be placed? • How should cross-sensor calibration be performed? • How can we address data scarcity? • How can we ensure stability across environments and relaxed sensing conditions? 89
[email protected] X: @m_isogawa Projects in this talk were partially supported by JST Presto, JSPS Grant-in-Aid for Challenging Research (Exploratory), JSPS KAKENHI(A), and Keio Academic Development Funds.