Upgrade to Pro — share decks privately, control downloads, hide ads and more …

IEEE AIxVR 2026 Keynote Talk: "Beyond Visibilit...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"

Slides of the IEEE AIxVR 2026 Keynote Talk entitled "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"

Avatar for Mariko Isogawa

Mariko Isogawa

January 25, 2026
Tweet

More Decks by Mariko Isogawa

Other Decks in Research

Transcript

  1. Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with

    Diverse Sensing IEEE AIxVR2026 Keynote Talk Mariko ISOGAWA Associate Professor, Keio University 2026.1.26
  2. A bit about Me Mariko Isogawa Associate Professor, Department of

    information and Computer Science 2007 - 2013 Bachelor’s and Master’s Courses @ Osaka Univ. 2013 - 2022 NTT Laboratories 2016 - 2019 Ph.D. @ Osaka Univ. 2019 - 2020 Visiting Scholar @ Carnegie Mellon Univ. 2022 - 2023 Assistant Professor @ Keio University 2023 - current Associate Professor @ Keio University Research Interestsɿ Computer Vision, Machine Learning, Sensing, XR 2
  3. Recent Advances in AI and XR The Sword of Damocles

    (1968) HTV Vive (2016) Oculus Rift (2016) Apple Vision Pro (2024) Microsoft HoloLens (2016) Google Glass (2013) Oculus Quest (2019) 3
  4. Recent Advances in AI and XR First Perceptron (1957) AlexNet

    (Krizhevsky+ 2012) 𝒙𝟏 𝒙𝟐 𝒚 OpenAI ChatGPT (2022) OpenAI DALL-E (2020) OpenAI Sora (2024) Stable Diffusion (2022) Transformer (Vaswani+ 2017) 4
  5. What can AI x XR Create? • Accurately understanding the

    world, and • Reproducing or extending it make a wide range of applications possible! 5
  6. Sensing and Recognition/Reconstruction Still Crucial • Regardless of how advanced

    AI and XR become, they rely on measured real-world data • Measurement, recognition, and reconstruction of environments and humans remain challenging 6 Sensing Experience Image © Adobe Stock/ dragonstock Recognition/ Reconstruction [ChatGPT]
  7. Role of Sensing in XR Natural and safe interaction in

    XR requires accurate capture of • user gaze and pose, • surrounding environment structure, and • presence and motion of other people, etc… 7 [Google Developers, Media pipe] [Meta, Segment Anything 3]
  8. Challenge: Occlusion, Low-Light Condition, Outside the FOV Dark environment Occluded

    by the desk or other people Outside the camera’s FOV Image © Adobe Stock / TommyStockProject, THANANIT, zef art 8
  9. How Can We Build AI x XR Applications Under Real-World

    Constrains? Instead of assuming full visibility and complete measurements, what remains possible with invisibility and constraints? 11
  10. Technical Questions • If something cannot be seen, can we

    compensate for it using alternative forms of observation? • If complete measurement is difficult, can we estimate or reconstruct the whole from partial information? • Can we extract only the information we need, without compromising privacy? 12 ?
  11. AI x XR in Surgical Training AI–XR integration in surgical

    scene, focusing on reconstructing a scene and allowing users to review expert skills via XR 13 Images generated by ChatGPT Review expert skills
  12. Note: The next few slides contain real surgical images and

    videos. Please proceed with caution if you are sensitive to such content.
  13. 22

  14. Virtual Single-Viewpoint Surgical Video Synthesis from Multiple Cameras in Surgical

    Lights [Kato+ MICCAI2023] Kato, Isogawa, Mori, Saito,Kajita, Takatsume, “High-Quality Virtual Single-Viewpoint Surgical Video: Geometric Autocalibration of Multiple Cameras in Surgical Lights”, MICCAI2023. Kato, Mori, Saito,Takatsume, Kajita, Isogawa, “Disturbance-Free Surgical Video Generation from Multi-Camera Shadowless Lamps for Open Surgery”, arXiv 2025. Select the camera with the least occlusion Align the images w/o alignment w/ alignment (Ours) The direction of the surgical target keeps changing.. 24
  15. How can we record surgery close to the surgeon’s perspective

    without occlusion?? → We use Multi-camera Shadowless Lamps (McSL) Challenges: • Surgeon’s head/hands cause occlusion • Each camera view is fixed Cameras Input: Videos from five cameras 4D Scene Output: Occlusion-less free-viewpoint video Occlusion-free 4D Gaussians for Open Surgery Videos Using Multi-Camera Shadowless Lamps[Kato+ MICCAI2025 Spotlight] Kato, Mori, Saito,Takatsume, Kajita, Isogawa, “Occlusion-free 4D Gaussians for Open Surgery Videos Using Multi-Camera Shadowless Lamps” , MICCAI2025. 25 [ChatGPT]
  16. Project page McSL 4D Gaussians Initialize SfM Points Occlusion Removal

    Module Occlusion Masking Distance Thresholding 𝑆! = 0.95 𝑆" = 0.73 𝑆# = 0.72 Image Segmented by [23] Distance Density Removed Side View 4D Gaussians Output Mask (a) (b) 𝑑$%&'(& Output: Occlusion-less free-viewpoint video Occlusion→ Input: Videos from five cameras Kato, Mori, Saito,Takatsume, Kajita, Isogawa, “Occlusion-free 4D Gaussians for Open Surgery Videos Using Multi-Camera Shadowless Lamps” , MICCAI2025. Occlusion-free 4D Gaussians for Open Surgery Videos Using Multi-Camera Shadowless Lamps[Kato+ MICCAI2025 Spotlight] 26
  17. Self-Occlusion in Wearable Sensing • Consider human pose estimation with

    a wrist-mounted camera • Camera sees only one side of the body, while the other side is hidden by the user’s own body Blocked region Observable region 28
  18. A Wide FOV is not Enough • Even with a

    wide or omnidirectional camera, half of the body remains self-occluded • Can we still recover human pose? Blocked region Camera’s FOV The left side when the camera is on the right wrist remains occluded.. Observable region 29
  19. Our Approach Deep learning-based human 3D pose estimation from a

    single wrist-mounted omnidirectional camera 360° camera Input: 360° camera images Output: Human 3D Poses Red: Ground truth White: Estimated Camera’s FOV Hori, Hachiuma, Saito, Isogawa, Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360› Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360› Camera” , IEEE Access, 2022. 30
  20. We propose a novel deep learning–based method that solves… Parameter

    tuning Dataset collection Computational cost Memory-hangry 32
  21. Our Challenge: Solving Data Scarcity with VR • AI model

    training requires substantial data, but data acquisition is time- and cost-intensive • VR makes large-scale data generation possible! • Any motion capture data can be imported • Virtual cameras can be installed Hori, Hachiuma, Saito, Isogawa, Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360› Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360› Camera” , IEEE Access, 2022. MoCap Data Import into Unity Virtual 360° camera Capture Synthesized 360° camera images 33
  22. Domain Gap • Virtual and real data follow different distributions

    (domain gap) • Consequently, models trained on VR data do not generalize well to real-world data Domain gap Hori, Hachiuma, Saito, Isogawa, Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360› Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360› Camera” , IEEE Access, 2022. 34
  23. Reducing the Domain Gap with Silhouettes RGB: clear visual differences

    between VR and real data Silhouettes: similar appearance across domains à can be used for bridging the gap! Silhouetting Hori, Hachiuma, Saito, Isogawa, Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360› Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360› Camera” , IEEE Access, 2022. 35
  24. Silhouette-Based Human 3D Pose Estimation BiLSTM Hori, Hachiuma, Saito, Isogawa,

    Mikami, “Silhouette-based Synthetic Data Generation for 3D Human Pose Estimation with a Single Wrist-mounted 360› Camera”, IEEE ICIP2021. Hori, Hachiuma, Isogawa, Mikami, Saito, “Silhouette-based 3D Human Pose Estimation Using a Single Wrist-mounted 360› Camera” , IEEE Access, 2022. • To bridge the domain gap, silhouetting is applied to input image sequence • BiLSTM introduced for maintaining temporal consistency across frames Red: Ground truth White: Estimated Output: Human 3D Poses 36
  25. Invisible Does Not Mean Nonexistent • Occlusion hides images and

    videos, but information still exists in the form of other physical quantities • Other sensing modalities may be able to observe them! 37 Low-light scene Prior work: [Mingmin+ 2018] WiFi Person behind the wall Event data Audio
  26. High-Temporal-Resolution Light Sensing Prior work[O’Toole+ CVPR2017] has shown that combining

    • Single-photon avalanche diode (SPAD) sensors for sensitive photon detection, and • Time-to-digital converter (TDC) with high temporal resolution allows us to measure light intensity and photon arrival time in a transient histogram formà this can be leverages to “see around” occluded regions! Actively emitted light + SPAD&TDC The time it takes for light to travel to the object and back Photon amount ≒light strength 38
  27. 0QUJDBM/PO-JOFPG4JHIU /-04 *NBHJOHͱ͸ʁʁ Visible Wall Hidden Object Occluder Laser/SPAD sensor

    Observing Scenes Behind a Wall • Target scene is behind the wall: non-line-of-sight (NLOS) for the sensor • The visible wall is illuminated with laser light to obtain a transient image What information does this transient response contain? 39
  28. 1st response at 2.7 ns (visible wall) Sensor Output: 2nd

    response at 4.3 ns (hidden object) Visible Wall Occluder Laser & Sensor Hidden Object 40
  29. 1st response at 2.7 ns (visible wall) Sensor Output: Visible

    Wall Occluder 40 cm (2.7 ns) 24 cm (4.3 ns – 2.7 ns) 2nd response at 4.3 ns (hidden object) Laser & Sensor Hidden Object 41
  30. Non-Line-of-Sight Human Pose Estimation [Isogawa+CVPR2020] Isogawa, Yuan, O’Toole, Kitani, “Optical

    Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. 42
  31. Challenge: Physically Valid Pose Estimation Reinforcement Learning is used for

    physically valid human pose estimation Estimated 3D pose (Output) 𝑧! 𝑧" 𝑧"#! 𝑧$ ⋯ ⋯ 𝜏! 𝜏" 𝜏"#! 𝜏$ ⋯ ⋯ Pseudo-transient Image (input) Simulation Environment ⋯ ⋯ Visual context 𝜙! 𝑎" LSTM cell MLP layer Feature Extractor Pseudo-transient image P2PSF Net Inverse PSF Optimization Part 3D Convolution Fourier 32×32×64 @1 64×32×3 2 @1 32×16×1 6@16 16×8 ×8 @32 8×4× 4 @64 64×32×32 @1 32×16×16 @16 16×8 ×8 @32 128×64×64 @1 Max pooling ResNet18 Inverse PSF Volume (𝑝" , 𝑣")= Feature Extractor 𝜋(𝑎" 𝑠") 𝑅% &! 𝑅" 𝑠" = (𝜙", 𝑧") Inverse Fourier 𝜓! ⋯ ⋯ 𝜓" 𝜓"#! 𝜓$ 𝜙" 𝜙"#! 𝜙$ Transient feature 𝜓" 𝜏" Transient feature Isogawa, Yuan, O’Toole, Kitani, “Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. 43
  32. Challenge: Physically Valid Pose Estimation Reinforcement Learning is used for

    physically valid human pose estimation Estimated 3D pose (Output) 𝑧! 𝑧" 𝑧"#! 𝑧$ ⋯ ⋯ 𝜏! 𝜏" 𝜏"#! 𝜏$ ⋯ ⋯ Pseudo-transient Image (input) Simulation Environment ⋯ ⋯ Visual context 𝜙! 𝑎" LSTM cell MLP layer Feature extractor Pseudo-transient image P2PSF Net Inverse PSF Optimization Part 3D Convolution Fourier 32×32×64 @1 64×32×3 2 @1 32×16×1 6@16 16×8 ×8 @32 8×4× 4 @64 64×32×32 @1 32×16×16 @16 16×8 ×8 @32 128×64×64 @1 Max pooling ResNet18 Inverse PSF Volume (𝑝" , 𝑣")= Feature extractor 𝜋(𝑎" 𝑠") 𝑅% &! 𝑅" 𝑠" = (𝜙", 𝑧") Inverse Fourier 𝜓! ⋯ ⋯ 𝜓" 𝜓"#! 𝜓$ 𝜙" 𝜙"#! 𝜙$ Transient feature 𝜓" 𝜏" Transient feature Isogawa, Yuan, O’Toole, Kitani, “Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. The humanoid kept falling over day after day… 44
  33. Challenge: Physically Valid Pose Estimation Reinforcement Learning is used for

    physically valid human pose estimation Estimated 3D pose (Output) 𝑧! 𝑧" 𝑧"#! 𝑧$ ⋯ ⋯ 𝜏! 𝜏" 𝜏"#! 𝜏$ ⋯ ⋯ Pseudo-transient Image (input) Simulation Environment ⋯ ⋯ Visual context 𝜙! 𝑎" LSTM cell MLP layer Feature extractor Pseudo-transient image P2PSF Net Inverse PSF Optimization Part 3D Convolution Fourier 32×32×64 @1 64×32×3 2 @1 32×16×1 6@16 16×8 ×8 @32 8×4× 4 @64 64×32×32 @1 32×16×16 @16 16×8 ×8 @32 128×64×64 @1 Max pooling ResNet18 Inverse PSF Volume (𝑝" , 𝑣")= Feature extractor 𝜋(𝑎" 𝑠") 𝑅% &! 𝑅" 𝑠" = (𝜙", 𝑧") Inverse Fourier 𝜓! ⋯ ⋯ 𝜓" 𝜓"#! 𝜓$ 𝜙" 𝜙"#! 𝜙$ Transient feature 𝜓" 𝜏" Transient feature Isogawa, Yuan, O’Toole, Kitani, “Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. But it finally stood up on its own and began to walk! 45
  34. Learning from Pseudo Transient Images • Deep learning models require

    substantial data, but acquiring real transient measurements is costly and impractical at scale • To overcome this, we trained our models on virtual data by synthesizing pseudo-transient images from depth images Depth image w/ MoCap data Pseudo transient image Isogawa, Yuan, O’Toole, Kitani, “Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. 46
  35. Challenge: Domain Gap in Transient Image Domain gap exists between

    • real transient measurements, and • synthetic pseudo-transient data Depth image w/ MoCap data Pseudo transient image Isogawa, Yuan, O’Toole, Kitani, “Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. Domain gap • Noise & blur augmentation • Temporal shift • resampling 47
  36. Challenge: Reconstruction from Sparse Data • Even with carefully designed

    data representations, the task difficult owing to low spatial resolution (32º32 spatial res.) • Reconstructed image (intermediate representation before pose estimation) is heavily blurred Transient Image P2PSF Net Inverse PSF Optimization Part Max pooling ResNet18 Inverse PSF Volume Transient Feature : Part of conventional NLOS imaging Presence of a person is only barely visible, making pose estimation highly challenging… Reference: actual scene Reconstructed scene Isogawa, Yuan, O’Toole, Kitani, “Optical Non-Line-of-Sight Physics-Based 3D Human Pose Estimation”, CVPR2020. 48
  37. Limitations of Visible Light • Cannot physically pass through solid

    obstacles • Light-based methods are ineffective in dark environments • Capturing of faces or clothing raises privacy concerns Person behind partition Dark environment Image © Adobe Stock / zef art 50
  38. Beyond Visibility: Looking at the Spectrum Many measurable signals exist

    beyond what humans can see! [Philip Ronan/Wikimedia Commons] (λ) → 51
  39. Beyond Visibility: Looking at the Spectrum [Philip Ronan/Wikimedia Commons] Radio-frequency

    signals (Wi-Fi, mmWave): Can penetrate materials such as paper, wood, and plastic Acoustic signals: Can propagate over a wide area (e.g., across rooms) through transmission and diffraction (λ) → 52
  40. Wireless signals propagate / sounds remain audible, even in dark

    environments à Effective for sensing in low-light or dark environments Beyond Visibility: Looking at the Spectrum Image © Adobe Stock / Viesturs Image © Adobe Stock / A Stockphoto 53
  41. Beyond Visibility: Looking at the Spectrum Wireless/acoustic signals typically have

    low spatial resolution without dense arrays à Privacy-preserving as they cannot capture fine details (e.g., facial geometry) Time Amplitude Acoustic signal Millimeter-wave point cloud x z y 54
  42. Limited Spatial Resolution Motions smaller than a single wavelength may

    be undetectable Visible signal nm order Acoustic signal cm-m order Wireless signal mm-cm order 56
  43. Specular Reflection and Diffraction • Because radio and acoustic wavelengths

    are long, many objects can become specular reflection sources • Diffraction (a phenomenon wherein the signal bends and propagates behind obstacles) occurs readily → Signal paths ascribed to these reflections/diffractions make analysis difficult Specular reflection Specular Reflection: When the surface roughness is smaller than signal wavelength Signal diffraction Diffuse Reflection: When the surface roughness is larger than signal wavelength [Scl Chua/Wikimedia Commons] 57
  44. Millimeter-wave-based Human Mesh Estimation [Amaya and Isogawa ICIP2023, Oral presentation]

    What are Millimeter Waves? • Radio-frequency signals in the 30–300 GHz band • Can penetrate paper/fabric/plastic, enabling sensing even in dark environments Our Method: • Input: mmWave point cloud data (locations of strong reflections and their intensities) • Output: Skinned Multi-Person Linear (SMPL) model parameters (pose and shape) [DeepX] Amplitude Time Mesh Estimation Input: mmWave point cloud Output: human mesh Amaya, Isogawa, “Adaptive and Robust Mmwave-Based 3D Human Mesh Estimation for Diverse Poses”, IEEE ICIP2021. 58
  45. Millimeter-wave-based Human Mesh Estimation [Amaya and Isogawa ICIP2023, oral presentation]

    Robust human mesh estimation from noisy mmWave point clouds Sphere-based denoising • Approximate human motion using six spherical regions (two arms/legs, head, torso) • Estimate the person’s center and keep points within these regions Separated anchor points extraction • Use separate cuboids for upper/lower body to aggregate points around grid-based anchor points Sphere-based denoising Feature Extraction Separated anchor points extraction 4 5 3 18 19 23 21 17 14 15 13 16 20 22 9 6 0 1 2 7 10 8 11 12 Feature Extraction Input: mmWave point cloud Output: human mesh Amaya, Isogawa, “Adaptive and Robust Mmwave-Based 3D Human Mesh Estimation for Diverse Poses”, IEEE ICIP2021. 59
  46. Color coding represents velocity Red : Denoised point cloud outside

    the sphere Green : Point clouds inside the sphere fast slow ours ( ) Raw point cloud Denoised Anchor Point Existing Method Ours Ground truth Millimeter-wave-based Human Mesh Estimation [Amaya and Isogawa ICIP2023, oral presentation] Amaya, Isogawa, “Adaptive and Robust Mmwave-Based 3D Human Mesh Estimation for Diverse Poses”, IEEE ICIP2021. 60
  47. Image © Adobe Stock / Janos Echolocation: Sensing with Sound

    Dolphins emit ultrasonic waves to estimate object position/size from the echo return time 61
  48. Image © Adobe Stock / vaclav Echolocation: Sensing with Sound

    Bats compare emitted sounds with echoes to perceive their surroundings even in complete darkness 62
  49. Image © Adobe Stock / FedBul Humans use sonar for

    locating schools of fish in the ocean 63
  50. Human Pose Estimation with Active Acoustic Sensing [Shibata+ CVPR2023] Shibata,

    Kawashima, Isogawa, Irie, Kimura, Aoki, “Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals”, CVPR2023. Method for “listening to person's poses” using only acoustic signals for pose estimation, instead of visible light or wireless signals Speakers Ambisonics microphone Acoustic field is formed between the speakers and microphone 64
  51. Signals for Active Acoustic Sensing Time-Stretched Pulse (TSP) signals, whose

    frequency increases (or decreases) over time Why TSP signals? à Because they serves as time-stretched approximations of an impulse! • We reinterpret environment state estimation (including human pose) as an analysis of spatial reverberation characteristics • Ideally, the impulse response should be measured to obtain frequency characteristics, but true impulse responses are difficult to measure owing to hardware limitations Frequency [Hz] Time [s] 65
  52. Microphone for Active Acoustic Sensing Four-channel ambisonics microphones to obtain

    the intensity vector, which represents phase delays (where sound energy is coming from in 3D space) W,X,Y,Z : Each component of captured signal R : real value <latexit sha1_base64="hAkYtmstpT+i6oaIDkOltZh0G/E=">AAACs3ichVFNS9xAGH5MWz+2Vdf2IngJLpYVZJmIH6UgSL3opfjRVWFXQhInGjaThMnsgmb3D/QP9OBJQUrptf+gl/4BD3q1l9KjhV566JtsRFpR35CZZ573fd55ZsaOfC9WjJ33aI8eP+nt6x8oPH02ODRcHHm+GYdN6fCqE/qh3LatmPtewKvKUz7fjiS3hO3zLbuxlOa3WlzGXhi8UwcR3xHWXuC5nmMposzi2xWzLiy1L0UShFJM6YL7nXJjSleT+oK+bCbX2ZTPEu5k3ZWWk6yU3bSqk7TbOWy3O2axxCosC/02MHJQQh6rYfEj6thFCAdNCHAEUIR9WIjpq8EAQ0TcDhLiJCEvy3N0UCBtk6o4VVjENmjco1UtZwNapz3jTO3QLj79kpQ6JtgZ+8Su2Df2mf1gf+7slWQ9Ui8HNNtdLY/M4fejG78fVAmaFfZvVPd6VnDxKvPqkfcoY9JTOF196/DD1cbr9YnkJTthP8n/MTtnX+kEQeuXc7rG149QoAcw/r/u22BzumLMVYy1mdLim/wp+jGGcZTpvuexiGWsokr7fsEFLvFdm9Vqmq3tdku1nlzzAv+EJv4CeY6m/g==</latexit> Inorm,mel(k, t) = Hmel(k, f) I(f, t) ||I(f, t)|| <latexit sha1_base64="huUxO6Wj0UbLoVkBG38NL3i10ts=">AAAC9HichVFLb9NAEB6bV1keTeGC1ItFEpQiFK0RKgipUlUucGtT0gS6JVpvNsmq64fsTaTW8h/gwrEHuACqEELiT3DhD3DoT0AcC+LCgYlt8aqAsbwz8833zc7uepFWiaH0wLKPHT9x8tTMaXLm7Lnzs5W5CxtJOI6FbItQh3HX44nUKpBto4yW3SiW3Pe07Hjbd6b1zkTGiQqD+2Ynkls+HwZqoAQ3CPUqT+41Btccs+AsOcznZiS4TlsZYVoODEtJ51F6NSsZTPRD45B6UWwQ5smhClLDvbHmcZbqjNS6BbfmMEZqD35NHpYJYTLo/xARFqvhyCyUnmW9SpU2aW7O0cAtgyqUthpWXgGDPoQgYAw+SAjAYKyBQ4LfJrhAIUJsC1LEYoxUXpeQAUHtGFkSGRzRbVyHmG2WaID5tGeSqwXuovGPUelAnX6gr+khfU/f0I/02197pXmP6Sw76L1CK6Pe7ONL61//q/LRGxj9VP1zZgMDuJXPqnD2KEempxCFfrK7d7h+u1VPr9AX9BPO/5we0Hd4gmDyWeyvydZTIPgA7p/XfTTYuN50F5vu2o3q8kr5FDMwD5ehgfd9E5bhLqxCG/f9Ys1bNatuT+xn9kt7v6DaVqm5CL+Z/fY7VX24RA==</latexit> I(f, t) = R 8 < : W⇤(f, t) · 0 @ X(f, t) Y (f, t) Z(f, t) 1 A 9 = ; Mel scale conversion Standardization Phase delays Time Frequency Channel 66
  53. Human Pose Estimation with Active Acoustic Sensing [Shibata+CVPR2023] Shibata, Kawashima,

    Isogawa, Irie, Kimura, Aoki, “Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals”, CVPR2023. Reference video Existing method Ours Ground truth Received Signal Transmitted TSP Signal Intensity Vector Log-Mel Spectrum Sequence of 3D human pose p Loudspeakers Ambisonics microphone 3D Human Pose Estimation Network ch t Frequency bin ch t Frequency bin Subject Discriminator Module 1D CNN 2D CNN Acoustic Signal Features !!"#$ … !#%& !#'""%( Acoustic Signal Features Sequence … 67
  54. Red: person on the speaker–microphone line Extended Idea: Pose Estimation

    of Person Off the Speaker–Microphone Line [Oumi+ BMVC2024] • Features vary significantly with position, and off-line features are close to those in the absence of a person! • This strong position-dependent feature distribution complicates model training Oumi, Shibata, Irie, Kimura, Aoki, Isogawa, “Acoustic-based 3D Human Pose Estimation Robust to Human Position”, BMVC2024. Green: person away from the line Blue: no person present 68
  55. Frequency time Human position ! Acoustic Signals ch "! ch

    1D CNN 2D CNN Position Label # $ %"#$ Position Discriminator Module Position Discriminator Module 0 0 n k Pose Estimation Module ! & 0 n %%&"' %"(&&#) Pose Sequence … Microphone Speakers … model Data augmentation Acoustic Features log-Mel Spectrum "* Intensity Vector Adversarial Learning Human pose estimation method that is robust to the person’s position • A position classifier is trained to estimate the person’s position • Simultaneously, the model generates feature representations that • Improve pose estimation accuracy, while • Intentionally fail to predict the person’s position Extended Idea: Pose Estimation of Person Off the Speaker–Microphone Line [Oumi+ BMVC2024] Oumi, Shibata, Irie, Kimura, Aoki, Isogawa, “Acoustic-based 3D Human Pose Estimation Robust to Human Position”, BMVC2024. Trained to success in pose estimation Trained to fail at position estimation 69
  56. Limitation of Our Previous Acoustic Sensing The environment was extremely

    loud, as the active acoustic signal is in the audible range… Me during the initial pilot test Japanese meditation (Zazen) Complete opposite!!! 70
  57. Loud audible measurement signals are not practical for real-world use

    à Why not use background music as the sensing signal? Shibata, Oumi, Irie, Kimura, Aoki, Isogawa, “BGM2Pose: Active 3D Human Pose Estimation with Non-Stationary Sounds”, CVPRW2025. Acoustic Signal-based Human Pose Estimation with Background Music [Shibata+ CVPRW2025] 71
  58. Beyond Visibility in Time The beyond-visibility paradigm is not limited

    to spatial dimensions… What about the temporal dimension? à We can use event-based camera! Detects “brightness changes” of scenes asynchronously on a per-pixel basis Differences of capturing process of standard and event-based cameras [Mueggler+ IROS2014] 72
  59. Event Camera Output: With No Motion Without motion, only background

    noise is output Standard camera Event camera (ON, OFF events) 73
  60. When apparent movement is present in the scene, many events

    are triggered Event camera output with relative motion Standard camera Event camera (ON, OFF events) 74
  61. Event camera output with relative motion Ego motion of the

    camera also triggers events Standard camera Event camera (ON, OFF events) 75
  62. Key advantages of event cameras • High temporal resolution (of

    the order of microseconds) • Reduces latency and motion blur in fast- moving scenes • High dynamic range • Minimizes white and black clipping • Suitable for sensing in low-light environments • Energy-efficient and memory-saving • Ideal for building edge devices • Supports privacy protection • Harder for non-experts to extract identifiable information compared with standard images Scaramuzza, “Tutorial on Event-based Cameras” [Prophesee] 76
  63. Event Camera for Fast Human Motion in XR Can the

    high-speed sensing capability of event cameras be leveraged for capturing fast human motion? Event camera data consists of • Image coordinates (x, y) of the event location • Timestamp (t) of the event • Polarity, indicating brightness change 77
  64. Human Mesh Recovery only with Event Data [Hori+ TVCG 2024]

    Hori, Isogawa, Mikami, Saito, “EventPointMesh: Human Mesh Recovery Solely From Event Point Clouds”, IEEE TVCG, 2024. 78
  65. Human Mesh Recovery only with Event Data [Hori+ TVCG 2024]

    Hori, Isogawa, Mikami, Saito, “EventPointMesh: Human Mesh Recovery Solely From Event Point Clouds”, IEEE TVCG, 2024. Mesh estimation by first addressing the simpler task of 2D joint position estimation with the weights of another model, rather than trying to solve a difficult task all at once! Input: Event point cloud Output: human mesh (3D pose + shape) 2D joint keypoint 79
  66. 0VST ʢGQTʣ 0VST ʢ.BUDIFEUPHSBZTDBMF JNBHF`TGQTUPGQTʣ Hori, Isogawa, Mikami, Saito, “EventPointMesh:

    Human Mesh Recovery Solely From Event Point Clouds”, IEEE TVCG, 2024. Human Mesh Recovery only with Event Data [Hori+ TVCG 2024] 80
  67. Intensity image Accumulated event Estimated human mesh Hori, Isogawa, Mikami,

    Saito, “EventPointMesh: Human Mesh Recovery Solely From Event Point Clouds”, IEEE TVCG, 2024. Human Mesh Recovery only with Event Data [Hori+ TVCG 2024] Dancing in a dark room 81
  68. Potential Use Case Examples of motions estimated from event data

    using our method, applied to 3D characters Hori, Isogawa, Mikami, Saito, “EventPointMesh: Human Mesh Recovery Solely From Event Point Clouds”, IEEE TVCG, 2024. 82
  69. Egocentric Event-based Human Hand Pose Estimation [Hara+ IEEE ICIP2025, Oral]

    Synthetic event data Reference video ʢnot used for trainingʣ Events are triggered by both the user’s hand motion and camera’s ego motion, making it difficult to estimate hand pose →Introduce segmentation task to extract hand regions as an intermediate task Hara, Ikeda, Hatano, Isogawa, “EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction”, IEEE ICIP2025. 84
  70. Existing methods Ours Ground truth Events Hara, Ikeda, Hatano, Isogawa,

    “EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction”, IEEE ICIP2025. Egocentric Event-based Human Hand Pose Estimation [Hara+ IEEE ICIP2025, Oral] 85
  71. Moving scene removal Events are triggered by both scene motion

    and camera motion (human’s motion) → Introduce segmentation step that removes events corresponding to scene movement Egocentric Event-based Human 3D Pose Estimation [Ikeda+ IEEE ICIP2025] Ikeda, Hatano, Hara, Isogawa, “Event-based Egocentric Human Pose Estimation in Dynamic Environment”, IEEE ICIP2025. w/ our motion segmentation Original event voxel frames 86
  72. Ours Ground truth Existing method (EgoEgo) Event Event w/o scene

    movement Egocentric Event-based Human 3D Pose Estimation [Ikeda+ IEEE ICIP2025] Reference 3rd person view video Ikeda, Hatano, Hara, Isogawa, “Event-based Egocentric Human Pose Estimation in Dynamic Environment”, IEEE ICIP2025. 87
  73. Even when something is “invisible,” information may be available Often,

    information is beyond the reach of a single viewpoint, human vision, or conventional cameras, but it exists in other forms Deeper fusion of AI and XR can be achieved by • capturing what cannot be seen using alternative sensing modalities • compensating for incomplete observations through estimation and reconstruction 88
  74. What Comes Next? • What is the optimal and minimal

    sensor set? • Where should sensors be placed? • How should cross-sensor calibration be performed? • How can we address data scarcity? • How can we ensure stability across environments and relaxed sensing conditions? 89
  75. Two quotes that I often recall “The best way to

    predict the future is to invent it.” ʔ Alan Kay “Any sufficiently advanced technology is indistinguishable from magic.” ʔ Arthur C. Clarke 90
  76. Thank you for your attention! 91 Contact: HP: https://isogawa.ics.keio.ac.jp/ Email:

    [email protected] X: @m_isogawa Projects in this talk were partially supported by JST Presto, JSPS Grant-in-Aid for Challenging Research (Exploratory), JSPS KAKENHI(A), and Keio Academic Development Funds.