Analysis of wearable device data using functional data models

Analysis of “big N” wearable device data using functional data
models Julia Wrobel, PhD Department of Biostatistics and Bioinformatics

2 BIOSTATISTICS, EPIDEMIOLOGY, & RESEARCH DESIGN FORUM Advances and Challenges
in Wearables Research Friday, November 3 Advances and Challenges in Wearables Research Julia Wrobel, PhD Keynote Speaker Friday, November 3 10:00 AM — 3:00 PM REGISTER: bit.ly/BERD2023 In-Person: Morehouse School of Medicine, Building A, 4th Floor Sr. Biostatistician Virtual: Zoom

Wearable devices

Accelerometers • Physical activity is key to many health-related questions
• Active individuals tend to live longer and healthier lives • Traditionally, this has been done using retrospective questionnaires • Accelerometers have become hugely popular • Objective • Collection “in the wild” • High resolution 7

Accelerometer data processing pipeline

• PA measures: Total steps / counts, MVPA minutes •
Sedentary measures: Sedentary time, number of sedentary bouts Accelerometer data processing pipeline

Reproducibility and rigor • Much of this is still up
for debate • Consider moderate-to-vigorous physical activity (MVPA) • How are “activity counts” generated? • How are cut points formed (no PA / light PA/ MVPA)? • Are these consistent across devices? Age groups? Placements? • Some general recommendations • Keep data in rawest form possible • Process using non-proprietary software 11

Functional data analysis (FDA) • Wearables devices record signal over
24-hour periods- the exact focus of FDA! • In FDA, outcome is curve or function 𝑌! 𝑡 • For accelerometer data 𝑌! 𝑡 is a 24-hour activity profiles 12 𝑡 (hour) 𝑌! (𝑡)

Uses for FDA in wearables • Less pre-processing of the
raw data • Less information is discarded • Better ways of imputing data • Missing data is a big problem in wearables • Time-dependent interpretations • Timing and consistency • Does it matter when and how regularly someone moves? 13

FDA tools for massive accelerometer studies • Function-on-scalar regression (FoSR)
• Functional outcome, scalar predictors (e.g. age) • UK Biobank Accelerometry Study • 80,000+ participants • Generalized functional principal components analysis (gFPCA) • National Health and Nutrition Examination Survey (NHANES) • 4,000+ participants (2011-2014 wave) • Registration • How does timing of wake/sleep, PA differ across people? • Baltimore Longitudinal Study on Aging (BLSA) • 500+ participants 14

Function-on-scalar regression Patterns in physical activity across ages in the
UK Biobank study 15

Function-on-scalar regression 𝑌! 𝑡 = 𝛽" 𝑡 + & #$%
& 𝛽# 𝑡 𝑋!# + 𝑏! 𝑡 + 𝜖! 𝑡 • 𝑌! 𝑡 : Magnitude of physical activity at time 𝑡 • 𝑋!# : Scalar covariate (e.g. age) for subject 𝑖 • 𝛽# 𝑡 : Coefficient function for covariate 𝑝 • 𝑏! 𝑡 ∼ 𝐺𝑃 0, Σ' ; 𝜖! 𝑡 ~!!( 𝑁 0, 𝜎) * 16

FDA of 88,693 subjects from UK Biobank study • Average
daily activity patterns across ages from functional regression • Left are males, right panel are females 17 J. Wrobel, J. Muschelli, and A. Leroux (2021). Sensors.

Fast generalized functional principal components analysis for ultra-high dimensional non-Gaussian
wearable device data 18

Exponential family functional data • Functional data methods assume 𝑌!
𝑡 is Gaussian • Wearable device data is often non-Gaussian • Poisson 𝑌! 𝑡 ∈ 0, 1, 2, … (activity counts) • Binary 𝑌! 𝑡 ∈ {0, 1} (sedentary/active minutes) • Instead assume 𝑌! 𝑡 follows exponential family distribution • Assumes smooth latent subject-specific mean 𝜇! 𝑡 = 𝐸 𝑌! 𝑡 • Leads to GLM-like framework 𝑔 𝐸 𝑌! 𝑡 = 𝜂! 𝑡

Example binary “curve” or “binary activity profile” • Subject shown
below is from BLSA data • Active 𝑌! 𝑡 = 1 vs. inactive 𝑌! 𝑡 = 0 20

Example binary “curve” or “binary activity profile” • Subject shown
below is from BLSA data • Active 𝑌! 𝑡 = 1 vs. inactive 𝑌! 𝑡 = 0 21

Binary activity profiles for studying sedentary behavior • Raw counts
at each minute dichotomized at low value to detect activity vs. inactivity 22

Generalized functional principal components analysis • Generalized FPCA and generalized
regression model exponential family functional data using a (GLM)-like framework 𝑔 𝐸 𝑌! 𝑠 = 𝜂! 𝑠 = 𝛽" 𝑠 + 𝑏! 𝑠 = 𝛽" 𝑠 + + #$% & 𝜉!# 𝜙# 𝑠 • 𝑌! ∼ 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝐹𝑎𝑚𝑖𝑙𝑦; 𝑔(⋅) is a link function • 𝛽& 𝑠 is a population mean function • 𝜙' 𝑠 are population level eigenfunctions • 𝜉!' are subject-specific scores 23

The NHANES 2011-2014 accelerometer study • National Health and Nutrition
Examination Survey • Accelerometer data from 2011-2014 wave released in 2021 • Accelerometer data over multiple days from > 4000 subjects • 1440 minutes per day of PA measurement • Goal is to understand population patterns in sedentary behavior • Existing FDA methods cannot handle data of this size • We proposed a fast, general-purpose algorithm for generalized FPCA 24

𝑔 𝐸 𝑌! 𝑠 = 𝜂! 𝑠 = 𝛽" 𝑠
+ 𝑏! 𝑠 = 𝛽" 𝑠 + + #$% & 𝜉!# 𝜙# 𝑠 1. Bin the data along the functional domain 𝑠 into 𝐿 bins 2. Estimate separate local GLMMs in each bin to obtain 𝜂! 𝑠(! at each bin midpoint 3. Estimate FPCA on local latent estimates 𝜂! 𝑠(! to obtain eigenfunctions 𝝓 𝑠 4. Estimate global model conditioning on eigenfunctions 𝝓 𝑠 by re- estimating subject-specific scores 𝜉!' Four-step fast GFPCA algorithm A. Leroux, C. Crainiceanu, and J. Wrobel (2023+). Fast generalized functional principal components analysis. Under review.

fastGFPCA simulation results • Compared with two existing methods •
Variational Bayes binary FPCA (Wrobel, 2019), bfpca • Can’t estimate Poisson or other distributions • Two-step conditional model (Gertheiss, 2017), tsGFPCA • Breaks for N > 100 • fastGFPCA is • More accurate than tsGFPCA for binary and Poisson data • Order of magnitude faster • As or more accurate than bfpca for binary data • Comparable computation time 26

GFPCA results for NHANES data • 4286 participants with 1440
observations each • 3-4 hours of computation time (step 4 is the slow step) • Subsampled version of step 4 led to ~22 minutes of computation time

Curve registration for exponential family functional data 28

Misalignment in accelerometer data • Time variation: subjects start and
end the day at different times • Activity level variation: people have higher or lower levels of activity 29

Misalignment in accelerometer data • Same subjects, but probabilities of
activity are shown below 30

Misalignment in accelerometer data • Same subjects, but probabilities of
activity are shown below 31

Registration methods align functional data by warping the domain •
Most methods are computationally inefficient and handle only continuous data 𝜇! 𝑡! ∗ ℎ! #$ 𝑡! ∗ = 𝑡 𝜇! ℎ! #$ 𝑡! ∗ = 𝜇! 𝑡

Two-step exponential family registration algorithm • Computationally efficient and geared
towards binary data 33 Step 1: estimate template Step 2: estimate warping 𝑌! 𝑡! ∗ 𝑌! 𝑡

Algorithm and software optimized for computational efficiency • Step 1:
Estimates template to which curves are registered • uses fast, novel variational EM algorithm for binary functional data • Step 2: Estimates warping function for each subject • uses constrained maximum likelihood estimation • Implemented in R package registr • Implemented in C++ 34 • Wrobel, Goldsmith (2019). Registration for exponential family functional data. Biometrics. • Wrobel (2018). registr: Registration for exponential family functional data. Journal of Open Source Software. 3.

Activity profiles pre-registration 35

Activity profiles post-registration 36

Future methods work in these areas • Fast GFPCA •
Multilevel data (Monday-Sunday) • Xinkai Zhou • Sparse and irregular data • Fast Generalized function-on-scalar regression • Dustin Rogers • Registration • Multilevel registration

Acknowledgements Colorado SPH Biostatistics • Andrew Leroux • Dustin Rogers
Columbia Biostatistics Functional Data Analysis Working Group • Jeff Goldsmith Johns Hopkins School of Public Health WIT: Wearable and Implantable Technology • Vadim Zipunnikov • Jennifer Schrack • John Muschelli • Ciprian Crainiceanu • Xinkai Zhou

Thanks! 39 Contact Info [email protected] juliawrobel.com github.com/julia-wrobel

Step 1: bin the data Choose 𝐿 bins where 𝑚+
is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins

is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues

is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues • Too few bins: bins width too big, don’t capture shape of underlying function

Step 2: fit Generalized Linear Mixed Model in each bin
Fit separate GLMM in each bin to get latent estimates • 𝑔 𝐸 𝑌! 𝑠"! = 𝛽$ 𝑠"! + 𝑏! 𝑠"! = 𝜂! 𝑠"! • 𝑠"! : time 𝑠 at the midpoint of bin 𝑙 • 𝛽$ 𝑠"! : fixed effect mean • 𝑏! 𝑠"! : subject-specific random effect • 𝜂! 𝑠"! : linear predictor, local latent estimates • Estimates are not on the original domain • On domain defined by bin midpoints • Model assumes constant effect for 𝛽% , 𝑏! across each bin • Used for estimating covariance matrix and eigenfunctions

Step 3: estimate eigenfunctions using fPCA Estimate FPCA using linear
predictor from Step 2 • + 𝜂! 𝑠"! = , 𝛽$ 𝑠"! + ∑%&' ( , 𝜉!% / 𝜙% 𝑠"! • Estimated using refund::fpca.face() • Eigenfunctions F 𝝓 characterize covariance • 𝐾 : chosen by percent variance explained • Evaluated at bin midpoint rather than original domain • Project eigenfunctions onto original domain

Step 4: estimate GFPCA Estimate GFPCA conditional on eigenfunctions from
Step 3 • 𝑔 𝐸 𝑌! 𝑠 | = 𝛽$ 𝑠 + ∑%&' ( 𝜉!% / 𝜙% 𝑠 • Eigenfunctions are orthogonal basis functions • Reduces number of covariance parameters that need to be estimated for random effects • Simple implemention • mgcv::bam()

Analysis of wearable device data using function...

Analysis of wearable device data using functional data models

More Decks by Julia Wrobel

Other Decks in Research

Featured

Transcript