Slide 1

Slide 1 text

Analysis of “big N” wearable device data using functional data models Julia Wrobel, PhD Department of Biostatistics and Bioinformatics

Slide 2

Slide 2 text

2 BIOSTATISTICS, EPIDEMIOLOGY, & RESEARCH DESIGN FORUM Advances and Challenges in Wearables Research Friday, November 3 Advances and Challenges in Wearables Research Julia Wrobel, PhD Keynote Speaker Friday, November 3 10:00 AM — 3:00 PM REGISTER: bit.ly/BERD2023 In-Person: Morehouse School of Medicine, Building A, 4th Floor Sr. Biostatistician Virtual: Zoom

Slide 3

Slide 3 text

Wearable devices

Slide 4

Slide 4 text

Wearable devices

Slide 5

Slide 5 text

Wearable devices

Slide 6

Slide 6 text

Wearable devices

Slide 7

Slide 7 text

Accelerometers • Physical activity is key to many health-related questions • Active individuals tend to live longer and healthier lives • Traditionally, this has been done using retrospective questionnaires • Accelerometers have become hugely popular • Objective • Collection “in the wild” • High resolution 7

Slide 8

Slide 8 text

Accelerometer data processing pipeline

Slide 9

Slide 9 text

Accelerometer data processing pipeline

Slide 10

Slide 10 text

• PA measures: Total steps / counts, MVPA minutes • Sedentary measures: Sedentary time, number of sedentary bouts Accelerometer data processing pipeline

Slide 11

Slide 11 text

Reproducibility and rigor • Much of this is still up for debate • Consider moderate-to-vigorous physical activity (MVPA) • How are “activity counts” generated? • How are cut points formed (no PA / light PA/ MVPA)? • Are these consistent across devices? Age groups? Placements? • Some general recommendations • Keep data in rawest form possible • Process using non-proprietary software 11

Slide 12

Slide 12 text

Functional data analysis (FDA) • Wearables devices record signal over 24-hour periods- the exact focus of FDA! • In FDA, outcome is curve or function 𝑌! 𝑡 • For accelerometer data 𝑌! 𝑡 is a 24-hour activity profiles 12 𝑡 (hour) 𝑌! (𝑡)

Slide 13

Slide 13 text

Uses for FDA in wearables • Less pre-processing of the raw data • Less information is discarded • Better ways of imputing data • Missing data is a big problem in wearables • Time-dependent interpretations • Timing and consistency • Does it matter when and how regularly someone moves? 13

Slide 14

Slide 14 text

FDA tools for massive accelerometer studies • Function-on-scalar regression (FoSR) • Functional outcome, scalar predictors (e.g. age) • UK Biobank Accelerometry Study • 80,000+ participants • Generalized functional principal components analysis (gFPCA) • National Health and Nutrition Examination Survey (NHANES) • 4,000+ participants (2011-2014 wave) • Registration • How does timing of wake/sleep, PA differ across people? • Baltimore Longitudinal Study on Aging (BLSA) • 500+ participants 14

Slide 15

Slide 15 text

Function-on-scalar regression Patterns in physical activity across ages in the UK Biobank study 15

Slide 16

Slide 16 text

Function-on-scalar regression 𝑌! 𝑡 = 𝛽" 𝑡 + & #$% & 𝛽# 𝑡 𝑋!# + 𝑏! 𝑡 + 𝜖! 𝑡 • 𝑌! 𝑡 : Magnitude of physical activity at time 𝑡 • 𝑋!# : Scalar covariate (e.g. age) for subject 𝑖 • 𝛽# 𝑡 : Coefficient function for covariate 𝑝 • 𝑏! 𝑡 ∼ 𝐺𝑃 0, Σ' ; 𝜖! 𝑡 ~!!( 𝑁 0, 𝜎) * 16

Slide 17

Slide 17 text

FDA of 88,693 subjects from UK Biobank study • Average daily activity patterns across ages from functional regression • Left are males, right panel are females 17 J. Wrobel, J. Muschelli, and A. Leroux (2021). Sensors.

Slide 18

Slide 18 text

Fast generalized functional principal components analysis for ultra-high dimensional non-Gaussian wearable device data 18

Slide 19

Slide 19 text

Exponential family functional data • Functional data methods assume 𝑌! 𝑡 is Gaussian • Wearable device data is often non-Gaussian • Poisson 𝑌! 𝑡 ∈ 0, 1, 2, … (activity counts) • Binary 𝑌! 𝑡 ∈ {0, 1} (sedentary/active minutes) • Instead assume 𝑌! 𝑡 follows exponential family distribution • Assumes smooth latent subject-specific mean 𝜇! 𝑡 = 𝐸 𝑌! 𝑡 • Leads to GLM-like framework 𝑔 𝐸 𝑌! 𝑡 = 𝜂! 𝑡

Slide 20

Slide 20 text

Example binary “curve” or “binary activity profile” • Subject shown below is from BLSA data • Active 𝑌! 𝑡 = 1 vs. inactive 𝑌! 𝑡 = 0 20

Slide 21

Slide 21 text

Example binary “curve” or “binary activity profile” • Subject shown below is from BLSA data • Active 𝑌! 𝑡 = 1 vs. inactive 𝑌! 𝑡 = 0 21

Slide 22

Slide 22 text

Binary activity profiles for studying sedentary behavior • Raw counts at each minute dichotomized at low value to detect activity vs. inactivity 22

Slide 23

Slide 23 text

Generalized functional principal components analysis • Generalized FPCA and generalized regression model exponential family functional data using a (GLM)-like framework 𝑔 𝐸 𝑌! 𝑠 = 𝜂! 𝑠 = 𝛽" 𝑠 + 𝑏! 𝑠 = 𝛽" 𝑠 + + #$% & 𝜉!# 𝜙# 𝑠 • 𝑌! ∼ 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝐹𝑎𝑚𝑖𝑙𝑦; 𝑔(⋅) is a link function • 𝛽& 𝑠 is a population mean function • 𝜙' 𝑠 are population level eigenfunctions • 𝜉!' are subject-specific scores 23

Slide 24

Slide 24 text

The NHANES 2011-2014 accelerometer study • National Health and Nutrition Examination Survey • Accelerometer data from 2011-2014 wave released in 2021 • Accelerometer data over multiple days from > 4000 subjects • 1440 minutes per day of PA measurement • Goal is to understand population patterns in sedentary behavior • Existing FDA methods cannot handle data of this size • We proposed a fast, general-purpose algorithm for generalized FPCA 24

Slide 25

Slide 25 text

𝑔 𝐸 𝑌! 𝑠 = 𝜂! 𝑠 = 𝛽" 𝑠 + 𝑏! 𝑠 = 𝛽" 𝑠 + + #$% & 𝜉!# 𝜙# 𝑠 1. Bin the data along the functional domain 𝑠 into 𝐿 bins 2. Estimate separate local GLMMs in each bin to obtain 𝜂! 𝑠(! at each bin midpoint 3. Estimate FPCA on local latent estimates 𝜂! 𝑠(! to obtain eigenfunctions 𝝓 𝑠 4. Estimate global model conditioning on eigenfunctions 𝝓 𝑠 by re- estimating subject-specific scores 𝜉!' Four-step fast GFPCA algorithm A. Leroux, C. Crainiceanu, and J. Wrobel (2023+). Fast generalized functional principal components analysis. Under review.

Slide 26

Slide 26 text

fastGFPCA simulation results • Compared with two existing methods • Variational Bayes binary FPCA (Wrobel, 2019), bfpca • Can’t estimate Poisson or other distributions • Two-step conditional model (Gertheiss, 2017), tsGFPCA • Breaks for N > 100 • fastGFPCA is • More accurate than tsGFPCA for binary and Poisson data • Order of magnitude faster • As or more accurate than bfpca for binary data • Comparable computation time 26

Slide 27

Slide 27 text

GFPCA results for NHANES data • 4286 participants with 1440 observations each • 3-4 hours of computation time (step 4 is the slow step) • Subsampled version of step 4 led to ~22 minutes of computation time

Slide 28

Slide 28 text

Curve registration for exponential family functional data 28

Slide 29

Slide 29 text

Misalignment in accelerometer data • Time variation: subjects start and end the day at different times • Activity level variation: people have higher or lower levels of activity 29

Slide 30

Slide 30 text

Misalignment in accelerometer data • Same subjects, but probabilities of activity are shown below 30

Slide 31

Slide 31 text

Misalignment in accelerometer data • Same subjects, but probabilities of activity are shown below 31

Slide 32

Slide 32 text

Registration methods align functional data by warping the domain • Most methods are computationally inefficient and handle only continuous data 𝜇! 𝑡! ∗ ℎ! #$ 𝑡! ∗ = 𝑡 𝜇! ℎ! #$ 𝑡! ∗ = 𝜇! 𝑡

Slide 33

Slide 33 text

Two-step exponential family registration algorithm • Computationally efficient and geared towards binary data 33 Step 1: estimate template Step 2: estimate warping 𝑌! 𝑡! ∗ 𝑌! 𝑡

Slide 34

Slide 34 text

Algorithm and software optimized for computational efficiency • Step 1: Estimates template to which curves are registered • uses fast, novel variational EM algorithm for binary functional data • Step 2: Estimates warping function for each subject • uses constrained maximum likelihood estimation • Implemented in R package registr • Implemented in C++ 34 • Wrobel, Goldsmith (2019). Registration for exponential family functional data. Biometrics. • Wrobel (2018). registr: Registration for exponential family functional data. Journal of Open Source Software. 3.

Slide 35

Slide 35 text

Activity profiles pre-registration 35

Slide 36

Slide 36 text

Activity profiles post-registration 36

Slide 37

Slide 37 text

Future methods work in these areas • Fast GFPCA • Multilevel data (Monday-Sunday) • Xinkai Zhou • Sparse and irregular data • Fast Generalized function-on-scalar regression • Dustin Rogers • Registration • Multilevel registration

Slide 38

Slide 38 text

Acknowledgements Colorado SPH Biostatistics • Andrew Leroux • Dustin Rogers Columbia Biostatistics Functional Data Analysis Working Group • Jeff Goldsmith Johns Hopkins School of Public Health WIT: Wearable and Implantable Technology • Vadim Zipunnikov • Jennifer Schrack • John Muschelli • Ciprian Crainiceanu • Xinkai Zhou

Slide 39

Slide 39 text

Thanks! 39 Contact Info [email protected] juliawrobel.com github.com/julia-wrobel

Slide 40

Slide 40 text

Step 1: bin the data Choose 𝐿 bins where 𝑚+ is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins

Slide 41

Slide 41 text

Step 1: bin the data Choose 𝐿 bins where 𝑚+ is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues

Slide 42

Slide 42 text

Step 1: bin the data Choose 𝐿 bins where 𝑚+ is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues • Too few bins: bins width too big, don’t capture shape of underlying function

Slide 43

Slide 43 text

Step 1: bin the data Choose 𝐿 bins where 𝑚+ is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues • Too few bins: bins width too big, don’t capture shape of underlying function

Slide 44

Slide 44 text

Step 2: fit Generalized Linear Mixed Model in each bin Fit separate GLMM in each bin to get latent estimates • 𝑔 𝐸 𝑌! 𝑠"! = 𝛽$ 𝑠"! + 𝑏! 𝑠"! = 𝜂! 𝑠"! • 𝑠"! : time 𝑠 at the midpoint of bin 𝑙 • 𝛽$ 𝑠"! : fixed effect mean • 𝑏! 𝑠"! : subject-specific random effect • 𝜂! 𝑠"! : linear predictor, local latent estimates • Estimates are not on the original domain • On domain defined by bin midpoints • Model assumes constant effect for 𝛽% , 𝑏! across each bin • Used for estimating covariance matrix and eigenfunctions

Slide 45

Slide 45 text

Step 3: estimate eigenfunctions using fPCA Estimate FPCA using linear predictor from Step 2 • + 𝜂! 𝑠"! = , 𝛽$ 𝑠"! + ∑%&' ( , 𝜉!% / 𝜙% 𝑠"! • Estimated using refund::fpca.face() • Eigenfunctions F 𝝓 characterize covariance • 𝐾 : chosen by percent variance explained • Evaluated at bin midpoint rather than original domain • Project eigenfunctions onto original domain

Slide 46

Slide 46 text

Step 4: estimate GFPCA Estimate GFPCA conditional on eigenfunctions from Step 3 • 𝑔 𝐸 𝑌! 𝑠 | = 𝛽$ 𝑠 + ∑%&' ( 𝜉!% / 𝜙% 𝑠 • Eigenfunctions are orthogonal basis functions • Reduces number of covariance parameters that need to be estimated for random effects • Simple implemention • mgcv::bam()