Quantifying and Controlling for Sources of Technical Variation and Bias in Longitudinal Microbiome Surveys

TECHNICAL VARIATION AND LONGITUDINAL MICROBIOME DATA JUSTIN D. SILVERMAN MEDICAL
SCIENTIST TRAINING PROGRAM COMPUTATIONAL BIOLOGY AND BIOINFORMATICS DUKE UNIVERSITY StatsAtHome.com inschool4life

FRAMING CHALLENGES

FRAMING SEQUENCE COUNT DATA DATA COLLECTION AND SAMPLE PROCESSING Adapted
from Hamady. et al., Nature Methods, 2008 Sample Collection   and Storage DNA Extraction  PCR Ampliﬁcation Sequencing BIOLOGICAL VARIATION AND SIGNAL TECHNICAL VARIATION AND BIAS TECHNICAL VARIATION AND BIAS RANDOM SAMPLING RANDOM SAMPLING RANDOM SAMPLING

FRAMING SEQUENCE COUNT DATA SAMPLE POOLING Preformed after PCR Ampliﬁcation
before sequencing Sample 1 Sample 2 Sample 3 Barcoded DNA after PCR DNA Quantiﬁcation Subsampling Pooling

Group 1 Group 2 Group 3 5 10 15 20
1000 2000 3000 4000 5000 200 400 600 50 100 150 200 250 Time Counts Group 1 Group 2 Group 3 5 10 15 20 100 200 300 400 50 100 25 50 75 100 Time Counts Time Counts A B True Abundance Abundance after Random Sampling FRAMING SEQUENCE COUNT DATA IMPACT OF MULTIVARIATE RANDOM SAMPLING

Group 1 Group 2 Group 3 5 10 15 20
1000 2000 3000 4000 5000 200 400 600 50 100 150 200 250 Time Counts Group 1 Group 2 Group 3 5 10 15 20 100 200 300 400 50 100 25 50 75 100 Time Counts Time Counts A B True Abundance Abundance after Random Sampling FRAMING SEQUENCE COUNT DATA IMPACT OF MULTIVARIATE RANDOM SAMPLING NO PROPORTIONS, NOT CLASSICALLY COMPOSITIONAL BUT SAMPLING CAUSES COMPOSITIONAL-LIKE EFFECTS

GENERAL MODELING APPROACH

CENTRAL THEME MULTINOMIAL-LOGISTIC NORMAL •Handles sampling zeros and ≈ biological
zeros •Allows positive and negative covariation between taxa •Models Multiplicative Errors ILR = "Isometric Log-Ratio" Transform

TIME-SERIES MODELING

BUILDING A FRAMEWORK MODELING TIME-EVOLUTION Silverman et al., bioRxiv 2018
True State with Biological Noise Addition of Technical Noise Observed Counts Priors

BUILDING A FRAMEWORK MODELING TIME-EVOLUTION θ0 θ1 θ2 ... θT
True State with Biological Noise Silverman et al., bioRxiv 2018 True State with Biological Noise Addition of Technical Noise Observed Counts Priors

BUILDING A FRAMEWORK MODELING TIME-EVOLUTION θ0 θ1 θ2 ... θT
W1 W2 WT True State with Biological Noise Silverman et al., bioRxiv 2018 True State with Biological Noise Addition of Technical Noise Observed Counts Priors

BUILDING A FRAMEWORK MODELING TIME-EVOLUTION η1 η2 ηT θ0 θ1
θ2 ... θT V1 V2 VT W1 W2 WT True State with Biological Noise Addition of Technical Noise Silverman et al., bioRxiv 2018 True State with Biological Noise Addition of Technical Noise Observed Counts Priors

BUILDING A FRAMEWORK MODELING TIME-EVOLUTION Y1 Y2 YT η1 η2
ηT θ0 θ1 θ2 ... θT V1 V2 VT W1 W2 WT True State with Biological Noise Observed Counts Addition of Technical Noise Silverman et al., bioRxiv 2018 True State with Biological Noise Addition of Technical Noise Observed Counts Priors

BUILDING A FRAMEWORK MODELING TIME-EVOLUTION Y1 Y2 YT η1 η2
ηT θ0 θ1 θ2 ... θT V1 V2 VT W1 W2 WT True State with Biological Noise Observed Counts Addition of Technical Noise ILR Silverman et al., bioRxiv 2018 True State with Biological Noise Addition of Technical Noise Observed Counts Priors

BUILDING A FRAMEWORK MODELING TIME-EVOLUTION Deﬁne: Inference Goal: Composition with
Technical Variation System State Covariance of Technical Variation Covariance of Temporal Evolution ("Biological Variation") Observed Counts and   Covariates

BUILDING A FRAMEWORK COLLAPSED SAMPLING USING THE KALMAN FILTER Goal:

Term 2: Using 1st order   Markov Structure Arbitrary Prior 1-step ahead predictive  densities calculable by   Kalman ﬁlter Product of Multinomial  densities

Term 2: Using 1st order   Markov Structure Arbitrary Prior 1-step ahead predictive  densities calculable by   Kalman ﬁlter Product of Multinomial  densities Term 1: Can be sampled from directly   using Backwards Sampling algorithm  (aka Kalman Smoother) Θ ⊥ Y | H

Sampling Algorithm: Step 1: Sample using adaptive HMCMC marginalizing over Θ using Kalman Filter Step 2: Sample using Backwards sampling (aka Kalman Smoother)

Sampling Algorithm: Step 1: Sample using adaptive HMCMC marginalizing over Θ using Kalman Filter Step 2: Sample using Backwards sampling (aka Kalman Smoother) Summary:

Sampling Algorithm: Step 1: Sample using adaptive HMCMC marginalizing over Θ using Kalman Filter Step 2: Sample using Backwards sampling (aka Kalman Smoother) Summary: • By inverting the "classic" Metropolis within Gibbs sampler we can take advantage of adaptive Hamiltonian MCMC

Sampling Algorithm: Step 1: Sample using adaptive HMCMC marginalizing over Θ using Kalman Filter Step 2: Sample using Backwards sampling (aka Kalman Smoother) Summary: • By inverting the "classic" Metropolis within Gibbs sampler we can take advantage of adaptive Hamiltonian MCMC • Θ (typically very high dimensional) can be removed from HMCMC and instead sampled directly using recurrence relationships of Kalman Smoother (very fast)

BUILDING A FRAMEWORK Multinomial Logistic-Normal Dynamic Linear Models

BUILDING A FRAMEWORK MALLARD Multinomial Logistic-Normal Dynamic Linear Models www.audubon.org

PARTITIONING BIOLOGICAL AND TECHNICAL VARIATION

SIMULATED AND REAL DATA AN EXAMPLE STUDY DESIGN Silverman et
al., bioRxiv 2018 STANDARD LONGITUDINAL MODEL CONDITION TO HANDLE REPLICATES 28 DAILY SAMPLES 120 HOURLY SAMPLES 20 REPLICATE  SAMPLES 4x • Mixed frequency to address potential signal aliasing • Replicate samples to identify and partition technical vs. biological variation.

SIMULATED DATA Silverman et al., bioRxiv 2018 Silverman et al.,
  eLife 2017 

SIMULATED DATA Silverman et al., bioRxiv 2018

REAL DATA Silverman et al., bioRxiv 2018

REAL DATA Silverman et al., bioRxiv 2018 0.6 0.2 0.2
0.4 0.6 seq_10 0.2 0.4 0.6 0.8 0.2 8 0 1 2 3 3.5 0 5 10 15 20 25 as.integer(lag) p50 Total Variation Sampling Interval (Hours) Biological Technical B ae Lachnospiraceae W+V

OTHER (ACADEMIC) THINGS I LIKE TALKING ABOUT AN APPEAL TO
THE COMMUNITY We need more studies that quantify technical variation and bias in a manner that can be used, modeled, and corrected. (Almost) All interesting measurements have error

ACKNOWLEDGEMENTS ACKNOWLEDGEMENTS Duke University Lawrence David Sayan Mukherjee Rachael Bloom
Heather Durand University de Girona Juan José Egozcue Vera Pawlowsky-Glahn Wife and   Collaborator Rachel Silverman Funding Duke Collaborative Quantitative Approaches to Problems in the Basic and Clinical Sciences   Duke MSTP NIH T32 xkcd.com StatsAtHome.com inschool4life

OTHER (ACADEMIC) THINGS I LIKE TALKING ABOUT SELECTED PROJECTS •Faster
Inference for MALLARD Models • Quantifying and removing Batch Variation and Bias using calibration curves and cross-batch standard samples. •There are different types of zero values in sequence count data •Total Relative Augmentation Models (TRAMs) to introduce uncertainty in "total" measurements into compositional models.

BUILDING A FRAMEWORK A NOTE ON PRIOR CHOICE All Log-Ratio
Transforms: Not so realistic prior: More realistic prior:

REAL DATA THINKING IN TERMS OF VARIATION 0.0066 0.0088 0.008
0.0085 0.0067 0.0056 0.015 0.013 0.014 0.011 0.011 0.0081 0.008 0.0062 0.0032 0.026 0.037 0.04 0.041 0.043 0.037 0.031 0.035 0.03 0.028 0.023 0.021 0.048 0.051 0.049 0.04 0.039 0.043 0.038 0.079 0.022 0.035 0.034 0.031 0.031 0.02 0.025 0.061 0.034 0.038 (0.004−0.01) (0.006−0.01) (0.005−0.01) (0.006−0.01) (0.004−0.01)(0.004−0.008) (0.01−0.02) (0.01−0.02) (0.01−0.02) (0.009−0.01) (0.008−0.02) (0.006−0.01) (0.006−0.01)(0.005−0.008) (0.002−0.005) (0.02−0.04) (0.03−0.05) (0.03−0.05) (0.03−0.05) (0.03−0.06) (0.03−0.05) (0.02−0.04) (0.03−0.04) (0.02−0.04) (0.02−0.03) (0.02−0.03) (0.02−0.03) (0.04−0.06) (0.04−0.07) (0.04−0.06) (0.03−0.05) (0.03−0.05) (0.03−0.05) (0.03−0.05) (0.06−0.1) (0.02−0.03) (0.03−0.04) (0.03−0.04) (0.02−0.04) (0.03−0.04) (0.02−0.02) (0.02−0.03) (0.05−0.08) (0.03−0.04) (0.03−0.05) Rikenellaceae Synergistaceae Fusobacteriaceae Enterobacteriaceae Lachnospiraceae Bacteroidaceae Ruminococcaceae Porphyromonadaceae Acidaminococcaceae 0.02 0.04 0.06 p50 Desulfovibrionaceae Acidaminococcaceae Porphyromonadaceae Ruminococcaceae Bacteroidaceae Enterobacteriaceae Fusobacteriaceae Synergistaceae Rikenellaceae Lachnospiraceae Bacteroidaceae 0.0032 (0.0021−0.0048) Median ρ 95% Credible Interval ( ( Lachnospiraceae n.enterobacteriaceae n.oral n.rikenellacae Day 02 Day 09 Day 16 Day 23 Day 30 −2 0 2 4 −8 −4 0 −6 −4 −2 0 2 Balance Value Vessel 1 2 3 4 Posterior 95% credible interval Fusobacteriaceae and Synergistaceae Enterobacteriaceae Rikenellaceae Balance Value (e.i.) B C D Feed Disruption of Vessels 1 and 2 A Silverman et al., bioRxiv 2018

REAL DATA THERE ARE SUB-DAILY DYNAMICS A Vessel 3 Vessel
4 Vessel 1 Vessel 2 D ay 21 D ay 22 D ay 23 D ay 24 D ay 25 D ay 21 D ay 22 D ay 23 D ay 24 D ay 25 1.6 2.0 2.4 2.8 1.5 2.0 2.5 3.0 1.50 1.75 2.00 2.25 1.6 2.0 2.4 2.8 Balance Value (e.i.) B Bacteroidetes Proteobacteria Fusobacteria – + Balance Silverman et al., bioRxiv 2018

WRAPPING UP CURRENT LIMITATIONS OF MALLARD •Computational Cost ▸ Depends
on Assumptions Users are willing to make. ▸ Without simplifying assumptions - 10 bacterial families 700 samples ~ 2 hours. ▸ With simplifying assumptions - 50 families 700 samples ~ 10 minutes ▸ With approximate inference - 100 families 700 samples ~ 1 minute •User Interface

Quantifying and Controlling for Sources of Tech...

Quantifying and Controlling for Sources of Technical Variation and Bias in Longitudinal Microbiome Surveys

More Decks by Justin Silverman

Other Decks in Research

Featured

Transcript