Upgrade to Pro — share decks privately, control downloads, hide ads and more …

psb_vandekar

Julia Wrobel
December 24, 2023
22

 psb_vandekar

Julia Wrobel

December 24, 2023
Tweet

Transcript

  1. Batch correction • Batch correction makes the same tissue appear

    similar across slides • Includes “normalization” and “harmonization” • Challenging because slides differ in their composition • Due to differences in tissue composition and type 3 Unnormalized Normalized MAP06025 MAP00083 MAP03361 Bottom of crypt Interior of crypt Stroma Top of crypt
  2. Review of batch correction methods • Transformation – transforms data

    to make it more amenable to downstream analysis • Normalization – adjusts data distribution to make it more similar across batches • Harmonization • adjusts data distribution conserving covariate effects to make it more similar across batches • Uses information across slides 5
  3. Common transformations • Need transformations that respect the range of

    the data (non- negative or strictly positive) • Ideally, yields reasonable visual separation of expressed and unexpressed cells 6 Type Formula c Range log y!" = log#$ (𝑥!" + 𝑐) 1 [0, ∞) tanh y!" = tanh(𝑥!" /𝑐) 2 (−∞, ∞) asinh y!" = asinh(𝑥!"/𝑐) 5 (−∞, ∞) For slide i and cell/pixel j.
  4. Common normalizations 7 Type Formula Parameters Range Mean division y!"

    = log#$(𝑥!"/𝜇! + 𝑐) y!" = tanh 𝑥!" 𝑐𝜇! c=1,2; 𝜇! =mean [0, ∞) Min/max scaling 𝑦!" = 𝑥!" − min " 𝑥!" max " 𝑥!" − min " 𝑥!" [0, ∞) Quantile 𝑦!" = 𝑥!"× 𝑞 𝑥 ⋅" 𝑞 𝑥!" 𝑞 𝑥 ⋅# - quantile all images 𝑞 𝑥!# - quantile image i [−∞, ∞) For slide i and cell/pixel j.
  5. Comments about normalization • Mean division assumes non-negative data –

    based on Poisson noise • Min-max scaling is bad • Pointless if images are bounded by precision of the image (e.g. 8 bit precision) • Maximum is usually a noisy parameter and technically unbounded – highly volatile normalization • Normalizations that result in normally distributed data do not get good separation of expressed and unexpressed cells (anecdotally) • Z-normalization does not respect boundedness of the data 8
  6. Common harmonization • None have been developed for mIF data

    • Closest option is ComBat-seq for negative binomial (count) data 9 Zhang et al., NAR Gen and Bioinf, 2020
  7. Investigation of normalization in mIF • We explored combinations of

    batch correction and transformation 10 Johnson et al. Biostatistics 2007; Fortin et al., Neuroimage 2017; Wrobel et al. Neuroimage 2020; Harris et al., Bioinformatics 2022 Harmonization method Transformation/Normalization
  8. New batch correction evaluation framework • Most spatial-omics data lack

    ground truth for evaluating slide effects • Cole Harris established a framework to evaluate batch correction 1. Alignment of marker densities 2. Cell phenotyping discordance 3. Proportion of variance due to slide 11 Alignment of marker densities
  9. Cell phenotyping discordance Logic of the procedure: • There are

    no batch effects within a slide – assigned phenotypes are unaffected by batch • Apply phenotyping within slide and across slides for a each batch correction method and measure discordance 12 Across slide negative Across slide positive Within-slide negative (-, -) (-, +) Within-slide positive (+, -) (+, +)
  10. Proportion of variance due to slide • Random effects model

    quantifying slide effect 𝑦<= = 𝛽 + 𝛾< + 𝜖<= • Intraclass correlation (ICC) is proportion of variance due to slide ICC = 𝜎> ? 𝜎> ? + 𝜎@ ? • Lower is better 14 Proportion of variance due to slide
  11. mxnorm: an R package for normalization of spatial-omics data •

    Users can easily evaluate normalization procedures in their own data • Default normalization options (from Harris et al. 2022) or user specified • mxnorm can be used to evaluate new normalization methods in future papers • Available on CRAN 15 Harris, Wrobel, Vandekar, JOSS, 2022
  12. Summary of batch correction methods • Transformations and normalization are

    widely used and adapted from other fields • Harmonization-based batch correction methods have not been well developed or evaluated • It is likely that normalization is insufficient • Room for new methods development 16
  13. CD8+ cells CD8+ cells P(CD8>0)=0.04 P(CD8>0)=0.4 Cell phenotyping- defining cell

    types • Marker gating – cell types defined by individual markers • Subjective 18
  14. Cell phenotyping- defining cell types • Marker gating – cell

    types defined by individual markers • Subjective • Clustering – cell types defined via clustering • Manually intensive • Recently developed semiautomated methods • ASTIR (Geuenich et al., Cell Systems, 2022) • HALO Inform phenotyping 19 UMAP 1 UMAP 2 Leiden clustering
  15. Semiautomated methods overview • Inform – proprietary, semiautomated (requires manually

    annotated training data) • ASTIR – unsupervised clustering, but takes a yaml file defining cell types • A mixture model assuming transformed data are log-normal • Uses variational Bayes for estimation with neural networks as variational distribution • GammaGateR – semi-supervised marker gating 20
  16. GammaGateR – semiautomated marker gating in R • GammaGateR fits

    gamma mixture models to perform automated marker gating separately to each channel and slide • It takes user-specified constraints to obtain consistent model fit to each slide • Returns probability of being marker positive for each channel 21
  17. Advantages of GammaGateR • Limitations of existing methods • Designed

    for less noisy data • Don’t consider underlying distribution • Benefits of Gamma mixture model • Can accommodate skewed distributions • Incorporates zero-inflation • Model incorporates biological information as a prior 22 CD3D cell expression
  18. NAKATPase cell expression Advantages of GammaGateR • Limitations of existing

    methods • Designed for less noisy data • Don’t consider underlying distribution • Benefits of Gamma mixture model • Can accommodate skewed distributions • Incorporates zero-inflation • Model incorporates biological information as a prior 23
  19. GammaGateR performs competitively 24 GammaGateR Posterior GammaGateR Marginal Biological prediction

    error (C-index) for ovarian cancer stage Comparison to silver-standard manual labels
  20. GammaGateR model fit • Component zero is proportion of zeros

    • Component 1 is unexpressed cells 26
  21. GammaGateR model fit • Component zero is proportion of zeros

    • Component 1 is unexpressed cells • Component 2 is expressed cells 27
  22. GammaGateR model fit • Component zero is proportion of zeros

    • Component 1 is unexpressed cells • Component 2 is expressed cells • Mode is location of peak 28
  23. GammaGateR model fit • Component zero is proportion of zeros

    • Component 1 is unexpressed cells • Component 2 is expressed cells • Mode is location of peak • Lambda is proportion in each cell type 29
  24. GammaGateR contraints • User-specified constraints improve model fit across slides

    30 Boundary for expressed cell mode Boundary for unexpressed cell mode
  25. GammaGateR in subsequent analyses • GammaGateR yields posterior probabilities of

    cells being in the “expressed” population • These can be used to compute cell proportions, thresholded to define cell types, or used directly in downstream analysis • QR code links to tutorial using Julia’s ovarian cancer dataset 32
  26. Thank you • Colleagues: Robert Coffey, Ken Lau, Martha Shrubsole,

    Eliot McKinley, Joe Roland, Qi Liu, Coleman Harris, Ruby Xiong • Funding: Vanderbilt Ingram Cancer Center GI SPORE: P50CA236733, NLM 5T32LM012412-05, R01MH123563 Simon Vandekar 33