psb_vandekar

Normalization and Cell Phenotyping for mIF Data Simon Vandekar

Batch correction Normalization and Harmonization 2

Batch correction • Batch correction makes the same tissue appear
similar across slides • Includes “normalization” and “harmonization” • Challenging because slides differ in their composition • Due to differences in tissue composition and type 3 Unnormalized Normalized MAP06025 MAP00083 MAP03361 Bottom of crypt Interior of crypt Stroma Top of crypt

Systematic differences in image intensities • Image/cell histograms reveal differences
in image intensities and cell populations 4

Review of batch correction methods • Transformation – transforms data
to make it more amenable to downstream analysis • Normalization – adjusts data distribution to make it more similar across batches • Harmonization • adjusts data distribution conserving covariate effects to make it more similar across batches • Uses information across slides 5

Common transformations • Need transformations that respect the range of
the data (non- negative or strictly positive) • Ideally, yields reasonable visual separation of expressed and unexpressed cells 6 Type Formula c Range log y!" = log#$ (𝑥!" + 𝑐) 1 [0, ∞) tanh y!" = tanh(𝑥!" /𝑐) 2 (−∞, ∞) asinh y!" = asinh(𝑥!"/𝑐) 5 (−∞, ∞) For slide i and cell/pixel j.

Common normalizations 7 Type Formula Parameters Range Mean division y!"
= log#$(𝑥!"/𝜇! + 𝑐) y!" = tanh 𝑥!" 𝑐𝜇! c=1,2; 𝜇! =mean [0, ∞) Min/max scaling 𝑦!" = 𝑥!" − min " 𝑥!" max " 𝑥!" − min " 𝑥!" [0, ∞) Quantile 𝑦!" = 𝑥!"× 𝑞 𝑥 ⋅" 𝑞 𝑥!" 𝑞 𝑥 ⋅# - quantile all images 𝑞 𝑥!# - quantile image i [−∞, ∞) For slide i and cell/pixel j.

Comments about normalization • Mean division assumes non-negative data –
based on Poisson noise • Min-max scaling is bad • Pointless if images are bounded by precision of the image (e.g. 8 bit precision) • Maximum is usually a noisy parameter and technically unbounded – highly volatile normalization • Normalizations that result in normally distributed data do not get good separation of expressed and unexpressed cells (anecdotally) • Z-normalization does not respect boundedness of the data 8

Common harmonization • None have been developed for mIF data
• Closest option is ComBat-seq for negative binomial (count) data 9 Zhang et al., NAR Gen and Bioinf, 2020

Investigation of normalization in mIF • We explored combinations of
batch correction and transformation 10 Johnson et al. Biostatistics 2007; Fortin et al., Neuroimage 2017; Wrobel et al. Neuroimage 2020; Harris et al., Bioinformatics 2022 Harmonization method Transformation/Normalization

New batch correction evaluation framework • Most spatial-omics data lack
ground truth for evaluating slide effects • Cole Harris established a framework to evaluate batch correction 1. Alignment of marker densities 2. Cell phenotyping discordance 3. Proportion of variance due to slide 11 Alignment of marker densities

Cell phenotyping discordance Logic of the procedure: • There are
no batch effects within a slide – assigned phenotypes are unaffected by batch • Apply phenotyping within slide and across slides for a each batch correction method and measure discordance 12 Across slide negative Across slide positive Within-slide negative (-, -) (-, +) Within-slide positive (+, -) (+, +)

Cell phenotyping discordance results 13 Cell phenotyping discordance

Proportion of variance due to slide • Random effects model
quantifying slide effect 𝑦<= = 𝛽 + 𝛾< + 𝜖<= • Intraclass correlation (ICC) is proportion of variance due to slide ICC = 𝜎> ? 𝜎> ? + 𝜎@ ? • Lower is better 14 Proportion of variance due to slide

mxnorm: an R package for normalization of spatial-omics data •
Users can easily evaluate normalization procedures in their own data • Default normalization options (from Harris et al. 2022) or user specified • mxnorm can be used to evaluate new normalization methods in future papers • Available on CRAN 15 Harris, Wrobel, Vandekar, JOSS, 2022

Summary of batch correction methods • Transformations and normalization are
widely used and adapted from other fields • Harmonization-based batch correction methods have not been well developed or evaluated • It is likely that normalization is insufficient • Room for new methods development 16

Cell phenotyping 17

CD8+ cells CD8+ cells P(CD8>0)=0.04 P(CD8>0)=0.4 Cell phenotyping- defining cell
types • Marker gating – cell types defined by individual markers • Subjective 18

Cell phenotyping- defining cell types • Marker gating – cell
types defined by individual markers • Subjective • Clustering – cell types defined via clustering • Manually intensive • Recently developed semiautomated methods • ASTIR (Geuenich et al., Cell Systems, 2022) • HALO Inform phenotyping 19 UMAP 1 UMAP 2 Leiden clustering

Semiautomated methods overview • Inform – proprietary, semiautomated (requires manually
annotated training data) • ASTIR – unsupervised clustering, but takes a yaml file defining cell types • A mixture model assuming transformed data are log-normal • Uses variational Bayes for estimation with neural networks as variational distribution • GammaGateR – semi-supervised marker gating 20

GammaGateR – semiautomated marker gating in R • GammaGateR fits
gamma mixture models to perform automated marker gating separately to each channel and slide • It takes user-specified constraints to obtain consistent model fit to each slide • Returns probability of being marker positive for each channel 21

Advantages of GammaGateR • Limitations of existing methods • Designed
for less noisy data • Don’t consider underlying distribution • Benefits of Gamma mixture model • Can accommodate skewed distributions • Incorporates zero-inflation • Model incorporates biological information as a prior 22 CD3D cell expression

NAKATPase cell expression Advantages of GammaGateR • Limitations of existing
methods • Designed for less noisy data • Don’t consider underlying distribution • Benefits of Gamma mixture model • Can accommodate skewed distributions • Incorporates zero-inflation • Model incorporates biological information as a prior 23

GammaGateR performs competitively 24 GammaGateR Posterior GammaGateR Marginal Biological prediction
error (C-index) for ovarian cancer stage Comparison to silver-standard manual labels

GammaGateR model fit • Component zero is proportion of zeros
25

• Component 1 is unexpressed cells 26

• Component 1 is unexpressed cells • Component 2 is expressed cells 27

• Component 1 is unexpressed cells • Component 2 is expressed cells • Mode is location of peak 28

• Component 1 is unexpressed cells • Component 2 is expressed cells • Mode is location of peak • Lambda is proportion in each cell type 29

GammaGateR contraints • User-specified constraints improve model fit across slides
30 Boundary for expressed cell mode Boundary for unexpressed cell mode

Constraints improve the model fit • Green – unconstrainted •
Red -- constrained 31

GammaGateR in subsequent analyses • GammaGateR yields posterior probabilities of
cells being in the “expressed” population • These can be used to compute cell proportions, thresholded to define cell types, or used directly in downstream analysis • QR code links to tutorial using Julia’s ovarian cancer dataset 32

Thank you • Colleagues: Robert Coffey, Ken Lau, Martha Shrubsole,
Eliot McKinley, Joe Roland, Qi Liu, Coleman Harris, Ruby Xiong • Funding: Vanderbilt Ingram Cancer Center GI SPORE: P50CA236733, NLM 5T32LM012412-05, R01MH123563 Simon Vandekar 33

psb_vandekar

psb_vandekar

Julia Wrobel

More Decks by Julia Wrobel

Featured

Transcript

Normalization and Cell Phenotyping for mIF Data Simon Vandekar

Batch correction Normalization and Harmonization 2

Batch correction • Batch correction makes the same tissue appear

Systematic differences in image intensities • Image/cell histograms reveal differences

Review of batch correction methods • Transformation – transforms data

Common transformations • Need transformations that respect the range of

Common normalizations 7 Type Formula Parameters Range Mean division y!"

Comments about normalization • Mean division assumes non-negative data –

Common harmonization • None have been developed for mIF data

Investigation of normalization in mIF • We explored combinations of

New batch correction evaluation framework • Most spatial-omics data lack

Cell phenotyping discordance Logic of the procedure: • There are

Cell phenotyping discordance results 13 Cell phenotyping discordance

Proportion of variance due to slide • Random effects model

mxnorm: an R package for normalization of spatial-omics data •

Summary of batch correction methods • Transformations and normalization are

Cell phenotyping 17

CD8+ cells CD8+ cells P(CD8>0)=0.04 P(CD8>0)=0.4 Cell phenotyping- defining cell

Cell phenotyping- defining cell types • Marker gating – cell

Semiautomated methods overview • Inform – proprietary, semiautomated (requires manually

GammaGateR – semiautomated marker gating in R • GammaGateR fits

Advantages of GammaGateR • Limitations of existing methods • Designed

NAKATPase cell expression Advantages of GammaGateR • Limitations of existing

GammaGateR performs competitively 24 GammaGateR Posterior GammaGateR Marginal Biological prediction

GammaGateR model fit • Component zero is proportion of zeros

GammaGateR model fit • Component zero is proportion of zeros

GammaGateR model fit • Component zero is proportion of zeros

GammaGateR model fit • Component zero is proportion of zeros

GammaGateR model fit • Component zero is proportion of zeros

GammaGateR contraints • User-specified constraints improve model fit across slides

Constraints improve the model fit • Green – unconstrainted •

GammaGateR in subsequent analyses • GammaGateR yields posterior probabilities of

Thank you • Colleagues: Robert Coffey, Ken Lau, Martha Shrubsole,