Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Don’t Normalize: the GLM-PCA Approach to Normalization

Afff88c1e31da163b3136196365a18f5?s=47 Will Townes
November 19, 2019

Don’t Normalize: the GLM-PCA Approach to Normalization

Afff88c1e31da163b3136196365a18f5?s=128

Will Townes

November 19, 2019
Tweet

More Decks by Will Townes

Other Decks in Science

Transcript

  1. 1/15 Don’t Normalize The GLM-PCA approach to normalization Will Townes

    Department of Computer Science, Princeton University 19 November 2019
  2. 2/15 RNA-seq measures relative abundance Batson et al 2019 Biorxiv

  3. 3/15 Normalization as estimation of relative abundance yij ∼ Multinomial(ni

    , πij ) ˆ πij = yij ni ˜ πij = yij + αi ni + Jαi log2 (1 + CPM) = log2 (˜ πij ) + C Poisson approximation when ni large, πij small.
  4. 4/15 Problems with normalization Small counts limit MLE accuracy. Justin

    Silverman: http://www.statsathome.com/2017/09/14/ visualizing-the-multinomial-in-the-simplex/
  5. 5/15 Problems with normalization Artificial zero inflation from log-transform. 0

    200 400 600 0 5 10 15 20 counts 0 200 400 600 0 5000 10000 15000 20000 CPM 0 200 400 600 0 5 10 15 log2(1+CPM)
  6. 6/15 Variance stabilizing transformations log_cpm counts_vst rel_abund_vst counts rel_abund cpm

    0 5 10 15 0.0 2.5 5.0 7.5 10.0 12.5 0.0 0.1 0.2 0.3 0 50 100 150 0.000 0.025 0.050 0.075 0.100 0 25000 50000 75000 100000 0e+00 2e+07 4e+07 6e+07 0e+00 1e−04 2e−04 0e+00 2e−05 4e−05 6e−05 0.0 0.1 0.2 0.3 0.4 0 50 100 150 0 5 10 15 20 mean variance
  7. 7/15 GLM-PCA: avoid normalization by using models yij ∼ Poi(ni

    πij ) ≈ Mult(ni , πij ) πij = fj (ui ) = exp vj ui Improve estimation of πij by sharing info across cells Variance stabilization not necessary with explicit noise model ZINB-WAVE, SCVI, linear decoded VAE also doing this
  8. 8/15 GLM-PCA failure modes Nonconvex optimization problem Numerical divergences Local

    optima Slow computation Too many factors?
  9. 9/15 Maybe normalization is not so bad Linear-Gaussian models (PCA)

    fast, convenient, interpretable PCA requires normally distributed errors Transform data to match Gaussian assumption Idea: GLM residuals asymptotically normal Fit multinomial null model and use deviance residuals: Dj = 2 i yij log yij ni ˆ πj + (ni − yij ) log (ni − yij ) ni (1 − ˆ πj ) (or Pearson residuals): yij − ni ˆ πij ni ˆ πij (1 − ˆ πij )
  10. 10/15 Normalization via null residuals q q q q q

    q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q pca_rp pca_rd glmpca pca −10 0 10 20 −20 −10 0 10 −20 0 20 −10 −5 0 5 10 15 −10 0 10 20 −10 0 10 20 −30 −20 −10 0 10 20 30 −15 −10 −5 0 5 10 dim1 dim2 batch q 1 2 clust q q q 1 2 3
  11. 11/15 Quantile normalization of read counts 0 200 400 600

    0 5 10 15 20 ENSG00000114391 UMI counts number of droplets in bin 0 200 400 600 0 200 400 600 ENSG00000114391 read counts number of droplets in bin Sometimes the generative process is too complex for modeling. UMI target distribution easier than Gaussian: “quasi-UMIs” Quasi-UMI only changes nonzero values
  12. 12/15 Quasi-UMI normalization accuracy q q q q q q

    q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q macosko_2015 tung_2016 zheng_2017_monocytes qumi2 qumi1 census reads qumi2 qumi1 census reads qumi2 qumi1 census reads 10 30 100 50 100 300 1 10 100 method distance from UMI counts
  13. 13/15 When does normalization work? Large total UMI counts Better

    capture efficiency and reverse transcriptase Consistently processed samples No amplification noise (PCR) The future of normalization is bright thanks to wet lab innovation!
  14. 14/15 How to demonstrate success Ground-truth negative controls- no biology,

    verify removal of technical noise and batch effects Ground-truth positive controls- known biology, verify preservation of signal Denoising/ molecular cross-validation Simulations- how to know if correct generative model? Posterior predictive checks for Bayesian models
  15. 15/15 Ideas for tomorrow Read counts vs UMI counts- assess

    separately Learn from ecology & metagenomics- e.g. distance metrics Denoiser concept (Batson) for comparing implicit normalization of models Negative controls- Tung 2017, 10x purified cells, Sarkar 2019 Positive controls- assessments will depend on downstream feature selection, dimension reduction, clustering, etc. Speed, memory consumption matter Sun et al 2019- comprehensive assessment of dim reduce Duo et al 2018- preprocessing, clustering assessments