Slide 1

Slide 1 text

@lcolladotor lcolladotor.github.io lcolladotor.github.io/bioc_team_ds Benchmarking cell type deconvolution methods with human brain data Leonardo Collado Torres, LIBD Investigator + Asst. Prof. Johns Hopkins Biostatistics Single-cell genomics webinar LA speaker at WCS Sept 26 2024 Slides available at speakerdeck.com/lcolladotor

Slide 2

Slide 2 text

• Bioinformatics • R and Bioconductor • Reproducibility and best practices • Outreach and community building • Back in 2005 at @LCGUNAM: I like math and coding; biology provides the challenging problems What defines me

Slide 3

Slide 3 text

History 2005-2009 Undergrad in Genomic Sciences 2009-2011 2011-2016 August 2016+ Data Science Division Leader 🇽 🇽 PIs: ● Jeff Leek: 2012+ ● Andrew Jaffe: 2013+ Ph.D. Biostatistics Staff Scientist I → II → Research Scientist → Investigator Data Science Team I PIs: ● Andrew Jaffe 2016-2020 ● Myself 2020+ Division Leader: Keri Martinowich 2024+

Slide 4

Slide 4 text

2008+ • BioC 2008-2011, 2014, 2017, 2019-2023 • useR!2013, 2021 • rOpenSci unconf 2018 • RStudio::conf 2019-2021 @lcolladotor 2010+ @LIBDrstats 2018+ @CDSBMexico 2018+ Defunct: BmoreBiostats, Biostats Cultural Mixers Guest @RLadiesBmore #RLadiesMx Blog: http://lcolladotor.github.io 2011+ FB: 75k, Tw: 66k weekly Interests

Slide 5

Slide 5 text

doi.org/10.1016/j.biopsych.2020.06.005 Michael Gandal @mikejg84 Transcriptomic Insight Into the Polygenic Mechanisms Underlying Psychiatric Disorders

Slide 6

Slide 6 text

Background: Cell Types in the Brain ● The brain is made of complex tissues consisting of different types of cells ● Some diagnoses associated with changes in cell type specific expression ○ Ex. Pitt-Hopkins syndrome and oligodendrocytes (Phan et al, Nature Neuroscience, 2020) 6 Louise Huuki-Myers @lahuuki speakerdeck.com/lahuuki/benchmarki ng-deconvolution-methods-in-the-hum an-brain

Slide 7

Slide 7 text

How can we connect bulk RNA-seq to cell type information? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 7 Deconvolution $$$ $ Free!

Slide 8

Slide 8 text

What is Deconvolution? Computational method that... ● Infers the composition of different cell types in a bulk RNA-seq data ● Utilizes single cell data to obtain cell type gene expression profiles 8

Slide 9

Slide 9 text

Why is Deconvolution Important? ● Tissue is heterogeneous ○ Different cell types express genes at different levels ● Samples can differ in cell type composition due to biology or dissection ○ Check for differences in case vs. control ● Controlling for cell fractions between samples can make case vs. control analysis cleaner ○ Quality control ○ Confounding factor in differential expression analysis - prevents false-positives and false-negatives 9

Slide 10

Slide 10 text

How do you run deconvolution? 10 deconvolution(Y, Z) = Proportion of Cell Types Gene Expression Bulk RNA-seq Sample Gene Expression scRNA-seq cell type Populations Computational Algorithm Bulk Samples Proportion

Slide 11

Slide 11 text

● There are 20+ single cell reference based methods published deconvolution(Y, Z) = Proportion of Cell Types Which Method Should We Use? ? ? ? ? 11

Slide 12

Slide 12 text

Which Method is the Most Accurate? ● Benchmarking shows that different methods perform best on different data sets (Cobos et al, Nature Communications, 2020) ● Benchmarking results from different papers on “real” data ○ MuSiC paper: MuSiC > NNLS > BSEQ-sx > CIBERSORT ■ Pancreatic Islet: Beta cells vs. HbA1c (Fig 2a) ○ Bisque paper: Bisque > MuSiC > CIBERSORT ■ DLPFC: Microglia vs. Braak stage, Neuron vs. Cognitive diagnostic category (Fig 4) ○ Cobos et al. benchmark: DWLS > MuSiC > Bisque > deconvoSeq ■ Human PMBC flow sorted (Fig 7) ○ Jin et al. benchmark: CIBERSORT, MuSiC > EPIC*, TIMER, DeconRNAseq ■ Human Whole Blood, simulations ○ Dai et al., benchmark: Dtangle > Bisque > Other Methods ■ human brain IHC & scRNA-seq data 12

Slide 13

Slide 13 text

Goals of Deconvolution Benchmark ● Build multi-assay dataset with orthogonal cell type measurements ● Test top deconvolution methods that employ different strategies ● Assess impact of other factors in deconvolution ○ Bulk RNA-seq data types ○ snRNA-seq features ○ Marker genes 13

Slide 14

Slide 14 text

How can we build on previous benchmarks? Previous Strategies to Assess Accuracy ● Use pseudobulk samples ○ Known or simulated composition ○ May not reflect real bulk RNA-seq data ● Compare with Immunofluorescence Data ● Cell flow sorting ○ Difficult to label nuclei by cell type 14 Our Strategy ● Use paired orthogonal imaging data to measure cell type proportions & evaluate method accuracy ● Focus on brain tissue

Slide 15

Slide 15 text

Orthogonal Data ● Alternative measurement of the same thing (cell type proportions) ○ Multiple independent measurements build confidence ● “Gold standard” ○ *All methods have biases 15

Slide 16

Slide 16 text

Multi-modal dataset From Human DLPFC 16

Slide 17

Slide 17 text

Spatial DLPFC Dataset 17 Kelsey Montgomery Louise Huuki-Myers

Slide 18

Slide 18 text

18 Experimental Design

Slide 19

Slide 19 text

Huuki-Myers et al, Science, 2024 10.1126/science.adh1938 ● 10 Donors (n=19) ● Seven broad cell types ● 56k nuclei 19 Single Nucleus RNA-seq References

Slide 20

Slide 20 text

Bulk RNA-seq ← Library Type → ← RNA Extraction → n = 110 6 library type + RNA Extraction combinations 20

Slide 21

Slide 21 text

RNAScope/IF Experiment Design ● Measure the abundance of 6 broad cell types ● Filtered for high quality images Kelsey Montgomery 21

Slide 22

Slide 22 text

RNAScope/IF Estimated Cell Type Proportions 22

Slide 23

Slide 23 text

RNAScope Cell Type Annotations Make Sense Spatially 23

Slide 24

Slide 24 text

RNAScope vs. snRNA-seq Proportions 24 Comparing Cell Type Proportions ● Pearson’s correlation (cor) ● Root Mean Squared Error (rmse) ● Relative rmse (rrmse)

Slide 25

Slide 25 text

deconvolution(Y, Z) = Proportion of Cell Types Six Methods 1. DWLS 2. Bisque 3. MuSiC 4. BayesPrism 5. hspe 6. CIBERSORTx vs. 25 Experimental Design Connection to Benchmark

Slide 26

Slide 26 text

Marker Genes Method Deconvolution Benchmark 26 Dataset Features

Slide 27

Slide 27 text

Evaluate Deconvolution Methods 27

Slide 28

Slide 28 text

Evaluate Deconvolution Methods 28 Method 1. What is the most accurate deconvolution method for brain tissue? 2. Is accuracy impacted by type of bulk RNA-seq? a. Library type? b. RNA extraction?

Slide 29

Slide 29 text

Run Deconvolution 29 deconvolution(Y, Z) = Proportion of Cell Types 110 bulk samples Paired snRNA-seq 7 cell types

Slide 30

Slide 30 text

Methods return a wide range of proportion estimates 30 B2720_post Each Tissue Block has 6 Bulk RNA-seq samples

Slide 31

Slide 31 text

31 All 19 Tissue Blocks (110 bulk RNA-seq samples)

Slide 32

Slide 32 text

Bisque and hspe are Most Accurate Methods Compared to RNAScope/IF Accurate Methods have: ● High Pearson’s correlation (cor) ● Low Root Mean Squared Error (rmse) 32

Slide 33

Slide 33 text

33 Bisque and hspe are Most Accurate Methods Compared to snRNA-seq

Slide 34

Slide 34 text

Library Type Impacts Method Performance Compared to RNAScope/IF 34

Slide 35

Slide 35 text

Method Evaluate Six Deconvolution Methods 35 1. What is the most accurate deconvolution method for brain tissue? hspe & Bisque 2. Is accuracy impacted by type of bulk RNA-seq? Yes a. Library type? Bisque more accurate in polyA, hspe in RiboZeroGold b. RNA extraction? Some impact but inconsistent

Slide 36

Slide 36 text

Marker Genes 36

Slide 37

Slide 37 text

Marker Genes Select Effective Marker Genes 37 1. Does selecting marker genes improve deconvolution? 2. How to best select good sets of marker genes?

Slide 38

Slide 38 text

Marker Gene Selection ● Filter for genes expressed in snRNA-seq and bulk data ● Looking for genes expressed in only one cell type ○ Test for specificity of each gene for each cell type ● Observe expression of selected marker genes ○ Heat maps of pseudobulked data The Ideal Heatmap snRNAseq data, Pseudobulked by cell type 38 Stephanie C Hicks Marker Genes

Slide 39

Slide 39 text

1 vs. All Marker Gene Selection 39 scran::findMarkers()

Slide 40

Slide 40 text

Mean Ratio Gene Selection DeconvoBuddies::get_mean_ratio()

Slide 41

Slide 41 text

Mean Ratio selects a subset of genes with high 1vAll fold change 41

Slide 42

Slide 42 text

Marker Gene Sets Tested 1. Full (17,804 genes) a. set of genes common between the bulk and snRNA-seq datasets 2. 1vALL top25 (145 genes) a. top 25 genes ranked by fold change for each cell type, then filtered to common genes 3. MeanRatio top25 (151 genes) a. top 25 genes ranked by MeanRatio for each cell type, then filtered to common genes 4. MeanRatio over2 (557 genes) a. All genes for each cell type with MeanRatio > 2 5. MeanRatio MAD3 (520 genes) a. All genes for each cell type with MeanRatio > 3 median absolute deviations (MADs) greater than the median of all MeanRatios > 1 42

Slide 43

Slide 43 text

Method Performance Varied Over Different Marker Gene Sets 43 Method’s highest cor Lowest rmse Mean Ratio top 25

Slide 44

Slide 44 text

Method Performance Over Different Marker Gene Sets Mean Ratio Top25 Balances rmse and cor in top methods 44

Slide 45

Slide 45 text

Marker Genes Select Effective Marker Genes 45 1. Does selecting marker genes improve deconvolution? Depends on the method ○ hspe more sensitive than Bisque 2. How to best select good sets of marker genes? Mean Ratio top25 ○ Mean Ratio top25 balanced rmse and correlation in Bisque & hspe

Slide 46

Slide 46 text

Other Datasets & Challenges 46

Slide 47

Slide 47 text

Other Factors Can Impact Method Performance 47 Dataset Features 1. What Features of snRNA-seq reference dataset can impact deconvolution accuracy? a. Number of donors? b. Donor diversity? c. Existing proportion of cell types?

Slide 48

Slide 48 text

48 Tran, Maynard et al., Neuron, 2021 Mathys et al., Nature, 2019 Paired snRNA-seq Features of Other DLPFC snRNA-seq Datasets

Slide 49

Slide 49 text

Method Performance with different snRNA-seq Reference 49

Slide 50

Slide 50 text

Changing Cell Type Proportions Nick Eagles x 1000 Sub- samples 50

Slide 51

Slide 51 text

Changing Cell Type Proportions Nick Eagles 51

Slide 52

Slide 52 text

Other Factors Can Impact Method Performance 52 Dataset Features 1. What features of snRNA-seq reference dataset can impact deconvolution accuracy? a. Number of donors? Bisque performs poorly with <4 donors b. Donor diversity? Bisque and hspe were unaffected by inclusion of AD cases c. Existing proportion of cell types? Bisque is biased to snRNA-seq proportions

Slide 53

Slide 53 text

Conclusions 53

Slide 54

Slide 54 text

Marker Genes Method Benchmark Conclusions 54 Dataset Features hspe & Bisque are top performing methods ● hspe better for RiboZeroGold Mean Ratio effectively selects cell type specific genes ● MR Top 25 improves performance of top methods Many factors impact deconvolution accuracy ● Bisque is sensitive to low donors and input cell proportions

Slide 55

Slide 55 text

How do our conclusions compare to other benchmarks? 55 Benchmark Strategy Tissue Top Methods Cobos et al. Pseudobulk, Flow sorting Blood, pancreas, kidney DWLS Jin et al. Flow sorting Blood CIBERSORT, MuSiC Dai et al. Immunohistochemistry, scRNA-seq pseudobulk Brain 🧠 dtangle (hspe), Bisque

Slide 56

Slide 56 text

How do our conclusions compare to other benchmarks? 56 Benchmark Strategy Tissue Top Methods Cobos et al. Pseudobulk, Flow sorting Blood, pancreas, kidney DWLS Jin et al. Flow sorting Blood CIBERSORT, MuSiC Dai et al. Immunohistochemistry, scRNA-seq pseudobulk Brain 🧠 dtangle (hspe), Bisque LIBD RNAScope/IF Brain 🧠 hspe, Bisque new! ✅

Slide 57

Slide 57 text

Resources ● DeconvoBuddies R package ○ R/Bioconductor package with tools for marker finding & plotting ○ https://research.libd.org/DeconvoBuddies/ ○ Access paired dataset ■ Bulk RNA-seq ■ snRNA-seq data ■ RNAScope Proportions ● Deconvolution code tutorial + video ○ updated version at LIBD Rstats club on May 3rd 57

Slide 58

Slide 58 text

Benchmark Paper now in Pre-print! 🎉 58

Slide 59

Slide 59 text

Acknowledgements Kristen Maynard Stephanie C Hicks 59 Kelsey Montgomery Sang Ho Kwon Sean Maden Nick Eagles Thank you! Any Questions? Sophia Cinquemani Download these slides: speakerdeck.com/lahuuki @lahuuki Daianna Gonzalez-Padilla NIMH Grant: R01 MH123183 & R01 MH111721 Louise Huuki-Myers

Slide 60

Slide 60 text

Selected Six Deconvolution Methods 60 Method Citation Approach Marker Gene Selection Availability Top Benchmark Performance DWLS (Dampened weighted least-squares) Tsoucas et al, Nature Comm, 2019 [5] weighted least squares - R package on CRAN Cobos et al. [18] Bisque Jew et al, Nature Comm, 2020 [6] Bias correction: Assay - R package on GitHub Dai et al. [17] MuSiC (Multi-subject Single-cell) Wang et al, Nature Communications, 2019 [7] Bias correction: Source Weights Genes R package GitHub Jin et al. [20] BayesPrism Chu et al., Nature Cancer, 2022 [8] Bayesian Pairwise t-test Webtool R package on GitHub Hippen et al. [22] hspe (dtangle) (hybrid-scale proportion estimation) Hunt and Gagnon-Bartsch, Ann. Appl. Stat. 2021 [9, 45] High collinearity adjustment Multiple options- default “ratio” 1vALL mean expression ratio R package on GitHub Dai et al. [17] CIBERSORTx Newman et al., Nat Biotech, 2019 [11] Machine Learning Differential Gene expression Webtool, Docker Image Jin et al. [20]

Slide 61

Slide 61 text

Comparing Estimates ● Bisque vs. hspe predict similar proportions ○ Cor = 0.938 ● Bisque has highest cor with snRNA-seq ○ Cor = 0.743 61

Slide 62

Slide 62 text

Evaluate by Library Type + RNA Extraction Combination 62

Slide 63

Slide 63 text

Method Predictions over 13 Brain Regions GTEx v8 Brain dataset Expected patterns ● Cerebellum contains more Inhib ● Caudate having an increased proportion of inhibitory neurons compared to frontal cortex 63

Slide 64

Slide 64 text

Considering Cell Size 64

Slide 65

Slide 65 text

Considering Cell Size Nick Eagles 65

Slide 66

Slide 66 text

Dai et al. benchmark ● Top deconvolution methods: dtangle (hspe) and Bisque ● Cell Type specific expression methods: bMIND 66 Figure 2 Figure 3