Introduction to Deconvolution - Seminar at UCL ICH

Introduction to Cell Type Deconvolution Louise Huuki-Myers Staff Scientist 1
@lahuuki lahuuki.github.io Download these slides: speakerdeck.com/lahuuki 1

About Lieber Institute for Brain Development • Non-proﬁt Research Institute
in Baltimore, MD • Study the genetics of neuropsychiatric disorders 🧬 • 139 multidisciplinary scientists • Aﬃliated with the Johns Hopkins Medical School 2 Baltimore Maryland 🔸

Our R/Bioconductor Powered Data Science Team • Led by Leonardo
Collado-Torres • Computational lab specializing in: ◦ RNA seq analysis ▪ Bulk, single cell, spatial ◦ Open Source software development ◦ Knowledge sharing ▪ Data Science Guidance Sessions ▪ Rstat Club: Videos available www.youtube.com/@lcolladotor • Team website ◦ lcolladotor.github.io/ 3

About Me • Staff Scientist at LIBD ◦ Joined in
2020 ◦ Working on Bulk RNA-seq, single cell RNA-seq, spatial transcriptomics • Masters in Bioinformatics from Temple University Philadelphia, PA ◦ Previously worked on evolutionary time trees • Other interest: ◦ running, rowing, baking 4 @lahuuki

Studying Gene Expression in the Human Brain Bulk RNA-seq Single
nucleus RNA-seq 5

Background: Cell Types in the Brain • The brain is
made of complex tissues consisting of different types of cells • Some Dx associated with changes in cell type speciﬁc expression ◦ Ex. Pitt-Hopkins syndrome and oligodendrocytes (Phan et al, Nature Neuroscience, 2020) 6

What is Deconvolution? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 7
Deconvolution $$$ $ Free!

What is Deconvolution? • Inferring the composition of different cell
types in a bulk RNA-seq data • Utilize single cell data to obtain cell type gene expression proﬁles 8

Why is Deconvolution Important? • Tissue is heterogeneous ◦ Different
cell types express genes at different levels • Samples can differ in cell type composition due to biology or dissection ◦ Check for differences in case vs. control • Controlling for cell fractions between samples can make case vs. control analysis cleaner ◦ Quality control ◦ Confounding factor in differential expression analysis - prevents false-positives and false-negatives 9

How do you run deconvolution? 10 deconvolution(Y, Z) = Proportion
of Cell Types Gene Expression Bulk RNA-seq Sample Gene Expression scRNA-seq cell type Populations Computational Algorithm Bulk Samples Proportion

Methods 11 deconvolution(Y, Z) = Proportion of Cell Types

Method Summary Method Regression Correction for Technical Variation Other Features
MuSiC Wang et al, Nature Communications, 2019 W-NNLS regression (Weighted - Non-negative least squares) None Tree guided deconvolution, good for closely related cell types Bisque Jew et al, Nature Communications, 2020 NNLS regresion Gene specific transformation of bulk data Leverage overlapping bulk & sc data SCDC Dong et al, Briefings in Bioinformatics, 2020 W-NNLS framework proposed by MuSiC Option for Gene specific transformation of bulk data (from Bisque) Multiple reference datasets can be used, results combined with ENSEMBL weights DWLS Tsoucas, Nature Communications, 2019 Dampened Weighted least squares None 12

Which Method is the Most Accurate? • Benchmarking shows that
different methods perform best on different data sets (Cobos et al, Nature Communications, 2020) • Benchmarking results from different papers on “real” data ◦ MuSiC paper: MuSiC > NNLS > BSEQ-sx > CIBERSORT ▪ Pancreatic Islet: Beta cells vs. HbA1c (Fig 2a) ◦ Bisque paper: Bisque > MuSiC > CIBERSORT ▪ DLPFC: Microglia vs. Braak stage, Neuron vs. Cognitive diagnostic category (Fig 4) ◦ SCDC paper: SCDC > MuSiC > Bisque > DWLS > CIBERSORT ▪ Pancreatic Islet: Beta cells vs. HbA1c (Fig 4b) ◦ Cobos benchmark: DWLS > MuSiC > Bisque > deconvoSeq ▪ Human PMBC ﬂow sorted (Fig 7) 13

Why we like Bisque • Benchmarked with a DLPFC dataset
• Robust to marker set • Robust to library prep • More reasonable estimates on GTEx dataset Stay Tuned: Methods benchmark in the works! 14

Reference Single Cell Data 15 deconvolution(Y, Z) = Proportion of
Cell Types

Important Factors • Number and diversity of donors (4+) •
Resolution of cell types • Does it match the bulk data? ◦ Same tissue or region? Same experimental conditions • Same cellular fraction? ◦ Brain tissue is often limited to single nucleus 16

Single Nucleus RNA-seq References Tran, Maynard et al, Neuron, 2021
10.1016/j.neuron.2021.09.001 • 5 Brain Regions + 8 Donors ◦ Amygdala, sACC, Hippocampus, NAc, DLPFC • Utilize “pan brain” annotation to maximize donors Matthew N Tran 17 Kelsey Montgomery

Huuki-Myers et al, bioRxiv, 2023 10.1101/2023.02.15.528722 • DLPFC + 10
Donors (n=19) • Layer level cell type annotation • Access with SpatialLIBD 18 Single Nucleus RNA-seq References

Marker Finding 19 deconvolution(Y, Z) = Proportion of Cell Types

What are Marker Genes? • “Deﬁne” cell types ◦ Differentially
expressed between cell types • Historically ◦ Know markers associated with key cell types ◦ Ex. MBP: major constituent of the myelin sheath, marker for oligodendrocytes • What does the Data tell us? ◦ Human vs. model organisms ◦ Regional ◦ Technical differences 20

Marker Gene Selection • Filter for genes expressed in snRNA-seq
and bulk data • Looking for genes expressed in only one cell type ◦ Test for speciﬁcity of each gene for each cell type • Observe expression of selected marker genes ◦ Heat maps of pseudobulked data ▪ Summation of counts from nuclei from one donor + cell type ◦ Violin plots by cell type Marker Genes shared by sn & bulk The Ideal Heatmap snRNAseq data, Pseudobulked by cell type + donor 21 Stephanie C Hicks

Exploring Marker Expression • T-test between two groups • Fold
change between groups 22

Exploring Marker Expression Where does this noise come from? •
Outliers in one of more non-target cell type ◦ Here OPCs are expressing MBP 23

Our Solution: Mean Ratio Target Highest non-target Mean Expression target
cell type Mean Expression highest non-target cell type = Mean Ratio Higher mean ratio: • the more speciﬁc that gene is to the target cell type • the better a marker gene it is 24

Mean Ratio vs. Fold Change • Genes with high mean
ratio also have high fold changes • Not all genes with high fold changes have high mean ratios • Selecting marker genes by mean ratio helps avoid “noisy” genes 25

1vAll Markers vs. Mean Ratio Markers 26

1vAll Markers vs. Mean Ratio Markers 27

How Many Markers? 28 • As many look like outliers
in the “worst” cell type ◦ Least amount of signal ◦ Balance overﬁtting vs. adding noise ◦ Looking at Inhib: we chose 25 markers • Same number for each cell type

How Many Markers? • This becomes more diﬃcult with more
speciﬁc cell types • We are looking for genes with big differences between cell types 29

Tran, Maynard, et al. Top 25 Markers 30

Huuki-Myers, et al. Top 25 Markers 31 * Only plotted
10/25 genes in this heatmap

Results + Validation 32 deconvolution(Y, Z) = Proportion of Cell
Types

Current LIBD Pipeline • Method: Bisque • Reference Data ◦
Pan-brain (Tran, Maynard et al., Neuron, 2021) ◦ Broad cell types • Marker genes ◦ Top 25 ranked with mean ratio (150 total) 33

34 MDDSeq Data

Application in Differential Expression Analysis • High correlation between gene
t-stats for models with and without deconvolution terms • Many of the signiﬁcant genes stay signiﬁcant • Deconvolution models are more exclusive • Which model would you choose? 35 ~Dx * BrainRegion + Age + Sex + snpPC + qc metrics + qSVS ~Dx * BrainRegion + Age + Sex + snpPC + qc metrics + qSVS + proportions

Validation Strategies How do we know we are right? •
Region Trends - does it make sense? • RNAscope - use cell type markers to check composition of tissue 36

Region Trends • Expect different patterns of composition across brain
regions • Ruzicka et al, bioRxiv, 2021 (DOI: 10.1101/2021.01.21.426000) ◦ Perform deconvolution on 3k bulk RNAseq samples from 15 regions ▪ GTEx, MAYO, ROSMAP data ▪ SPLITR method ▪ 48 donor reference scRNA-seq - 10X ▪ Method and reference data are not available ◦ Validate method using region composition 37

RNAScope • Multiplex single-molecule ﬂuorescent in situ hybridization (smFISH) •
Visualize cell type speciﬁc markers in tissue • What we can observe: ◦ Cell type proportions in the tissue ◦ Individual cell sizes ◦ Total RNA content in different cell types using “total RNA expression genes” Maynard, et al, Nucleic Acids Research, 2020 Fig. 5 Future Work Kristen Maynard 38 Neurons Excit Inhib Oligo

Looking Ahead 39

Considering variation in Cell Size & Transcription • Sosina et
al, bioRxiv, 2020 : Is deconvolution predicting the amount of RNA from a cell type, or the cellular fraction? ◦ RNA fraction vs. Cellular fraction ◦ Neurons are more transcriptionally active: more RNA ◦ Cell size are different across cell types • Most current methods don't account for cell size • Future work! 40

New Commentary Preprint on arXive! 41 Sean Maden https://doi.org/10.48550/arXiv.2305.06501

Benchmark Experiment • Linked Bulk, snRNA-seq, and RNAscope experiment •
Check deconvolution prediction accuracy with RNAScope orthogonal measurement • Impact of cellular fraction in bulk tissue ◦ Is snRNA-seq good enough? • Marker gene selection and more! 42

Resources • DeconvoBuddies ◦ R Package with tools for marker
ﬁnding & plotting ◦ github.com/LieberInstitute/DeconvoBuddies • Coming Soon: Deconvolution code tutorial + video • DLPFC snRNA-seq data available through spatialLIBD 43

Acknowledgements Leonardo Collado-Torres Kristen Maynard Stephanie C Hicks 44 Kelsey
Montgomery Sang Ho Kwon Sean Maden Nick Eagles Thank you! Any Questions? Sophia Cinquemani Download these slides: speakerdeck.com/lahuuki @lahuuki

Introduction to Deconvolution - Seminar at UCL ICH

Introduction to Deconvolution - Seminar at UCL ICH

More Decks by Louise Huuki-Myers

Other Decks in Science

Featured

Transcript