Deconvolution of RNA-seq data in the postmortem human brain: the present and preparing the future Psychgenomics: Ichan School of Medicine Leonardo Collado Torres @lcolladotor 2022-03-03 https://lcolladotor.github.io/bioc_team_ds/
LIEBER INSTITUTE for BRAIN DEVELOPMENT https://research.libd.org/DeconvoBuddies/ http://research.libd.org/TREG/ @lahuuki Louise A Huuki-Myers @kr_maynard Kristen R Maynard @stephaniehicks Stephanie C Hicks @martinowk Keri Martinowich NIH grant R01MH123183 @CerceoPage Stephanie C Page
Background: Cell Types in the Brain ● The brain is made of complex tissues consisting of different types of cells ● Some Dx associated with changes in cell type specific expression ○ Ex. Pitt-Hopkins syndrome and oligodendrocytes (Phan et al, Nature Neuroscience, 2020) 3
● Inferring the composition of different cell types in a bulk RNA-seq data What is Deconvolution? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 4 Deconvolution Get single cell like data from bulk RNA-seq $$$ $ Free! https://twitter.com/BoXia7/status/1261464021322137600
Why is Deconvolution Important? ● Tissue is heterogeneous ● Samples can differ in cell type composition due to biology or dissection ● Different cell types express genes at different levels ● Check for differences between groups (i.e. case vs. control) ● Controlling for cell fractions between samples can make differential expression analysis cleaner ○ Confounding factor in differential expression analysis ○ Deconvoluting data can prevent false-positives and false-negatives 5
What Challenges Exist? In Pipeline Development ● Method Selection ● Data selection ○ Region specific or all available data ○ Cell type resolution: Broad vs. specific ○ NeuN + non-Neun sorted ● Marker Gene Selection ○ How to find best markers? ○ How many markers? ● Validating Results ○ What do we expect to find? In the Field ● Differences between snRNA-seq and bulk RNA-seq ○ Nuclear vs. cytoplasmic compartments ○ Missing cell types in reference ● Cell sizes ● RNA-fraction vs. Cell fraction Sosina et al, F1000Research, 2021 Future Work: Maynard + Hicks R01MH123183 6
What Methods Exist? ● Regression based ○ Define bulk cell-type specific expression profiles, regress out optimal cell proportions from bulk data ○ Examples: CIBERSORT, OLS, nnls, RLR, deconvoSeq ● Single Cell RNA-seq reference based ○ Use a annotated single cell expression data as reference ○ Examples: DWLS, MuSiC, Bisque, SCDC, SPLITR 8
Method Summary Method Regression Correction for Technical Variation Other Features MuSiC Wang et al, Nature Communications, 2019 W-NNLS regression (Weighted - Non-negative least squares) None Tree guided deconvolution, good for closely related cell types Bisque Jew et al, Nature Communications, 2020 NNLS regresion Gene specific transformation of bulk data Leverage overlapping bulk & sc data SCDC Dong et al, Briefings in Bioinformatics, 2020 W-NNLS framework proposed by MuSiC Option for Gene specific transformation of bulk data (from Bisque) Multiple reference datasets can be used, results combined with ENSEMBL weights DWLS Tsoucas, Nature Communications, 2019 Dampened Weighted least squares None 9
Method Regression Method Run Time Marker Evaluation Adjust for snRNA-seq vs. Bulk Tissues Tested Consider Cell Size Reference Set MuSiC W-NNLS Min. Internal Weighting No Pancreatic Islet, Rat & Mouse Kidney Yes Bisque NNLS Min. No Yes Adipose, DLPFC Recommend 3+ donors SCDC W-NNLS Min. Internal Weighting Yes Pancreatic Islet, mouse mammary Can input multiple references DWLS DWLS Hours Internal Selection No Mouse kidney, lung, liver, small intestine 10
Which Method is the Most Accurate? ● Benchmarking shows that different methods perform best on different data sets (Cobos et al, Nature Communications, 2020) ● Benchmarking results from different papers on “real” data ○ MuSiC paper: MuSiC > NNLS > BSEQ-sx > CIBERSORT ■ Pancreatic Islet: Beta cells vs. HbA1c (Fig 2a) ○ Bisque paper: Bisque > MuSiC > CIBERSORT ■ DLPFC: Microglia vs. Braak stage, Neuron vs. Cognitive diagnostic category (Fig 4) ○ SCDC paper: SCDC > MuSiC > Bisque > DWLS > CIBERSORT ■ Pancreatic Islet: Beta cells vs. HbA1c (Fig 4b) ○ Cobos benchmark: DWLS > MuSiC > Bisque > deconvoSeq ■ Human PMBC flow sorted (Fig 7) 11
What Methods are We Exploring? ● MuSiC & Bisque What Methods are we Ignoring? ● SCDC is redundant: we only have 1 reference set ● DWLS seems basic and slow 12
What Data are We Using? LIBD Bulk RNA-seq ● MDDSeq: MDD + control in sACC + AMY (Goes et al. DOI: 10.7303/syn4921369) ● BipSeq: BPD + control in sACC + AMY (Zandi et al. DOI: 10.7303/syn5844980) ● BrainSEQ phase I + II (DOI: 10.1038/s41593-018-0197-y, 10.1016/j.neuron.2019.05.013) ○ polyA (DLPFC) vs. RiboZero (DLPFC + HPC) LIBD Single Nucleus ● Tran, Maynard et al 10x snRNA-seq (DOI: 10.1016/j.neuron.2021.09.001) ○ 5 Brain Regions, 70,615 nuclei, 8 donors Publicly Available ● GTEx version 8 via recount3 (10.1186/s13059-021-02533-6) ○ 13 Brain Regions, 3,387 samples, 1,127 individuals 14
What are Marker Genes? ● “Define” cell types ○ Differentially expressed between cell types ● Historically ○ Know markers associated with key cell types ○ Ex. MBP: major constituent of the myelin sheath, marker for oligodendrocytes ● What does the Data tell us? ○ Human vs. model organisms ○ Regional ○ Technical differences 18
Marker Gene Selection ● Filter for genes expressed in snRNA-seq and bulk data ● Looking for genes expressed in only one cell type ○ Test for specificity of each gene for each cell type ● Observe expression of selected marker genes ○ Heat maps of pseudo-bulked data ■ Summation of counts from nuclei from one donor + cell type ○ Violin plots by cell type Marker Genes shared by sn & bulk RNA-seq The Ideal Heatmap snRNA-seq data, Pseudo-bulked by cell type + donor 19 Stephanie C Hicks
Differential Expression : One vs. All ● scran::findMarkers() ● Target cell type vs. all other cells ● Select genes with highest fold change ● Observed lots of noise between different cell types snRNA-seq sACC Pseudo-bulked by cell type + donor The Actual Heatmap 20
Our Solution: Mean Ratio Target Highest non-target Mean Expression target cell type Mean Expression highest non-target cell type = Mean Ratio Higher mean ratio: ● the more specific that gene is to the target cell type ● the better a marker gene it is 23
Mean Ratio vs. Fold Change ● Genes with high mean ratio also have high fold changes ● Not all genes with high fold changes have high mean ratios ● Selecting marker genes by mean ratio helps avoid “noisy” genes 24
How Many Markers? 27 ● As many look like outliers in the “worst” cell type ○ Least amount of signal ○ Balance overfitting vs. adding noise ○ Looking at Inhib: we chose 25 markers ● Same number for each cell type ● Typical methods [ findMarkers() ] are useful for finding cell identity markers, not necessarily good cell decomposition markers
How Many Markers? ● This becomes more difficult with more specific cell types ● We are looking for genes with big differences between cell types: hence why it’s easier to work at the broad cell type resolution 28
Final QC Reference set Top 25 for 10 cell types ● Drop 2 worst macrophage genes ● Drop two macrophage-like microglia clusters ● 248 marker genes genes in total ● 70,527 nuclei post-QC
Why we like Bisque ● Benchmarked with a DLPFC dataset ● Robust to marker set ● Robust to library prep ● More reasonable estimates on GTEx dataset ● With final reference dataset (8 donors) we have more than the 4 donors recommended
Deconvolution of GTEX Brain Data Park et al., bioRxiv, 2021 ● Deconvolute three large human brain datasets ○ GTEx, MAYO, ROSMAP ● Introduce new deconvolution algorithm: SPLITR ● 48 donor reference set ● Validate with previous composition estimates + immunostaining ● Variation across brain regions Figure 2
● Bisque has more similar pattern of composition over regions vs. SPLITR ● MuSiC predicts large proportions of Endo + Mural (Peric) ● Both estimate lower proportions of Excit ○ MuSiC is more extreme and also predicts low portion Inhib Bisque & MuSiC vs SPLITR Different deconvolution methods, bulk RNA-seq data source, marker genes, and reference snRNA-seq data
● Run with set of 20 & 25 marker genes per cell type ● Bisque is more robust to changes in the marker set than MuSiC Method Sensitivity to Marker Set 25 vs. 20 Genes
Background ● Most deconvolution algorithms don't consider cell types have different cell sizes and transcriptional activity ○ Ex. Neurons are more active than Glia ○ Can only predict fraction of RNA not fraction of cells in tissue ○ Sosina et al., F1000Research, 2021 MDDseq Bisque
Bulk RNA-seq Goals for RNAscope Experiment ● Deconvolution R01 MH123183 ○ Kristen Maynard, Stephanie C Hicks ● Use six slices of DLPFC to generate corresponding RNA-seq & RNAscope data ● This information will be useful to evaluate and design deconvolution algorithms DLPFC Bulk RNA-seq snRNA-seq Spatial RNAscope RNAscope 47 polyA RiboZero
Motivation ● Improve Deconvolution algorithms by considering differences in size and RNA content between cell types ● Use smFISH with RNAscope to establish data set of: ○ Cellular composition ○ Nuclei sizes of major cell types ○ Average nuclei RNA content of major cell types How do we measure total RNA content of a cell if we can only observe a few genes at a time? Use a TREG
What is a TREG? ● Total RNA Expression Gene ● Expression is proportional to the overall RNA expression in a cell ● In smFISH the count of TREG puncta in a cell can estimate the RNA content ○ Linking RNA content to nucleus size
Housekeeping Genes ● Unrelated to protein function ● Tissue specific ● Expression tracks with total expression (should vary) ● Involved in basic cellular function ● Consistent across tissues ● Expressed evenly across tissue ● expressed in every cell TREGs We will compare candidate TREGs to housekeeping gene POLR2A to examine differences
Quality Wish List TREG ● Non-zero expression in most cells across tissues and cell types ● Expressed at a constant level in respect to other genes across different cell types RNAscope ● Expressed in top 50% of genes for easy detection ● Show a dynamic range of puncta* ● Individual puncta are countable* * Will need to evaluated experimentally
Data Driven TREG Discovery Data ● Human Brain postmortem 10x snRNA-seq ○ Tran, Maynard et al., Neuron, 2021 ○ Eight donors, 5 Brain Regions ○ Nine broad cell types ● smFISH with RNAscope in DLPFC tissue ○ 3 tissue sections from one donor Method Overview 1. Filter to top 50% expressed genes 2. Filter out genes with high proportion of zero expression 3. Select genes with high Rank Invariance as candidate TREGs 4. Evaluate candidate gene performance in RNAscope experiment @mattntran Matthew N Tran Kelsey D Montgomery Sang Ho Kwon
Gene Filtering ● First filtered to top 50% of 23k genes in dataset ● Compute Proportion Zero ● Used distribution to pick 0.75 as max Proportion Zero cutoff ○ For a gene to pass the maximum proportion of zeros among groups must be less than the cutoff ● Only 877 (3.8%) of genes pass ○ POLR2A doesn’t pass
Evaluate for Rank Invariance ● How variable is the Expression Rank of a gene within a cell type? Between cell types? ● Expressed at a constant level in respect to other genes = High Rank Invariance ○ Evaluate within and between cell types Expression Rank
Candidate TREGs ● Selected AKT3, ARID1B, and MALAT1 as our candidate TREGs ○ In top 10 RI values, and has available probes ● TREGs show consistent Expression Ranks within all cells and with in most cell types Expression Rank Expression Rank Expression Rank
Validate in snRNA-seq data ● TREGs showed strong correlation between gene expression and total reads per cell ● log2(counts + 1) ~ log2(sum) + cellType Total RNA per nucleus in snRNA-seq Expression of TREG per nucleus
HALO Analysis ● HALO identified 944k cells across the 9 slides ○ 90k-100k per sample ● Some QC was required to remove out of focus regions ● HALO data tracks really well with what we see visually in the slides and expected anatomy of the tissue ○ Neurons are larger ○ Neurons are transcriptionally more active ○ Excit, Inhib and Oligo are where you would expect them to be Typo** GAD1
Patterns of Observed Puncta ● TREGs were expressed in most cells ● AKT3 tracks really well with pattern of expression seen in snRNA-seq (ARID1B is also pretty good) snRNA-seq RNAscope Gene Mean Prop. Cells with Expression Prop. non-zero in DLPFC snRNA Standardized β (95% CI) AKT3 0.948 0.92 -1.38 (-1.39,-1.37) ARID1B 0.908 0.94 -0.62 (-0.62,-0.61) MALAT1 0.910 1.00 -0.11 (-0.12,-0.11) POLR2A 0.853 0.30 -0.98 (-0.99,-0.98) snRNA-seq NA NA -1.33 (-1.35,-1.31) Remember: MALAT1’s puncta data is unreliable
Conclusions ● Proportion Zero filtering + Rank Invariance allow selection of candidate TREGs in snRNA-seq ● AKT3 appears to be a TREG compatible with RNAscope in the human brain ● TREGs would allow the observation of total RNA expression via RNAscope ○ Allow us to collect data that can be used to improve deconvolution algorithms
Suggestions for the present and future ● We recommend Bisque over MuSiC for postmortem human brain data ● You might want to use the mean ratio method for selecting snRNA-seq cell type marker genes ● You’ll benefit from having more than 3 donors (we had 8) in your snRNA-seq data ● Plan to generate RNAscope data for some major cell types using adjacent tissue dissections: could be a gold standard ● Use a TREG (AKT3) in your RNAscope to estimate nuclei RNA and size: could be useful input parameters in future deconvolution algorithms https://research.libd.org/DeconvoBuddies/ http://research.libd.org/TREG/
@MadhaviTippani Madhavi Tippani @HeenaDivecha Heena R Divecha @lmwebr Lukas M Weber @stephaniehicks Stephanie C Hicks @abspangler Abby Spangler @martinowk Keri Martinowich @CerceoPage Stephanie C Page @kr_maynard Kristen R Maynard @lcolladotor Leonardo Collado-Torres @Nick-Eagles (GH) Nicholas J Eagles Kelsey D Montgomery Sang Ho Kwon Image Analysis Expression Analysis Data Generation Thomas M Hyde @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M Stolz Moods Workgroup Fernando Goes Patricia Braun Peter Zandi Thomas Hyde Joel Kleinman Christopher Ross Shizhong Han @mattntran Matthew N Tran @sowmyapartybun Sowmya Parthiban https://www.libd.org/careers/