psychgenomics-2022

Deconvolution of RNA-seq data in the postmortem human brain: the
present and preparing the future Psychgenomics: Ichan School of Medicine Leonardo Collado Torres @lcolladotor 2022-03-03 https://lcolladotor.github.io/bioc_team_ds/

LIEBER INSTITUTE for BRAIN DEVELOPMENT https://research.libd.org/DeconvoBuddies/ http://research.libd.org/TREG/ @lahuuki Louise A
Huuki-Myers @kr_maynard Kristen R Maynard @stephaniehicks Stephanie C Hicks @martinowk Keri Martinowich NIH grant R01MH123183 @CerceoPage Stephanie C Page

Background: Cell Types in the Brain • The brain is
made of complex tissues consisting of different types of cells • Some Dx associated with changes in cell type speciﬁc expression ◦ Ex. Pitt-Hopkins syndrome and oligodendrocytes (Phan et al, Nature Neuroscience, 2020) 3

• Inferring the composition of different cell types in a
bulk RNA-seq data What is Deconvolution? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 4 Deconvolution Get single cell like data from bulk RNA-seq $$$ $ Free! https://twitter.com/BoXia7/status/1261464021322137600

Why is Deconvolution Important? • Tissue is heterogeneous • Samples
can differ in cell type composition due to biology or dissection • Different cell types express genes at different levels • Check for differences between groups (i.e. case vs. control) • Controlling for cell fractions between samples can make differential expression analysis cleaner ◦ Confounding factor in differential expression analysis ◦ Deconvoluting data can prevent false-positives and false-negatives 5

What Challenges Exist? In Pipeline Development • Method Selection •
Data selection ◦ Region specific or all available data ◦ Cell type resolution: Broad vs. specific ◦ NeuN + non-Neun sorted • Marker Gene Selection ◦ How to find best markers? ◦ How many markers? • Validating Results ◦ What do we expect to find? In the Field • Differences between snRNA-seq and bulk RNA-seq ◦ Nuclear vs. cytoplasmic compartments ◦ Missing cell types in reference • Cell sizes • RNA-fraction vs. Cell fraction Sosina et al, F1000Research, 2021 Future Work: Maynard + Hicks R01MH123183 6

Deconvolution Methods 7

What Methods Exist? • Regression based ◦ Define bulk cell-type
specific expression profiles, regress out optimal cell proportions from bulk data ◦ Examples: CIBERSORT, OLS, nnls, RLR, deconvoSeq • Single Cell RNA-seq reference based ◦ Use a annotated single cell expression data as reference ◦ Examples: DWLS, MuSiC, Bisque, SCDC, SPLITR 8

Method Summary Method Regression Correction for Technical Variation Other Features
MuSiC Wang et al, Nature Communications, 2019 W-NNLS regression (Weighted - Non-negative least squares) None Tree guided deconvolution, good for closely related cell types Bisque Jew et al, Nature Communications, 2020 NNLS regresion Gene specific transformation of bulk data Leverage overlapping bulk & sc data SCDC Dong et al, Briefings in Bioinformatics, 2020 W-NNLS framework proposed by MuSiC Option for Gene specific transformation of bulk data (from Bisque) Multiple reference datasets can be used, results combined with ENSEMBL weights DWLS Tsoucas, Nature Communications, 2019 Dampened Weighted least squares None 9

Method Regression Method Run Time Marker Evaluation Adjust for snRNA-seq
vs. Bulk Tissues Tested Consider Cell Size Reference Set MuSiC W-NNLS Min. Internal Weighting No Pancreatic Islet, Rat & Mouse Kidney Yes Bisque NNLS Min. No Yes Adipose, DLPFC Recommend 3+ donors SCDC W-NNLS Min. Internal Weighting Yes Pancreatic Islet, mouse mammary Can input multiple references DWLS DWLS Hours Internal Selection No Mouse kidney, lung, liver, small intestine 10

Which Method is the Most Accurate? • Benchmarking shows that
different methods perform best on different data sets (Cobos et al, Nature Communications, 2020) • Benchmarking results from different papers on “real” data ◦ MuSiC paper: MuSiC > NNLS > BSEQ-sx > CIBERSORT ▪ Pancreatic Islet: Beta cells vs. HbA1c (Fig 2a) ◦ Bisque paper: Bisque > MuSiC > CIBERSORT ▪ DLPFC: Microglia vs. Braak stage, Neuron vs. Cognitive diagnostic category (Fig 4) ◦ SCDC paper: SCDC > MuSiC > Bisque > DWLS > CIBERSORT ▪ Pancreatic Islet: Beta cells vs. HbA1c (Fig 4b) ◦ Cobos benchmark: DWLS > MuSiC > Bisque > deconvoSeq ▪ Human PMBC ﬂow sorted (Fig 7) 11

What Methods are We Exploring? • MuSiC & Bisque What
Methods are we Ignoring? • SCDC is redundant: we only have 1 reference set • DWLS seems basic and slow 12

Datasets 13

What Data are We Using? LIBD Bulk RNA-seq • MDDSeq:
MDD + control in sACC + AMY (Goes et al. DOI: 10.7303/syn4921369) • BipSeq: BPD + control in sACC + AMY (Zandi et al. DOI: 10.7303/syn5844980) • BrainSEQ phase I + II (DOI: 10.1038/s41593-018-0197-y, 10.1016/j.neuron.2019.05.013) ◦ polyA (DLPFC) vs. RiboZero (DLPFC + HPC) LIBD Single Nucleus • Tran, Maynard et al 10x snRNA-seq (DOI: 10.1016/j.neuron.2021.09.001) ◦ 5 Brain Regions, 70,615 nuclei, 8 donors Publicly Available • GTEx version 8 via recount3 (10.1186/s13059-021-02533-6) ◦ 13 Brain Regions, 3,387 samples, 1,127 individuals 14

10x snRNA-seq Reference Data Tran, Maynard et al., Neuron, 2021
AMY DLPFC HPC NAc sACC Astro 1638 782 1170 1099 907 Endo 31 0 0 0 0 Macro 0 10 0 22 0 Micro 1168 388 1126 492 784 Mural 39 18 43 0 0 Oligo 6080 5455 5912 6134 4584 OPC 1459 572 838 669 911 Tcell 31 9 26 0 0 Excit 443 2388 623 0 4163 Inhib 3117 1580 366 11476 3974 @mattntran Matthew N Tran

Dataset Regions Samples Case Control Project Status BipSeq sACC +
AMY 511 247 BPD 264 Revisions Suicide Genomics DLPFC 329 226 103 Revisions BrainSeq Phase III Caudate 464 298 SCZD 266 Revisions MDDseq sACC + AMY 1091 704 MDD/BPD 387 Active AANRI DG, Caudate, Hippo, DLPFC 1647 (263, 464, 447, 453) - - Active BrainSeq Phase I DLPFC 727 395 SCZD 332 Exploratory BrainSeq Phase II DLPFC 453 153 SCZD 300 Exploratory GTEx 13 Regions 2670 - - Exploratory Degradation AMY, Caudate, DLPFC, HIPPO, mPFC, sACC 119 - - Exploratory

Marker Finding 17

What are Marker Genes? • “Deﬁne” cell types ◦ Differentially
expressed between cell types • Historically ◦ Know markers associated with key cell types ◦ Ex. MBP: major constituent of the myelin sheath, marker for oligodendrocytes • What does the Data tell us? ◦ Human vs. model organisms ◦ Regional ◦ Technical differences 18

Marker Gene Selection • Filter for genes expressed in snRNA-seq
and bulk data • Looking for genes expressed in only one cell type ◦ Test for speciﬁcity of each gene for each cell type • Observe expression of selected marker genes ◦ Heat maps of pseudo-bulked data ▪ Summation of counts from nuclei from one donor + cell type ◦ Violin plots by cell type Marker Genes shared by sn & bulk RNA-seq The Ideal Heatmap snRNA-seq data, Pseudo-bulked by cell type + donor 19 Stephanie C Hicks

Differential Expression : One vs. All • scran::findMarkers() • Target
cell type vs. all other cells • Select genes with highest fold change • Observed lots of noise between different cell types snRNA-seq sACC Pseudo-bulked by cell type + donor The Actual Heatmap 20

Exploring Marker Expression • T-test between two groups • Fold
change between groups 21

Exploring Marker Expression Where does this noise come from? •
Outliers in one of more non-target cell type ◦ Here OPCs are expressing MBP 22

Our Solution: Mean Ratio Target Highest non-target Mean Expression target
cell type Mean Expression highest non-target cell type = Mean Ratio Higher mean ratio: • the more speciﬁc that gene is to the target cell type • the better a marker gene it is 23

Mean Ratio vs. Fold Change • Genes with high mean
ratio also have high fold changes • Not all genes with high fold changes have high mean ratios • Selecting marker genes by mean ratio helps avoid “noisy” genes 24

1vAll Markers vs. Mean Ratio Markers 25

1vAll Markers vs. Mean Ratio Markers 26

How Many Markers? 27 • As many look like outliers
in the “worst” cell type ◦ Least amount of signal ◦ Balance overfitting vs. adding noise ◦ Looking at Inhib: we chose 25 markers • Same number for each cell type • Typical methods [ findMarkers() ] are useful for finding cell identity markers, not necessarily good cell decomposition markers

How Many Markers? • This becomes more diﬃcult with more
speciﬁc cell types • We are looking for genes with big differences between cell types: hence why it’s easier to work at the broad cell type resolution 28

Final QC Reference set Top 25 for 10 cell types
• Drop 2 worst macrophage genes • Drop two macrophage-like microglia clusters • 248 marker genes genes in total • 70,527 nuclei post-QC

Why we like Bisque • Benchmarked with a DLPFC dataset
• Robust to marker set • Robust to library prep • More reasonable estimates on GTEx dataset • With ﬁnal reference dataset (8 donors) we have more than the 4 donors recommended

GTEx tests

Deconvolution of GTEX Brain Data Park et al., bioRxiv, 2021
• Deconvolute three large human brain datasets ◦ GTEx, MAYO, ROSMAP • Introduce new deconvolution algorithm: SPLITR • 48 donor reference set • Validate with previous composition estimates + immunostaining • Variation across brain regions Figure 2

Mean Proportions By Region: Tran et al, bioRxiv, 2020 (5
donors, 6 cell types)

Peric = Mural + Endo Mean Proportions By Region: Tran
et al, Neuron, 2021 (8 donors, 10 cell types)

• Bisque has more similar pattern of composition over regions
vs. SPLITR • MuSiC predicts large proportions of Endo + Mural (Peric) • Both estimate lower proportions of Excit ◦ MuSiC is more extreme and also predicts low portion Inhib Bisque & MuSiC vs SPLITR Different deconvolution methods, bulk RNA-seq data source, marker genes, and reference snRNA-seq data

• Run with set of 20 & 25 marker genes
per cell type • Bisque is more robust to changes in the marker set than MuSiC Method Sensitivity to Marker Set 25 vs. 20 Genes

• Different cell types change in different ways - Excit
& Mural are the most dramatic

MDDseq Results

Average Compositions

Compare Proportion of Cell Types over Dx

5/30 t-test are signiﬁcant (p.bonf < 0.05) Amygdala

9/30 t-test are signiﬁcant (p.bonf < 0.05) sACC

0/20 t-test are signiﬁcant (p.bonf < 0.05) Compare Sex

Correlate Proportion Cell Type vs. qSVs

Data Driven Identiﬁcation of Total RNA Expression Genes “TREGs” for
estimation of RNA abundance in heterogeneous cell types

Background • Most deconvolution algorithms don't consider cell types have
different cell sizes and transcriptional activity ◦ Ex. Neurons are more active than Glia ◦ Can only predict fraction of RNA not fraction of cells in tissue ◦ Sosina et al., F1000Research, 2021 MDDseq Bisque

Bulk RNA-seq Goals for RNAscope Experiment • Deconvolution R01 MH123183
◦ Kristen Maynard, Stephanie C Hicks • Use six slices of DLPFC to generate corresponding RNA-seq & RNAscope data • This information will be useful to evaluate and design deconvolution algorithms DLPFC Bulk RNA-seq snRNA-seq Spatial RNAscope RNAscope 47 polyA RiboZero

Motivation • Improve Deconvolution algorithms by considering differences in size
and RNA content between cell types • Use smFISH with RNAscope to establish data set of: ◦ Cellular composition ◦ Nuclei sizes of major cell types ◦ Average nuclei RNA content of major cell types How do we measure total RNA content of a cell if we can only observe a few genes at a time? Use a TREG

What is a TREG? • Total RNA Expression Gene •
Expression is proportional to the overall RNA expression in a cell • In smFISH the count of TREG puncta in a cell can estimate the RNA content ◦ Linking RNA content to nucleus size

Housekeeping Genes • Unrelated to protein function • Tissue speciﬁc
• Expression tracks with total expression (should vary) • Involved in basic cellular function • Consistent across tissues • Expressed evenly across tissue • expressed in every cell TREGs We will compare candidate TREGs to housekeeping gene POLR2A to examine diﬀerences

Quality Wish List TREG • Non-zero expression in most cells
across tissues and cell types • Expressed at a constant level in respect to other genes across different cell types RNAscope • Expressed in top 50% of genes for easy detection • Show a dynamic range of puncta* • Individual puncta are countable* * Will need to evaluated experimentally

Data Driven TREG Discovery Data • Human Brain postmortem 10x
snRNA-seq ◦ Tran, Maynard et al., Neuron, 2021 ◦ Eight donors, 5 Brain Regions ◦ Nine broad cell types • smFISH with RNAscope in DLPFC tissue ◦ 3 tissue sections from one donor Method Overview 1. Filter to top 50% expressed genes 2. Filter out genes with high proportion of zero expression 3. Select genes with high Rank Invariance as candidate TREGs 4. Evaluate candidate gene performance in RNAscope experiment @mattntran Matthew N Tran Kelsey D Montgomery Sang Ho Kwon

Data Driven TREG Discovery - Method rank(x) gives you high
ranks for high values in x, with average ranks for ties

Gene Filtering • First ﬁltered to top 50% of 23k
genes in dataset • Compute Proportion Zero • Used distribution to pick 0.75 as max Proportion Zero cutoff ◦ For a gene to pass the maximum proportion of zeros among groups must be less than the cutoff • Only 877 (3.8%) of genes pass ◦ POLR2A doesn’t pass

Evaluate for Rank Invariance • How variable is the Expression
Rank of a gene within a cell type? Between cell types? • Expressed at a constant level in respect to other genes = High Rank Invariance ◦ Evaluate within and between cell types Expression Rank

Rank Invariance Calculation C1 C2 G1 G2 G3 G4 G1
4 G2 2 G3 3 G4 1 C1 C2 G1 3.5 4 G2 3.5 3 G3 2 1 G4 1 2 Rank cell (gxc) Rank rowMean (gx1) Filtered genes, for one cell type (gxc) C1 C2 G1 .5 0 G2 .5 1 G3 -1 2 G4 0 -1 Difference of ranks (gxc) G1 0.25 G2 0.75 G3 1.5 G4 0.5 Rank rowMean (gx1) G1 1 G2 3 G3 4 G4 2 rowMean of absolute differences (gx1) How diﬀerent is the Expression Rank of the gene in one nuclei vs the average nuclei for that cell type? Low variation = low diﬀerence = low rank

C1 C2 C3 C4 C5 G1 G2 G3 G4 R
B Y G1 1 1 1 G2 3 2 2 G3 4 3 3.5 G4 2 4 3.5 G1 4 G2 3 G3 1 G4 2 Calculate “rank-difference rowMeans” for each cell type Reverse-Rank of rowSums “Rank Invariance” (gx1) Filtered genes, grouped by cell type (gxc) Rank Invariance Calculation Low variation across groups = low rank sum = high rank invariance

TREG R Package Functions 58 • Gene ﬁltering ◦ get_prop_zero()
◦ filter_prop_zero() • Rank Invariance step-wise ◦ rank_cells() ◦ rank_group() ◦ rank_invariance() • rank_invariance_express() ◦ Completes full RI calculations https://github.com/LieberInstitute/TREG

Candidate TREGs • Selected AKT3, ARID1B, and MALAT1 as our
candidate TREGs ◦ In top 10 RI values, and has available probes • TREGs show consistent Expression Ranks within all cells and with in most cell types Expression Rank Expression Rank Expression Rank

Validate in snRNA-seq data • TREGs showed strong correlation between
gene expression and total reads per cell • log2(counts + 1) ~ log2(sum) + cellType Total RNA per nucleus in snRNA-seq Expression of TREG per nucleus

Validate with RNAscope • DLPFC from control, sectioned at 10μm
• 3 slides with 3 sections each ◦ TREG candidate + cell type marker genes • Images analyzed with HALO TREG Gene AKT3 ARID1B MALAT1/ POLR2A Cell Type Markers GAD1, SLC17A7, MBP GAD1, SLC17A7, MBP SLC17A7, MBP Cell Type Marker Excit SLC17A7 Inhib GAD1 Oligo MBP Kelsey D Montgomery Sang Ho Kwon

HALO Analysis • HALO identiﬁed 944k cells across the 9
slides ◦ 90k-100k per sample • Some QC was required to remove out of focus regions • HALO data tracks really well with what we see visually in the slides and expected anatomy of the tissue ◦ Neurons are larger ◦ Neurons are transcriptionally more active ◦ Excit, Inhib and Oligo are where you would expect them to be Typo** GAD1

Patterns of Observed Puncta • TREGs were expressed in most
cells • AKT3 tracks really well with pattern of expression seen in snRNA-seq (ARID1B is also pretty good) snRNA-seq RNAscope Gene Mean Prop. Cells with Expression Prop. non-zero in DLPFC snRNA Standardized β (95% CI) AKT3 0.948 0.92 -1.38 (-1.39,-1.37) ARID1B 0.908 0.94 -0.62 (-0.62,-0.61) MALAT1 0.910 1.00 -0.11 (-0.12,-0.11) POLR2A 0.853 0.30 -0.98 (-0.99,-0.98) snRNA-seq NA NA -1.33 (-1.35,-1.31) Remember: MALAT1’s puncta data is unreliable

Conclusions • Proportion Zero ﬁltering + Rank Invariance allow selection
of candidate TREGs in snRNA-seq • AKT3 appears to be a TREG compatible with RNAscope in the human brain • TREGs would allow the observation of total RNA expression via RNAscope ◦ Allow us to collect data that can be used to improve deconvolution algorithms

Suggestions for the present and future • We recommend Bisque
over MuSiC for postmortem human brain data • You might want to use the mean ratio method for selecting snRNA-seq cell type marker genes • You’ll beneﬁt from having more than 3 donors (we had 8) in your snRNA-seq data • Plan to generate RNAscope data for some major cell types using adjacent tissue dissections: could be a gold standard • Use a TREG (AKT3) in your RNAscope to estimate nuclei RNA and size: could be useful input parameters in future deconvolution algorithms https://research.libd.org/DeconvoBuddies/ http://research.libd.org/TREG/

@MadhaviTippani Madhavi Tippani @HeenaDivecha Heena R Divecha @lmwebr Lukas M
Weber @stephaniehicks Stephanie C Hicks @abspangler Abby Spangler @martinowk Keri Martinowich @CerceoPage Stephanie C Page @kr_maynard Kristen R Maynard @lcolladotor Leonardo Collado-Torres @Nick-Eagles (GH) Nicholas J Eagles Kelsey D Montgomery Sang Ho Kwon Image Analysis Expression Analysis Data Generation Thomas M Hyde @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M Stolz Moods Workgroup Fernando Goes Patricia Braun Peter Zandi Thomas Hyde Joel Kleinman Christopher Ross Shizhong Han @mattntran Matthew N Tran @sowmyapartybun Sowmya Parthiban https://www.libd.org/careers/

psychgenomics-2022

psychgenomics-2022

More Decks by Leonardo Collado-Torres

Other Decks in Science

Featured

Transcript