psychgenomics-2022 - Speaker Deck

Slide 1

Slide 1 text

Deconvolution of RNA-seq data in the postmortem human brain: the present and preparing the future Psychgenomics: Ichan School of Medicine Leonardo Collado Torres @lcolladotor 2022-03-03 https://lcolladotor.github.io/bioc_team_ds/

Slide 2

Slide 2 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT https://research.libd.org/DeconvoBuddies/ http://research.libd.org/TREG/ @lahuuki Louise A Huuki-Myers @kr_maynard Kristen R Maynard @stephaniehicks Stephanie C Hicks @martinowk Keri Martinowich NIH grant R01MH123183 @CerceoPage Stephanie C Page

Slide 3

Slide 3 text

Background: Cell Types in the Brain ● The brain is made of complex tissues consisting of different types of cells ● Some Dx associated with changes in cell type speciﬁc expression ○ Ex. Pitt-Hopkins syndrome and oligodendrocytes (Phan et al, Nature Neuroscience, 2020) 3

Slide 4

Slide 4 text

● Inferring the composition of different cell types in a bulk RNA-seq data What is Deconvolution? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 4 Deconvolution Get single cell like data from bulk RNA-seq $$$ $ Free! https://twitter.com/BoXia7/status/1261464021322137600

Slide 5

Slide 5 text

Why is Deconvolution Important? ● Tissue is heterogeneous ● Samples can differ in cell type composition due to biology or dissection ● Different cell types express genes at different levels ● Check for differences between groups (i.e. case vs. control) ● Controlling for cell fractions between samples can make differential expression analysis cleaner ○ Confounding factor in differential expression analysis ○ Deconvoluting data can prevent false-positives and false-negatives 5

Slide 6

Slide 6 text

What Challenges Exist? In Pipeline Development ● Method Selection ● Data selection ○ Region specific or all available data ○ Cell type resolution: Broad vs. specific ○ NeuN + non-Neun sorted ● Marker Gene Selection ○ How to find best markers? ○ How many markers? ● Validating Results ○ What do we expect to find? In the Field ● Differences between snRNA-seq and bulk RNA-seq ○ Nuclear vs. cytoplasmic compartments ○ Missing cell types in reference ● Cell sizes ● RNA-fraction vs. Cell fraction Sosina et al, F1000Research, 2021 Future Work: Maynard + Hicks R01MH123183 6

Slide 7

Slide 7 text

Deconvolution Methods 7

Slide 8

Slide 8 text

What Methods Exist? ● Regression based ○ Define bulk cell-type specific expression profiles, regress out optimal cell proportions from bulk data ○ Examples: CIBERSORT, OLS, nnls, RLR, deconvoSeq ● Single Cell RNA-seq reference based ○ Use a annotated single cell expression data as reference ○ Examples: DWLS, MuSiC, Bisque, SCDC, SPLITR 8

Slide 9

Slide 9 text

Method Summary Method Regression Correction for Technical Variation Other Features MuSiC Wang et al, Nature Communications, 2019 W-NNLS regression (Weighted - Non-negative least squares) None Tree guided deconvolution, good for closely related cell types Bisque Jew et al, Nature Communications, 2020 NNLS regresion Gene specific transformation of bulk data Leverage overlapping bulk & sc data SCDC Dong et al, Briefings in Bioinformatics, 2020 W-NNLS framework proposed by MuSiC Option for Gene specific transformation of bulk data (from Bisque) Multiple reference datasets can be used, results combined with ENSEMBL weights DWLS Tsoucas, Nature Communications, 2019 Dampened Weighted least squares None 9

Slide 10

Slide 10 text

Method Regression Method Run Time Marker Evaluation Adjust for snRNA-seq vs. Bulk Tissues Tested Consider Cell Size Reference Set MuSiC W-NNLS Min. Internal Weighting No Pancreatic Islet, Rat & Mouse Kidney Yes Bisque NNLS Min. No Yes Adipose, DLPFC Recommend 3+ donors SCDC W-NNLS Min. Internal Weighting Yes Pancreatic Islet, mouse mammary Can input multiple references DWLS DWLS Hours Internal Selection No Mouse kidney, lung, liver, small intestine 10

Slide 11

Slide 11 text

Which Method is the Most Accurate? ● Benchmarking shows that different methods perform best on different data sets (Cobos et al, Nature Communications, 2020) ● Benchmarking results from different papers on “real” data ○ MuSiC paper: MuSiC > NNLS > BSEQ-sx > CIBERSORT ■ Pancreatic Islet: Beta cells vs. HbA1c (Fig 2a) ○ Bisque paper: Bisque > MuSiC > CIBERSORT ■ DLPFC: Microglia vs. Braak stage, Neuron vs. Cognitive diagnostic category (Fig 4) ○ SCDC paper: SCDC > MuSiC > Bisque > DWLS > CIBERSORT ■ Pancreatic Islet: Beta cells vs. HbA1c (Fig 4b) ○ Cobos benchmark: DWLS > MuSiC > Bisque > deconvoSeq ■ Human PMBC ﬂow sorted (Fig 7) 11

Slide 12

Slide 12 text

What Methods are We Exploring? ● MuSiC & Bisque What Methods are we Ignoring? ● SCDC is redundant: we only have 1 reference set ● DWLS seems basic and slow 12

Slide 13

Slide 13 text

Datasets 13

Slide 14

Slide 14 text

What Data are We Using? LIBD Bulk RNA-seq ● MDDSeq: MDD + control in sACC + AMY (Goes et al. DOI: 10.7303/syn4921369) ● BipSeq: BPD + control in sACC + AMY (Zandi et al. DOI: 10.7303/syn5844980) ● BrainSEQ phase I + II (DOI: 10.1038/s41593-018-0197-y, 10.1016/j.neuron.2019.05.013) ○ polyA (DLPFC) vs. RiboZero (DLPFC + HPC) LIBD Single Nucleus ● Tran, Maynard et al 10x snRNA-seq (DOI: 10.1016/j.neuron.2021.09.001) ○ 5 Brain Regions, 70,615 nuclei, 8 donors Publicly Available ● GTEx version 8 via recount3 (10.1186/s13059-021-02533-6) ○ 13 Brain Regions, 3,387 samples, 1,127 individuals 14

Slide 15

Slide 15 text

10x snRNA-seq Reference Data Tran, Maynard et al., Neuron, 2021 AMY DLPFC HPC NAc sACC Astro 1638 782 1170 1099 907 Endo 31 0 0 0 0 Macro 0 10 0 22 0 Micro 1168 388 1126 492 784 Mural 39 18 43 0 0 Oligo 6080 5455 5912 6134 4584 OPC 1459 572 838 669 911 Tcell 31 9 26 0 0 Excit 443 2388 623 0 4163 Inhib 3117 1580 366 11476 3974 @mattntran Matthew N Tran

Slide 16

Slide 16 text

Dataset Regions Samples Case Control Project Status BipSeq sACC + AMY 511 247 BPD 264 Revisions Suicide Genomics DLPFC 329 226 103 Revisions BrainSeq Phase III Caudate 464 298 SCZD 266 Revisions MDDseq sACC + AMY 1091 704 MDD/BPD 387 Active AANRI DG, Caudate, Hippo, DLPFC 1647 (263, 464, 447, 453) - - Active BrainSeq Phase I DLPFC 727 395 SCZD 332 Exploratory BrainSeq Phase II DLPFC 453 153 SCZD 300 Exploratory GTEx 13 Regions 2670 - - Exploratory Degradation AMY, Caudate, DLPFC, HIPPO, mPFC, sACC 119 - - Exploratory

Slide 17

Slide 17 text

Marker Finding 17

Slide 18

Slide 18 text

What are Marker Genes? ● “Deﬁne” cell types ○ Differentially expressed between cell types ● Historically ○ Know markers associated with key cell types ○ Ex. MBP: major constituent of the myelin sheath, marker for oligodendrocytes ● What does the Data tell us? ○ Human vs. model organisms ○ Regional ○ Technical differences 18

Slide 19

Slide 19 text

Marker Gene Selection ● Filter for genes expressed in snRNA-seq and bulk data ● Looking for genes expressed in only one cell type ○ Test for speciﬁcity of each gene for each cell type ● Observe expression of selected marker genes ○ Heat maps of pseudo-bulked data ■ Summation of counts from nuclei from one donor + cell type ○ Violin plots by cell type Marker Genes shared by sn & bulk RNA-seq The Ideal Heatmap snRNA-seq data, Pseudo-bulked by cell type + donor 19 Stephanie C Hicks

Slide 20

Slide 20 text

Differential Expression : One vs. All ● scran::findMarkers() ● Target cell type vs. all other cells ● Select genes with highest fold change ● Observed lots of noise between different cell types snRNA-seq sACC Pseudo-bulked by cell type + donor The Actual Heatmap 20

Slide 21

Slide 21 text

Exploring Marker Expression ● T-test between two groups ● Fold change between groups 21

Slide 22

Slide 22 text

Exploring Marker Expression Where does this noise come from? ● Outliers in one of more non-target cell type ○ Here OPCs are expressing MBP 22

Slide 23

Slide 23 text

Our Solution: Mean Ratio Target Highest non-target Mean Expression target cell type Mean Expression highest non-target cell type = Mean Ratio Higher mean ratio: ● the more speciﬁc that gene is to the target cell type ● the better a marker gene it is 23

Slide 24

Slide 24 text

Mean Ratio vs. Fold Change ● Genes with high mean ratio also have high fold changes ● Not all genes with high fold changes have high mean ratios ● Selecting marker genes by mean ratio helps avoid “noisy” genes 24

Slide 25

Slide 25 text

1vAll Markers vs. Mean Ratio Markers 25

Slide 26

Slide 26 text

1vAll Markers vs. Mean Ratio Markers 26

Slide 27

Slide 27 text

How Many Markers? 27 ● As many look like outliers in the “worst” cell type ○ Least amount of signal ○ Balance overfitting vs. adding noise ○ Looking at Inhib: we chose 25 markers ● Same number for each cell type ● Typical methods [ findMarkers() ] are useful for finding cell identity markers, not necessarily good cell decomposition markers

Slide 28

Slide 28 text

How Many Markers? ● This becomes more diﬃcult with more speciﬁc cell types ● We are looking for genes with big differences between cell types: hence why it’s easier to work at the broad cell type resolution 28

Slide 29

Slide 29 text

Final QC Reference set Top 25 for 10 cell types ● Drop 2 worst macrophage genes ● Drop two macrophage-like microglia clusters ● 248 marker genes genes in total ● 70,527 nuclei post-QC

Slide 30

Slide 30 text

Why we like Bisque ● Benchmarked with a DLPFC dataset ● Robust to marker set ● Robust to library prep ● More reasonable estimates on GTEx dataset ● With ﬁnal reference dataset (8 donors) we have more than the 4 donors recommended

Slide 31

Slide 31 text

GTEx tests

Slide 32

Slide 32 text

Deconvolution of GTEX Brain Data Park et al., bioRxiv, 2021 ● Deconvolute three large human brain datasets ○ GTEx, MAYO, ROSMAP ● Introduce new deconvolution algorithm: SPLITR ● 48 donor reference set ● Validate with previous composition estimates + immunostaining ● Variation across brain regions Figure 2

Slide 33

Slide 33 text

Mean Proportions By Region: Tran et al, bioRxiv, 2020 (5 donors, 6 cell types)

Slide 34

Slide 34 text

Peric = Mural + Endo Mean Proportions By Region: Tran et al, Neuron, 2021 (8 donors, 10 cell types)

Slide 35

Slide 35 text

● Bisque has more similar pattern of composition over regions vs. SPLITR ● MuSiC predicts large proportions of Endo + Mural (Peric) ● Both estimate lower proportions of Excit ○ MuSiC is more extreme and also predicts low portion Inhib Bisque & MuSiC vs SPLITR Different deconvolution methods, bulk RNA-seq data source, marker genes, and reference snRNA-seq data

Slide 36

Slide 36 text

● Run with set of 20 & 25 marker genes per cell type ● Bisque is more robust to changes in the marker set than MuSiC Method Sensitivity to Marker Set 25 vs. 20 Genes

Slide 37

Slide 37 text

● Different cell types change in different ways - Excit & Mural are the most dramatic

Slide 38

Slide 38 text

MDDseq Results

Slide 39

Slide 39 text

Average Compositions

Slide 40

Slide 40 text

Compare Proportion of Cell Types over Dx

Slide 41

Slide 41 text

5/30 t-test are signiﬁcant (p.bonf < 0.05) Amygdala

Slide 42

Slide 42 text

9/30 t-test are signiﬁcant (p.bonf < 0.05) sACC

Slide 43

Slide 43 text

0/20 t-test are signiﬁcant (p.bonf < 0.05) Compare Sex

Slide 44

Slide 44 text

Correlate Proportion Cell Type vs. qSVs

Slide 45

Slide 45 text

Data Driven Identiﬁcation of Total RNA Expression Genes “TREGs” for estimation of RNA abundance in heterogeneous cell types

Slide 46

Slide 46 text

Background ● Most deconvolution algorithms don't consider cell types have different cell sizes and transcriptional activity ○ Ex. Neurons are more active than Glia ○ Can only predict fraction of RNA not fraction of cells in tissue ○ Sosina et al., F1000Research, 2021 MDDseq Bisque

Slide 47

Slide 47 text

Bulk RNA-seq Goals for RNAscope Experiment ● Deconvolution R01 MH123183 ○ Kristen Maynard, Stephanie C Hicks ● Use six slices of DLPFC to generate corresponding RNA-seq & RNAscope data ● This information will be useful to evaluate and design deconvolution algorithms DLPFC Bulk RNA-seq snRNA-seq Spatial RNAscope RNAscope 47 polyA RiboZero

Slide 48

Slide 48 text

Motivation ● Improve Deconvolution algorithms by considering differences in size and RNA content between cell types ● Use smFISH with RNAscope to establish data set of: ○ Cellular composition ○ Nuclei sizes of major cell types ○ Average nuclei RNA content of major cell types How do we measure total RNA content of a cell if we can only observe a few genes at a time? Use a TREG

Slide 49

Slide 49 text

What is a TREG? ● Total RNA Expression Gene ● Expression is proportional to the overall RNA expression in a cell ● In smFISH the count of TREG puncta in a cell can estimate the RNA content ○ Linking RNA content to nucleus size

Slide 50

Slide 50 text

Housekeeping Genes ● Unrelated to protein function ● Tissue speciﬁc ● Expression tracks with total expression (should vary) ● Involved in basic cellular function ● Consistent across tissues ● Expressed evenly across tissue ● expressed in every cell TREGs We will compare candidate TREGs to housekeeping gene POLR2A to examine diﬀerences

Slide 51

Slide 51 text

Quality Wish List TREG ● Non-zero expression in most cells across tissues and cell types ● Expressed at a constant level in respect to other genes across different cell types RNAscope ● Expressed in top 50% of genes for easy detection ● Show a dynamic range of puncta* ● Individual puncta are countable* * Will need to evaluated experimentally

Slide 52

Slide 52 text

Data Driven TREG Discovery Data ● Human Brain postmortem 10x snRNA-seq ○ Tran, Maynard et al., Neuron, 2021 ○ Eight donors, 5 Brain Regions ○ Nine broad cell types ● smFISH with RNAscope in DLPFC tissue ○ 3 tissue sections from one donor Method Overview 1. Filter to top 50% expressed genes 2. Filter out genes with high proportion of zero expression 3. Select genes with high Rank Invariance as candidate TREGs 4. Evaluate candidate gene performance in RNAscope experiment @mattntran Matthew N Tran Kelsey D Montgomery Sang Ho Kwon

Slide 53

Slide 53 text

Data Driven TREG Discovery - Method rank(x) gives you high ranks for high values in x, with average ranks for ties

Slide 54

Slide 54 text

Gene Filtering ● First ﬁltered to top 50% of 23k genes in dataset ● Compute Proportion Zero ● Used distribution to pick 0.75 as max Proportion Zero cutoff ○ For a gene to pass the maximum proportion of zeros among groups must be less than the cutoff ● Only 877 (3.8%) of genes pass ○ POLR2A doesn’t pass

Slide 55

Slide 55 text

Evaluate for Rank Invariance ● How variable is the Expression Rank of a gene within a cell type? Between cell types? ● Expressed at a constant level in respect to other genes = High Rank Invariance ○ Evaluate within and between cell types Expression Rank

Slide 56

Slide 56 text

Rank Invariance Calculation C1 C2 G1 G2 G3 G4 G1 4 G2 2 G3 3 G4 1 C1 C2 G1 3.5 4 G2 3.5 3 G3 2 1 G4 1 2 Rank cell (gxc) Rank rowMean (gx1) Filtered genes, for one cell type (gxc) C1 C2 G1 .5 0 G2 .5 1 G3 -1 2 G4 0 -1 Difference of ranks (gxc) G1 0.25 G2 0.75 G3 1.5 G4 0.5 Rank rowMean (gx1) G1 1 G2 3 G3 4 G4 2 rowMean of absolute differences (gx1) How diﬀerent is the Expression Rank of the gene in one nuclei vs the average nuclei for that cell type? Low variation = low diﬀerence = low rank

Slide 57

Slide 57 text

C1 C2 C3 C4 C5 G1 G2 G3 G4 R B Y G1 1 1 1 G2 3 2 2 G3 4 3 3.5 G4 2 4 3.5 G1 4 G2 3 G3 1 G4 2 Calculate “rank-difference rowMeans” for each cell type Reverse-Rank of rowSums “Rank Invariance” (gx1) Filtered genes, grouped by cell type (gxc) Rank Invariance Calculation Low variation across groups = low rank sum = high rank invariance

Slide 58

Slide 58 text

TREG R Package Functions 58 ● Gene ﬁltering ○ get_prop_zero() ○ filter_prop_zero() ● Rank Invariance step-wise ○ rank_cells() ○ rank_group() ○ rank_invariance() ● rank_invariance_express() ○ Completes full RI calculations https://github.com/LieberInstitute/TREG

Slide 59

Slide 59 text

Candidate TREGs ● Selected AKT3, ARID1B, and MALAT1 as our candidate TREGs ○ In top 10 RI values, and has available probes ● TREGs show consistent Expression Ranks within all cells and with in most cell types Expression Rank Expression Rank Expression Rank

Slide 60

Slide 60 text

Validate in snRNA-seq data ● TREGs showed strong correlation between gene expression and total reads per cell ● log2(counts + 1) ~ log2(sum) + cellType Total RNA per nucleus in snRNA-seq Expression of TREG per nucleus

Slide 61

Slide 61 text

Validate with RNAscope ● DLPFC from control, sectioned at 10μm ● 3 slides with 3 sections each ○ TREG candidate + cell type marker genes ● Images analyzed with HALO TREG Gene AKT3 ARID1B MALAT1/ POLR2A Cell Type Markers GAD1, SLC17A7, MBP GAD1, SLC17A7, MBP SLC17A7, MBP Cell Type Marker Excit SLC17A7 Inhib GAD1 Oligo MBP Kelsey D Montgomery Sang Ho Kwon

Slide 62

Slide 62 text

HALO Analysis ● HALO identiﬁed 944k cells across the 9 slides ○ 90k-100k per sample ● Some QC was required to remove out of focus regions ● HALO data tracks really well with what we see visually in the slides and expected anatomy of the tissue ○ Neurons are larger ○ Neurons are transcriptionally more active ○ Excit, Inhib and Oligo are where you would expect them to be Typo** GAD1

Slide 63

Slide 63 text

Patterns of Observed Puncta ● TREGs were expressed in most cells ● AKT3 tracks really well with pattern of expression seen in snRNA-seq (ARID1B is also pretty good) snRNA-seq RNAscope Gene Mean Prop. Cells with Expression Prop. non-zero in DLPFC snRNA Standardized β (95% CI) AKT3 0.948 0.92 -1.38 (-1.39,-1.37) ARID1B 0.908 0.94 -0.62 (-0.62,-0.61) MALAT1 0.910 1.00 -0.11 (-0.12,-0.11) POLR2A 0.853 0.30 -0.98 (-0.99,-0.98) snRNA-seq NA NA -1.33 (-1.35,-1.31) Remember: MALAT1’s puncta data is unreliable

Slide 64

Slide 64 text

Conclusions ● Proportion Zero ﬁltering + Rank Invariance allow selection of candidate TREGs in snRNA-seq ● AKT3 appears to be a TREG compatible with RNAscope in the human brain ● TREGs would allow the observation of total RNA expression via RNAscope ○ Allow us to collect data that can be used to improve deconvolution algorithms

Slide 65

Slide 65 text

Suggestions for the present and future ● We recommend Bisque over MuSiC for postmortem human brain data ● You might want to use the mean ratio method for selecting snRNA-seq cell type marker genes ● You’ll beneﬁt from having more than 3 donors (we had 8) in your snRNA-seq data ● Plan to generate RNAscope data for some major cell types using adjacent tissue dissections: could be a gold standard ● Use a TREG (AKT3) in your RNAscope to estimate nuclei RNA and size: could be useful input parameters in future deconvolution algorithms https://research.libd.org/DeconvoBuddies/ http://research.libd.org/TREG/

Slide 66

Slide 66 text

@MadhaviTippani Madhavi Tippani @HeenaDivecha Heena R Divecha @lmwebr Lukas M Weber @stephaniehicks Stephanie C Hicks @abspangler Abby Spangler @martinowk Keri Martinowich @CerceoPage Stephanie C Page @kr_maynard Kristen R Maynard @lcolladotor Leonardo Collado-Torres @Nick-Eagles (GH) Nicholas J Eagles Kelsey D Montgomery Sang Ho Kwon Image Analysis Expression Analysis Data Generation Thomas M Hyde @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M Stolz Moods Workgroup Fernando Goes Patricia Braun Peter Zandi Thomas Hyde Joel Kleinman Christopher Ross Shizhong Han @mattntran Matthew N Tran @sowmyapartybun Sowmya Parthiban https://www.libd.org/careers/