Upgrade to Pro — share decks privately, control downloads, hide ads and more …



Talk for Psychgenomics at Ichan School of Medicine, NY. Invited by Kiran Girdhar.


Leonardo Collado-Torres

March 03, 2022

More Decks by Leonardo Collado-Torres

Other Decks in Science


  1. Deconvolution of RNA-seq data in the postmortem human brain: the

    present and preparing the future Psychgenomics: Ichan School of Medicine Leonardo Collado Torres @lcolladotor 2022-03-03 https://lcolladotor.github.io/bioc_team_ds/
  2. LIEBER INSTITUTE for BRAIN DEVELOPMENT https://research.libd.org/DeconvoBuddies/ http://research.libd.org/TREG/ @lahuuki Louise A

    Huuki-Myers @kr_maynard Kristen R Maynard @stephaniehicks Stephanie C Hicks @martinowk Keri Martinowich NIH grant R01MH123183 @CerceoPage Stephanie C Page
  3. Background: Cell Types in the Brain • The brain is

    made of complex tissues consisting of different types of cells • Some Dx associated with changes in cell type specific expression ◦ Ex. Pitt-Hopkins syndrome and oligodendrocytes (Phan et al, Nature Neuroscience, 2020) 3
  4. • Inferring the composition of different cell types in a

    bulk RNA-seq data What is Deconvolution? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 4 Deconvolution Get single cell like data from bulk RNA-seq $$$ $ Free! https://twitter.com/BoXia7/status/1261464021322137600
  5. Why is Deconvolution Important? • Tissue is heterogeneous • Samples

    can differ in cell type composition due to biology or dissection • Different cell types express genes at different levels • Check for differences between groups (i.e. case vs. control) • Controlling for cell fractions between samples can make differential expression analysis cleaner ◦ Confounding factor in differential expression analysis ◦ Deconvoluting data can prevent false-positives and false-negatives 5
  6. What Challenges Exist? In Pipeline Development • Method Selection •

    Data selection ◦ Region specific or all available data ◦ Cell type resolution: Broad vs. specific ◦ NeuN + non-Neun sorted • Marker Gene Selection ◦ How to find best markers? ◦ How many markers? • Validating Results ◦ What do we expect to find? In the Field • Differences between snRNA-seq and bulk RNA-seq ◦ Nuclear vs. cytoplasmic compartments ◦ Missing cell types in reference • Cell sizes • RNA-fraction vs. Cell fraction Sosina et al, F1000Research, 2021 Future Work: Maynard + Hicks R01MH123183 6
  7. Deconvolution Methods 7

  8. What Methods Exist? • Regression based ◦ Define bulk cell-type

    specific expression profiles, regress out optimal cell proportions from bulk data ◦ Examples: CIBERSORT, OLS, nnls, RLR, deconvoSeq • Single Cell RNA-seq reference based ◦ Use a annotated single cell expression data as reference ◦ Examples: DWLS, MuSiC, Bisque, SCDC, SPLITR 8
  9. Method Summary Method Regression Correction for Technical Variation Other Features

    MuSiC Wang et al, Nature Communications, 2019 W-NNLS regression (Weighted - Non-negative least squares) None Tree guided deconvolution, good for closely related cell types Bisque Jew et al, Nature Communications, 2020 NNLS regresion Gene specific transformation of bulk data Leverage overlapping bulk & sc data SCDC Dong et al, Briefings in Bioinformatics, 2020 W-NNLS framework proposed by MuSiC Option for Gene specific transformation of bulk data (from Bisque) Multiple reference datasets can be used, results combined with ENSEMBL weights DWLS Tsoucas, Nature Communications, 2019 Dampened Weighted least squares None 9
  10. Method Regression Method Run Time Marker Evaluation Adjust for snRNA-seq

    vs. Bulk Tissues Tested Consider Cell Size Reference Set MuSiC W-NNLS Min. Internal Weighting No Pancreatic Islet, Rat & Mouse Kidney Yes Bisque NNLS Min. No Yes Adipose, DLPFC Recommend 3+ donors SCDC W-NNLS Min. Internal Weighting Yes Pancreatic Islet, mouse mammary Can input multiple references DWLS DWLS Hours Internal Selection No Mouse kidney, lung, liver, small intestine 10
  11. Which Method is the Most Accurate? • Benchmarking shows that

    different methods perform best on different data sets (Cobos et al, Nature Communications, 2020) • Benchmarking results from different papers on “real” data ◦ MuSiC paper: MuSiC > NNLS > BSEQ-sx > CIBERSORT ▪ Pancreatic Islet: Beta cells vs. HbA1c (Fig 2a) ◦ Bisque paper: Bisque > MuSiC > CIBERSORT ▪ DLPFC: Microglia vs. Braak stage, Neuron vs. Cognitive diagnostic category (Fig 4) ◦ SCDC paper: SCDC > MuSiC > Bisque > DWLS > CIBERSORT ▪ Pancreatic Islet: Beta cells vs. HbA1c (Fig 4b) ◦ Cobos benchmark: DWLS > MuSiC > Bisque > deconvoSeq ▪ Human PMBC flow sorted (Fig 7) 11
  12. What Methods are We Exploring? • MuSiC & Bisque What

    Methods are we Ignoring? • SCDC is redundant: we only have 1 reference set • DWLS seems basic and slow 12
  13. Datasets 13

  14. What Data are We Using? LIBD Bulk RNA-seq • MDDSeq:

    MDD + control in sACC + AMY (Goes et al. DOI: 10.7303/syn4921369) • BipSeq: BPD + control in sACC + AMY (Zandi et al. DOI: 10.7303/syn5844980) • BrainSEQ phase I + II (DOI: 10.1038/s41593-018-0197-y, 10.1016/j.neuron.2019.05.013) ◦ polyA (DLPFC) vs. RiboZero (DLPFC + HPC) LIBD Single Nucleus • Tran, Maynard et al 10x snRNA-seq (DOI: 10.1016/j.neuron.2021.09.001) ◦ 5 Brain Regions, 70,615 nuclei, 8 donors Publicly Available • GTEx version 8 via recount3 (10.1186/s13059-021-02533-6) ◦ 13 Brain Regions, 3,387 samples, 1,127 individuals 14
  15. 10x snRNA-seq Reference Data Tran, Maynard et al., Neuron, 2021

    AMY DLPFC HPC NAc sACC Astro 1638 782 1170 1099 907 Endo 31 0 0 0 0 Macro 0 10 0 22 0 Micro 1168 388 1126 492 784 Mural 39 18 43 0 0 Oligo 6080 5455 5912 6134 4584 OPC 1459 572 838 669 911 Tcell 31 9 26 0 0 Excit 443 2388 623 0 4163 Inhib 3117 1580 366 11476 3974 @mattntran Matthew N Tran
  16. Dataset Regions Samples Case Control Project Status BipSeq sACC +

    AMY 511 247 BPD 264 Revisions Suicide Genomics DLPFC 329 226 103 Revisions BrainSeq Phase III Caudate 464 298 SCZD 266 Revisions MDDseq sACC + AMY 1091 704 MDD/BPD 387 Active AANRI DG, Caudate, Hippo, DLPFC 1647 (263, 464, 447, 453) - - Active BrainSeq Phase I DLPFC 727 395 SCZD 332 Exploratory BrainSeq Phase II DLPFC 453 153 SCZD 300 Exploratory GTEx 13 Regions 2670 - - Exploratory Degradation AMY, Caudate, DLPFC, HIPPO, mPFC, sACC 119 - - Exploratory
  17. Marker Finding 17

  18. What are Marker Genes? • “Define” cell types ◦ Differentially

    expressed between cell types • Historically ◦ Know markers associated with key cell types ◦ Ex. MBP: major constituent of the myelin sheath, marker for oligodendrocytes • What does the Data tell us? ◦ Human vs. model organisms ◦ Regional ◦ Technical differences 18
  19. Marker Gene Selection • Filter for genes expressed in snRNA-seq

    and bulk data • Looking for genes expressed in only one cell type ◦ Test for specificity of each gene for each cell type • Observe expression of selected marker genes ◦ Heat maps of pseudo-bulked data ▪ Summation of counts from nuclei from one donor + cell type ◦ Violin plots by cell type Marker Genes shared by sn & bulk RNA-seq The Ideal Heatmap snRNA-seq data, Pseudo-bulked by cell type + donor 19 Stephanie C Hicks
  20. Differential Expression : One vs. All • scran::findMarkers() • Target

    cell type vs. all other cells • Select genes with highest fold change • Observed lots of noise between different cell types snRNA-seq sACC Pseudo-bulked by cell type + donor The Actual Heatmap 20
  21. Exploring Marker Expression • T-test between two groups • Fold

    change between groups 21
  22. Exploring Marker Expression Where does this noise come from? •

    Outliers in one of more non-target cell type ◦ Here OPCs are expressing MBP 22
  23. Our Solution: Mean Ratio Target Highest non-target Mean Expression target

    cell type Mean Expression highest non-target cell type = Mean Ratio Higher mean ratio: • the more specific that gene is to the target cell type • the better a marker gene it is 23
  24. Mean Ratio vs. Fold Change • Genes with high mean

    ratio also have high fold changes • Not all genes with high fold changes have high mean ratios • Selecting marker genes by mean ratio helps avoid “noisy” genes 24
  25. 1vAll Markers vs. Mean Ratio Markers 25

  26. 1vAll Markers vs. Mean Ratio Markers 26

  27. How Many Markers? 27 • As many look like outliers

    in the “worst” cell type ◦ Least amount of signal ◦ Balance overfitting vs. adding noise ◦ Looking at Inhib: we chose 25 markers • Same number for each cell type • Typical methods [ findMarkers() ] are useful for finding cell identity markers, not necessarily good cell decomposition markers
  28. How Many Markers? • This becomes more difficult with more

    specific cell types • We are looking for genes with big differences between cell types: hence why it’s easier to work at the broad cell type resolution 28
  29. Final QC Reference set Top 25 for 10 cell types

    • Drop 2 worst macrophage genes • Drop two macrophage-like microglia clusters • 248 marker genes genes in total • 70,527 nuclei post-QC
  30. Why we like Bisque • Benchmarked with a DLPFC dataset

    • Robust to marker set • Robust to library prep • More reasonable estimates on GTEx dataset • With final reference dataset (8 donors) we have more than the 4 donors recommended
  31. GTEx tests

  32. Deconvolution of GTEX Brain Data Park et al., bioRxiv, 2021

    • Deconvolute three large human brain datasets ◦ GTEx, MAYO, ROSMAP • Introduce new deconvolution algorithm: SPLITR • 48 donor reference set • Validate with previous composition estimates + immunostaining • Variation across brain regions Figure 2
  33. Mean Proportions By Region: Tran et al, bioRxiv, 2020 (5

    donors, 6 cell types)
  34. Peric = Mural + Endo Mean Proportions By Region: Tran

    et al, Neuron, 2021 (8 donors, 10 cell types)
  35. • Bisque has more similar pattern of composition over regions

    vs. SPLITR • MuSiC predicts large proportions of Endo + Mural (Peric) • Both estimate lower proportions of Excit ◦ MuSiC is more extreme and also predicts low portion Inhib Bisque & MuSiC vs SPLITR Different deconvolution methods, bulk RNA-seq data source, marker genes, and reference snRNA-seq data
  36. • Run with set of 20 & 25 marker genes

    per cell type • Bisque is more robust to changes in the marker set than MuSiC Method Sensitivity to Marker Set 25 vs. 20 Genes
  37. • Different cell types change in different ways - Excit

    & Mural are the most dramatic
  38. MDDseq Results

  39. Average Compositions

  40. Compare Proportion of Cell Types over Dx

  41. 5/30 t-test are significant (p.bonf < 0.05) Amygdala

  42. 9/30 t-test are significant (p.bonf < 0.05) sACC

  43. 0/20 t-test are significant (p.bonf < 0.05) Compare Sex

  44. Correlate Proportion Cell Type vs. qSVs

  45. Data Driven Identification of Total RNA Expression Genes “TREGs” for

    estimation of RNA abundance in heterogeneous cell types
  46. Background • Most deconvolution algorithms don't consider cell types have

    different cell sizes and transcriptional activity ◦ Ex. Neurons are more active than Glia ◦ Can only predict fraction of RNA not fraction of cells in tissue ◦ Sosina et al., F1000Research, 2021 MDDseq Bisque
  47. Bulk RNA-seq Goals for RNAscope Experiment • Deconvolution R01 MH123183

    ◦ Kristen Maynard, Stephanie C Hicks • Use six slices of DLPFC to generate corresponding RNA-seq & RNAscope data • This information will be useful to evaluate and design deconvolution algorithms DLPFC Bulk RNA-seq snRNA-seq Spatial RNAscope RNAscope 47 polyA RiboZero
  48. Motivation • Improve Deconvolution algorithms by considering differences in size

    and RNA content between cell types • Use smFISH with RNAscope to establish data set of: ◦ Cellular composition ◦ Nuclei sizes of major cell types ◦ Average nuclei RNA content of major cell types How do we measure total RNA content of a cell if we can only observe a few genes at a time? Use a TREG
  49. What is a TREG? • Total RNA Expression Gene •

    Expression is proportional to the overall RNA expression in a cell • In smFISH the count of TREG puncta in a cell can estimate the RNA content ◦ Linking RNA content to nucleus size
  50. Housekeeping Genes • Unrelated to protein function • Tissue specific

    • Expression tracks with total expression (should vary) • Involved in basic cellular function • Consistent across tissues • Expressed evenly across tissue • expressed in every cell TREGs We will compare candidate TREGs to housekeeping gene POLR2A to examine differences
  51. Quality Wish List TREG • Non-zero expression in most cells

    across tissues and cell types • Expressed at a constant level in respect to other genes across different cell types RNAscope • Expressed in top 50% of genes for easy detection • Show a dynamic range of puncta* • Individual puncta are countable* * Will need to evaluated experimentally
  52. Data Driven TREG Discovery Data • Human Brain postmortem 10x

    snRNA-seq ◦ Tran, Maynard et al., Neuron, 2021 ◦ Eight donors, 5 Brain Regions ◦ Nine broad cell types • smFISH with RNAscope in DLPFC tissue ◦ 3 tissue sections from one donor Method Overview 1. Filter to top 50% expressed genes 2. Filter out genes with high proportion of zero expression 3. Select genes with high Rank Invariance as candidate TREGs 4. Evaluate candidate gene performance in RNAscope experiment @mattntran Matthew N Tran Kelsey D Montgomery Sang Ho Kwon
  53. Data Driven TREG Discovery - Method rank(x) gives you high

    ranks for high values in x, with average ranks for ties
  54. Gene Filtering • First filtered to top 50% of 23k

    genes in dataset • Compute Proportion Zero • Used distribution to pick 0.75 as max Proportion Zero cutoff ◦ For a gene to pass the maximum proportion of zeros among groups must be less than the cutoff • Only 877 (3.8%) of genes pass ◦ POLR2A doesn’t pass
  55. Evaluate for Rank Invariance • How variable is the Expression

    Rank of a gene within a cell type? Between cell types? • Expressed at a constant level in respect to other genes = High Rank Invariance ◦ Evaluate within and between cell types Expression Rank
  56. Rank Invariance Calculation C1 C2 G1 G2 G3 G4 G1

    4 G2 2 G3 3 G4 1 C1 C2 G1 3.5 4 G2 3.5 3 G3 2 1 G4 1 2 Rank cell (gxc) Rank rowMean (gx1) Filtered genes, for one cell type (gxc) C1 C2 G1 .5 0 G2 .5 1 G3 -1 2 G4 0 -1 Difference of ranks (gxc) G1 0.25 G2 0.75 G3 1.5 G4 0.5 Rank rowMean (gx1) G1 1 G2 3 G3 4 G4 2 rowMean of absolute differences (gx1) How different is the Expression Rank of the gene in one nuclei vs the average nuclei for that cell type? Low variation = low difference = low rank
  57. C1 C2 C3 C4 C5 G1 G2 G3 G4 R

    B Y G1 1 1 1 G2 3 2 2 G3 4 3 3.5 G4 2 4 3.5 G1 4 G2 3 G3 1 G4 2 Calculate “rank-difference rowMeans” for each cell type Reverse-Rank of rowSums “Rank Invariance” (gx1) Filtered genes, grouped by cell type (gxc) Rank Invariance Calculation Low variation across groups = low rank sum = high rank invariance
  58. TREG R Package Functions 58 • Gene filtering ◦ get_prop_zero()

    ◦ filter_prop_zero() • Rank Invariance step-wise ◦ rank_cells() ◦ rank_group() ◦ rank_invariance() • rank_invariance_express() ◦ Completes full RI calculations https://github.com/LieberInstitute/TREG
  59. Candidate TREGs • Selected AKT3, ARID1B, and MALAT1 as our

    candidate TREGs ◦ In top 10 RI values, and has available probes • TREGs show consistent Expression Ranks within all cells and with in most cell types Expression Rank Expression Rank Expression Rank
  60. Validate in snRNA-seq data • TREGs showed strong correlation between

    gene expression and total reads per cell • log2(counts + 1) ~ log2(sum) + cellType Total RNA per nucleus in snRNA-seq Expression of TREG per nucleus
  61. Validate with RNAscope • DLPFC from control, sectioned at 10μm

    • 3 slides with 3 sections each ◦ TREG candidate + cell type marker genes • Images analyzed with HALO TREG Gene AKT3 ARID1B MALAT1/ POLR2A Cell Type Markers GAD1, SLC17A7, MBP GAD1, SLC17A7, MBP SLC17A7, MBP Cell Type Marker Excit SLC17A7 Inhib GAD1 Oligo MBP Kelsey D Montgomery Sang Ho Kwon
  62. HALO Analysis • HALO identified 944k cells across the 9

    slides ◦ 90k-100k per sample • Some QC was required to remove out of focus regions • HALO data tracks really well with what we see visually in the slides and expected anatomy of the tissue ◦ Neurons are larger ◦ Neurons are transcriptionally more active ◦ Excit, Inhib and Oligo are where you would expect them to be Typo** GAD1
  63. Patterns of Observed Puncta • TREGs were expressed in most

    cells • AKT3 tracks really well with pattern of expression seen in snRNA-seq (ARID1B is also pretty good) snRNA-seq RNAscope Gene Mean Prop. Cells with Expression Prop. non-zero in DLPFC snRNA Standardized β (95% CI) AKT3 0.948 0.92 -1.38 (-1.39,-1.37) ARID1B 0.908 0.94 -0.62 (-0.62,-0.61) MALAT1 0.910 1.00 -0.11 (-0.12,-0.11) POLR2A 0.853 0.30 -0.98 (-0.99,-0.98) snRNA-seq NA NA -1.33 (-1.35,-1.31) Remember: MALAT1’s puncta data is unreliable
  64. Conclusions • Proportion Zero filtering + Rank Invariance allow selection

    of candidate TREGs in snRNA-seq • AKT3 appears to be a TREG compatible with RNAscope in the human brain • TREGs would allow the observation of total RNA expression via RNAscope ◦ Allow us to collect data that can be used to improve deconvolution algorithms
  65. Suggestions for the present and future • We recommend Bisque

    over MuSiC for postmortem human brain data • You might want to use the mean ratio method for selecting snRNA-seq cell type marker genes • You’ll benefit from having more than 3 donors (we had 8) in your snRNA-seq data • Plan to generate RNAscope data for some major cell types using adjacent tissue dissections: could be a gold standard • Use a TREG (AKT3) in your RNAscope to estimate nuclei RNA and size: could be useful input parameters in future deconvolution algorithms https://research.libd.org/DeconvoBuddies/ http://research.libd.org/TREG/
  66. @MadhaviTippani Madhavi Tippani @HeenaDivecha Heena R Divecha @lmwebr Lukas M

    Weber @stephaniehicks Stephanie C Hicks @abspangler Abby Spangler @martinowk Keri Martinowich @CerceoPage Stephanie C Page @kr_maynard Kristen R Maynard @lcolladotor Leonardo Collado-Torres @Nick-Eagles (GH) Nicholas J Eagles Kelsey D Montgomery Sang Ho Kwon Image Analysis Expression Analysis Data Generation Thomas M Hyde @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M Stolz Moods Workgroup Fernando Goes Patricia Braun Peter Zandi Thomas Hyde Joel Kleinman Christopher Ross Shizhong Han @mattntran Matthew N Tran @sowmyapartybun Sowmya Parthiban https://www.libd.org/careers/