$30 off During Our Annual Pro Sale. View Details »

psychgenomics-2022

 psychgenomics-2022

Talk for Psychgenomics at Ichan School of Medicine, NY. Invited by Kiran Girdhar.

Leonardo Collado-Torres

March 03, 2022
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. Deconvolution of RNA-seq data in the
    postmortem human brain: the
    present and preparing the future
    Psychgenomics:
    Ichan School of Medicine
    Leonardo Collado Torres
    @lcolladotor 2022-03-03
    https://lcolladotor.github.io/bioc_team_ds/

    View Slide

  2. LIEBER INSTITUTE for BRAIN DEVELOPMENT
    https://research.libd.org/DeconvoBuddies/
    http://research.libd.org/TREG/
    @lahuuki
    Louise A Huuki-Myers
    @kr_maynard
    Kristen R Maynard
    @stephaniehicks
    Stephanie C Hicks
    @martinowk
    Keri Martinowich
    NIH grant R01MH123183
    @CerceoPage
    Stephanie C Page

    View Slide

  3. Background: Cell Types in the Brain
    ● The brain is made of complex tissues consisting
    of different types of cells
    ● Some Dx associated with changes in cell type
    specific expression
    ○ Ex. Pitt-Hopkins syndrome and oligodendrocytes (Phan
    et al, Nature Neuroscience, 2020)
    3

    View Slide

  4. ● Inferring the composition of different cell types in a bulk RNA-seq data
    What is Deconvolution?
    Tissue
    Bulk RNA-seq
    snRNA-seq
    Estimated proportions
    4
    Deconvolution
    Get single cell like data
    from bulk RNA-seq
    $$$
    $
    Free!
    https://twitter.com/BoXia7/status/1261464021322137600

    View Slide

  5. Why is Deconvolution Important?
    ● Tissue is heterogeneous
    ● Samples can differ in cell type composition due to biology or dissection
    ● Different cell types express genes at different levels
    ● Check for differences between groups (i.e. case vs. control)
    ● Controlling for cell fractions between samples can make differential
    expression analysis cleaner
    ○ Confounding factor in differential expression analysis
    ○ Deconvoluting data can prevent false-positives and false-negatives
    5

    View Slide

  6. What Challenges Exist?
    In Pipeline Development
    ● Method Selection
    ● Data selection
    ○ Region specific or all available data
    ○ Cell type resolution: Broad vs. specific
    ○ NeuN + non-Neun sorted
    ● Marker Gene Selection
    ○ How to find best markers?
    ○ How many markers?
    ● Validating Results
    ○ What do we expect to find?
    In the Field
    ● Differences between snRNA-seq and bulk
    RNA-seq
    ○ Nuclear vs. cytoplasmic compartments
    ○ Missing cell types in reference
    ● Cell sizes
    ● RNA-fraction vs. Cell fraction
    Sosina et al, F1000Research, 2021
    Future Work: Maynard + Hicks R01MH123183
    6

    View Slide

  7. Deconvolution
    Methods
    7

    View Slide

  8. What Methods Exist?
    ● Regression based
    ○ Define bulk cell-type specific expression profiles, regress out optimal cell proportions from
    bulk data
    ○ Examples: CIBERSORT, OLS, nnls, RLR, deconvoSeq
    ● Single Cell RNA-seq reference based
    ○ Use a annotated single cell expression data as reference
    ○ Examples: DWLS, MuSiC, Bisque, SCDC, SPLITR
    8

    View Slide

  9. Method Summary
    Method Regression
    Correction for
    Technical
    Variation
    Other Features
    MuSiC
    Wang et al, Nature
    Communications, 2019
    W-NNLS regression
    (Weighted -
    Non-negative least
    squares)
    None
    Tree guided deconvolution,
    good for closely related cell
    types
    Bisque
    Jew et al, Nature
    Communications, 2020
    NNLS regresion
    Gene specific
    transformation of
    bulk data
    Leverage overlapping bulk &
    sc data
    SCDC
    Dong et al, Briefings in
    Bioinformatics, 2020
    W-NNLS framework
    proposed by MuSiC
    Option for Gene
    specific
    transformation of
    bulk data (from
    Bisque)
    Multiple reference datasets
    can be used, results
    combined with ENSEMBL
    weights
    DWLS
    Tsoucas, Nature
    Communications, 2019
    Dampened Weighted
    least squares
    None
    9

    View Slide

  10. Method
    Regression
    Method
    Run
    Time
    Marker
    Evaluation
    Adjust for
    snRNA-seq
    vs. Bulk
    Tissues
    Tested
    Consider Cell
    Size
    Reference
    Set
    MuSiC W-NNLS Min.
    Internal
    Weighting
    No
    Pancreatic
    Islet, Rat &
    Mouse Kidney
    Yes
    Bisque NNLS Min. No Yes
    Adipose,
    DLPFC
    Recommend
    3+ donors
    SCDC
    W-NNLS
    Min.
    Internal
    Weighting
    Yes
    Pancreatic
    Islet, mouse
    mammary
    Can input
    multiple
    references
    DWLS DWLS Hours
    Internal
    Selection
    No
    Mouse kidney,
    lung, liver,
    small intestine
    10

    View Slide

  11. Which Method is the Most Accurate?
    ● Benchmarking shows that different methods perform best on
    different data sets (Cobos et al, Nature Communications, 2020)
    ● Benchmarking results from different papers on “real” data
    ○ MuSiC paper: MuSiC > NNLS > BSEQ-sx > CIBERSORT
    ■ Pancreatic Islet: Beta cells vs. HbA1c (Fig 2a)
    ○ Bisque paper: Bisque > MuSiC > CIBERSORT
    ■ DLPFC: Microglia vs. Braak stage, Neuron vs. Cognitive diagnostic category
    (Fig 4)
    ○ SCDC paper: SCDC > MuSiC > Bisque > DWLS > CIBERSORT
    ■ Pancreatic Islet: Beta cells vs. HbA1c (Fig 4b)
    ○ Cobos benchmark: DWLS > MuSiC > Bisque > deconvoSeq
    ■ Human PMBC flow sorted (Fig 7)
    11

    View Slide

  12. What Methods are We Exploring?
    ● MuSiC & Bisque
    What Methods are we Ignoring?
    ● SCDC is redundant: we only have 1 reference set
    ● DWLS seems basic and slow
    12

    View Slide

  13. Datasets
    13

    View Slide

  14. What Data are We Using?
    LIBD Bulk RNA-seq
    ● MDDSeq: MDD + control in sACC + AMY (Goes et al. DOI: 10.7303/syn4921369)
    ● BipSeq: BPD + control in sACC + AMY (Zandi et al. DOI: 10.7303/syn5844980)
    ● BrainSEQ phase I + II (DOI: 10.1038/s41593-018-0197-y, 10.1016/j.neuron.2019.05.013)
    ○ polyA (DLPFC) vs. RiboZero (DLPFC + HPC)
    LIBD Single Nucleus
    ● Tran, Maynard et al 10x snRNA-seq (DOI: 10.1016/j.neuron.2021.09.001)
    ○ 5 Brain Regions, 70,615 nuclei, 8 donors
    Publicly Available
    ● GTEx version 8 via recount3 (10.1186/s13059-021-02533-6)
    ○ 13 Brain Regions, 3,387 samples, 1,127 individuals
    14

    View Slide

  15. 10x snRNA-seq Reference Data
    Tran, Maynard et al., Neuron, 2021
    AMY DLPFC HPC NAc sACC
    Astro 1638 782 1170 1099 907
    Endo 31 0 0 0 0
    Macro 0 10 0 22 0
    Micro 1168 388 1126 492 784
    Mural 39 18 43 0 0
    Oligo 6080 5455 5912 6134 4584
    OPC 1459 572 838 669 911
    Tcell 31 9 26 0 0
    Excit 443 2388 623 0 4163
    Inhib 3117 1580 366 11476 3974
    @mattntran
    Matthew N Tran

    View Slide

  16. Dataset Regions Samples Case Control Project Status
    BipSeq sACC + AMY 511 247 BPD 264 Revisions
    Suicide Genomics DLPFC 329 226 103 Revisions
    BrainSeq Phase III Caudate 464 298 SCZD 266 Revisions
    MDDseq sACC + AMY 1091
    704
    MDD/BPD
    387 Active
    AANRI
    DG, Caudate,
    Hippo, DLPFC
    1647
    (263, 464,
    447, 453)
    - - Active
    BrainSeq Phase I DLPFC 727 395 SCZD 332 Exploratory
    BrainSeq Phase II DLPFC 453 153 SCZD 300 Exploratory
    GTEx 13 Regions 2670 - - Exploratory
    Degradation
    AMY, Caudate,
    DLPFC, HIPPO,
    mPFC, sACC
    119 - - Exploratory

    View Slide

  17. Marker Finding
    17

    View Slide

  18. What are Marker Genes?
    ● “Define” cell types
    ○ Differentially expressed between cell types
    ● Historically
    ○ Know markers associated with key cell types
    ○ Ex. MBP: major constituent of the myelin sheath, marker for oligodendrocytes
    ● What does the Data tell us?
    ○ Human vs. model organisms
    ○ Regional
    ○ Technical differences
    18

    View Slide

  19. Marker Gene Selection
    ● Filter for genes expressed in snRNA-seq and
    bulk data
    ● Looking for genes expressed in only one cell
    type
    ○ Test for specificity of each gene for each cell
    type
    ● Observe expression of selected marker genes
    ○ Heat maps of pseudo-bulked data
    ■ Summation of counts from nuclei from
    one donor + cell type
    ○ Violin plots by cell type
    Marker Genes shared by sn & bulk
    RNA-seq
    The Ideal Heatmap
    snRNA-seq data,
    Pseudo-bulked by cell type + donor
    19
    Stephanie C Hicks

    View Slide

  20. Differential Expression :
    One vs. All
    ● scran::findMarkers()
    ● Target cell type vs. all other cells
    ● Select genes with highest fold change
    ● Observed lots of noise between different
    cell types
    snRNA-seq sACC
    Pseudo-bulked by cell type + donor
    The Actual Heatmap
    20

    View Slide

  21. Exploring Marker Expression
    ● T-test between two groups
    ● Fold change between groups
    21

    View Slide

  22. Exploring Marker Expression
    Where does this noise come from?
    ● Outliers in one of more non-target
    cell type
    ○ Here OPCs are expressing MBP
    22

    View Slide

  23. Our Solution: Mean Ratio
    Target
    Highest non-target
    Mean Expression
    target cell type
    Mean Expression
    highest non-target
    cell type
    = Mean Ratio
    Higher mean ratio:
    ● the more specific that gene is to
    the target cell type
    ● the better a marker gene it is
    23

    View Slide

  24. Mean Ratio vs. Fold Change
    ● Genes with high mean ratio also have high
    fold changes
    ● Not all genes with high fold changes have
    high mean ratios
    ● Selecting marker genes by mean ratio
    helps avoid “noisy” genes
    24

    View Slide

  25. 1vAll Markers vs. Mean Ratio Markers
    25

    View Slide

  26. 1vAll Markers vs. Mean Ratio Markers
    26

    View Slide

  27. How Many Markers?
    27
    ● As many look like outliers in the “worst”
    cell type
    ○ Least amount of signal
    ○ Balance overfitting vs. adding noise
    ○ Looking at Inhib: we chose 25 markers
    ● Same number for each cell type
    ● Typical methods [ findMarkers() ] are
    useful for finding cell identity markers, not
    necessarily good cell decomposition
    markers

    View Slide

  28. How Many Markers?
    ● This becomes more difficult with more
    specific cell types
    ● We are looking for genes with big
    differences between cell types: hence why
    it’s easier to work at the broad cell type
    resolution
    28

    View Slide

  29. Final QC Reference set
    Top 25 for 10 cell types
    ● Drop 2 worst macrophage genes
    ● Drop two macrophage-like
    microglia clusters
    ● 248 marker genes genes in total
    ● 70,527 nuclei post-QC

    View Slide

  30. Why we like Bisque
    ● Benchmarked with a DLPFC dataset
    ● Robust to marker set
    ● Robust to library prep
    ● More reasonable estimates on GTEx dataset
    ● With final reference dataset (8 donors) we have more than the 4 donors
    recommended

    View Slide

  31. GTEx tests

    View Slide

  32. Deconvolution of GTEX Brain Data
    Park et al., bioRxiv, 2021
    ● Deconvolute three large human
    brain datasets
    ○ GTEx, MAYO, ROSMAP
    ● Introduce new deconvolution
    algorithm: SPLITR
    ● 48 donor reference set
    ● Validate with previous composition
    estimates + immunostaining
    ● Variation across brain regions
    Figure 2

    View Slide

  33. Mean Proportions By Region: Tran et al, bioRxiv, 2020 (5 donors, 6 cell types)

    View Slide

  34. Peric =
    Mural + Endo
    Mean Proportions By Region: Tran et al, Neuron, 2021 (8 donors, 10 cell types)

    View Slide

  35. ● Bisque has more similar
    pattern of composition over
    regions vs. SPLITR
    ● MuSiC predicts large
    proportions of Endo + Mural
    (Peric)
    ● Both estimate lower
    proportions of Excit
    ○ MuSiC is more extreme
    and also predicts low
    portion Inhib
    Bisque & MuSiC vs SPLITR
    Different deconvolution methods, bulk RNA-seq data source,
    marker genes, and reference snRNA-seq data

    View Slide

  36. ● Run with set of 20
    & 25 marker genes
    per cell type
    ● Bisque is more
    robust to changes
    in the marker set
    than MuSiC
    Method Sensitivity to Marker Set
    25 vs. 20 Genes

    View Slide

  37. ● Different cell types change in different ways - Excit & Mural
    are the most dramatic

    View Slide

  38. MDDseq Results

    View Slide

  39. Average Compositions

    View Slide

  40. Compare Proportion of Cell Types over Dx

    View Slide

  41. 5/30 t-test are significant (p.bonf < 0.05)
    Amygdala

    View Slide

  42. 9/30 t-test are significant (p.bonf < 0.05)
    sACC

    View Slide

  43. 0/20 t-test are significant
    (p.bonf < 0.05)
    Compare Sex

    View Slide

  44. Correlate Proportion Cell Type vs. qSVs

    View Slide

  45. Data Driven Identification of Total
    RNA Expression Genes “TREGs” for
    estimation of RNA abundance in
    heterogeneous cell types

    View Slide

  46. Background
    ● Most deconvolution algorithms don't consider cell types
    have different cell sizes and transcriptional activity
    ○ Ex. Neurons are more active than Glia
    ○ Can only predict fraction of RNA not fraction of cells in tissue
    ○ Sosina et al., F1000Research, 2021
    MDDseq Bisque

    View Slide

  47. Bulk RNA-seq
    Goals for RNAscope Experiment
    ● Deconvolution R01 MH123183
    ○ Kristen Maynard, Stephanie C Hicks
    ● Use six slices of DLPFC to generate
    corresponding RNA-seq & RNAscope
    data
    ● This information will be useful to
    evaluate and design deconvolution
    algorithms
    DLPFC
    Bulk RNA-seq
    snRNA-seq
    Spatial
    RNAscope
    RNAscope
    47
    polyA
    RiboZero

    View Slide

  48. Motivation
    ● Improve Deconvolution algorithms by considering differences in size and RNA
    content between cell types
    ● Use smFISH with RNAscope to establish data set of:
    ○ Cellular composition
    ○ Nuclei sizes of major cell types
    ○ Average nuclei RNA content of major cell types
    How do we measure total RNA content of a cell if we can only observe a few
    genes at a time? Use a TREG

    View Slide

  49. What is a TREG?
    ● Total RNA Expression Gene
    ● Expression is proportional to the
    overall RNA expression in a cell
    ● In smFISH the count of TREG
    puncta in a cell can estimate the
    RNA content
    ○ Linking RNA content to nucleus size

    View Slide

  50. Housekeeping
    Genes
    ● Unrelated to protein
    function
    ● Tissue specific
    ● Expression tracks
    with total expression
    (should vary)
    ● Involved in basic
    cellular function
    ● Consistent across
    tissues
    ● Expressed evenly
    across tissue
    ● expressed in
    every cell
    TREGs
    We will compare candidate TREGs to housekeeping
    gene POLR2A to examine differences

    View Slide

  51. Quality Wish List
    TREG
    ● Non-zero expression in most cells across
    tissues and cell types
    ● Expressed at a constant level in respect to
    other genes across different cell types
    RNAscope
    ● Expressed in top 50% of genes for easy
    detection
    ● Show a dynamic range of puncta*
    ● Individual puncta are countable*
    * Will need to evaluated experimentally

    View Slide

  52. Data Driven TREG Discovery
    Data
    ● Human Brain postmortem 10x snRNA-seq
    ○ Tran, Maynard et al., Neuron, 2021
    ○ Eight donors, 5 Brain Regions
    ○ Nine broad cell types
    ● smFISH with RNAscope in DLPFC tissue
    ○ 3 tissue sections from one donor
    Method Overview
    1. Filter to top 50% expressed genes
    2. Filter out genes with high proportion of
    zero expression
    3. Select genes with high Rank Invariance as
    candidate TREGs
    4. Evaluate candidate gene performance in
    RNAscope experiment
    @mattntran
    Matthew N Tran
    Kelsey D Montgomery Sang Ho Kwon

    View Slide

  53. Data Driven TREG Discovery - Method
    rank(x) gives you high
    ranks for high values in x,
    with average ranks for ties

    View Slide

  54. Gene Filtering
    ● First filtered to top 50% of 23k genes in
    dataset
    ● Compute Proportion Zero
    ● Used distribution to pick 0.75 as max
    Proportion Zero cutoff
    ○ For a gene to pass the maximum
    proportion of zeros among groups must
    be less than the cutoff
    ● Only 877 (3.8%) of genes pass
    ○ POLR2A doesn’t pass

    View Slide

  55. Evaluate for Rank Invariance
    ● How variable is the Expression Rank of a
    gene within a cell type? Between cell
    types?
    ● Expressed at a constant level in respect to
    other genes = High Rank Invariance
    ○ Evaluate within and between cell types
    Expression Rank

    View Slide

  56. Rank Invariance Calculation
    C1 C2
    G1
    G2
    G3
    G4
    G1 4
    G2 2
    G3 3
    G4 1
    C1 C2
    G1 3.5 4
    G2 3.5 3
    G3 2 1
    G4 1 2
    Rank cell
    (gxc)
    Rank rowMean
    (gx1)
    Filtered genes,
    for one cell type
    (gxc)
    C1 C2
    G1 .5 0
    G2 .5 1
    G3 -1 2
    G4 0 -1
    Difference of
    ranks (gxc)
    G1 0.25
    G2 0.75
    G3 1.5
    G4 0.5
    Rank rowMean
    (gx1)
    G1 1
    G2 3
    G3 4
    G4 2
    rowMean of
    absolute
    differences
    (gx1)
    How different is the Expression
    Rank of the gene in one nuclei vs
    the average nuclei for that cell
    type?
    Low variation = low difference = low rank

    View Slide

  57. C1 C2 C3 C4 C5
    G1
    G2
    G3
    G4
    R B Y
    G1 1 1 1
    G2 3 2 2
    G3 4 3 3.5
    G4 2 4 3.5
    G1 4
    G2 3
    G3 1
    G4 2
    Calculate
    “rank-difference
    rowMeans”
    for each cell type
    Reverse-Rank of
    rowSums
    “Rank Invariance”
    (gx1)
    Filtered genes,
    grouped by
    cell type (gxc)
    Rank Invariance Calculation
    Low variation across groups = low rank sum
    = high rank invariance

    View Slide

  58. TREG R Package
    Functions
    58
    ● Gene filtering
    ○ get_prop_zero()
    ○ filter_prop_zero()
    ● Rank Invariance step-wise
    ○ rank_cells()
    ○ rank_group()
    ○ rank_invariance()
    ● rank_invariance_express()
    ○ Completes full RI calculations
    https://github.com/LieberInstitute/TREG

    View Slide

  59. Candidate TREGs
    ● Selected AKT3, ARID1B, and
    MALAT1 as our candidate
    TREGs
    ○ In top 10 RI values, and
    has available probes
    ● TREGs show consistent
    Expression Ranks within all
    cells and with in most cell
    types
    Expression Rank
    Expression Rank
    Expression Rank

    View Slide

  60. Validate in snRNA-seq data
    ● TREGs showed strong correlation between
    gene expression and total reads per cell
    ● log2(counts + 1) ~ log2(sum) +
    cellType
    Total RNA per nucleus in snRNA-seq
    Expression of TREG per nucleus

    View Slide

  61. Validate with RNAscope
    ● DLPFC from control, sectioned at 10μm
    ● 3 slides with 3 sections each
    ○ TREG candidate + cell type marker genes
    ● Images analyzed with HALO
    TREG Gene AKT3 ARID1B
    MALAT1/
    POLR2A
    Cell Type
    Markers
    GAD1, SLC17A7,
    MBP
    GAD1, SLC17A7,
    MBP
    SLC17A7, MBP
    Cell Type Marker
    Excit SLC17A7
    Inhib GAD1
    Oligo MBP
    Kelsey D Montgomery Sang Ho Kwon

    View Slide

  62. HALO Analysis
    ● HALO identified 944k cells
    across the 9 slides
    ○ 90k-100k per sample
    ● Some QC was required to
    remove out of focus regions
    ● HALO data tracks really well
    with what we see visually in the
    slides and expected anatomy of
    the tissue
    ○ Neurons are larger
    ○ Neurons are transcriptionally
    more active
    ○ Excit, Inhib and Oligo are
    where you would expect them
    to be
    Typo**
    GAD1

    View Slide

  63. Patterns of Observed Puncta
    ● TREGs were expressed in most
    cells
    ● AKT3 tracks really well with
    pattern of expression seen in
    snRNA-seq (ARID1B is also
    pretty good)
    snRNA-seq RNAscope
    Gene
    Mean Prop.
    Cells with
    Expression
    Prop. non-zero
    in DLPFC snRNA
    Standardized β
    (95% CI)
    AKT3 0.948 0.92 -1.38 (-1.39,-1.37)
    ARID1B 0.908 0.94 -0.62 (-0.62,-0.61)
    MALAT1 0.910 1.00 -0.11 (-0.12,-0.11)
    POLR2A 0.853 0.30 -0.98 (-0.99,-0.98)
    snRNA-seq NA NA -1.33 (-1.35,-1.31)
    Remember: MALAT1’s puncta data is unreliable

    View Slide

  64. Conclusions
    ● Proportion Zero filtering + Rank Invariance allow selection of candidate
    TREGs in snRNA-seq
    ● AKT3 appears to be a TREG compatible with RNAscope in the human brain
    ● TREGs would allow the observation of total RNA expression via RNAscope
    ○ Allow us to collect data that can be used to improve deconvolution algorithms

    View Slide

  65. Suggestions for the present and future
    ● We recommend Bisque over MuSiC for postmortem human brain data
    ● You might want to use the mean ratio method for selecting snRNA-seq cell
    type marker genes
    ● You’ll benefit from having more than 3 donors (we had 8) in your snRNA-seq
    data
    ● Plan to generate RNAscope data for some major cell types using adjacent
    tissue dissections: could be a gold standard
    ● Use a TREG (AKT3) in your RNAscope to estimate nuclei RNA and size: could
    be useful input parameters in future deconvolution algorithms
    https://research.libd.org/DeconvoBuddies/
    http://research.libd.org/TREG/

    View Slide

  66. @MadhaviTippani
    Madhavi Tippani
    @HeenaDivecha
    Heena R Divecha
    @lmwebr
    Lukas M Weber
    @stephaniehicks
    Stephanie C Hicks
    @abspangler
    Abby Spangler
    @martinowk
    Keri Martinowich
    @CerceoPage
    Stephanie C Page
    @kr_maynard
    Kristen R Maynard
    @lcolladotor
    Leonardo Collado-Torres
    @Nick-Eagles (GH)
    Nicholas J Eagles
    Kelsey D Montgomery
    Sang Ho Kwon
    Image Analysis
    Expression Analysis
    Data Generation
    Thomas M Hyde
    @lahuuki
    Louise A Huuki-Myers
    @JoshStolz2
    Joshua M Stolz
    Moods Workgroup
    Fernando Goes
    Patricia Braun
    Peter Zandi
    Thomas Hyde
    Joel Kleinman
    Christopher Ross
    Shizhong Han
    @mattntran
    Matthew N Tran
    @sowmyapartybun
    Sowmya Parthiban
    https://www.libd.org/careers/

    View Slide