2025 LIBD lcolladotor data science team TL;DR

Slide 1

Slide 1 text

@lcolladotor.bsky.social lcolladotor.github.io lcolladotor.github.io/bioc_team_ds R Bioconductor-powered Team Data Science Data Science Team 1, Translational Neuroscience Division Leonardo Collado Torres, LIBD Investigator + Asst. Prof. Johns Hopkins Biostatistics January 29th, 2025 Slides available at speakerdeck.com/lcolladotor Orchestrated By Louise Huuki-Myers

Slide 2

Slide 2 text

Spatially Resolved Transcriptomics with Visium Presented By Manisha Barse

Slide 3

Slide 3 text

Visium Technologies Standard Visium (Fresh Frozen) Visium CytAssist (HD or non HD) (Fresh Frozen, Fixed Frozen, FFPE) ● Sequencing-based spatial proﬁling technology developed by 10x Genomics. ● Visualize cells and their location in a tissue sample. ● Measure and map gene activity throughout the tissue.

Slide 4

Slide 4 text

Wet Bench + Preprocessing w/ SpaceRanger Visium Cytassist LIBD Repository and Dissections Kristen Maynard Stephanie Page Sarah Maguire Ruth Zhang Ryan Miller Data Analysis https://speakerdeck.com/manishabarse/exploring-spaceranger-summary-metrics https://www.youtube.com/watch?v=cJqsbDh0ZtI LIBD Data Science Team

Slide 5

Slide 5 text

Quality Control Levels of QC in spatial transcriptomics Artifact Level Spot Level https://pmc.ncbi.nlm.nih.gov/articles/PMC11185656/ Sample Level ● scran ● SpotSweeper

Slide 6

Slide 6 text

Feature Selection Identify Highly variable genes (HVGs) Identify Spatially variable genes (SVGs) Dimension Reduction GLM PCA UMAP TSNE Batch Correction Harmony Combat ScVI: single-cell Variational Inference

Slide 7

Slide 7 text

Clustering and Annotating ● Use spatial clustering algorithms like BayesSpace and Precast. ● Different resolutions of BayesSpace clustering: 1. k = 2: separate gray and white matter 2. k = 9: best reiterated histological layers https://doi.org/10.1126/science.adh1938 More clusters = More complexity

Slide 8

Slide 8 text

Spatial registration ● Adds anatomical context ● Map SpDs to Maynard et.al manual annotated layers. ● Correlate enrichment t-statistics for top markers for reference. ● Highlight most strongly associated histological layer to add biological context. https://doi.org/10.1126/science.adh1938

Slide 9

Slide 9 text

SpatialLIBD apps ● http://spatial.libd.org/spatialLIBD/ ● https://research.libd.org/spatialDLPFC/#interactive-websites ● https://research.libd.org/Visium_SPG_AD/#interactive-websites ● … https://bioconductor.org/packages/release/data/experiment/html/spatialLIBD.html

Slide 10

Slide 10 text

Summary: Visium ● Using both Standard and CytAssist workﬂows: analyze diverse tissue types to spatial transcriptomics and developmental brain disorder research.

Slide 11

Slide 11 text

Visium HD data analysis Presented By Nick Eagles

Slide 12

Slide 12 text

Visium HD Overview - Visium HD is composed of square bins, not spots, at 3 different resolutions - 8μm bin size is recommended for analysis by 10x Genomics - Compare to 100μm distance between spots! - ~700k bins, compared to ~5k spots! - 2μm bin size is subcellular, and can be combined with segmentation to form cells “Bin level”: 8μm bins “Cell level”: Individual cells from combining 2μm bins 16μm bin border 8μm bin border 2μm bin border cell

Slide 13

Slide 13 text

Visium HD: Goals and Challenges Goals: - Explore the current landscape of Visium HD software tools - What’s useful and feasible to run on our data? - Develop novel code where there are holes - Gain interesting insights from LIBD Visium HD datasets (e.g. habenula, DLPFC, HPC samples) - What questions can we answer with spatially resolved cells? Challenges: - Big data: 1 HD sample is larger than 100 standard samples - Uncharted territory; limited existing solutions

Slide 14

Slide 14 text

Top SVGs (bin level) 1. Lower resolution with SEraster 2. Compute SVGs with nnSVG 1 5 9

Slide 15

Slide 15 text

FICTURE: habenula * Note: mirrored and rotated data

Slide 16

Slide 16 text

Banksy cell-level clusters k = 2 k = 4 k = 9

Slide 17

Slide 17 text

Transferring cell-type annotation

Slide 18

Slide 18 text

Software Exploration Software tool Runs on our data Run time (per sample) Required memory (per sample, GB) Integrates well with SpatialExper iment Operates on multiple samples simultaneously Uses GPU bin2cell Yes 30 min < 32 Yes No No FICTURE Yes 3 hours < 64 No No No SEraster Yes 20 min < 64 Yes No No MERINGUE Sort of 1 day > 400 Yes No No Giotto Yes Varied Varied Yes Yes No Banksy Yes 1 hour < 16 Yes Yes No HERGAST No Yes ENACT No No No CellNEST No No Yes

Slide 19

Slide 19 text

Visium HD Summary - Promising existing software includes FICTURE, bin2cell, Banksy, and SERaster - Novel code requires careful implementation to control memory and runtime - bin2cell produces spatially resolved individual cells with many downstream possibilities

Slide 20

Slide 20 text

sc/snRNA-seq data analysis Presented By Melissa Mayén Quiroz

Slide 21

Slide 21 text

Droplet Microfluidics: Single cells encapsulated with barcoded gel beads. Barcoding: Unique barcodes assign reads to individual cells. Unique Molecular Identifiers (UMIs): Distinguish real transcripts from PCR artifacts. Library Preparation: cDNA synthesis and amplification. NGS Sequencing: Captures transcript data for each cell. Data Analysis Single-Cell RNA-seq - 10x genomics

Slide 22

Slide 22 text

Obtaining samples and preprocessing Ryan Miller Staff Scientist 3. Cell ranger Analysis software suite provided by 10x Genomics for processing and analyzing raw sequencing data from their single-cell RNA sequencing (scRNA-seq) platform ● Demultiplexing ● Read Alignment ● Barcode and UMI Processing ● Gene Quantification 1. Sample Preparation 2. Sequencing

Slide 23

Slide 23 text

Quality Control ● Empty Droplets ● Doublet detection ● Quality Control ➔ Low counts ➔ Low detected genes ➔ High mitochondrial percentage

Slide 24

Slide 24 text

Dimensionality reduction and batch correction GLM-PCA GLM-PCA incorporates a generalized linear model (GLM) framework to model the distribution of the data More appropriate for count data. ● Accounts for Data Distribution ● Captures Variance Harmony It is designed for batch correction and works on data already reduced to a lower-dimensional space, such as PCA embeddings. ● Prevent Overcorrection ● Iterative Refinement

Slide 25

Slide 25 text

Clustering Shared Nearest Neighbors (SNN) Graph: ● Build an SNN graph by identifying the k-nearest neighbors for each cell in a reduced dimensional space (Harmony-corrected PCA) ● Nodes represent cells, and edges connect cells with overlapping nearest neighbors. Walktrap Community Detection: ● Apply the Walktrap algorithm to the SNN graph. ● This algorithm identifies clusters by simulating random walks on the graph, grouping nodes that are frequently visited together.

Slide 26

Slide 26 text

Downstream analysis ● Obtain Marker genes ● Annotation of clusters ● Differential expression analysis (pseudo bulk)

Slide 27

Slide 27 text

Summary: sc/snRNAseq ● Heterogeneity: Distinguishes individual cell types and states ● Allow to explore differences between Hb-VTA projection neurons and other brain cells

Slide 28

Slide 28 text

Multi-ome data analysis (snATAC-seq + snRNA-seq) Presented By Cynthia Soto Cardinault

Slide 29

Slide 29 text

Multiome scRNAseq / scATACseq strategy It analyzes Chromium Single Cell Multiome data, connecting gene expression and chromatin accessibility for enhanced genomic understanding Cell Ranger ARC

Slide 30

Slide 30 text

Data preprocessing Raw data Ryan M Kristen M Sarah M Kelsey M Leonardo C. Customized approach Cellranger count M GEX ATAC Cellranger atac count Lisa K. Johnson Abby Primack Cell Ranger ARC Image modified from 10x Genomics

Slide 31

Slide 31 text

Downstream analysis (Hao Y, et al, 2021; Stuart T et al, 2021) Single-cell chromatin analysis workflow with Signac

Slide 32

Slide 32 text

Additional controls: quality on GEX and ATAC Standard Quality Controls for GEX Specific Quality Controls for ATAC Normalize and scale data and perform dimensionality reduction Batch correction method Clustering

Slide 33

Slide 33 text

Clustering and Data exploration ● Clustering evaluation and optimization ● Identify and uncover cell types and regulatory elements. SNN (RNA) LSI (ATAC)

Slide 34

Slide 34 text

Integration with Weighted Nearest Neighbor (WNN) WNN = GEX+ATAC Multimodal clustering on pca/harmonize reduction for RNA and lsi reduction for ATAC ∫ Visualize chromatin tracks to identify accessible regions and boundaries over genes or regulatory elements

Slide 35

Slide 35 text

Summary: Multiome ● Multiome approach links gene expression to regulation, revealing disease mechanisms. ● Habenula project explores regulatory pathways, and unique gene expression patterns in habenula subdomains.

Slide 36

Slide 36 text

Bulk RNA-seq deconvolution Presented By Louise Huuki-Myers

Slide 37

Slide 37 text

Studying Gene Expression in the Human Brain Bulk RNA-seq ● Mixture of cell types Single nucleus RNA-seq ● Proﬁle cell type populations 37

Slide 38

Slide 38 text

What is Deconvolution? Computational method that... ● Infers the composition of different cell types in a bulk RNA-seq data ● Utilizes single cell data to obtain cell type gene expression proﬁles 38

Slide 39

Slide 39 text

Mean Ratio Gene Selection DeconvoBuddies::get_mean_ratio2()

Slide 40

Slide 40 text

Deconvolution Benchmark ● Paired dataset of bulk, snRNA-seq, and RNAScope ○ Used RNAScope/IF to build orthogonal measure of cell type proportions in DLPFC ● Compared proportion estimates from six deconvolution methods ○ DWLS, Bisque, MuSiC, BayesPrism, hspe, and CIBERSORTx ● hspe & Bisque are top performing methods

Slide 41

Slide 41 text

● Quality Control ○ Ex. Remove samples with high proportion of neighboring region ● Control for cell type in downstream analysis ○ Differential Expression ○ eQTL cell type interactions How to use deconvolution?

Slide 42

Slide 42 text

Suggested Deconvolution Pipeline 1. Select single nucleus reference dataset a. Same brain region, multiple donors (4+) b. Determine cell type resolution of interest 2. Select Marker Genes a. Mean Ratio marker genes DeconvoBuddies::get_mean_ratio() b. Observe marker gene selection with heatmaps and violin plots to gauge quality 3. Run Deconvolution a. Subset to marker gene set b. Run Bisque or hspe 4. Check estimated proportions and apply to downstream analysis

Slide 43

Slide 43 text

DeconvoBuddies Find Marker Genes ● Implements Mean Ratio marker gene selection ○ get_mean_ratio() ● Implements 1 vs. All marker gene selection ○ findMarkers_1vALL() wrapper function for scran::findMarkers() Plotting tools ● Quickly plot gene expression over cell types (or other category) ○ plot_gene_express() ● Plot top marker genes with annotated statistics ○ Plot_marker_express ● Plot Composition bar plots of deconvolution outputs ○ plot_comoposition_bar() Access Data ● Access paired data from consecutive slices of human DLPFC, used in deconvolution benchmark ○ RNA-scope ○ snRNA-seq ○ bulk RNA-seq

Slide 44

Slide 44 text

Summary: Deconvolution ● Deconvolution predicts cell type composition in bulk RNA-seq data ● Our deconvolution benchmark determined Bisque & hspe as top methods for human brain data ○ Now in preprint https://doi.org/10.1101/2024.02.09.579665

Slide 45

Slide 45 text

Bulk RNA-seq data analysis

Slide 46

Slide 46 text

eQTL Analysis ● Expression quantitative trait loci ● Test if snp (loci) explains variation in gene expression ● We use tensorQTL Presented By Louise Huuki-Myers Image credit: 10.1038/s41588-021-00913-z

Slide 47

Slide 47 text

- Do eQTLs and PGC3 SCZD GWAS risk SNPs share common causal variants? - Used coloc::coloc.abf(), one eQTL gene at a time, considering all SNPs in a neighborhood of the gene Colocalization Analysis Presented By Nick Eagles 16 colocalized genes 2 sig. colocalized eQTLs 123 colocalized PGC3 risk SNPs

Slide 48

Slide 48 text

qsvaR Presented By Nick Eagles

Slide 49

Slide 49 text

qsvaR Background - Post-mortem studies involve a post-mortem interval where tissue degrades before being frozen - Different transcripts degrade at different rates - Degradation confounds biological signal in differential expression analysis and can produce false positives

Slide 50

Slide 50 text

qsvaR Overview - R package to reduce degradation signal in DEG analysis - Degradational signal is summarized from degradation-associated transcripts and included as covariates in gene-level DE model - Expansion of Jaffe et al approach, now using transcripts instead of expressed regions

Slide 51

Slide 51 text

qsvaR: Results qsvaR effectively removes degradation signal in DE Degradation and case-control signals correlated for other models, even when including RIN!

Slide 52

Slide 52 text

Best Number of Transcripts? Assumption: correctly adjusting for degradation should improve concordance of DE results across datasets (replication) Replication: prop. of sig genes in discovery that are sig at p = 0.05 in target and match direction

Slide 53

Slide 53 text

Summary: qsvaR ● qsvaR is an R package reducing the confounding effect of degradation on DE ● Requires transcript-level data instead of expressed regions ● Upcoming manuscript

Slide 54

Slide 54 text

CHESS-Brain Presented By Geo Pertea

Slide 55

Slide 55 text

Center for Computational Biology - JHU ● CHESS : building a comprehensive catalog of RNA transcripts expressed in RNAseq data ● based on transcript assembly of nearly 10,000 GTEx RNAseq samples with StringTie Comprehensive Human Expression Sequences (CHESS) exon2 exon3 exon1 exon4 StringTie uses splicing graphs to rebuild transcript structures from read alignments

Slide 56

Slide 56 text

Collaboration work with Dr. Mihaela Pertea and Ida Shinder at CCB JHU ● Original CHESS used GTEx PolyA-selection samples ● LIBD and CMC RNAseq prep uses rRNA-depletion (RiboZero) ● Stringtie 3 is aware of co-transcriptional splicing as captured by RiboZero data Comprehensive Human Expression Sequences — Brain edition exon2 exon3 exon1 exon4 splicing graph takes into account unprocessed introns https://github.com/gpertea/stringtie

Slide 57

Slide 57 text

CHESS-Brain : polyA vs. RiboZero transcriptomics A. BrainSeq (Phase 1 and 2): includes SCZD cases for differential expression analysis B. Cell Fractionation: nuclear, cytoplasmic, and total RNA fractions compared C. Degradation experiment: time course showing degradation effects on RNA capture 3 DLPFC datasets sequenced with both PolyA and RiboZero

Slide 58

Slide 58 text

CHESS-Brain : polyA vs. RiboZero transcriptomics Transcript assembly performance for the degradation samples across 4 time points. time points (minutes) polyA vs RiboZero DEGs by gene biotype

Slide 59

Slide 59 text

SCZD DGE for PolyA and RiboZero on the BrainSeq datasets before and after ● Volcano plots (diagonal): differential expression and significant DEG counts CHESS-Brain : polyA vs. RiboZero transcriptomics ● Correlation plots (lower triangle): t-statistic concordance between methods, significant DEGs highlighted ● CAT plots (upper triangle): cumulative Concordance-At-the-Top agreement of top 3000 DEGs qSVA correction

Slide 60

Slide 60 text

Summary: CHESS-Brain ● Data-driven expansion of the catalog of RNA transcripts expressed in brain tissue ● Improved transcript-level analyses of LIBD RNAseq RiboZero data ● Upcoming manuscripts

Slide 61

Slide 61 text

Open source software development Presented By Nick Eagles

Slide 62

Slide 62 text

SPEAQeasy and BiocMAP SPEAQeasy: Bulk RNA-seq preprocessing pipeline BiocMAP: WGBS preprocessing pipeline Documentation sites: https://research.libd.org/SPEAQeasy/ https://research.libd.org/BiocMAP/ SPEAQeasy updates: - Minor bug ﬁxes - Quantify transcripts for rat

Slide 63

Slide 63 text

visiumStitched - R package for spatially integrating multiple Visium samples - Large brain regions can be accurately analyzed in one piece - Successfully applied in NAc, hippocampus Agreement at overlaps PRECAST clustering (k = 4) Manuscript

Slide 64

Slide 64 text

slurmjobs - R package from working with SLURM (e.g. JHPCE) from R - Simplifies repetitive tasks - Enables creation of parallel workflows for efficient data processing - Enables monitoring memory usage Job submission Monitoring job_report() example output

Slide 65

Slide 65 text

Packages Maintained by the Team ● SpatialLIBD ● DeconvoBuddies ● qsvaR ● SPEAQeasy ● visiumStitched ● slurmjobs

Slide 66

Slide 66 text

LIBD JHPCE Presented By Nick Eagles

Slide 67

Slide 67 text

Modules - Modules: units of software we maintain for LIBD and collaborators at JHU - Users can immediately load and use popular software - Behavior of software is identical across users - Installation for each user not necessary - Result: reproducible, straightforward data analysis at JHPCE bin2cell/0.3.0 cellpose/2.0 cellranger/8.0.1 ficture/0.0.3.1 hergast/0.0.1 leafcutter nda-tools/0.3.0 plink2 qctool/2.2.5 spaceranger/3.1.2 spatula visium_hd/1.0 Recent modules module load bin2cell

Slide 68

Slide 68 text

LIBD rstats club + Journal club videos

Slide 69

Slide 69 text

R Stats Club Overview - Topics include genomics software, computational methods, R packages, and tips for working as data scientists at JHPCE - Promotes sharing of knowledge and building skills outside of explicit project requirements

Slide 70

Slide 70 text

A Growing Resource for LIBD and Collaborators - Topics are recorded, freely accessible, and searchable by keyword - Complements our efforts to share knowledge through DSGSs

Slide 71

Slide 71 text

lcolladotor.github.io @lcolladotor.bsky.social Slides: