Pseudobulk Analysis using pseudoBulkDGE()
RStats presentation
Presented By: Manisha Barse
May 2, 2025
Slide 2
Slide 2 text
Objectives
Understand what pseudobulk analysis is and why it’s used
Learn the steps of pseudoBulkDGE() following OSCA guidelines
Compare registration_pseudobulk() vs aggregateAcrossCells() +
pseudoBulkDGE() workflows
Understand input parameters, normalization, filtering, and output interpretation
Slide 3
Slide 3 text
What is Pseudobulk Analysis?
Aggregates single-cell or spatial counts into group-level profiles
Treats groups as “bulk samples” → improves statistical power
Suitable for differential expression analysis (DEA)
Reduces false positives compared to cell-level models
Maynard, K.R., Collado-Torres, L., Weber, L.M. et al.
Transcriptome-scale spatial gene expression in the
human dorsolateral prefrontal cortex. Nat Neurosci 24,
425–436 (2021).
Overview of pseudoBulkDGE() Workflow
1. Input: SummarizedExperiment with counts
2. Labeling: Define groups (e.g., BayesSpace domains)
3. Aggregation: Sum counts within groups → pseudobulks
4. Normalization: calcNormFactors()
5. Filtering: Remove lowly expressed genes (filterByExpr)
6. Modeling: Fit linear model, estimate dispersion
7. Testing: Compute DE statistics
8. Output: Table of DE genes, logFC, p-values
Slide 7
Slide 7 text
Normalization and Filtering
Normalization: edgeR::calcNormFactors()
Filtering: edgeR::filterByExpr()
Logcounts: After normalization, generates logCPM/logcounts matrix.
Slide 8
Slide 8 text
Comparison to existing method
● registration_pseudobulk() applies edgeR::filterByExpr() across all spots- might
remove genes with overall low expression which by might be expressed in
specific spatial domains.
● So, for situations with limited samples size or domain-wise expression is
important: pseudobulk samples using aggregateAcrossCells() and then use
pseudobulkDGE().
Slide 9
Slide 9 text
psuedoBulkDGE()
● Main function: Performs DEA on pseudobulked data
● Important arguments:
○ data: SummarizedExperiment
○ label: grouping variable (e.g., BayesSpace)
○ design: model matrix (~ diagnosis + covariates)
○ coef: coefficient of interest (e.g., “Case”)
○ condition: primary condition (diagnosis)
○ row.data: gene-level metadata
○ method: "edgeR" or "voom" (edgeR handles low counts better via its count-based model but
voom supports variable sample precision when quality=TRUE)
Slide 10
Slide 10 text
R script
Demo script
https://github.com/manishabarse/LIBD_presentation/blob/main/pseudobulk_demo.R