LIBD_DS_TLDR

R/Bioconductor-powered Team Data Science Leonardo Collado-Torres Louise A. Huuki-Myers Joshua
M. Stolz Nicholas J. Eagles + Geo Pertea https://lcolladotor.github.io/bioc_team_ds/ April 20, 2022 https://speakerdeck.com/lcolladotor/libd-ds-tldr

DNA Genotyping: PopTop Joshua M. Stolz

Beneﬁt of TopMed • TopMed offers a reference panel for
far more snps (~300 million) • The Rsq value for lower MAF is preserved in populations of african ancestry. • Can this increased power be used to include lower MAFs? • Should we just ﬁlter by Rsq? MAF: minor allele frequency Rsq: R squared

PopTop:NextFlow Pipeline • Parallelizes computationally expensive tasks • Allows for
the automation of large jobs • Documentation is available at: ◦ https://research.libd.org/Topmed-Imputation-Pipeline/

Topmed Output: VCF Format

Future Works • Working to make this modular with the
upcoming LIBD Data Portal. • Writing scripts to make delivering subsets of samples more time feasible. • Continuing to maintain and update the documentation website to make it more robust and user friendly. @geo_pertea Geo Pertea @Nick-Eagles (GH) Nicholas J Eagles

Bulk RNA-seq Processing: SPEAQeasy Nicholas J. Eagles

Bulk RNA-seq Processing: Motivation and Challenges - Data processing should
be uniform across time/ datasets, documented, and reproducible - What aligner was used for this dataset? - Did we use hg19 or hg38? What GENCODE release was used? - How did we decide which samples to trim, if any? - Which version of FastQC was used? - While many computational steps are involved before analyses (e.g. DE) are possible, data pre-processing should ideally not require technical expertise to apply

SPEAQeasy Workﬂow https://github.com/LieberInstitute/SPEAQeasy Raw sequencing reads R objects, ready for
statistical analysis

Manuscript and Documentation https://doi.org/10.1186/s12859-021-04142-3 http://research.libd.org/SPEAQeasy/

Example Analysis - Demonstrate how to run SPEAQeasy on real
data and work with its outputs - Use variant calling results to ﬁnd and resolve identity issues originating from labelling mistakes - Perform a differential analysis after attaching experiment-speciﬁc sample metadata http://research.libd.org/SPEAQeasy-example @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M Stolz

Conﬁguration - Each user recommended to install SPEAQeasy separately -
git clone [email protected] :LieberInstitute/SPEAQeasy.git - cd SPEAQeasy - bash install_software.sh "jhpce" - Control: - exact annotation ﬁles used (GENCODE/ Ensembl versions) - command-line arguments to software - how to trim samples, if at all - aligner (HISAT2/ STAR) and pseudo-aligner (kallisto/ salmon) /dcs04/lieber/ds2a/Data/CMC/Data/RNAseq/Raw/SPEAQeasy

Recent Improvements and Future Work Post-publication improvements: - Alignment-related optimizations
resulting in reduction in disk space and computational time - Support for singularity leading to new Cardiff users and greatly expanding possible users (collaboration with Nick Clifton) - Added raw counts for transcripts in addition to TPM Future improvements: - Want to allow a user to only perform alignment, or only quantify transcripts - Reduce required memory when counting junctions to produce R objects @geo_pertea Geo Pertea

WGBS Processing: BiocMAP Nicholas J. Eagles

WGBS: Motivation and Challenges - Profile DNA methylation, a critical
epigenetic modification, across the entire human genome - Differentially methylated regions (DMRs), e.g. between schizophrenia and controls - Methylation quantitative trait loci (MeQTLs) - Large data size (> 1B cytosines measured, ~2TB disk space per sample) - Careful choices must be made to fit files on JHPCE, generate/load results with available memory - Several steps are required before raw sequenced reads yield methylation proportions ready for analysis - Trimming, alignment to reference genome, extract methylation proportions, import to R

TL;DR Raw sequencing reads R objects, ready for statistical analysis
https://github.com/LieberInstitute/BiocMAP

Manuscript and Documentation http://research.libd.org/BiocMAP/

Example Vignette https://github.com/LieberInstitute/BiocMAP/blob/master/ documentation/example_analysis/age_neun_analysis.pdf https://github.com/LieberInstitute/BiocMAP/blob/master/ documentation/example_analysis/example_analysis.pdf Price et al., BMC
Genome Biology 2019 https://doi.org/10.1186/s13059-019-1805-1

Workﬂow in Practice - Datasets where we’ve applied BiocMAP: -
664 PsychENCODE Schizophrenia/control samples (Hippocampus, DLPFC, Caudate) - 20 PsychENCODE fetal samples - 2597 VA PTSD samples (year 1 and 2) - How long does it take to run? - ~3 months for 648 samples (second module) - ~2 weeks for 43 samples (both modules at JHPCE) - How much disk space is required? - ~2TB disk space per sample while generating; 1TB outputs

LIBD Data Portal Geo Pertea

LIBD Data integration • relational database tracking data assets at
LIBD • linking LIMS to processed data • flexible database back-end & indexed file storage • unified web interface for data queries

brains histological samples extraction sequencing sequencing samples processing id brnum
brint age sex race dx_id subjects id name subj_id region sdate samples id dataset_id s_id s_name sample_id protocol restricted numReads numMapped totalMapped overallMapRate ... mitoRate rRNA_rate totalAssignedGene exp_metadata exp_id dtype ftype G / E / J / T f_set_id f_data real [ ] version exp_data Experiment data flow H5 filesystem PostgreSQL database assay data Parquet

PostgreSQL relational database demographic data experiment metadata histological sample metadata
genomic features (annotations) assay data id subj_id dnum sample_id panel_id batch_id call_rate p10gc, p50gc nPennCNV [ ] SUM16,SUM20 imputation data_path genotype location of data files on file system storage

Integration of PostgreSQL and R from back-end to front-end Leveraging
R’s data processing and visualization capabilities SQL + R code: Front-end (web application) Back-end PostgreSQL server client selects dataset (sample metadata only) client receives results & plot data middleware (nodejs) retrieve sample data process large data output results SQL / R server returns results & baked plot data (plotly JSON ) srv16

sc/snRNA-seq Louise A. Huuki-Myers Joshua M. Stolz @mattntran Matthew N
Tran With help from:

https://bioconductor.org/packages/3.14/SingleCellExperiment

https://doi.org/10.1038/s41592-019-0654-x https://bioconductor.org/books/release/OSCA @stephaniehicks Stephanie C Hicks

Quality control + normalization • emptyDrops() from DropletUtils ◦ Determine
the empty droplets • isOutlier() from scran ◦ Identify outlier cells/nuclei based on mitochondrial expression and other metrics • devianceFeatureSelection()+ nullResiduals() from scry ◦ GLM-PCA approximation by Townes, Hicks, Ayree, and Irizarry https://doi.org/10.1186/s13059-019-1861-6 • reduceMNN() from batchelor ◦ Batch correction since sc/snRNA-seq has strong sample effects • + much more before you get to annotated clusters of cells @mattntran Matthew N Tran @Erik-D-Nelson (GH) Erik D Nelson

1vAll Markers vs. Mean Ratio Markers 29 https://research.libd.org/DeconvoBuddies/ @lahuuki Louise
A Huuki-Myers

Deconvolution Louise A. Huuki-Myers

• Inferring the composition of different cell types in a
bulk RNA-seq data What is Deconvolution? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 31 Deconvolution Get single cell like information from bulk RNA-seq $$$ $ Free! https://twitter.com/BoXia7/status/1261464021322137600

Mean Proportions By Region: Tran et al, bioRxiv, 2020 (5
donors, 6 cell types)

Peric = Mural + Endo Mean Proportions By Region: Tran
et al, Neuron, 2021 (8 donors, 10 cell types)

• Bisque has more similar pattern of composition over regions
vs. SPLITR • MuSiC predicts large proportions of Endo + Mural (Peric) • Both estimate lower proportions of Excit ◦ MuSiC is more extreme and also predicts low portion Inhib Bisque & MuSiC vs SPLITR Different deconvolution methods, bulk RNA-seq data source, marker genes, and reference snRNA-seq data

• Run with set of 20 & 25 marker genes
per cell type • Bisque is more robust to changes in the marker set than MuSiC Method Sensitivity to Marker Set 25 vs. 20 Genes Currently Bisque is our method of choice

Dataset Regions Samples Case Control Analysis Publication BipSeq sACC +
AMY 511 247 BPD 264 Revisions Zandi et al., Nat. Neurosci, 2022 Suicide Genomics DLPFC 329 226 103 Revisions Punzi et al., American Journal of Psychiatry, 2022 BrainSeq Phase III Caudate 464 298 SCZD 266 Revisions Benjamin et al., Nature Neuroscience, 2022 MDDseq sACC + AMY 1091 704 MDD/BPD 387 Main In Progress AANRI DG, Caudate, Hippo, DLPFC 1647 (263, 464, 447, 453) - - Main In Progress Astellas AD Main In Progress BrainSeq Phase I DLPFC 727 395 SCZD 332 Exploratory - BrainSeq Phase II DLPFC 453 153 SCZD 300 Exploratory - GTEx 13 Regions 2670 - - Exploratory - Degradation AMY, Caudate, DLPFC, HIPPO, mPFC, sACC 119 - - Exploratory -

Upcoming: Deconvolution Methods Benchmark • Goal: determine the most accurate
deconvolution method for brain bulk RNA-seq data ◦ Test available softwares (Bisque, MuSiC, and others) over a variety of conditions ▪ Reference set qualities ▪ Marker Genes selection ▪ Preparation of the bulk data • Requires: A “gold standard” cell type composition reference to measure performance ◦ snRNA-seq can be enriched for certain cell types ◦ smFISH + RNAscope allows “direct” measurement from intact tissue, will be used to establish true composition

Bulk RNA-seq Goals for RNAscope Experiment • Deconvolution R01 MH123183
◦ Kristen Maynard, Stephanie C Hicks • Use six slices of DLPFC to generate corresponding RNA-seq & RNAscope data • This information will be useful to evaluate and design deconvolution algorithms DLPFC Bulk RNA-seq snRNA-seq Spatial RNAscope RNAscope 38 polyA RiboZero @kr_maynard Kristen R Maynard @stephaniehicks Stephanie C Hicks Kelsey D Montgomery

What is a TREG? • Total RNA Expression Gene •
Expression is proportional to the overall RNA expression in a cell • In smFISH the count of TREG puncta in a cell can estimate the RNA content ◦ Linking RNA content to nucleus size http://research.libd.org/TREG/ http://bioconductor.org/packages/TREG/

eQTLs Louise A. Huuki-Myers

Key inputs • Genotype Data ◦ Consider minor allele frequency
◦ Full topMed imputed SNP data set ◦ Risk SNP subset • Expression Data ◦ Gene, exon, junction, transcript ◦ Position of the feature • Covariates Data ◦ Phenotype data: Dx, Age, Sex ◦ Feature PCs • Interaction Data ◦ Example: cell fractions from deconvolution • Parameters ◦ Window size ◦ Minor allele frequency PopTop Genotype data SPEAQeasy generated Summarizedexperiment TensorQTL + parameters Deconvolution or other analysis Covariate Data as matrix Plink ﬁles containing SNPs of interest Interaction vector Feature position + expression matrix Only for interaction analysis eQTL results

MatrixEQTL vs tensorQTL (fastQTL) MatrixEQTL • R package • Many
Andrew E Jaffe analyses: ◦ BrainSEQ Phase II ◦ Burke et al stem cell ◦ … ◦ BipSeq by Zandi et al tensorQTL • Python, GPU enabled • Currently utilized in MDDseq project • Recommended upgrade by Andrew Jaffe, utilized by other LIBD researchers • github.com/broadinstitute/tensorqtl https://youtu.be/zOMU XYHtVJM

Genome-wide eQTLs: several flavors • Nominal: evaluate all pairs •
Cis: find most significant pair per feature • Independent: conditionally independent cis-QTLs using stepwise regression

tensorQTL at JHPCE (GPU-powered) Data Formatting • Genotype Data ◦
Needs .bed/.bim/.bam ﬁles • Expression Data ◦ Gene, exon, junction, transcript ◦ As .bed.gz • Covariates Data ◦ Phenotype data: Dx, Age, Sex, Feature PCs ▪ Categorical variables must be converted to numeric ◦ File type flexible, need to read in as pandas.DataFrame How to Run on GPU • Can be used as a function in python script or as command line tool ◦ Requires conversion to correct data formats • Fast when run on GPU ◦ Completed MDDseq Amygdala Gene analysis in 2.52 min vs 51.21 min on CPU (vs. 288 min matrixEQTL) ▪ 540 samples x 53.6M pairs • Use GPU queue when submitting job ◦ Example sh ﬁle #$ -l gpu,mem_free=50G,h_vmem=50G,h_fsize=100G

GWAS-loci eQTL analysis • Subset genotype dataset to SNPs identiﬁed
as risk loci by GWAS • Check for association with cellular fractions predicted by deconvolution ◦ Run nominal analysis w/ addition of interaction vector ◦ Adds interaction term to the model ▪ p ~ g + i + gi PGC Major Depressive Disorder GWAS Wray et al. Nature Genetics, 2018 Deconvolution Results

Interaction eQTLs with cell type proportions https://github.com/LieberInstitute/goesHyde_mdd_rnaseq/tree/master/eqtl/code

Quality Surrogate Variable Analysis (qSVA) Joshua M. Stolz

Differential expression is confounded by degradation The t-statistics between SCZ
vs Control and degradation time DE are correlated. Traditional methods (like RIN) fail to remove this affect. Jaffe AE, Tao R, Norris AL, Kealhofer M, Nellore A, Shin JH, et al. qSVA framework for RNA quality correction in differential expression analysis. Proc Natl Acad Sci U S A. 2017;114:7130–5.

qSVA Original Process Each sample was allowed to degrade on
a bench for 0,15,30,60 minutes. From this we get the top 1000 expressed regions associated with degradation. Peterson, Amy. “Quality Surrogate Variable Analysis.” LIBD Rstats Club, LIBD Rstats Club, 11 Dec. 2018, research.libd.org/rstatsclub/2018/12/11/quality-surrogate-variable-analysis/

Updated pipeline 2000

Degradation is confounded by Region

Deconvolution @lahuuki Louise A Huuki-Myers

Deconvolution in Degradation Matrix • Identify 2,976 degradation associated transcripts
with cell proportion terms in model (vs. 1,792) • Controlling expression for qSVs predicted with this set of transcripts shows lower correlations between DE results and degradation statistic (desired result) Cor = -0.091 Cor = -0.051

http://research.libd.org/qsvaR http://bioconductor.org/packages/qsvaR/ @HeenaDivecha Heena R Divecha With ongoing feedback on
the documentation from:

Differential Gene Expression Louise A. Huuki-Myers

Key inputs • Quality Controlled Expression Data • Model &
corresponding data ◦ Primary Dx (explanatory variable) ◦ Phenotype data (AgeDeath, Sex) ◦ Quality control metrics ▪ mitoRate, rRNA_rate, totalAssignedGene, RIN, ERCC ◦ snpPCs ▪ from DNA Genotyping ◦ qSVs ▪ from qSVA v1 or v2 (qsvaR) ◦ Other Analysis ▪ E.x. Deconvolution cell fractions ~Dx + pd + QC + snpPC + qSVs + ? Model Deconvolution or other analysis qsvaR Degradation matrix PopTop Genotype data SPEAQeasy generated Summarizedexperiment Model Matrix Normalized counts limma + voom process lmFit() eBayes() topTable() DE Results calcNormFactors() model.matrix()

Modeling with limma: quick overview • calcNormFactors() from edgeR ◦
For normalization of the bulk RNA-seq counts • model.matrix() from stats ◦ Define how you want to model gene expression ◦ Covariates like qSVs, ancestry PCs, SPEAQeasy QC metrics, sex, age, diagnosis, … • lmFit() ◦ Fit the linear regression model for all genes • eBayes() ◦ Use empirical Bayes to compute the statistics • topTable() ◦ Extract results for downstream analyses More details at http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html edgeR, DESeq2, dream are good alternatives

Why limma + voom? • Limma utilizes linear regression vs.
DESeq2 utilizes negative binomial distribution ◦ Comparable results ◦ Limma is less computationally expensive (faster) • Other methods: ◦ DREAM: linear mixed effect model (Hoffman et al. Bioinformatics, 2021) ▪ More precise but computationally expensive ▪ May be better for analysis with small sample sizes

https://github.com/LieberInstitute/goesHyde_mdd_rnaseq/blob/master/differential_expression/cod e/run_DE.R Example DE code with limma

Interactively explore your model.matrix http://bioconductor.org/packages/ ExploreModelMatrix https://doi.org/10.12688/f1000re search.24187.2 • Important
for exploring the DE model

Adding Cell Fraction to DE Model • Including Deconvolution can
result in a more conservative model ◦ For the most part similar t-statistics ◦ Fever signiﬁcant DE genes

Spatially-resolved transcriptomics Leonardo Collado-Torres With help from: @abspangler Abby Spangler
@lmwebr Lukas M Weber @stephaniehicks Stephanie C Hicks @MadhaviTippani Madhavi Tippani @sowmyapartybun Sowmya Parthiban @HeenaDivecha Heena R Divecha @PardoBree Brenda Pardo @Nick-Eagles (GH) Nicholas J Eagles @martinowk Keri Martinowich @kr_maynard Kristen R Maynard @CerceoPage Stephanie C Page

63 SpatialExperiment: infrastructure for spatially resolved transcriptomics data in R
using Bioconductor Righelli, Weber, Crowell, et al, bioRxiv, 2021 Accepted at Oxford Bioinformatics on 04/19/2022 DOI 10.1101/2021.01.27.428431 Dario Righellli Helena L Crowell @drighelli @CrowellHL Lukas M Weber @lmwebr

bioconductor.org/packages/spatialLIBD Pardo et al, bioRxiv, 2021 DOI 10.1101/2021.04.29.440149 Accepted at
BMC Genomics on 04/20/2022 Maynard, Collado-Torres, Nat Neuro, 2021 Brenda Pardo Abby Spangler @PardoBree @abspangler

http://research.libd.org/spatialLIBD/articles/TenX_data_download.html

OSTA: https://lmweber.org/OSTA-book/ @lmwebr Lukas M Weber @lcolladotor Leonardo Collado-Torres @abspangler
Abby Spangler @HeenaDivecha Heena R Divecha @MadhaviTippani Madhavi Tippani @stephaniehicks Stephanie C Hicks

67 Data-driven clustering: BayesSpace Zhao et al, Nature Biotechnology, 2021
https://doi.org/10.1038/s41587-021-00935-2

Spatial registration of your sc/snRNA-seq data Your sc/snRNA-seq data Our
spatial data Hodge et al, Nature, 2019 Maynard, Collado-Torres, Nat Neuro, 2021

Spot deconvolution: Tangram https://www.nature.com/articles/s41592-021-01264-7/ﬁgures/1 @Nick-Eagles (GH) Nicholas J Eagles

Our Philosophy + Getting Help

Share knowledge openly • As an independent researcher, my team
and I are not a data science core, yet we share our knowledge openly so others can get up to speed if needed ◦ Share research results early through pre-prints (bioRxiv) ◦ Share code on GitHub with others ▪ GitHub is widely used as a social coding platform ◦ Share code snippets that might be useful to others ◦ Share our experiences ◦ Maintain and share information on several Slack communication channels ◦ People are free to adapt what we have done and we would love to learn about what others have come up with, since we might need to update/change our own work ▪ We do not impose solutions or make decisions for others

Different types of help ◦ Things we can do ▪
Guidance, feasibility, and/or brainstorming ▪ Data processing like with bulk RNA-seq with SPEAQeasy, DNA genotyping with PopTop, WGBS with BiocMAP, … • We would strongly prefer that others learn how to run these tools ◦ Aka, please use the documentation we wrote =) ▪ Sharing data with external collaborators • After internal LIBD approval by Rujuta Narurkar ◦ Things that are beyond what we can typically do ▪ Lead analysis ▪ Develop and/or maintain custom solutions ▪ Write papers

Data Science guidance sessions (DSgs) • https://lcolladotor.github.io/bioc_team_ds/data-science-guidance-sessions.html ◦ JHPCE ◦
R ◦ Bioconductor ◦ Understanding code we wrote ◦ Training others on how they can more effectively get help from us or others ▪ Providing reproducible examples: reprex in R https://reprex.tidyverse.org/ which provides a solution to “help me help you” ▪ Framing questions and software bug reports • The DSgs system works best over the long term ◦ It’s based on my 3 yr experience as an JHBSPH MpH capstone teaching assistant

https://jhpce.jhu.edu/knowledge-base/knowledge-base-articles-from-lieber-institute/ Join us Fridays at 9 AM (check the code
of conduct please!)

https://www.youtube.com/c/LeonardoColladoTorres/playlists Videos allow us to multiply ourselves We can make
you custom selections of videos for a speciﬁc problem on DSgs sessions

https://github.com/

https://github.com/LieberInstitute Email Bill Ulrich your GitHub username to get added
@ckbehemoth (GH) William S Ulrich

https://github.com/search?q=org%3ALieberInstitute Example question: How do you use aggregateAcrossCells() ?

https://github.com/search?q=org%3ALieberInstitute+aggregateAcrossCells&type=code GitHub is our library / encyclopedia It could be
yours / LIBD’s too! Code is the ultimate documentation Git commit messages remind you of what you were thinking when you made a change Bill Ulrich or Leo can give you access

Project 1 • https://github.com/LieberInstitute/HumanPilot/blob/ master/Analysis/Layer_Guesses/layer_speciﬁcity.R • https://github.com/LieberInstitute/HumanPilot/blob/ master/Analysis/Layer_Guesses/asd_snRNAseq_re cast.R Project
2 • https://github.com/LieberInstitute/spatialDLPFC/blo b/main/code/analysis/07_spatial_registration/07_sp atial_registration.R Project 3 • https://github.com/LieberInstitute/Visium_IF_AD/blo b/master/code/10_spatial_registration/01_spatial_r egistration.R On the horizon: A new function at https://github.com/LieberInstitute/spatialLIBD @sowmyapartybun Sowmya Parthiban @abspangler Abby Spangler Code is constantly adapted and improved, both within and across projects Spatial registration code example It’s hard to keep track of code evolution • Try to include comments linking back to where you adapted it from Basically: divide and conquer ^^

@lcolladotor Leonardo Collado-Torres @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M
Stolz @Nick-Eagles (GH) Nicholas J Eagles @geo_pertea Geo Pertea @abspangler Abby Spangler @mattntran Matthew N Tran @lmwebr Lukas M Weber @stephaniehicks Stephanie C Hicks @MadhaviTippani Madhavi Tippani @sowmyapartybun Sowmya Parthiban + Many more LIBD, JHU, and external collaborators @PardoBree Brenda Pardo @HeenaDivecha Heena R Divecha @ckbehemoth (GH) William S Ulrich @martinowk Keri Martinowich @kr_maynard Kristen R Maynard @CerceoPage Stephanie C Page

LIBD_DS_TLDR

LIBD_DS_TLDR

More Decks by Leonardo Collado-Torres

Other Decks in Science

Featured

Transcript