far more snps (~300 million) • The Rsq value for lower MAF is preserved in populations of african ancestry. • Can this increased power be used to include lower MAFs? • Should we just filter by Rsq? MAF: minor allele frequency Rsq: R squared
upcoming LIBD Data Portal. • Writing scripts to make delivering subsets of samples more time feasible. • Continuing to maintain and update the documentation website to make it more robust and user friendly. @geo_pertea Geo Pertea @Nick-Eagles (GH) Nicholas J Eagles
be uniform across time/ datasets, documented, and reproducible - What aligner was used for this dataset? - Did we use hg19 or hg38? What GENCODE release was used? - How did we decide which samples to trim, if any? - Which version of FastQC was used? - While many computational steps are involved before analyses (e.g. DE) are possible, data pre-processing should ideally not require technical expertise to apply
data and work with its outputs - Use variant calling results to find and resolve identity issues originating from labelling mistakes - Perform a differential analysis after attaching experiment-specific sample metadata http://research.libd.org/SPEAQeasy-example @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M Stolz
git clone git@github.com :LieberInstitute/SPEAQeasy.git - cd SPEAQeasy - bash install_software.sh "jhpce" - Control: - exact annotation files used (GENCODE/ Ensembl versions) - command-line arguments to software - how to trim samples, if at all - aligner (HISAT2/ STAR) and pseudo-aligner (kallisto/ salmon) /dcs04/lieber/ds2a/Data/CMC/Data/RNAseq/Raw/SPEAQeasy
resulting in reduction in disk space and computational time - Support for singularity leading to new Cardiff users and greatly expanding possible users (collaboration with Nick Clifton) - Added raw counts for transcripts in addition to TPM Future improvements: - Want to allow a user to only perform alignment, or only quantify transcripts - Reduce required memory when counting junctions to produce R objects @geo_pertea Geo Pertea
epigenetic modification, across the entire human genome - Differentially methylated regions (DMRs), e.g. between schizophrenia and controls - Methylation quantitative trait loci (MeQTLs) - Large data size (> 1B cytosines measured, ~2TB disk space per sample) - Careful choices must be made to fit files on JHPCE, generate/load results with available memory - Several steps are required before raw sequenced reads yield methylation proportions ready for analysis - Trimming, alignment to reference genome, extract methylation proportions, import to R
664 PsychENCODE Schizophrenia/control samples (Hippocampus, DLPFC, Caudate) - 20 PsychENCODE fetal samples - 2597 VA PTSD samples (year 1 and 2) - How long does it take to run? - ~3 months for 648 samples (second module) - ~2 weeks for 43 samples (both modules at JHPCE) - How much disk space is required? - ~2TB disk space per sample while generating; 1TB outputs
brint age sex race dx_id subjects id name subj_id region sdate samples id dataset_id s_id s_name sample_id protocol restricted numReads numMapped totalMapped overallMapRate ... mitoRate rRNA_rate totalAssignedGene exp_metadata exp_id dtype ftype G / E / J / T f_set_id f_data real [ ] version exp_data Experiment data flow H5 filesystem PostgreSQL database assay data Parquet
genomic features (annotations) assay data id subj_id dnum sample_id panel_id batch_id call_rate p10gc, p50gc nPennCNV [ ] SUM16,SUM20 imputation data_path genotype location of data files on file system storage
R’s data processing and visualization capabilities SQL + R code: Front-end (web application) Back-end PostgreSQL server client selects dataset (sample metadata only) client receives results & plot data middleware (nodejs) retrieve sample data process large data output results SQL / R server returns results & baked plot data (plotly JSON ) srv16
the empty droplets • isOutlier() from scran ◦ Identify outlier cells/nuclei based on mitochondrial expression and other metrics • devianceFeatureSelection()+ nullResiduals() from scry ◦ GLM-PCA approximation by Townes, Hicks, Ayree, and Irizarry https://doi.org/10.1186/s13059-019-1861-6 • reduceMNN() from batchelor ◦ Batch correction since sc/snRNA-seq has strong sample effects • + much more before you get to annotated clusters of cells @mattntran Matthew N Tran @Erik-D-Nelson (GH) Erik D Nelson
bulk RNA-seq data What is Deconvolution? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 31 Deconvolution Get single cell like information from bulk RNA-seq $$$ $ Free! https://twitter.com/BoXia7/status/1261464021322137600
vs. SPLITR • MuSiC predicts large proportions of Endo + Mural (Peric) • Both estimate lower proportions of Excit ◦ MuSiC is more extreme and also predicts low portion Inhib Bisque & MuSiC vs SPLITR Different deconvolution methods, bulk RNA-seq data source, marker genes, and reference snRNA-seq data
per cell type • Bisque is more robust to changes in the marker set than MuSiC Method Sensitivity to Marker Set 25 vs. 20 Genes Currently Bisque is our method of choice
deconvolution method for brain bulk RNA-seq data ◦ Test available softwares (Bisque, MuSiC, and others) over a variety of conditions ▪ Reference set qualities ▪ Marker Genes selection ▪ Preparation of the bulk data • Requires: A “gold standard” cell type composition reference to measure performance ◦ snRNA-seq can be enriched for certain cell types ◦ smFISH + RNAscope allows “direct” measurement from intact tissue, will be used to establish true composition
◦ Kristen Maynard, Stephanie C Hicks • Use six slices of DLPFC to generate corresponding RNA-seq & RNAscope data • This information will be useful to evaluate and design deconvolution algorithms DLPFC Bulk RNA-seq snRNA-seq Spatial RNAscope RNAscope 38 polyA RiboZero @kr_maynard Kristen R Maynard @stephaniehicks Stephanie C Hicks Kelsey D Montgomery
Expression is proportional to the overall RNA expression in a cell • In smFISH the count of TREG puncta in a cell can estimate the RNA content ◦ Linking RNA content to nucleus size http://research.libd.org/TREG/ http://bioconductor.org/packages/TREG/
◦ Full topMed imputed SNP data set ◦ Risk SNP subset • Expression Data ◦ Gene, exon, junction, transcript ◦ Position of the feature • Covariates Data ◦ Phenotype data: Dx, Age, Sex ◦ Feature PCs • Interaction Data ◦ Example: cell fractions from deconvolution • Parameters ◦ Window size ◦ Minor allele frequency PopTop Genotype data SPEAQeasy generated Summarizedexperiment TensorQTL + parameters Deconvolution or other analysis Covariate Data as matrix Plink files containing SNPs of interest Interaction vector Feature position + expression matrix Only for interaction analysis eQTL results
Andrew E Jaffe analyses: ◦ BrainSEQ Phase II ◦ Burke et al stem cell ◦ … ◦ BipSeq by Zandi et al tensorQTL • Python, GPU enabled • Currently utilized in MDDseq project • Recommended upgrade by Andrew Jaffe, utilized by other LIBD researchers • github.com/broadinstitute/tensorqtl https://youtu.be/zOMU XYHtVJM
Needs .bed/.bim/.bam files • Expression Data ◦ Gene, exon, junction, transcript ◦ As .bed.gz • Covariates Data ◦ Phenotype data: Dx, Age, Sex, Feature PCs ▪ Categorical variables must be converted to numeric ◦ File type flexible, need to read in as pandas.DataFrame How to Run on GPU • Can be used as a function in python script or as command line tool ◦ Requires conversion to correct data formats • Fast when run on GPU ◦ Completed MDDseq Amygdala Gene analysis in 2.52 min vs 51.21 min on CPU (vs. 288 min matrixEQTL) ▪ 540 samples x 53.6M pairs • Use GPU queue when submitting job ◦ Example sh file #$ -l gpu,mem_free=50G,h_vmem=50G,h_fsize=100G
as risk loci by GWAS • Check for association with cellular fractions predicted by deconvolution ◦ Run nominal analysis w/ addition of interaction vector ◦ Adds interaction term to the model ▪ p ~ g + i + gi PGC Major Depressive Disorder GWAS Wray et al. Nature Genetics, 2018 Deconvolution Results
vs Control and degradation time DE are correlated. Traditional methods (like RIN) fail to remove this affect. Jaffe AE, Tao R, Norris AL, Kealhofer M, Nellore A, Shin JH, et al. qSVA framework for RNA quality correction in differential expression analysis. Proc Natl Acad Sci U S A. 2017;114:7130–5.
a bench for 0,15,30,60 minutes. From this we get the top 1000 expressed regions associated with degradation. Peterson, Amy. “Quality Surrogate Variable Analysis.” LIBD Rstats Club, LIBD Rstats Club, 11 Dec. 2018, research.libd.org/rstatsclub/2018/12/11/quality-surrogate-variable-analysis/
with cell proportion terms in model (vs. 1,792) • Controlling expression for qSVs predicted with this set of transcripts shows lower correlations between DE results and degradation statistic (desired result) Cor = -0.091 Cor = -0.051
For normalization of the bulk RNA-seq counts • model.matrix() from stats ◦ Define how you want to model gene expression ◦ Covariates like qSVs, ancestry PCs, SPEAQeasy QC metrics, sex, age, diagnosis, … • lmFit() ◦ Fit the linear regression model for all genes • eBayes() ◦ Use empirical Bayes to compute the statistics • topTable() ◦ Extract results for downstream analyses More details at http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html edgeR, DESeq2, dream are good alternatives
DESeq2 utilizes negative binomial distribution ◦ Comparable results ◦ Limma is less computationally expensive (faster) • Other methods: ◦ DREAM: linear mixed effect model (Hoffman et al. Bioinformatics, 2021) ▪ More precise but computationally expensive ▪ May be better for analysis with small sample sizes
@lmwebr Lukas M Weber @stephaniehicks Stephanie C Hicks @MadhaviTippani Madhavi Tippani @sowmyapartybun Sowmya Parthiban @HeenaDivecha Heena R Divecha @PardoBree Brenda Pardo @Nick-Eagles (GH) Nicholas J Eagles @martinowk Keri Martinowich @kr_maynard Kristen R Maynard @CerceoPage Stephanie C Page
using Bioconductor Righelli, Weber, Crowell, et al, bioRxiv, 2021 Accepted at Oxford Bioinformatics on 04/19/2022 DOI 10.1101/2021.01.27.428431 Dario Righellli Helena L Crowell @drighelli @CrowellHL Lukas M Weber @lmwebr
and I are not a data science core, yet we share our knowledge openly so others can get up to speed if needed ◦ Share research results early through pre-prints (bioRxiv) ◦ Share code on GitHub with others ▪ GitHub is widely used as a social coding platform ◦ Share code snippets that might be useful to others ◦ Share our experiences ◦ Maintain and share information on several Slack communication channels ◦ People are free to adapt what we have done and we would love to learn about what others have come up with, since we might need to update/change our own work ▪ We do not impose solutions or make decisions for others
Guidance, feasibility, and/or brainstorming ▪ Data processing like with bulk RNA-seq with SPEAQeasy, DNA genotyping with PopTop, WGBS with BiocMAP, … • We would strongly prefer that others learn how to run these tools ◦ Aka, please use the documentation we wrote =) ▪ Sharing data with external collaborators • After internal LIBD approval by Rujuta Narurkar ◦ Things that are beyond what we can typically do ▪ Lead analysis ▪ Develop and/or maintain custom solutions ▪ Write papers
R ◦ Bioconductor ◦ Understanding code we wrote ◦ Training others on how they can more effectively get help from us or others ▪ Providing reproducible examples: reprex in R https://reprex.tidyverse.org/ which provides a solution to “help me help you” ▪ Framing questions and software bug reports • The DSgs system works best over the long term ◦ It’s based on my 3 yr experience as an JHBSPH MpH capstone teaching assistant
yours / LIBD’s too! Code is the ultimate documentation Git commit messages remind you of what you were thinking when you made a change Bill Ulrich or Leo can give you access
2 • https://github.com/LieberInstitute/spatialDLPFC/blo b/main/code/analysis/07_spatial_registration/07_sp atial_registration.R Project 3 • https://github.com/LieberInstitute/Visium_IF_AD/blo b/master/code/10_spatial_registration/01_spatial_r egistration.R On the horizon: A new function at https://github.com/LieberInstitute/spatialLIBD @sowmyapartybun Sowmya Parthiban @abspangler Abby Spangler Code is constantly adapted and improved, both within and across projects Spatial registration code example It’s hard to keep track of code evolution • Try to include comments linking back to where you adapted it from Basically: divide and conquer ^^
Stolz @Nick-Eagles (GH) Nicholas J Eagles @geo_pertea Geo Pertea @abspangler Abby Spangler @mattntran Matthew N Tran @lmwebr Lukas M Weber @stephaniehicks Stephanie C Hicks @MadhaviTippani Madhavi Tippani @sowmyapartybun Sowmya Parthiban + Many more LIBD, JHU, and external collaborators @PardoBree Brenda Pardo @HeenaDivecha Heena R Divecha @ckbehemoth (GH) William S Ulrich @martinowk Keri Martinowich @kr_maynard Kristen R Maynard @CerceoPage Stephanie C Page