$30 off During Our Annual Pro Sale. View Details »

LIBD_DS_TLDR

 LIBD_DS_TLDR

Leonardo Collado-Torres

April 20, 2022
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. R/Bioconductor-powered Team Data Science Leonardo Collado-Torres Louise A. Huuki-Myers Joshua

    M. Stolz Nicholas J. Eagles + Geo Pertea https://lcolladotor.github.io/bioc_team_ds/ April 20, 2022 https://speakerdeck.com/lcolladotor/libd-ds-tldr
  2. Benefit of TopMed • TopMed offers a reference panel for

    far more snps (~300 million) • The Rsq value for lower MAF is preserved in populations of african ancestry. • Can this increased power be used to include lower MAFs? • Should we just filter by Rsq? MAF: minor allele frequency Rsq: R squared
  3. PopTop:NextFlow Pipeline • Parallelizes computationally expensive tasks • Allows for

    the automation of large jobs • Documentation is available at: ◦ https://research.libd.org/Topmed-Imputation-Pipeline/
  4. Future Works • Working to make this modular with the

    upcoming LIBD Data Portal. • Writing scripts to make delivering subsets of samples more time feasible. • Continuing to maintain and update the documentation website to make it more robust and user friendly. @geo_pertea Geo Pertea @Nick-Eagles (GH) Nicholas J Eagles
  5. Bulk RNA-seq Processing: Motivation and Challenges - Data processing should

    be uniform across time/ datasets, documented, and reproducible - What aligner was used for this dataset? - Did we use hg19 or hg38? What GENCODE release was used? - How did we decide which samples to trim, if any? - Which version of FastQC was used? - While many computational steps are involved before analyses (e.g. DE) are possible, data pre-processing should ideally not require technical expertise to apply
  6. Example Analysis - Demonstrate how to run SPEAQeasy on real

    data and work with its outputs - Use variant calling results to find and resolve identity issues originating from labelling mistakes - Perform a differential analysis after attaching experiment-specific sample metadata http://research.libd.org/SPEAQeasy-example @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M Stolz
  7. Configuration - Each user recommended to install SPEAQeasy separately -

    git clone [email protected] :LieberInstitute/SPEAQeasy.git - cd SPEAQeasy - bash install_software.sh "jhpce" - Control: - exact annotation files used (GENCODE/ Ensembl versions) - command-line arguments to software - how to trim samples, if at all - aligner (HISAT2/ STAR) and pseudo-aligner (kallisto/ salmon) /dcs04/lieber/ds2a/Data/CMC/Data/RNAseq/Raw/SPEAQeasy
  8. Recent Improvements and Future Work Post-publication improvements: - Alignment-related optimizations

    resulting in reduction in disk space and computational time - Support for singularity leading to new Cardiff users and greatly expanding possible users (collaboration with Nick Clifton) - Added raw counts for transcripts in addition to TPM Future improvements: - Want to allow a user to only perform alignment, or only quantify transcripts - Reduce required memory when counting junctions to produce R objects @geo_pertea Geo Pertea
  9. WGBS: Motivation and Challenges - Profile DNA methylation, a critical

    epigenetic modification, across the entire human genome - Differentially methylated regions (DMRs), e.g. between schizophrenia and controls - Methylation quantitative trait loci (MeQTLs) - Large data size (> 1B cytosines measured, ~2TB disk space per sample) - Careful choices must be made to fit files on JHPCE, generate/load results with available memory - Several steps are required before raw sequenced reads yield methylation proportions ready for analysis - Trimming, alignment to reference genome, extract methylation proportions, import to R
  10. Workflow in Practice - Datasets where we’ve applied BiocMAP: -

    664 PsychENCODE Schizophrenia/control samples (Hippocampus, DLPFC, Caudate) - 20 PsychENCODE fetal samples - 2597 VA PTSD samples (year 1 and 2) - How long does it take to run? - ~3 months for 648 samples (second module) - ~2 weeks for 43 samples (both modules at JHPCE) - How much disk space is required? - ~2TB disk space per sample while generating; 1TB outputs
  11. LIBD Data integration • relational database tracking data assets at

    LIBD • linking LIMS to processed data • flexible database back-end & indexed file storage • unified web interface for data queries
  12. brains histological samples extraction sequencing sequencing samples processing id brnum

    brint age sex race dx_id subjects id name subj_id region sdate samples id dataset_id s_id s_name sample_id protocol restricted numReads numMapped totalMapped overallMapRate ... mitoRate rRNA_rate totalAssignedGene exp_metadata exp_id dtype ftype G / E / J / T f_set_id f_data real [ ] version exp_data Experiment data flow H5 filesystem PostgreSQL database assay data Parquet
  13. PostgreSQL relational database demographic data experiment metadata histological sample metadata

    genomic features (annotations) assay data id subj_id dnum sample_id panel_id batch_id call_rate p10gc, p50gc nPennCNV [ ] SUM16,SUM20 imputation data_path genotype location of data files on file system storage
  14. Integration of PostgreSQL and R from back-end to front-end Leveraging

    R’s data processing and visualization capabilities SQL + R code: Front-end (web application) Back-end PostgreSQL server client selects dataset (sample metadata only) client receives results & plot data middleware (nodejs) retrieve sample data process large data output results SQL / R server returns results & baked plot data (plotly JSON ) srv16
  15. Quality control + normalization • emptyDrops() from DropletUtils ◦ Determine

    the empty droplets • isOutlier() from scran ◦ Identify outlier cells/nuclei based on mitochondrial expression and other metrics • devianceFeatureSelection()+ nullResiduals() from scry ◦ GLM-PCA approximation by Townes, Hicks, Ayree, and Irizarry https://doi.org/10.1186/s13059-019-1861-6 • reduceMNN() from batchelor ◦ Batch correction since sc/snRNA-seq has strong sample effects • + much more before you get to annotated clusters of cells @mattntran Matthew N Tran @Erik-D-Nelson (GH) Erik D Nelson
  16. • Inferring the composition of different cell types in a

    bulk RNA-seq data What is Deconvolution? Tissue Bulk RNA-seq snRNA-seq Estimated proportions 31 Deconvolution Get single cell like information from bulk RNA-seq $$$ $ Free! https://twitter.com/BoXia7/status/1261464021322137600
  17. Peric = Mural + Endo Mean Proportions By Region: Tran

    et al, Neuron, 2021 (8 donors, 10 cell types)
  18. • Bisque has more similar pattern of composition over regions

    vs. SPLITR • MuSiC predicts large proportions of Endo + Mural (Peric) • Both estimate lower proportions of Excit ◦ MuSiC is more extreme and also predicts low portion Inhib Bisque & MuSiC vs SPLITR Different deconvolution methods, bulk RNA-seq data source, marker genes, and reference snRNA-seq data
  19. • Run with set of 20 & 25 marker genes

    per cell type • Bisque is more robust to changes in the marker set than MuSiC Method Sensitivity to Marker Set 25 vs. 20 Genes Currently Bisque is our method of choice
  20. Dataset Regions Samples Case Control Analysis Publication BipSeq sACC +

    AMY 511 247 BPD 264 Revisions Zandi et al., Nat. Neurosci, 2022 Suicide Genomics DLPFC 329 226 103 Revisions Punzi et al., American Journal of Psychiatry, 2022 BrainSeq Phase III Caudate 464 298 SCZD 266 Revisions Benjamin et al., Nature Neuroscience, 2022 MDDseq sACC + AMY 1091 704 MDD/BPD 387 Main In Progress AANRI DG, Caudate, Hippo, DLPFC 1647 (263, 464, 447, 453) - - Main In Progress Astellas AD Main In Progress BrainSeq Phase I DLPFC 727 395 SCZD 332 Exploratory - BrainSeq Phase II DLPFC 453 153 SCZD 300 Exploratory - GTEx 13 Regions 2670 - - Exploratory - Degradation AMY, Caudate, DLPFC, HIPPO, mPFC, sACC 119 - - Exploratory -
  21. Upcoming: Deconvolution Methods Benchmark • Goal: determine the most accurate

    deconvolution method for brain bulk RNA-seq data ◦ Test available softwares (Bisque, MuSiC, and others) over a variety of conditions ▪ Reference set qualities ▪ Marker Genes selection ▪ Preparation of the bulk data • Requires: A “gold standard” cell type composition reference to measure performance ◦ snRNA-seq can be enriched for certain cell types ◦ smFISH + RNAscope allows “direct” measurement from intact tissue, will be used to establish true composition
  22. Bulk RNA-seq Goals for RNAscope Experiment • Deconvolution R01 MH123183

    ◦ Kristen Maynard, Stephanie C Hicks • Use six slices of DLPFC to generate corresponding RNA-seq & RNAscope data • This information will be useful to evaluate and design deconvolution algorithms DLPFC Bulk RNA-seq snRNA-seq Spatial RNAscope RNAscope 38 polyA RiboZero @kr_maynard Kristen R Maynard @stephaniehicks Stephanie C Hicks Kelsey D Montgomery
  23. What is a TREG? • Total RNA Expression Gene •

    Expression is proportional to the overall RNA expression in a cell • In smFISH the count of TREG puncta in a cell can estimate the RNA content ◦ Linking RNA content to nucleus size http://research.libd.org/TREG/ http://bioconductor.org/packages/TREG/
  24. Key inputs • Genotype Data ◦ Consider minor allele frequency

    ◦ Full topMed imputed SNP data set ◦ Risk SNP subset • Expression Data ◦ Gene, exon, junction, transcript ◦ Position of the feature • Covariates Data ◦ Phenotype data: Dx, Age, Sex ◦ Feature PCs • Interaction Data ◦ Example: cell fractions from deconvolution • Parameters ◦ Window size ◦ Minor allele frequency PopTop Genotype data SPEAQeasy generated Summarizedexperiment TensorQTL + parameters Deconvolution or other analysis Covariate Data as matrix Plink files containing SNPs of interest Interaction vector Feature position + expression matrix Only for interaction analysis eQTL results
  25. MatrixEQTL vs tensorQTL (fastQTL) MatrixEQTL • R package • Many

    Andrew E Jaffe analyses: ◦ BrainSEQ Phase II ◦ Burke et al stem cell ◦ … ◦ BipSeq by Zandi et al tensorQTL • Python, GPU enabled • Currently utilized in MDDseq project • Recommended upgrade by Andrew Jaffe, utilized by other LIBD researchers • github.com/broadinstitute/tensorqtl https://youtu.be/zOMU XYHtVJM
  26. Genome-wide eQTLs: several flavors • Nominal: evaluate all pairs •

    Cis: find most significant pair per feature • Independent: conditionally independent cis-QTLs using stepwise regression
  27. tensorQTL at JHPCE (GPU-powered) Data Formatting • Genotype Data ◦

    Needs .bed/.bim/.bam files • Expression Data ◦ Gene, exon, junction, transcript ◦ As .bed.gz • Covariates Data ◦ Phenotype data: Dx, Age, Sex, Feature PCs ▪ Categorical variables must be converted to numeric ◦ File type flexible, need to read in as pandas.DataFrame How to Run on GPU • Can be used as a function in python script or as command line tool ◦ Requires conversion to correct data formats • Fast when run on GPU ◦ Completed MDDseq Amygdala Gene analysis in 2.52 min vs 51.21 min on CPU (vs. 288 min matrixEQTL) ▪ 540 samples x 53.6M pairs • Use GPU queue when submitting job ◦ Example sh file #$ -l gpu,mem_free=50G,h_vmem=50G,h_fsize=100G
  28. GWAS-loci eQTL analysis • Subset genotype dataset to SNPs identified

    as risk loci by GWAS • Check for association with cellular fractions predicted by deconvolution ◦ Run nominal analysis w/ addition of interaction vector ◦ Adds interaction term to the model ▪ p ~ g + i + gi PGC Major Depressive Disorder GWAS Wray et al. Nature Genetics, 2018 Deconvolution Results
  29. Differential expression is confounded by degradation The t-statistics between SCZ

    vs Control and degradation time DE are correlated. Traditional methods (like RIN) fail to remove this affect. Jaffe AE, Tao R, Norris AL, Kealhofer M, Nellore A, Shin JH, et al. qSVA framework for RNA quality correction in differential expression analysis. Proc Natl Acad Sci U S A. 2017;114:7130–5.
  30. qSVA Original Process Each sample was allowed to degrade on

    a bench for 0,15,30,60 minutes. From this we get the top 1000 expressed regions associated with degradation. Peterson, Amy. “Quality Surrogate Variable Analysis.” LIBD Rstats Club, LIBD Rstats Club, 11 Dec. 2018, research.libd.org/rstatsclub/2018/12/11/quality-surrogate-variable-analysis/
  31. Deconvolution in Degradation Matrix • Identify 2,976 degradation associated transcripts

    with cell proportion terms in model (vs. 1,792) • Controlling expression for qSVs predicted with this set of transcripts shows lower correlations between DE results and degradation statistic (desired result) Cor = -0.091 Cor = -0.051
  32. Key inputs • Quality Controlled Expression Data • Model &

    corresponding data ◦ Primary Dx (explanatory variable) ◦ Phenotype data (AgeDeath, Sex) ◦ Quality control metrics ▪ mitoRate, rRNA_rate, totalAssignedGene, RIN, ERCC ◦ snpPCs ▪ from DNA Genotyping ◦ qSVs ▪ from qSVA v1 or v2 (qsvaR) ◦ Other Analysis ▪ E.x. Deconvolution cell fractions ~Dx + pd + QC + snpPC + qSVs + ? Model Deconvolution or other analysis qsvaR Degradation matrix PopTop Genotype data SPEAQeasy generated Summarizedexperiment Model Matrix Normalized counts limma + voom process lmFit() eBayes() topTable() DE Results calcNormFactors() model.matrix()
  33. Modeling with limma: quick overview • calcNormFactors() from edgeR ◦

    For normalization of the bulk RNA-seq counts • model.matrix() from stats ◦ Define how you want to model gene expression ◦ Covariates like qSVs, ancestry PCs, SPEAQeasy QC metrics, sex, age, diagnosis, … • lmFit() ◦ Fit the linear regression model for all genes • eBayes() ◦ Use empirical Bayes to compute the statistics • topTable() ◦ Extract results for downstream analyses More details at http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html edgeR, DESeq2, dream are good alternatives
  34. Why limma + voom? • Limma utilizes linear regression vs.

    DESeq2 utilizes negative binomial distribution ◦ Comparable results ◦ Limma is less computationally expensive (faster) • Other methods: ◦ DREAM: linear mixed effect model (Hoffman et al. Bioinformatics, 2021) ▪ More precise but computationally expensive ▪ May be better for analysis with small sample sizes
  35. Adding Cell Fraction to DE Model • Including Deconvolution can

    result in a more conservative model ◦ For the most part similar t-statistics ◦ Fever significant DE genes
  36. Spatially-resolved transcriptomics Leonardo Collado-Torres With help from: @abspangler Abby Spangler

    @lmwebr Lukas M Weber @stephaniehicks Stephanie C Hicks @MadhaviTippani Madhavi Tippani @sowmyapartybun Sowmya Parthiban @HeenaDivecha Heena R Divecha @PardoBree Brenda Pardo @Nick-Eagles (GH) Nicholas J Eagles @martinowk Keri Martinowich @kr_maynard Kristen R Maynard @CerceoPage Stephanie C Page
  37. 63 SpatialExperiment: infrastructure for spatially resolved transcriptomics data in R

    using Bioconductor Righelli, Weber, Crowell, et al, bioRxiv, 2021 Accepted at Oxford Bioinformatics on 04/19/2022 DOI 10.1101/2021.01.27.428431 Dario Righellli Helena L Crowell @drighelli @CrowellHL Lukas M Weber @lmwebr
  38. bioconductor.org/packages/spatialLIBD Pardo et al, bioRxiv, 2021 DOI 10.1101/2021.04.29.440149 Accepted at

    BMC Genomics on 04/20/2022 Maynard, Collado-Torres, Nat Neuro, 2021 Brenda Pardo Abby Spangler @PardoBree @abspangler
  39. OSTA: https://lmweber.org/OSTA-book/ @lmwebr Lukas M Weber @lcolladotor Leonardo Collado-Torres @abspangler

    Abby Spangler @HeenaDivecha Heena R Divecha @MadhaviTippani Madhavi Tippani @stephaniehicks Stephanie C Hicks
  40. Spatial registration of your sc/snRNA-seq data Your sc/snRNA-seq data Our

    spatial data Hodge et al, Nature, 2019 Maynard, Collado-Torres, Nat Neuro, 2021
  41. Share knowledge openly • As an independent researcher, my team

    and I are not a data science core, yet we share our knowledge openly so others can get up to speed if needed ◦ Share research results early through pre-prints (bioRxiv) ◦ Share code on GitHub with others ▪ GitHub is widely used as a social coding platform ◦ Share code snippets that might be useful to others ◦ Share our experiences ◦ Maintain and share information on several Slack communication channels ◦ People are free to adapt what we have done and we would love to learn about what others have come up with, since we might need to update/change our own work ▪ We do not impose solutions or make decisions for others
  42. Different types of help ◦ Things we can do ▪

    Guidance, feasibility, and/or brainstorming ▪ Data processing like with bulk RNA-seq with SPEAQeasy, DNA genotyping with PopTop, WGBS with BiocMAP, … • We would strongly prefer that others learn how to run these tools ◦ Aka, please use the documentation we wrote =) ▪ Sharing data with external collaborators • After internal LIBD approval by Rujuta Narurkar ◦ Things that are beyond what we can typically do ▪ Lead analysis ▪ Develop and/or maintain custom solutions ▪ Write papers
  43. Data Science guidance sessions (DSgs) • https://lcolladotor.github.io/bioc_team_ds/data-science-guidance-sessions.html ◦ JHPCE ◦

    R ◦ Bioconductor ◦ Understanding code we wrote ◦ Training others on how they can more effectively get help from us or others ▪ Providing reproducible examples: reprex in R https://reprex.tidyverse.org/ which provides a solution to “help me help you” ▪ Framing questions and software bug reports • The DSgs system works best over the long term ◦ It’s based on my 3 yr experience as an JHBSPH MpH capstone teaching assistant
  44. https://www.youtube.com/c/LeonardoColladoTorres/playlists Videos allow us to multiply ourselves We can make

    you custom selections of videos for a specific problem on DSgs sessions
  45. https://github.com/search?q=org%3ALieberInstitute+aggregateAcrossCells&type=code GitHub is our library / encyclopedia It could be

    yours / LIBD’s too! Code is the ultimate documentation Git commit messages remind you of what you were thinking when you made a change Bill Ulrich or Leo can give you access
  46. Project 1 • https://github.com/LieberInstitute/HumanPilot/blob/ master/Analysis/Layer_Guesses/layer_specificity.R • https://github.com/LieberInstitute/HumanPilot/blob/ master/Analysis/Layer_Guesses/asd_snRNAseq_re cast.R Project

    2 • https://github.com/LieberInstitute/spatialDLPFC/blo b/main/code/analysis/07_spatial_registration/07_sp atial_registration.R Project 3 • https://github.com/LieberInstitute/Visium_IF_AD/blo b/master/code/10_spatial_registration/01_spatial_r egistration.R On the horizon: A new function at https://github.com/LieberInstitute/spatialLIBD @sowmyapartybun Sowmya Parthiban @abspangler Abby Spangler Code is constantly adapted and improved, both within and across projects Spatial registration code example It’s hard to keep track of code evolution • Try to include comments linking back to where you adapted it from Basically: divide and conquer ^^
  47. @lcolladotor Leonardo Collado-Torres @lahuuki Louise A Huuki-Myers @JoshStolz2 Joshua M

    Stolz @Nick-Eagles (GH) Nicholas J Eagles @geo_pertea Geo Pertea @abspangler Abby Spangler @mattntran Matthew N Tran @lmwebr Lukas M Weber @stephaniehicks Stephanie C Hicks @MadhaviTippani Madhavi Tippani @sowmyapartybun Sowmya Parthiban + Many more LIBD, JHU, and external collaborators @PardoBree Brenda Pardo @HeenaDivecha Heena R Divecha @ckbehemoth (GH) William S Ulrich @martinowk Keri Martinowich @kr_maynard Kristen R Maynard @CerceoPage Stephanie C Page