11 Analyzing BrainSEQ Phase II and generating the recount-brain resource Leonardo Collado-Torres Staff Scientist II @fellgernon Data Science I with Andrew Jaffe
What defines me • Bioinformatics • R and Bioconductor • Reproducibility and best practices • Outreach and community building • Back in 2005 at @LCGUNAM: I like math and coding; biology provides the challenging problems
History 2005-2009 Undergrad in Genomic Sciences 2009-2011 2011-2016 August 2016+ Data Science Division Leader ! ! PIs: • Jeff Leek: 2012+ • Andrew Jaffe: 2013+ Ph.D. Biostatistics Staff Scientist I → II Data Science Team I PI: Andrew Jaffe
My path to LIBD 2005-2009 Undergrad in Genomic Sciences 2009-2011 2011-2016 August 2016+ Data Science Division Leader ! ! PIs: • Jeff Leek: 2012+ • Andrew Jaffe: 2013+ Ph.D. Biostatistics Staff Scientist Data Science Team I Microarrays, little RNA-seq N = 12 (bacteria) N = 59 N = 72 potential for 1,000 BSP2: N = 900 + many more
Ph.D. overview The goal was to develop statistical methods and software [...] in RNA-seq […]. We applied these methods to further our understanding of neuropsychiatric disorders using the Lieber Institute for Brain Development human brains collection (> 1000 samples).
Lieber Institute for Brain Development • Role: Staff Scientist II • PI: Andrew Jaffe Team: • Staff Scientist: Emily Burke • Research Associate: Madhavi Tippani • Grad students: Matt Nguyen, Brianna Barry, Kira Perzel Mandell • Research Assistant: Nick Eagles • Recent Alumni: Amanda Price, Stephen Semick • Close collaborators: Carrie Wright, Nina Rajpurohit • Role details: like a postdoc (major role in some projects) with some support projects Lab at Amy Peterson’s MPH capstone presentation amy-peterson.github.io
Lieber Institute for Brain Development • DNAm on WGBS across development & cell types biorxiv.org/content/early/2018/09/29/428391 • RNA-seq from stem cells biorxiv.org/content/early/2018/07/31/380758 Peer reviewed: • RNA-seq DE in Schizophrenia disorder & 2 brain regions doi.org/10.1016/j.neuron.2019.05.013 • miRNA kit prep comparison doi.org/10.1186/s12864 • DNAm and gene DE in Alzheimer’s disease doi.org/10.1007/s00401-019-01966-5 • RNA-seq smoking during pregnancy doi.org/10.1038/s41380-018-0223-1 • RNA-seq DE in Schizophrenia disorder on DLPFC doi.org/10.1038/s41593-018-0197-y • Histamine signaling in autism spectrum disorder doi.org/10.1038/tp.2017.87 Pre-prints:
DATA 17 BrainSeq Phase II RNA-seq samples: by case status DLPFC HIPPO total adult 222 238 460 prenatal 29 28 57 0 <= age < 18 49 48 97 total 300 314 614 DLPFC HIPPO total adult 152 132 284 prenatal 0 0 0 0 <= age < 18 1 1 2 total 153 133 286 Control Schizophrenia cases
DATA ANALYSIS 18 Focus on being conservative 1.Use well established processing methods 2.Apply strict expression cutoffs 3.Use replication when possible 4.Adjust for RNA quality degradation confounding • Using the qSVA method 5.Avoid potential batch effects • Drop problematic samples 6.Take into account correlation at the individual level
RNA-SEQ APPROACH 19 BrainSeq Phase II Pre-natal Adult ACTG Birth Unaffected Controls Patients with Schizophrenia RNA Sequencing Genotyping + + + Gene Exons Expressed Regions Transcripts Junctions Age CC CA AA SZ CONT DLPFC HIPPO + region differences For public data, check recount2
DATA ANALYSIS 20 Main processing steps 1. Quality check (QC) on raw reads (FastQC) 2. Failed QC? Then trim reads (Trimmomatic) 3. Align reads to the genome (HISAT2) 4. Count features (featureCounts + others) 5. Calculate coverage (bam2wig) 6. Quantify transcripts (Salmon) 7. Create count tables (R) 8. Genotype samples (samtools + vcftools) Emily E. Burke Nextflow version in preparation with Winter Genomics and Nick Eagles
DATA ANALYSIS 22 Differential expression models • Region-specific for adult or fetal ages • Using only adult samples or only prenatal samples • Test for differences between DLFPC and HIPPO • Development • Linear age splines with breakpoints at developmental stages • Test for interaction between age and brain region at these splines • Case-control • By brain region • Test for differences between non-psychiatric controls and individuals with schizophrenia • For the first two models, we account for the fact that an individual can have two correlated samples: one for each brain region limma::duplicateCorrelation()
DATA ANALYSIS 25 Region-specific by age results • For adults: 1,612 DE genes with Bonferroni < 0.01 that replicate in BrainSpan • For prenatal samples: 32 DE genes 23 6 2 0 3 0 3 gene exon jxn DE features grouped by gene id (prenatal) 328 422 778 23 839 647 388 gene exon jxn DE features grouped by gene id (adult)
DATA ANALYSIS 26 Region-specific by age results: adult enriched biological processes by region G: gene, E: exon, J: exon-exon junction D: DLPFC, H: HIPPO cell−matrix adhesion extracellular matrix organization ameboidal−type cell migration microtubule bundle formation axoneme assembly cilium movement establishment of synaptic vesicle localization synaptic vesicle transport synaptic vesicle localization signal release from synapse neurotransmitter secretion presynaptic process involved in chemical synaptic transmission positive regulation of GTPase activity regulation of GTPase activity synaptic vesicle exocytosis neurotransmitter transport regulation of calcium ion−dependent exocytosis synaptic vesicle cycle regulation of synapse organization positive regulation of synaptic transmission regulation of neuron projection development regulation of small GTPase mediated signal transduction extracellular structure organization positive regulation of nervous system development positive regulation of neurogenesis positive regulation of cell development axon development axonogenesis regulation of hormone levels cardiac chamber development cardiac septum development kidney epithelium development nephron development urogenital system development cardiac chamber morphogenesis cardiac septum morphogenesis renal system development kidney development regulation of transmembrane transport actomyosin structure organization regulation of trans−synaptic signaling modulation of chemical synaptic transmission synapse organization regulation of cell morphogenesis synaptic transmission, glutamatergic regulation of synaptic transmission, glutamatergic potassium ion transmembrane transport cellular potassium ion transport potassium ion transport G.D (609) G.H (573) E.D (1101) E.H (1087) J.D (850) J.H (823) GeneRatio 0.02 0.03 0.04 0.05 0.06 0.01 0.02 0.03 0.04 p.adjust ontology: BP
DATA ANALYSIS 29 Development model results • 5,982 (~55%) genes contain differentially expressed exons and splice junctions that replicated in BrainSpan (Bonferroni < 1%) 2354 2260 1762 243 5982 8501 1558 gene exon jxn DE features grouped by gene id
qSVA WORKFLOW 32 Slide adapted from Amy Peterson Jaffe et al., PNAS, 2017 Model 1 (6429 genes) Log2 FC Dx Log2 FC Degradation http://research.libd.org/rstatsclub/2018/12/11/quality-surrogate-variable-analysis/
DEqual HIPPO 33 Model 1 (6429 genes) Model 1. Naïve model E / = 10 + 11 45 DEqual plots demonstrate effectiveness of statistical correction r = 0.412 Slide adapted from Amy Peterson Log2 FC Dx Log2 FC Degradation HIPPO 333 samples
axon development axonogenesis regulation of axonogenesis regulation of neuron projection development neutrophil chemotaxis regulation of B cell proliferation regulation of B cell activation positive regulation of B cell activation B cell receptor signaling pathway B cell activation lymphocyte activation regulation of leukocyte cell−cell adhesion regulation of cell−cell adhesion positive regulation of alpha−beta T cell activation regulation of leukocyte activation positive regulation of lymphocyte activation positive regulation of cell activation positive regulation of leukocyte activation leukocyte migration positive regulation of leukocyte cell−cell adhesion positive regulation of T cell activation leukocyte chemotaxis response to organophosphorus ATF6−mediated unfolded protein response positive regulation of cell adhesion cell chemotaxis positive regulation of cell−cell adhesion protein folding protein folding in endoplasmic reticulum HgC (137) HeC (306) DgC (297) DeS (250) GeneRatio 0.025 0.050 0.075 0.100 0.01 0.02 0.03 0.04 p.adjust ontology: BP DATA ANALYSIS 38 • BP enrichment in Control > SCZD at gene level • Immune processes g: gene, e: exon, j: exon-exon junction D: DLPFC, H: HIPPO, C: control, S: SCZD Case-control: by region