Slide 1

Slide 1 text

11 Analyzing BrainSeq Phase II and generating the recount-brain resource Leonardo Collado-Torres @fellgernon Data Science I with Andrew Jaffe

Slide 2

Slide 2 text

Questions I’ll answer today • Who am I? • What is the BrainSeq Phase II project? • What is the recount-brain resource?

Slide 3

Slide 3 text

What defines me • Bioinformatics • R and Bioconductor • Reproducibility and best practices • Outreach and community building • Back in 2005: I like math and coding; biology provides the challenging problems

Slide 4

Slide 4 text

History 2005-2009 Undergrad in Genomic Sciences 2009-2011 2011-2016 August 2016+ Data Science Division Leader ! ! PIs: • Jeff Leek: 2012+ • Andrew Jaffe: 2013+ Ph.D. Biostatistics Staff Scientist Data Science Team I

Slide 5

Slide 5 text

2008+ • BioC 2008-2011, 2014, 2017 • useR!2013 • rOpenSci unconf 2018 • RStudio::conf 2019 @fellgernon 2010+ Interests ! @LIBDrstats 2018+ @CDSBMexico 2018+ Defunct: BmoreBiostats, Biostats Cultural Mixers Guest @RLadiesBmore Blog: http://lcolladotor.github.io 2011+ FB: 75k, Tw: 66k weekly

Slide 6

Slide 6 text

Software tools I use every day Mercurial/git user 2009+ https://github.com/lcolladotor From: Feb 21, 2019 Communication tool for work and communities

Slide 7

Slide 7 text

My path to LIBD 2005-2009 Undergrad in Genomic Sciences 2009-2011 2011-2016 August 2016+ Data Science Division Leader ! ! PIs: • Jeff Leek: 2012+ • Andrew Jaffe: 2013+ Ph.D. Biostatistics Staff Scientist Data Science Team I Microarrays, little RNA-seq N = 12 (bacteria) N = 59 N = 72 potential for 1,000 BSP2: N = 900 + many more

Slide 8

Slide 8 text

Ph.D. overview The goal was to develop statistical methods and software [...] in RNA-seq […]. We applied these methods to further our understanding of neuropsychiatric disorders using the Lieber Institute for Brain Development human brains collection (> 1000 samples).

Slide 9

Slide 9 text

Lieber Institute for Brain Development • Role: Staff Scientist • PI: Andrew Jaffe Team: • Staff Scientist: Emily Burke • Research Associate: Madhavi Tippani • Postdocs: Amanda Price • Grad students: Matt Nguyen, Brianna Barry, Kira Perzel Mandell • Research Assistant: Nick Eagles, Stephen Semick* • Close collaborators: Carrie Wright, Nina Rajpurohit • Role details: like a postdoc (major role in some projects) with some support projects * Now at University of Maryland in med school Lab at Amy Peterson’s MPH capstone presentation

Slide 10

Slide 10 text

Papers since 2016

Slide 11

Slide 11 text

Papers since 2016 RNA-seq WGBS Biostatistics RNA-seq > 70,000 samples

Slide 12

Slide 12 text

Lieber Institute for Brain Development • miRNA kit prep comparison biorxiv.org/content/early/2018/10/30/445437 • DNAm on WGBS across development & cell types biorxiv.org/content/early/2018/09/29/428391 • RNA-seq DE in Schizophrenia disorder & 2 brain regions biorxiv.org/content/early/2018/09/26/426213 • RNA-seq from stem cells biorxiv.org/content/early/2018/07/31/380758 Peer reviewed: • DNAm and gene DE in Alzheimer’s disease doi.org/10.1007/s00401-019-01966-5 • RNA-seq smoking during pregnancy doi.org/10.1038/s41380-018-0223-1 • RNA-seq DE in Schizophrenia disorder on DLPFC doi.org/10.1038/s41593-018-0197-y • Histamine signaling in autism spectrum disorder doi.org/10.1038/tp.2017.87 Pre-prints:

Slide 13

Slide 13 text

Today BrainSeq Phase II brain

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

BrainSeq: A Human Brain Genomics Consortium DLPFC 495 samples BrainSeq Phase I polyA+ Jaffe et al., Nature Neuroscience, 2018 DLPFC 453 samples HIPPO 447 samples BrainSeq Phase II RiboZero THE SCIENTIFIC FRONTIER Neuron, 2015 Collado-Torres et al, bioRxiv, 2018 Manuscript status: under revision at Neuron

Slide 16

Slide 16 text

DATA 16 BrainSeq Phase II RNA-seq samples DLPFC HIPPO total adult (age >= 18) 374 370 744 prenatal 29 28 57 0 <= age < 18 50 49 99 total 453 447 900 • Non-psychiatric control and schizophrenia affected individuals • Two brain regions: dorsolateral prefrontal cortex and hippocampus All samples

Slide 17

Slide 17 text

DATA 17 BrainSeq Phase II RNA-seq samples: by case status DLPFC HIPPO total adult 222 238 460 prenatal 29 28 57 0 <= age < 18 49 48 97 total 300 314 614 DLPFC HIPPO total adult 152 132 284 prenatal 0 0 0 0 <= age < 18 1 1 2 total 153 133 286 Control Schizophrenia cases

Slide 18

Slide 18 text

DATA ANALYSIS 18 Focus on being conservative 1.Use well established processing methods 2.Apply strict expression cutoffs 3.Use replication when possible 4.Adjust for RNA quality degradation confounding • Using the qSVA method 5.Avoid potential batch effects • Drop problematic samples 6.Take into account correlation at the individual level

Slide 19

Slide 19 text

RNA-SEQ APPROACH 19 BrainSeq Phase II Pre-natal Adult ACTG Birth Unaffected Controls Patients with Schizophrenia RNA Sequencing Genotyping + + + Gene Exons Expressed Regions Transcripts Junctions Age CC CA AA SZ CONT DLPFC HIPPO + region differences For public data, check recount2

Slide 20

Slide 20 text

DATA ANALYSIS 20 Main processing steps 1. Quality check (QC) on raw reads (FastQC) 2. Failed QC? Then trim reads (Trimmomatic) 3. Align reads to the genome (HISAT2) 4. Count features (featureCounts + others) 5. Calculate coverage (bam2wig) 6. Quantify transcripts (Salmon) 7. Create count tables (R) 8. Genotype samples (samtools + vcftools) Emily E. Burke Nextflow version in preparation with Winter Genomics and Nick Eagles

Slide 21

Slide 21 text

DATA ANALYSIS 21 Filter features with low expression ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 20000 30000 40000 50000 mean expression cutoff Number of features > cutoff Feature Cutoff Unit Gene 0.25 RPKM Exon 0.30 RPKM Jxn 0.46 RP10M Tx 0.38 TPM Gene • Used mean across all 900 samples • All analyses with filtered data Mean RPKM # genes > cutoff Based on the segmented R package: break-point estimation jaffelab::expression_cutoff()

Slide 22

Slide 22 text

DATA ANALYSIS 22 Differential expression models • Region-specific for adult or fetal ages • Using only adult samples or only prenatal samples • Test for differences between DLFPC and HIPPO • Development • Linear age splines with breakpoints at developmental stages • Test for interaction between age and brain region at these splines • Case-control • By brain region • Test for differences between non-psychiatric controls and individuals with schizophrenia • For the first two models, we account for the fact that an individual can have two correlated samples: one for each brain region limma::duplicateCorrelation()

Slide 23

Slide 23 text

DATA ANALYSIS 23 Differential expression models • Region-specific for adult or prenatal ages • Alternative: !"#$ = &' + )*+ + ,+" + ∑./0 1 23#45. + 6789:)8+ + 898);<227*3+=>+3+ + :?@ + :+*793 • Development • Alternative: !"#$ = &' + )*+ ∗ :+*793 + B+8); ∗ :+*793 + C7$8ℎ ∗ :+*793 + 73B)38 ∗ :+*793 + Eℎ7;= ∗ :+*793 + 8++3 ∗ :+*793 + )=F;8 ∗ :+*793 + ,+" + ∑./0 1 23#45. + 6789:)8+ + 898);<227*3+=>+3+ + :?@ + :+*793 • Case-control • Alternative: !"#$ = &' + )*+ + ,+" + 6789:)8+ + $:@+3+ + :?@ + $+*793,#+E7B7EK,L2 + M7)*39272

Slide 24

Slide 24 text

DATA ANALYSIS 24 Using BrainSpan for replication: region-specific model • P-value < 0.05 in BrainSpan, consistent direction Similar results for the development model prenatal adult gene exon jxn p<0.05 p<0.01 p<0.001 p<1e−04 p<1e−05 p<1e−06 p<0.05 p<0.01 p<0.001 p<1e−04 p<1e−05 p<1e−06 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p−value threshold Replication rate

Slide 25

Slide 25 text

DATA ANALYSIS 25 Region-specific by age results • For adults: 1,612 DE genes with Bonferroni < 0.01 that replicate in BrainSpan • For prenatal samples: 32 DE genes 23 6 2 0 3 0 3 gene exon jxn DE features grouped by gene id (prenatal) 328 422 778 23 839 647 388 gene exon jxn DE features grouped by gene id (adult)

Slide 26

Slide 26 text

DATA ANALYSIS 26 Region-specific by age results: adult enriched biological processes by region G: gene, E: exon, J: exon-exon junction D: DLPFC, H: HIPPO cell−matrix adhesion extracellular matrix organization ameboidal−type cell migration microtubule bundle formation axoneme assembly cilium movement establishment of synaptic vesicle localization synaptic vesicle transport synaptic vesicle localization signal release from synapse neurotransmitter secretion presynaptic process involved in chemical synaptic transmission positive regulation of GTPase activity regulation of GTPase activity synaptic vesicle exocytosis neurotransmitter transport regulation of calcium ion−dependent exocytosis synaptic vesicle cycle regulation of synapse organization positive regulation of synaptic transmission regulation of neuron projection development regulation of small GTPase mediated signal transduction extracellular structure organization positive regulation of nervous system development positive regulation of neurogenesis positive regulation of cell development axon development axonogenesis regulation of hormone levels cardiac chamber development cardiac septum development kidney epithelium development nephron development urogenital system development cardiac chamber morphogenesis cardiac septum morphogenesis renal system development kidney development regulation of transmembrane transport actomyosin structure organization regulation of trans−synaptic signaling modulation of chemical synaptic transmission synapse organization regulation of cell morphogenesis synaptic transmission, glutamatergic regulation of synaptic transmission, glutamatergic potassium ion transmembrane transport cellular potassium ion transport potassium ion transport G.D (609) G.H (573) E.D (1101) E.H (1087) J.D (850) J.H (823) GeneRatio 0.02 0.03 0.04 0.05 0.06 0.01 0.02 0.03 0.04 p.adjust ontology: BP

Slide 27

Slide 27 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Pluripotency Glial Fetal Infant C hild Teens Adult 50+ Fetal Infant C hild Teens Adult 50+ 0.4 0.6 0.8 0.0 0.1 0.2 0.3 Age Group Cell Type Proportion Region DLPFC Hippo DATA ANALYSIS 27 Development model: similar composition in prenatal across regions by DNAm Stephen A. Semick

Slide 28

Slide 28 text

DATA ANALYSIS 28 Development model results • 5,982 (~55%) genes contain differentially expressed exons and splice junctions that replicated in BrainSpan (Bonferroni < 1%) 2354 2260 1762 243 5982 8501 1558 gene exon jxn DE features grouped by gene id

Slide 29

Slide 29 text

DATA ANALYSIS 29 Development model results • Normalized expression over age for GABRD −2 0 2 4 6 log2(CPM + 0.5) 14 16 18 20 22 PCW 0.0 0.2 0.4 2 4 6 8 12 14 16 18 2020 30 40 50 50 55 60 65 70 75 80 85 ENSG00000187730.7 GABRD p−bonf 0 Age DLPFC HIPPO

Slide 30

Slide 30 text

DATA ANALYSIS 30 Development model results • Normalized expression over age after removing effect of terms from the null model jaffelab::cleaningY() and jaffelab::agePlotter() −2 −1 0 1 2 3 log2(CPM + 0.5) − covariate effects removed 14 16 18 20 22 PCW 0.0 0.2 0.4 2 4 6 8 12 14 16 18 2020 30 40 50 50 55 60 65 70 75 80 85 ENSG00000187730.7 GABRD p−bonf 0 Age DLPFC HIPPO

Slide 31

Slide 31 text

qSVA WORKFLOW 31 Slide adapted from Amy Peterson Jaffe et al., PNAS, 2017 Model 1 (6429 genes) Log2 FC Dx Log2 FC Degradation http://research.libd.org/rstatsclub/2018/12/11/quality-surrogate-variable-analysis/

Slide 32

Slide 32 text

DEqual HIPPO 32 Model 1 (6429 genes) Model 1. Naïve model E / = 10 + 11 45 DEqual plots demonstrate effectiveness of statistical correction r = 0.412 Slide adapted from Amy Peterson Log2 FC Dx Log2 FC Degradation HIPPO 333 samples

Slide 33

Slide 33 text

DEqual HIPPO 33 Model 1 (6429 genes) Model 2 (63 genes) Model 1. Naïve model E / = 10 + 1145 Model 2. Added RNA-quality and demographic covariates E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH + 17PXS + ∑Z[\ ] γMJV_`aM DEqual plots demonstrate effectiveness of statistical correction r = 0.412 r = 0.0712 Slide adapted from Amy Peterson Log2 FC Dx Log2 FC Dx, adj. cov Log2 FC Degradation Log2 FC Degradation HIPPO 333 samples

Slide 34

Slide 34 text

DEqual HIPPO 34 Model 1 (6429 genes) Model 2 (63 genes) Model 3 (48 genes) Model 1. Naïve model E / = 10 + 1145 Model 2. Added RNA-quality and demographic covariates E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH + 17PXS + ∑Z[\ ] γMJV_`aM Model 3. Added qSVs E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH + 17PXS + ∑Z[\ ] γMJV_`aM + ∑Z[\ d eMfghM DEqual plots demonstrate effectiveness of statistical correction HIPPO 333 samples r = 0.412 r = 0.0712 r = -0.00173 Slide adapted from Amy Peterson Log2 FC Dx Log2 FC Dx, adj. cov Log2 FC Dx, adj qSVs Log2 FC Degradation Log2 FC Degradation Log2 FC Degradation

Slide 35

Slide 35 text

−6 −4 −2 2 4 6 −6 −4 −2 0 2 4 6 0 t−statistic DLPFC t−statistic BSP1 r = 0.809 DATA ANALYSIS 35 Case-control: by region • 48 DE genes at FDR <5% in hippocampus, 243 in DLPFC (FDR <5%) • DLPFC results agree with BrainSeq Phase I Amy Peterson −6 −4 −2 2 4 −4 −2 0 2 4 0 t−statistic HIPPO t−statistic DLPFC r = 0.644 Control > SCZD SCZD > Control DLPFC FDR<10%, HIPPO FDR<20%

Slide 36

Slide 36 text

axon development axonogenesis regulation of axonogenesis regulation of neuron projection development neutrophil chemotaxis regulation of B cell proliferation regulation of B cell activation positive regulation of B cell activation B cell receptor signaling pathway B cell activation lymphocyte activation regulation of leukocyte cell−cell adhesion regulation of cell−cell adhesion positive regulation of alpha−beta T cell activation regulation of leukocyte activation positive regulation of lymphocyte activation positive regulation of cell activation positive regulation of leukocyte activation leukocyte migration positive regulation of leukocyte cell−cell adhesion positive regulation of T cell activation leukocyte chemotaxis response to organophosphorus ATF6−mediated unfolded protein response positive regulation of cell adhesion cell chemotaxis positive regulation of cell−cell adhesion protein folding protein folding in endoplasmic reticulum HgC (137) HeC (306) DgC (297) DeS (250) GeneRatio 0.025 0.050 0.075 0.100 0.01 0.02 0.03 0.04 p.adjust ontology: BP DATA ANALYSIS 36 • BP enrichment in Control > SCZD at gene level • Immune processes g: gene, e: exon, j: exon-exon junction D: DLPFC, H: HIPPO, C: control, S: SCZD Case-control: by region

Slide 37

Slide 37 text

Gene Individual 1 Individual 2 1 10 15 2 5 22 … DLPFC Gene Individual 1 Individual 2 1 10 17 2 6 16 … HIPPO correlation( (10, 5, …), (10, 6, …) ) =~ 0.9 correlation( (15, 22, …), (17, 16, …) ) =~ 0.7 Individual 1 Individual 2 Control vs SCZD

Slide 38

Slide 38 text

A B C D Control SCZD 0.76 0.77 0.78 0.79 cleaned expr (keeping Dx) − geneRpkm p−value: 0.0164 SCZD diagnosis Correlation Control SCZD 0.71 0.72 0.73 0.74 0.75 0.76 cleaned expr (keeping Dx) − exonRpkm p−value: 0.0499 SCZD diagnosis Correlation Control SCZD 0.54 0.55 0.56 0.57 cleaned expr (keeping Dx) − jxnRp10m p−value: 1.72e−05 Correlation Control SCZD 0.47 0.48 0.49 0.50 0.51 0.52 cleaned expr (keeping Dx) − txTpm p−value: 0.00992 Correlation Correlation Correlation Correlation Correlation Gene: 0.0164 Exon: 0.0499 Jxn: 1.72 * 10-5 Tx: 0.00992

Slide 39

Slide 39 text

DATA ANALYSIS 39 HIPPO eQTLs • 11,237,357 eQTL associations (FDR <1%) across genes, exons and junctions corresponding to 17,719 genes Emily E. Burke 2061 3183 2163 274 5945 2915 1178 gene exon jxn eQTLs grouped by gene id Includes 60 risk SNPs from PGC2 8 +,332 E 0 1 2 −1 0 1 2 3 4 chr2:5822204358229ï-[Q rs74563533:58250433:G:A r2:58 250 433 Residualized Expression ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 −1 0 1 2 3 4 chr2:5822204358229ï-[Q rs75575209:58138192:A:T r2:58 138 192 Residualized Expression ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● p=5.75e−34 p=3.82e−81 F

Slide 40

Slide 40 text

DATA ANALYSIS 40 Region dependent eQTLs • 205,618 region-dependent eQTLs (FDR <1%) corresponding to 1,484 genes • Includes 3 PCG2 schizophrenia risk loci Emily E. Burke 19394 10734 6493 2622 3937 4398 34259 gene exon jxn eQTLs grouped by SNP id 488 246 319 18 99 63 251 gene exon jxn eQTLs grouped by gene id

Slide 41

Slide 41 text

A B 0.D 1.D 2.D 0.H 1.H 2.H 4.0 4.4 4.8 5.2 LRP1(16( (Exon) rs324015 Residualized Expression p=3.91e−05 0.D 1.D 2.D 0.H 1.H 2.H −1 0 1 2 3 2YHUODSVJHQH15*1 FKUï-[Q rs12293670:124612932:A:G Residualized Expression p=5.27e−26 C D 0.D 1.D 2.D 0.H 1.H 2.H −1 0 1 2 3 4 2YHUODSVJHQH1*() FKUïï-[Q rs4144797:233562197:T:C Residualized Expression p=3.5e−14 0.D 1.D 2.D 0.H 1.H 2.H 6.8 7.2 7.6 8.0 LRP1 FKUï-[Q rs324015 Residualized Expression p=2.02e−06 D: DLPFC, H: HIPPO DATA ANALYSIS Region dependent eQTLs These 3 SNPs are also meQTLs (Jaffe et al, 2016)

Slide 42

Slide 42 text

Overall eQTLs at FDR < 1% - rAggr eQTL analysis with PCG2 risk SNPs Hippocampus DLPFC # unique SNPs (unique index SNPs) 5510 (103) 6780 (116) # unique features 1731 2525 # Unique genes 123 171 # Unique transcripts 244 332 # Unique exons 857 1363 # Unique junctions 507 659 Emily E. Burke 21 8 95 DLPFC HIPPO In BSP2: 163/179 (91.1%) FDR <1%: 124/163 (76.6%)

Slide 43

Slide 43 text

• We will soon update our eQTL browser at http://eqtl.brainseq.org/ Bill Ulrich

Slide 44

Slide 44 text

TWAS Gusev et al, Nature Genetics, 2016

Slide 45

Slide 45 text

TWAS SNP ID chromosome position rs101 12 1 rs102 13 10000 … SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … SNP ID chromosome position rs101 12 1 rs102 13 10000 … LD reference LIBD BSP2 SNPs GWAS summary SNPs https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md 1,190,321 82,393,211 6,500,475 CLOZUK+PGC2

Slide 46

Slide 46 text

TWAS SNP ID chromosome position rs101 12 1 rs102 13 10000 … SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … SNP ID chromosome position rs101 12 1 rs102 13 10000 … LD reference LIBD BSP2 SNPs GWAS summary SNPs https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md 1,190,321 82,393,211 6,500,475 CLOZUK+PGC2

Slide 47

Slide 47 text

TWAS SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … LD reference LIBD BSP2 SNPs GWAS summary SNPs https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md 1,022,527 1,022,546 4,316,013 CLOZUK+PGC2

Slide 48

Slide 48 text

TWAS DLPFC HIPPO gene 7,683 (32.8%) 5,866 (25.1%) exon 61,678 (16.2%) 47,583 (12.5%) jxn 33,754 (12.5%) 27,088 (10.1%) tx 14,118 (15.9%) 11,567 (13%) Features with TWAS weights Removed sex, snp PCs 1-5, expr PCs Kept diagnosis jaffelab::cleaningY()

Slide 49

Slide 49 text

TWAS Among loci considered? DLPFC HIPPO Yes 99 (85.3%) 87 (84.5 %) No 17 16 116 103 PGC2 loci with eQTLs (FDR<1%): considered in TWAS? ? DLPFC HIPPO Yes 48 (48.5%) 45 (51.7%) No 51 42 99 87 Locus has a TWAS p-value among all features? ? DLPFC HIPPO Yes 26 (54.2%) 25 (55.5%) No 22 20 48 45 TWAS p-value <1x10-8?

Slide 50

Slide 50 text

TWAS ENSG00000188511.12 ENSG00000280224.1 ENSG00000279597.1 ENSG00000264139.1 ENSG00000212939.2 ENSG00000213279.2 ENSG00000235111.1 ENSG00000226954.1 ENSG00000223142.1 ENSG00000214727.2 ENSG00000100425.18 ENSG00000279345.1 ENSG00000236867.1 ENSG00000260613.1 ENSG00000229409.1 ENSG00000277771.1 ENSG00000279494.1 ENSG00000100426.6 ENSG00000182858.13 ENSG00000273192.1 ENSG00000184164.14 ENSG00000278869.1 ENSG00000198355.4 ENSG00000276753.1 ENSG00000188263.10 ENSG00000138892.11 ENSG00000100427.15 ENSG00000073146.15 ENSG00000280395.1 ENSG00000073150.13 ENSG00000170638.9 ENSG00000273188.1 ENSG00000273253.2 ENSG00000073169.13 ENSG00000272836.1 ENSG00000273137.1 ENSG00000128159.11 ENSG00000100429.17 ENSG00000188130.13 ENSG00000185386.14 ENSG00000196576.14 ENSG00000205593.11 ENSG00000227484.1 ENSG00000279182.1 ENSG00000100239.15 ENSG00000100241.20 ENSG00000128165.8 ENSG00000100253.12 ENSG00000100258.17 ENSG00000025770.18 ENSG00000130489.13 ENSG00000272821.1 ENSG00000025708.13 ENSG00000177989.13 ENSG00000272666.1 ENSG00000273272.1 ENSG00000226738.1 ENSG00000130487.5 ENSG00000217442.3 ENSG00000205560.12 ENSG00000254413.8 ENSG00000100288.19 ENSG00000205559.3 ENSG00000272940.1 ENSG00000008735.13 ENSG00000100299.17 ENSG00000212569.1 ENSG00000251322.7 ENSG00000206841.1 ENSG00000225929.1 ENSG00000100312.10 ENSG00000254499.1 ENSG00000213683.4 ENSG00000184319.15 ENSG00000079974.17 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 49.4 49.6 49.8 50.0 50.2 50.4 50.6 0 1 2 3 4 5 chr 22 physical position (MB) −log10(P−value) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● Can also identify conditionally independent features GWAS: grey TWAS: blue CRELD2

Slide 51

Slide 51 text

TWAS ENSG00000188511.12 ENSG00000280224.1 ENSG00000279597.1 ENSG00000264139.1 ENSG00000212939.2 ENSG00000213279.2 ENSG00000235111.1 ENSG00000226954.1 ENSG00000223142.1 ENSG00000214727.2 ENSG00000100425.18 ENSG00000279345.1 ENSG00000236867.1 ENSG00000260613.1 ENSG00000229409.1 ENSG00000277771.1 ENSG00000279494.1 ENSG00000100426.6 ENSG00000182858.13 ENSG00000273192.1 ENSG00000184164.14 ENSG00000278869.1 ENSG00000198355.4 ENSG00000276753.1 ENSG00000188263.10 ENSG00000138892.11 ENSG00000100427.15 ENSG00000073146.15 ENSG00000280395.1 ENSG00000073150.13 ENSG00000170638.9 ENSG00000273188.1 ENSG00000273253.2 ENSG00000073169.13 ENSG00000272836.1 ENSG00000273137.1 ENSG00000128159.11 ENSG00000100429.17 ENSG00000188130.13 ENSG00000185386.14 ENSG00000196576.14 ENSG00000205593.11 ENSG00000227484.1 ENSG00000279182.1 ENSG00000100239.15 ENSG00000100241.20 ENSG00000128165.8 ENSG00000100253.12 ENSG00000100258.17 ENSG00000025770.18 ENSG00000130489.13 ENSG00000272821.1 ENSG00000025708.13 ENSG00000177989.13 ENSG00000272666.1 ENSG00000273272.1 ENSG00000226738.1 ENSG00000130487.5 ENSG00000217442.3 ENSG00000205560.12 ENSG00000254413.8 ENSG00000100288.19 ENSG00000205559.3 ENSG00000272940.1 ENSG00000008735.13 ENSG00000100299.17 ENSG00000212569.1 ENSG00000251322.7 ENSG00000206841.1 ENSG00000225929.1 ENSG00000100312.10 ENSG00000254499.1 ENSG00000213683.4 ENSG00000184319.15 ENSG00000079974.17 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 49.4 49.6 49.8 50.0 50.2 50.4 50.6 0 1 2 3 4 5 −log10(P−value) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 49.4 49.6 49.8 50.0 50.2 50.4 50.6 0 2 4 6 8 chr 22 physical position (MB) −log10(P−value) GWAS signal eQTL signal CRELD2

Slide 52

Slide 52 text

TWAS Feature DLFPC HIPPO TWAS conditionally independent? Gene 29 (43.3%) 20 (40 %) Yes 38 30 No Exon 42 (48.3%) 31 (40.3) Yes 45 46 No Jxn 50 (51%) 40 (40.8%) Yes 48 48 No Tx 35 (46.1%) 26 (40%) Yes 41 39 No

Slide 53

Slide 53 text

Authors • Leonardo Collado-Torres o leo.collado@libd.org o @fellgernon • Emily E. Burke • Amy Peterson • Joo Heon Shin • Richard E. Straub • Anandita Rajpurohit • Stephen A. Semick • William S. Ulrich • Cristian Valencia • Ran Tao • Amy Deep-Soboslay • Thomas M. Hyde • Joel E. Kleinman • Daniel R. Weinberger+ • Andrew E. Jaffe+ o andrew.jaffe@libd.org o @andrewejaffe 53 • BrainSeq Consortium • LIBD @lieberinstitute Funding github.com/LieberInstitute/brainseq_phase2

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

https://jhubiostatistics.shinyapps.io/recount/

Slide 56

Slide 56 text

http://rail.bio/ Slide adapted from Ben Langmead by Abhinav Nellore

Slide 57

Slide 57 text

http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

Slide 58

Slide 58 text

GTEx TCGA slide adapted from Shannon Ellis

Slide 59

Slide 59 text

SRA

Slide 60

Slide 60 text

jx 1 jx 2 jx 3 jx 4 jx 5 jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5 doi.org/10.12688/f1000research.12223.1

Slide 61

Slide 61 text

doi.org/10.12688/f1000research.12223.1

Slide 62

Slide 62 text

> library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata')) > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/

Slide 63

Slide 63 text

slide adapted from Jeff Leek

Slide 64

Slide 64 text

related projects • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1 • Shannon Ellis & Leek: phenotype prediction doi.org/10.1093/nar/gky102 • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346 • Madugundu & Pandey (JHU): proteomics • Luidi-Imada & Marchionni (JHU): cancer and FANTOM • Kuri-Magaña & Martínez-Barnetche (INSP Mexico): immune expression doi.org/10.3389/fimmu.2018.02679 • Ryten (UCL): Guelfi: validating expressed regions (ERs) eQTLs Zhang: improving the detection of ERs doi.org/10.1101/499103

Slide 65

Slide 65 text

expression data for ~70,000 human samples samples phenotypes ? GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis

Slide 66

Slide 66 text

Category Frequency F 95 female 2036 Female 51 M 77 male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$Sex “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis

Slide 67

Slide 67 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 TCGA The Cancer Genome Atlas N=11,284 GTEx Genotype Tissue Expression Project N=9,662 slide adapted from Shannon Ellis

Slide 68

Slide 68 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis

Slide 69

Slide 69 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis

Slide 70

Slide 70 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis

Slide 71

Slide 71 text

Sex prediction is accurate across data sets Number of Regions 20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8% 99.6% 99.4% 88.5% slide adapted from Shannon Ellis

Slide 72

Slide 72 text

Number of Regions 589 589 589 589 589 Number of Samples (N) 4,769 4,769 613 6,579 8,951 97.3% 96.5% 91.0% 70.2% Prediction is more accurate in healthy tissue 50.6% slide adapted from Shannon Ellis

Slide 73

Slide 73 text

> library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata')) > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/

Slide 74

Slide 74 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● adipose tissue adrenal gland bladder blood blood vessel bone bone marrow brain breast cervix cervix uteri colon epithelium esophagus fallopian tube heart intestine kidney liver lung melanoma monocytes muscle nerve ovary pancreas penis pituitary placenta prostate salivary gland skin small intestine spinal cord spleen stem cell stomach testis thyroid tonsil umbilical cord urinary bladder uterus vagina 0 3000 6000 9000 12000 0 1000 2000 3000 reported predicted

Slide 75

Slide 75 text

• 62 SRA studies • 4,431 rows by 48 columns Ashkaun Razmara, in prep.

Slide 76

Slide 76 text

Sex Female Male Age/Development Fetus Child Adolescent Adult Race/Ethnicity Asian Black Hispanic White Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia Tissue Site 3 Dorsolateral prefrontal cortex Superior temporal gyrus Substantia nigra Caudate Hemisphere Left Right Brodmann Area 1-52 Disease Status Disease Neurological control Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma Clinical Stage 1 Grade I Grade II Grade III Grade IV Clinical Stage 2 Primary Secondary Recurrent Viability Postmortem Biopsy Preparation Frozen Thawed Ashkaun Razmara, in prep.

Slide 77

Slide 77 text

https://github.com/LieberInstitute/recount-brain/tree/master/metadata_reproducibility • Overall curation steps: starts by downloading SRA Run Table info, then info from the publications • Details for each SRA study Reproducibility document Ashkaun Razmara, in prep.

Slide 78

Slide 78 text

Ashkaun Razmara, in prep.

Slide 79

Slide 79 text

Replicates part of the GTEx PMI paper by Ferreira et al. doi.org/10.1038/s41467-017- 02772-x

Slide 80

Slide 80 text

Code Example: research.libd.org/recount-brain/example_PMI/example_PMI.html research.libd.org/recount-brain/example_PMI/example_PMI.Rmd Replicates part of the GTEx PMI paper by Ferreira et al. doi.org/10.1038/s41467-017-02772-x

Slide 81

Slide 81 text

* Jeff Leek presented Shannon Ellis’ prediction work in Toronto (around April 2018) https://docs.google.com/presentation/d/1FgUZZU6pW91J7zH0OqrEgxfnV1Py_ZGL3ZKHfbOZskY/edit#slide=id.g2f831fd4ae_0_306 * Dustin J. Sokolowski from Michael D. Wilson’s lab is using recount2 * Dustin joins the project and merges recount-brain with GTEx and TCGA recount_brain_v2 The SRA samples in recount-brain are complemented by 1,409 GTEx (GTEx Consortium 2015) and 707 TCGA (Brennan et al. 2013; Cancer Genome Atlas Research Network et al. 2015) samples covering 13 healthy regions of the brain and 2 tumor types, respectively. In total, there are 6,547 samples with metadata in recount-brain with 5,330 (81.4%) present in recount2

Slide 82

Slide 82 text

* Discussed with Sean Davis (NIH) at Biological Data Science #biodata18 * Potential for contributing recount-brain to SRAdbV2 github.com/seandavi/SRAdbV2 * Have mapped variables to ontologies What about maintaining/growing it?

Slide 83

Slide 83 text

The recount-brain team Hopkins Ashkaun Razmara Shannon E. Ellis Jeff T. Leek University of Toronto Dustin J. Sokolowski Michael D. Wilson NIH Sean Davis LIBD Andrew E. Jaffe Funding NIH R01 GM105705 NIH 1R21MH109956 NIH R01 GM121459 CIHR, NSERC Ontario Ministry of Research IDIES SciServer Hosting recount2 github.com/LieberInstitute/recount-brain