Upgrade to Pro — share decks privately, control downloads, hide ads and more …

libd2019

 libd2019

Leonardo Collado-Torres

February 25, 2019
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11 Analyzing BrainSeq Phase II and generating the recount-brain resource

    Leonardo Collado-Torres @fellgernon Data Science I with Andrew Jaffe
  2. Questions I’ll answer today • Who am I? • What

    is the BrainSeq Phase II project? • What is the recount-brain resource?
  3. What defines me • Bioinformatics • R and Bioconductor •

    Reproducibility and best practices • Outreach and community building • Back in 2005: I like math and coding; biology provides the challenging problems
  4. History 2005-2009 Undergrad in Genomic Sciences 2009-2011 2011-2016 August 2016+

    Data Science Division Leader ! ! PIs: • Jeff Leek: 2012+ • Andrew Jaffe: 2013+ Ph.D. Biostatistics Staff Scientist Data Science Team I
  5. 2008+ • BioC 2008-2011, 2014, 2017 • useR!2013 • rOpenSci

    unconf 2018 • RStudio::conf 2019 @fellgernon 2010+ Interests ! @LIBDrstats 2018+ @CDSBMexico 2018+ Defunct: BmoreBiostats, Biostats Cultural Mixers Guest @RLadiesBmore Blog: http://lcolladotor.github.io 2011+ FB: 75k, Tw: 66k weekly
  6. Software tools I use every day Mercurial/git user 2009+ https://github.com/lcolladotor

    From: Feb 21, 2019 Communication tool for work and communities
  7. My path to LIBD 2005-2009 Undergrad in Genomic Sciences 2009-2011

    2011-2016 August 2016+ Data Science Division Leader ! ! PIs: • Jeff Leek: 2012+ • Andrew Jaffe: 2013+ Ph.D. Biostatistics Staff Scientist Data Science Team I Microarrays, little RNA-seq N = 12 (bacteria) N = 59 N = 72 potential for 1,000 BSP2: N = 900 + many more
  8. Ph.D. overview The goal was to develop statistical methods and

    software [...] in RNA-seq […]. We applied these methods to further our understanding of neuropsychiatric disorders using the Lieber Institute for Brain Development human brains collection (> 1000 samples).
  9. Lieber Institute for Brain Development • Role: Staff Scientist •

    PI: Andrew Jaffe Team: • Staff Scientist: Emily Burke • Research Associate: Madhavi Tippani • Postdocs: Amanda Price • Grad students: Matt Nguyen, Brianna Barry, Kira Perzel Mandell • Research Assistant: Nick Eagles, Stephen Semick* • Close collaborators: Carrie Wright, Nina Rajpurohit • Role details: like a postdoc (major role in some projects) with some support projects * Now at University of Maryland in med school Lab at Amy Peterson’s MPH capstone presentation
  10. Lieber Institute for Brain Development • miRNA kit prep comparison

    biorxiv.org/content/early/2018/10/30/445437 • DNAm on WGBS across development & cell types biorxiv.org/content/early/2018/09/29/428391 • RNA-seq DE in Schizophrenia disorder & 2 brain regions biorxiv.org/content/early/2018/09/26/426213 • RNA-seq from stem cells biorxiv.org/content/early/2018/07/31/380758 Peer reviewed: • DNAm and gene DE in Alzheimer’s disease doi.org/10.1007/s00401-019-01966-5 • RNA-seq smoking during pregnancy doi.org/10.1038/s41380-018-0223-1 • RNA-seq DE in Schizophrenia disorder on DLPFC doi.org/10.1038/s41593-018-0197-y • Histamine signaling in autism spectrum disorder doi.org/10.1038/tp.2017.87 Pre-prints:
  11. BrainSeq: A Human Brain Genomics Consortium DLPFC 495 samples BrainSeq

    Phase I polyA+ Jaffe et al., Nature Neuroscience, 2018 DLPFC 453 samples HIPPO 447 samples BrainSeq Phase II RiboZero THE SCIENTIFIC FRONTIER Neuron, 2015 Collado-Torres et al, bioRxiv, 2018 Manuscript status: under revision at Neuron
  12. DATA 16 BrainSeq Phase II RNA-seq samples DLPFC HIPPO total

    adult (age >= 18) 374 370 744 prenatal 29 28 57 0 <= age < 18 50 49 99 total 453 447 900 • Non-psychiatric control and schizophrenia affected individuals • Two brain regions: dorsolateral prefrontal cortex and hippocampus All samples
  13. DATA 17 BrainSeq Phase II RNA-seq samples: by case status

    DLPFC HIPPO total adult 222 238 460 prenatal 29 28 57 0 <= age < 18 49 48 97 total 300 314 614 DLPFC HIPPO total adult 152 132 284 prenatal 0 0 0 0 <= age < 18 1 1 2 total 153 133 286 Control Schizophrenia cases
  14. DATA ANALYSIS 18 Focus on being conservative 1.Use well established

    processing methods 2.Apply strict expression cutoffs 3.Use replication when possible 4.Adjust for RNA quality degradation confounding • Using the qSVA method 5.Avoid potential batch effects • Drop problematic samples 6.Take into account correlation at the individual level
  15. RNA-SEQ APPROACH 19 BrainSeq Phase II Pre-natal Adult ACTG Birth

    Unaffected Controls Patients with Schizophrenia RNA Sequencing Genotyping + + + Gene Exons Expressed Regions Transcripts Junctions Age CC CA AA SZ CONT DLPFC HIPPO + region differences For public data, check recount2
  16. DATA ANALYSIS 20 Main processing steps 1. Quality check (QC)

    on raw reads (FastQC) 2. Failed QC? Then trim reads (Trimmomatic) 3. Align reads to the genome (HISAT2) 4. Count features (featureCounts + others) 5. Calculate coverage (bam2wig) 6. Quantify transcripts (Salmon) 7. Create count tables (R) 8. Genotype samples (samtools + vcftools) Emily E. Burke Nextflow version in preparation with Winter Genomics and Nick Eagles
  17. DATA ANALYSIS 21 Filter features with low expression • •

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0.0 0.2 0.4 0.6 0.8 1.0 20000 30000 40000 50000 mean expression cutoff Number of features > cutoff Feature Cutoff Unit Gene 0.25 RPKM Exon 0.30 RPKM Jxn 0.46 RP10M Tx 0.38 TPM Gene • Used mean across all 900 samples • All analyses with filtered data Mean RPKM # genes > cutoff Based on the segmented R package: break-point estimation jaffelab::expression_cutoff()
  18. DATA ANALYSIS 22 Differential expression models • Region-specific for adult

    or fetal ages • Using only adult samples or only prenatal samples • Test for differences between DLFPC and HIPPO • Development • Linear age splines with breakpoints at developmental stages • Test for interaction between age and brain region at these splines • Case-control • By brain region • Test for differences between non-psychiatric controls and individuals with schizophrenia • For the first two models, we account for the fact that an individual can have two correlated samples: one for each brain region limma::duplicateCorrelation()
  19. DATA ANALYSIS 23 Differential expression models • Region-specific for adult

    or prenatal ages • Alternative: !"#$ = &' + )*+ + ,+" + ∑./0 1 23#45. + 6789:)8+ + 898);<227*3+=>+3+ + :?@ + :+*793 • Development • Alternative: !"#$ = &' + )*+ ∗ :+*793 + B+8); ∗ :+*793 + C7$8ℎ ∗ :+*793 + 73B)38 ∗ :+*793 + Eℎ7;= ∗ :+*793 + 8++3 ∗ :+*793 + )=F;8 ∗ :+*793 + ,+" + ∑./0 1 23#45. + 6789:)8+ + 898);<227*3+=>+3+ + :?@ + :+*793 • Case-control • Alternative: !"#$ = &' + )*+ + ,+" + 6789:)8+ + $:@<GHIJ + ∑./0 1 23#45. + 898);<227*3+=>+3+ + :?@ + $+*793,#+E7B7EK,L2 + M7)*39272
  20. DATA ANALYSIS 24 Using BrainSpan for replication: region-specific model •

    P-value < 0.05 in BrainSpan, consistent direction Similar results for the development model prenatal adult gene exon jxn p<0.05 p<0.01 p<0.001 p<1e−04 p<1e−05 p<1e−06 p<0.05 p<0.01 p<0.001 p<1e−04 p<1e−05 p<1e−06 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p−value threshold Replication rate
  21. DATA ANALYSIS 25 Region-specific by age results • For adults:

    1,612 DE genes with Bonferroni < 0.01 that replicate in BrainSpan • For prenatal samples: 32 DE genes 23 6 2 0 3 0 3 gene exon jxn DE features grouped by gene id (prenatal) 328 422 778 23 839 647 388 gene exon jxn DE features grouped by gene id (adult)
  22. DATA ANALYSIS 26 Region-specific by age results: adult enriched biological

    processes by region G: gene, E: exon, J: exon-exon junction D: DLPFC, H: HIPPO cell−matrix adhesion extracellular matrix organization ameboidal−type cell migration microtubule bundle formation axoneme assembly cilium movement establishment of synaptic vesicle localization synaptic vesicle transport synaptic vesicle localization signal release from synapse neurotransmitter secretion presynaptic process involved in chemical synaptic transmission positive regulation of GTPase activity regulation of GTPase activity synaptic vesicle exocytosis neurotransmitter transport regulation of calcium ion−dependent exocytosis synaptic vesicle cycle regulation of synapse organization positive regulation of synaptic transmission regulation of neuron projection development regulation of small GTPase mediated signal transduction extracellular structure organization positive regulation of nervous system development positive regulation of neurogenesis positive regulation of cell development axon development axonogenesis regulation of hormone levels cardiac chamber development cardiac septum development kidney epithelium development nephron development urogenital system development cardiac chamber morphogenesis cardiac septum morphogenesis renal system development kidney development regulation of transmembrane transport actomyosin structure organization regulation of trans−synaptic signaling modulation of chemical synaptic transmission synapse organization regulation of cell morphogenesis synaptic transmission, glutamatergic regulation of synaptic transmission, glutamatergic potassium ion transmembrane transport cellular potassium ion transport potassium ion transport G.D (609) G.H (573) E.D (1101) E.H (1087) J.D (850) J.H (823) GeneRatio 0.02 0.03 0.04 0.05 0.06 0.01 0.02 0.03 0.04 p.adjust ontology: BP
  23. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Pluripotency Glial Fetal Infant C hild Teens Adult 50+ Fetal Infant C hild Teens Adult 50+ 0.4 0.6 0.8 0.0 0.1 0.2 0.3 Age Group Cell Type Proportion Region DLPFC Hippo DATA ANALYSIS 27 Development model: similar composition in prenatal across regions by DNAm Stephen A. Semick
  24. DATA ANALYSIS 28 Development model results • 5,982 (~55%) genes

    contain differentially expressed exons and splice junctions that replicated in BrainSpan (Bonferroni < 1%) 2354 2260 1762 243 5982 8501 1558 gene exon jxn DE features grouped by gene id
  25. DATA ANALYSIS 29 Development model results • Normalized expression over

    age for GABRD −2 0 2 4 6 log2(CPM + 0.5) 14 16 18 20 22 PCW 0.0 0.2 0.4 2 4 6 8 12 14 16 18 2020 30 40 50 50 55 60 65 70 75 80 85 ENSG00000187730.7 GABRD p−bonf 0 Age DLPFC HIPPO
  26. DATA ANALYSIS 30 Development model results • Normalized expression over

    age after removing effect of terms from the null model jaffelab::cleaningY() and jaffelab::agePlotter() −2 −1 0 1 2 3 log2(CPM + 0.5) − covariate effects removed 14 16 18 20 22 PCW 0.0 0.2 0.4 2 4 6 8 12 14 16 18 2020 30 40 50 50 55 60 65 70 75 80 85 ENSG00000187730.7 GABRD p−bonf 0 Age DLPFC HIPPO
  27. qSVA WORKFLOW 31 Slide adapted from Amy Peterson Jaffe et

    al., PNAS, 2017 Model 1 (6429 genes) Log2 FC Dx Log2 FC Degradation http://research.libd.org/rstatsclub/2018/12/11/quality-surrogate-variable-analysis/
  28. DEqual HIPPO 32 Model 1 (6429 genes) Model 1. Naïve

    model E / = 10 + 11 45 DEqual plots demonstrate effectiveness of statistical correction r = 0.412 Slide adapted from Amy Peterson Log2 FC Dx Log2 FC Degradation HIPPO 333 samples
  29. DEqual HIPPO 33 Model 1 (6429 genes) Model 2 (63

    genes) Model 1. Naïve model E / = 10 + 1145 Model 2. Added RNA-quality and demographic covariates E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH + 17PXS + ∑Z[\ ] γMJV_`aM DEqual plots demonstrate effectiveness of statistical correction r = 0.412 r = 0.0712 Slide adapted from Amy Peterson Log2 FC Dx Log2 FC Dx, adj. cov Log2 FC Degradation Log2 FC Degradation HIPPO 333 samples
  30. DEqual HIPPO 34 Model 1 (6429 genes) Model 2 (63

    genes) Model 3 (48 genes) Model 1. Naïve model E / = 10 + 1145 Model 2. Added RNA-quality and demographic covariates E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH + 17PXS + ∑Z[\ ] γMJV_`aM Model 3. Added qSVs E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH + 17PXS + ∑Z[\ ] γMJV_`aM + ∑Z[\ d eMfghM DEqual plots demonstrate effectiveness of statistical correction HIPPO 333 samples r = 0.412 r = 0.0712 r = -0.00173 Slide adapted from Amy Peterson Log2 FC Dx Log2 FC Dx, adj. cov Log2 FC Dx, adj qSVs Log2 FC Degradation Log2 FC Degradation Log2 FC Degradation
  31. −6 −4 −2 2 4 6 −6 −4 −2 0

    2 4 6 0 t−statistic DLPFC t−statistic BSP1 r = 0.809 DATA ANALYSIS 35 Case-control: by region • 48 DE genes at FDR <5% in hippocampus, 243 in DLPFC (FDR <5%) • DLPFC results agree with BrainSeq Phase I Amy Peterson −6 −4 −2 2 4 −4 −2 0 2 4 0 t−statistic HIPPO t−statistic DLPFC r = 0.644 Control > SCZD SCZD > Control DLPFC FDR<10%, HIPPO FDR<20%
  32. axon development axonogenesis regulation of axonogenesis regulation of neuron projection

    development neutrophil chemotaxis regulation of B cell proliferation regulation of B cell activation positive regulation of B cell activation B cell receptor signaling pathway B cell activation lymphocyte activation regulation of leukocyte cell−cell adhesion regulation of cell−cell adhesion positive regulation of alpha−beta T cell activation regulation of leukocyte activation positive regulation of lymphocyte activation positive regulation of cell activation positive regulation of leukocyte activation leukocyte migration positive regulation of leukocyte cell−cell adhesion positive regulation of T cell activation leukocyte chemotaxis response to organophosphorus ATF6−mediated unfolded protein response positive regulation of cell adhesion cell chemotaxis positive regulation of cell−cell adhesion protein folding protein folding in endoplasmic reticulum HgC (137) HeC (306) DgC (297) DeS (250) GeneRatio 0.025 0.050 0.075 0.100 0.01 0.02 0.03 0.04 p.adjust ontology: BP DATA ANALYSIS 36 • BP enrichment in Control > SCZD at gene level • Immune processes g: gene, e: exon, j: exon-exon junction D: DLPFC, H: HIPPO, C: control, S: SCZD Case-control: by region
  33. Gene Individual 1 Individual 2 1 10 15 2 5

    22 … DLPFC Gene Individual 1 Individual 2 1 10 17 2 6 16 … HIPPO correlation( (10, 5, …), (10, 6, …) ) =~ 0.9 correlation( (15, 22, …), (17, 16, …) ) =~ 0.7 Individual 1 Individual 2 Control vs SCZD
  34. A B C D Control SCZD 0.76 0.77 0.78 0.79

    cleaned expr (keeping Dx) − geneRpkm p−value: 0.0164 SCZD diagnosis Correlation Control SCZD 0.71 0.72 0.73 0.74 0.75 0.76 cleaned expr (keeping Dx) − exonRpkm p−value: 0.0499 SCZD diagnosis Correlation Control SCZD 0.54 0.55 0.56 0.57 cleaned expr (keeping Dx) − jxnRp10m p−value: 1.72e−05 Correlation Control SCZD 0.47 0.48 0.49 0.50 0.51 0.52 cleaned expr (keeping Dx) − txTpm p−value: 0.00992 Correlation Correlation Correlation Correlation Correlation Gene: 0.0164 Exon: 0.0499 Jxn: 1.72 * 10-5 Tx: 0.00992
  35. DATA ANALYSIS 39 HIPPO eQTLs • 11,237,357 eQTL associations (FDR

    <1%) across genes, exons and junctions corresponding to 17,719 genes Emily E. Burke 2061 3183 2163 274 5945 2915 1178 gene exon jxn eQTLs grouped by gene id Includes 60 risk SNPs from PGC2 8 +,332 E 0 1 2 −1 0 1 2 3 4 chr2:5822204358229 ï  -[Q rs74563533:58250433:G:A r2:58 250 433 Residualized Expression • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0 1 2 −1 0 1 2 3 4 chr2:5822204358229 ï  -[Q rs75575209:58138192:A:T r2:58 138 192 Residualized Expression • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • p=5.75e−34 p=3.82e−81 F
  36. DATA ANALYSIS 40 Region dependent eQTLs • 205,618 region-dependent eQTLs

    (FDR <1%) corresponding to 1,484 genes • Includes 3 PCG2 schizophrenia risk loci Emily E. Burke 19394 10734 6493 2622 3937 4398 34259 gene exon jxn eQTLs grouped by SNP id 488 246 319 18 99 63 251 gene exon jxn eQTLs grouped by gene id
  37. A B 0.D 1.D 2.D 0.H 1.H 2.H 4.0 4.4

    4.8 5.2 LRP1(16( (Exon) rs324015 Residualized Expression p=3.91e−05 0.D 1.D 2.D 0.H 1.H 2.H −1 0 1 2 3 2YHUODSVJHQH15*1 FKUï   -[Q rs12293670:124612932:A:G Residualized Expression p=5.27e−26 C D 0.D 1.D 2.D 0.H 1.H 2.H −1 0 1 2 3 4 2YHUODSVJHQH1*() FKUï ï  -[Q rs4144797:233562197:T:C Residualized Expression p=3.5e−14 0.D 1.D 2.D 0.H 1.H 2.H 6.8 7.2 7.6 8.0 LRP1 FKUï   -[Q rs324015 Residualized Expression p=2.02e−06 D: DLPFC, H: HIPPO DATA ANALYSIS Region dependent eQTLs These 3 SNPs are also meQTLs (Jaffe et al, 2016)
  38. Overall eQTLs at FDR < 1% - rAggr eQTL analysis

    with PCG2 risk SNPs Hippocampus DLPFC # unique SNPs (unique index SNPs) 5510 (103) 6780 (116) # unique features 1731 2525 # Unique genes 123 171 # Unique transcripts 244 332 # Unique exons 857 1363 # Unique junctions 507 659 Emily E. Burke 21 8 95 DLPFC HIPPO In BSP2: 163/179 (91.1%) FDR <1%: 124/163 (76.6%)
  39. TWAS SNP ID chromosome position rs101 12 1 rs102 13

    10000 … SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … SNP ID chromosome position rs101 12 1 rs102 13 10000 … LD reference LIBD BSP2 SNPs GWAS summary SNPs https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md 1,190,321 82,393,211 6,500,475 CLOZUK+PGC2
  40. TWAS SNP ID chromosome position rs101 12 1 rs102 13

    10000 … SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … SNP ID chromosome position rs101 12 1 rs102 13 10000 … LD reference LIBD BSP2 SNPs GWAS summary SNPs https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md 1,190,321 82,393,211 6,500,475 CLOZUK+PGC2
  41. TWAS SNP ID chromosome position rs101:A:T 12 1 rs102 13

    10005 … SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … SNP ID chromosome position rs101:A:T 12 1 rs102 13 10005 … LD reference LIBD BSP2 SNPs GWAS summary SNPs https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md 1,022,527 1,022,546 4,316,013 CLOZUK+PGC2
  42. TWAS DLPFC HIPPO gene 7,683 (32.8%) 5,866 (25.1%) exon 61,678

    (16.2%) 47,583 (12.5%) jxn 33,754 (12.5%) 27,088 (10.1%) tx 14,118 (15.9%) 11,567 (13%) Features with TWAS weights Removed sex, snp PCs 1-5, expr PCs Kept diagnosis jaffelab::cleaningY()
  43. TWAS Among loci considered? DLPFC HIPPO Yes 99 (85.3%) 87

    (84.5 %) No 17 16 116 103 PGC2 loci with eQTLs (FDR<1%): considered in TWAS? ? DLPFC HIPPO Yes 48 (48.5%) 45 (51.7%) No 51 42 99 87 Locus has a TWAS p-value among all features? ? DLPFC HIPPO Yes 26 (54.2%) 25 (55.5%) No 22 20 48 45 TWAS p-value <1x10-8?
  44. TWAS ENSG00000188511.12 ENSG00000280224.1 ENSG00000279597.1 ENSG00000264139.1 ENSG00000212939.2 ENSG00000213279.2 ENSG00000235111.1 ENSG00000226954.1 ENSG00000223142.1

    ENSG00000214727.2 ENSG00000100425.18 ENSG00000279345.1 ENSG00000236867.1 ENSG00000260613.1 ENSG00000229409.1 ENSG00000277771.1 ENSG00000279494.1 ENSG00000100426.6 ENSG00000182858.13 ENSG00000273192.1 ENSG00000184164.14 ENSG00000278869.1 ENSG00000198355.4 ENSG00000276753.1 ENSG00000188263.10 ENSG00000138892.11 ENSG00000100427.15 ENSG00000073146.15 ENSG00000280395.1 ENSG00000073150.13 ENSG00000170638.9 ENSG00000273188.1 ENSG00000273253.2 ENSG00000073169.13 ENSG00000272836.1 ENSG00000273137.1 ENSG00000128159.11 ENSG00000100429.17 ENSG00000188130.13 ENSG00000185386.14 ENSG00000196576.14 ENSG00000205593.11 ENSG00000227484.1 ENSG00000279182.1 ENSG00000100239.15 ENSG00000100241.20 ENSG00000128165.8 ENSG00000100253.12 ENSG00000100258.17 ENSG00000025770.18 ENSG00000130489.13 ENSG00000272821.1 ENSG00000025708.13 ENSG00000177989.13 ENSG00000272666.1 ENSG00000273272.1 ENSG00000226738.1 ENSG00000130487.5 ENSG00000217442.3 ENSG00000205560.12 ENSG00000254413.8 ENSG00000100288.19 ENSG00000205559.3 ENSG00000272940.1 ENSG00000008735.13 ENSG00000100299.17 ENSG00000212569.1 ENSG00000251322.7 ENSG00000206841.1 ENSG00000225929.1 ENSG00000100312.10 ENSG00000254499.1 ENSG00000213683.4 ENSG00000184319.15 ENSG00000079974.17 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • •• • • • •• • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 49.4 49.6 49.8 50.0 50.2 50.4 50.6 0 1 2 3 4 5 chr 22 physical position (MB) −log10(P−value) • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ••• • • • • • • • • • • • •• • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • Can also identify conditionally independent features GWAS: grey TWAS: blue CRELD2
  45. TWAS ENSG00000188511.12 ENSG00000280224.1 ENSG00000279597.1 ENSG00000264139.1 ENSG00000212939.2 ENSG00000213279.2 ENSG00000235111.1 ENSG00000226954.1 ENSG00000223142.1

    ENSG00000214727.2 ENSG00000100425.18 ENSG00000279345.1 ENSG00000236867.1 ENSG00000260613.1 ENSG00000229409.1 ENSG00000277771.1 ENSG00000279494.1 ENSG00000100426.6 ENSG00000182858.13 ENSG00000273192.1 ENSG00000184164.14 ENSG00000278869.1 ENSG00000198355.4 ENSG00000276753.1 ENSG00000188263.10 ENSG00000138892.11 ENSG00000100427.15 ENSG00000073146.15 ENSG00000280395.1 ENSG00000073150.13 ENSG00000170638.9 ENSG00000273188.1 ENSG00000273253.2 ENSG00000073169.13 ENSG00000272836.1 ENSG00000273137.1 ENSG00000128159.11 ENSG00000100429.17 ENSG00000188130.13 ENSG00000185386.14 ENSG00000196576.14 ENSG00000205593.11 ENSG00000227484.1 ENSG00000279182.1 ENSG00000100239.15 ENSG00000100241.20 ENSG00000128165.8 ENSG00000100253.12 ENSG00000100258.17 ENSG00000025770.18 ENSG00000130489.13 ENSG00000272821.1 ENSG00000025708.13 ENSG00000177989.13 ENSG00000272666.1 ENSG00000273272.1 ENSG00000226738.1 ENSG00000130487.5 ENSG00000217442.3 ENSG00000205560.12 ENSG00000254413.8 ENSG00000100288.19 ENSG00000205559.3 ENSG00000272940.1 ENSG00000008735.13 ENSG00000100299.17 ENSG00000212569.1 ENSG00000251322.7 ENSG00000206841.1 ENSG00000225929.1 ENSG00000100312.10 ENSG00000254499.1 ENSG00000213683.4 ENSG00000184319.15 ENSG00000079974.17 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • •• • • • •• • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 49.4 49.6 49.8 50.0 50.2 50.4 50.6 0 1 2 3 4 5 −log10(P−value) • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ••• • • • • • • •• • • • •• • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ••• • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • •• • • • • •• • • • •• • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • 49.4 49.6 49.8 50.0 50.2 50.4 50.6 0 2 4 6 8 chr 22 physical position (MB) −log10(P−value) GWAS signal eQTL signal CRELD2
  46. TWAS Feature DLFPC HIPPO TWAS conditionally independent? Gene 29 (43.3%)

    20 (40 %) Yes 38 30 No Exon 42 (48.3%) 31 (40.3) Yes 45 46 No Jxn 50 (51%) 40 (40.8%) Yes 48 48 No Tx 35 (46.1%) 26 (40%) Yes 41 39 No
  47. Authors • Leonardo Collado-Torres o [email protected] o @fellgernon • Emily

    E. Burke • Amy Peterson • Joo Heon Shin • Richard E. Straub • Anandita Rajpurohit • Stephen A. Semick • William S. Ulrich • Cristian Valencia • Ran Tao • Amy Deep-Soboslay • Thomas M. Hyde • Joel E. Kleinman • Daniel R. Weinberger+ • Andrew E. Jaffe+ o [email protected] o @andrewejaffe 53 • BrainSeq Consortium • LIBD @lieberinstitute Funding github.com/LieberInstitute/brainseq_phase2
  48. SRA

  49. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5 doi.org/10.12688/f1000research.12223.1
  50. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  51. related projects • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1 • Shannon Ellis &

    Leek: phenotype prediction doi.org/10.1093/nar/gky102 • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346 • Madugundu & Pandey (JHU): proteomics • Luidi-Imada & Marchionni (JHU): cancer and FANTOM • Kuri-Magaña & Martínez-Barnetche (INSP Mexico): immune expression doi.org/10.3389/fimmu.2018.02679 • Ryten (UCL): Guelfi: validating expressed regions (ERs) eQTLs Zhang: improving the detection of ERs doi.org/10.1101/499103
  52. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis
  53. Category Frequency F 95 female 2036 Female 51 M 77

    male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$Sex “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis
  54. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 TCGA The Cancer Genome Atlas N=11,284 GTEx Genotype Tissue Expression Project N=9,662 slide adapted from Shannon Ellis
  55. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  56. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  57. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  58. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8% 99.6% 99.4% 88.5% slide adapted from Shannon Ellis
  59. Number of Regions 589 589 589 589 589 Number of

    Samples (N) 4,769 4,769 613 6,579 8,951 97.3% 96.5% 91.0% 70.2% Prediction is more accurate in healthy tissue 50.6% slide adapted from Shannon Ellis
  60. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/
  61. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • adipose tissue adrenal gland bladder blood blood vessel bone bone marrow brain breast cervix cervix uteri colon epithelium esophagus fallopian tube heart intestine kidney liver lung melanoma monocytes muscle nerve ovary pancreas penis pituitary placenta prostate salivary gland skin small intestine spinal cord spleen stem cell stomach testis thyroid tonsil umbilical cord urinary bladder uterus vagina 0 3000 6000 9000 12000 0 1000 2000 3000 reported predicted
  62. Sex Female Male Age/Development Fetus Child Adolescent Adult Race/Ethnicity Asian

    Black Hispanic White Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia Tissue Site 3 Dorsolateral prefrontal cortex Superior temporal gyrus Substantia nigra Caudate Hemisphere Left Right Brodmann Area 1-52 Disease Status Disease Neurological control Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma Clinical Stage 1 Grade I Grade II Grade III Grade IV Clinical Stage 2 Primary Secondary Recurrent Viability Postmortem Biopsy Preparation Frozen Thawed Ashkaun Razmara, in prep.
  63. https://github.com/LieberInstitute/recount-brain/tree/master/metadata_reproducibility • Overall curation steps: starts by downloading SRA Run

    Table info, then info from the publications • Details for each SRA study Reproducibility document Ashkaun Razmara, in prep.
  64. Replicates part of the GTEx PMI paper by Ferreira et

    al. doi.org/10.1038/s41467-017- 02772-x
  65. * Jeff Leek presented Shannon Ellis’ prediction work in Toronto

    (around April 2018) https://docs.google.com/presentation/d/1FgUZZU6pW91J7zH0OqrEgxfnV1Py_ZGL3ZKHfbOZskY/edit#slide=id.g2f831fd4ae_0_306 * Dustin J. Sokolowski from Michael D. Wilson’s lab is using recount2 * Dustin joins the project and merges recount-brain with GTEx and TCGA recount_brain_v2 The SRA samples in recount-brain are complemented by 1,409 GTEx (GTEx Consortium 2015) and 707 TCGA (Brennan et al. 2013; Cancer Genome Atlas Research Network et al. 2015) samples covering 13 healthy regions of the brain and 2 tumor types, respectively. In total, there are 6,547 samples with metadata in recount-brain with 5,330 (81.4%) present in recount2
  66. * Discussed with Sean Davis (NIH) at Biological Data Science

    #biodata18 * Potential for contributing recount-brain to SRAdbV2 github.com/seandavi/SRAdbV2 * Have mapped variables to ontologies What about maintaining/growing it?
  67. The recount-brain team Hopkins Ashkaun Razmara Shannon E. Ellis Jeff

    T. Leek University of Toronto Dustin J. Sokolowski Michael D. Wilson NIH Sean Davis LIBD Andrew E. Jaffe Funding NIH R01 GM105705 NIH 1R21MH109956 NIH R01 GM121459 CIHR, NSERC Ontario Ministry of Research IDIES SciServer Hosting recount2 github.com/LieberInstitute/recount-brain