$30 off During Our Annual Pro Sale. View Details »

libd2019

 libd2019

Leonardo Collado-Torres

February 25, 2019
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11
    Analyzing BrainSeq Phase II and generating the
    recount-brain resource
    Leonardo Collado-Torres
    @fellgernon
    Data Science I with Andrew Jaffe

    View Slide

  2. Questions I’ll answer today
    • Who am I?
    • What is the BrainSeq Phase II project?
    • What is the recount-brain resource?

    View Slide

  3. What defines me
    • Bioinformatics
    • R and Bioconductor
    • Reproducibility and best practices
    • Outreach and community building
    • Back in 2005:
    I like math and coding; biology provides the challenging problems

    View Slide

  4. History
    2005-2009
    Undergrad in
    Genomic Sciences
    2009-2011 2011-2016 August 2016+
    Data Science
    Division Leader
    !
    !
    PIs:
    • Jeff Leek: 2012+
    • Andrew Jaffe: 2013+
    Ph.D. Biostatistics Staff Scientist
    Data Science Team I

    View Slide

  5. 2008+
    • BioC 2008-2011, 2014, 2017
    • useR!2013
    • rOpenSci unconf 2018
    • RStudio::conf 2019
    @fellgernon
    2010+
    Interests
    !
    @LIBDrstats
    2018+
    @CDSBMexico
    2018+
    Defunct: BmoreBiostats, Biostats Cultural Mixers
    Guest
    @RLadiesBmore
    Blog: http://lcolladotor.github.io
    2011+
    FB: 75k, Tw: 66k
    weekly

    View Slide

  6. Software tools I use every day
    Mercurial/git user 2009+
    https://github.com/lcolladotor
    From: Feb 21, 2019
    Communication tool for work and communities

    View Slide

  7. My path to LIBD
    2005-2009
    Undergrad in
    Genomic Sciences
    2009-2011 2011-2016 August 2016+
    Data Science
    Division Leader
    !
    !
    PIs:
    • Jeff Leek: 2012+
    • Andrew Jaffe: 2013+
    Ph.D. Biostatistics Staff Scientist
    Data Science Team I
    Microarrays,
    little RNA-seq
    N = 12
    (bacteria)
    N = 59
    N = 72
    potential for 1,000
    BSP2: N = 900
    + many more

    View Slide

  8. Ph.D. overview
    The goal was to develop statistical methods and software [...] in
    RNA-seq […]. We applied these methods to further our
    understanding of neuropsychiatric disorders using the Lieber Institute
    for Brain Development human brains collection (> 1000 samples).

    View Slide

  9. Lieber Institute for Brain Development
    • Role: Staff Scientist
    • PI: Andrew Jaffe
    Team:
    • Staff Scientist: Emily Burke
    • Research Associate: Madhavi Tippani
    • Postdocs: Amanda Price
    • Grad students: Matt Nguyen, Brianna Barry,
    Kira Perzel Mandell
    • Research Assistant: Nick Eagles, Stephen Semick*
    • Close collaborators: Carrie Wright, Nina Rajpurohit
    • Role details: like a postdoc (major role in some projects) with some support projects
    * Now at University of Maryland in med school
    Lab at Amy Peterson’s MPH capstone presentation

    View Slide

  10. Papers since 2016

    View Slide

  11. Papers since 2016
    RNA-seq
    WGBS
    Biostatistics
    RNA-seq
    > 70,000 samples

    View Slide

  12. Lieber Institute for Brain Development
    • miRNA kit prep comparison biorxiv.org/content/early/2018/10/30/445437
    • DNAm on WGBS across development & cell types biorxiv.org/content/early/2018/09/29/428391
    • RNA-seq DE in Schizophrenia disorder & 2 brain regions biorxiv.org/content/early/2018/09/26/426213
    • RNA-seq from stem cells biorxiv.org/content/early/2018/07/31/380758
    Peer reviewed:
    • DNAm and gene DE in Alzheimer’s disease doi.org/10.1007/s00401-019-01966-5
    • RNA-seq smoking during pregnancy doi.org/10.1038/s41380-018-0223-1
    • RNA-seq DE in Schizophrenia disorder on DLPFC doi.org/10.1038/s41593-018-0197-y
    • Histamine signaling in autism spectrum disorder doi.org/10.1038/tp.2017.87
    Pre-prints:

    View Slide

  13. Today
    BrainSeq Phase II
    brain

    View Slide

  14. View Slide

  15. BrainSeq: A Human Brain Genomics Consortium
    DLPFC
    495
    samples
    BrainSeq Phase I
    polyA+
    Jaffe et al., Nature Neuroscience, 2018
    DLPFC
    453
    samples
    HIPPO
    447
    samples
    BrainSeq Phase II
    RiboZero
    THE SCIENTIFIC FRONTIER
    Neuron, 2015
    Collado-Torres et al, bioRxiv, 2018
    Manuscript status: under revision at Neuron

    View Slide

  16. DATA
    16
    BrainSeq Phase II RNA-seq samples
    DLPFC HIPPO total
    adult (age >= 18) 374 370 744
    prenatal 29 28 57
    0 <= age < 18 50 49 99
    total 453 447 900
    • Non-psychiatric control and schizophrenia affected individuals
    • Two brain regions: dorsolateral prefrontal cortex and hippocampus
    All
    samples

    View Slide

  17. DATA
    17
    BrainSeq Phase II RNA-seq samples: by case status
    DLPFC HIPPO total
    adult 222 238 460
    prenatal 29 28 57
    0 <= age < 18 49 48 97
    total 300 314 614
    DLPFC HIPPO total
    adult 152 132 284
    prenatal 0 0 0
    0 <= age < 18 1 1 2
    total 153 133 286
    Control
    Schizophrenia
    cases

    View Slide

  18. DATA ANALYSIS
    18
    Focus on being conservative
    1.Use well established processing methods
    2.Apply strict expression cutoffs
    3.Use replication when possible
    4.Adjust for RNA quality degradation confounding
    • Using the qSVA method
    5.Avoid potential batch effects
    • Drop problematic samples
    6.Take into account correlation at the individual level

    View Slide

  19. RNA-SEQ APPROACH
    19
    BrainSeq Phase II
    Pre-natal
    Adult
    ACTG
    Birth
    Unaffected
    Controls
    Patients with
    Schizophrenia
    RNA Sequencing
    Genotyping
    + +
    +
    Gene Exons Expressed Regions
    Transcripts Junctions
    Age
    CC CA AA
    SZ CONT
    DLPFC
    HIPPO
    + region differences
    For public data, check recount2

    View Slide

  20. DATA ANALYSIS
    20
    Main processing steps
    1. Quality check (QC) on raw reads (FastQC)
    2. Failed QC? Then trim reads (Trimmomatic)
    3. Align reads to the genome (HISAT2)
    4. Count features (featureCounts + others)
    5. Calculate coverage (bam2wig)
    6. Quantify transcripts (Salmon)
    7. Create count tables (R)
    8. Genotype samples (samtools + vcftools)
    Emily E. Burke
    Nextflow version in
    preparation with
    Winter Genomics
    and Nick Eagles

    View Slide

  21. DATA ANALYSIS
    21
    Filter features with low expression


















    ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
    0.0 0.2 0.4 0.6 0.8 1.0
    20000 30000 40000 50000
    mean expression cutoff
    Number of features > cutoff
    Feature Cutoff Unit
    Gene 0.25 RPKM
    Exon 0.30 RPKM
    Jxn 0.46 RP10M
    Tx 0.38 TPM
    Gene
    • Used mean across all 900 samples
    • All analyses with filtered data
    Mean
    RPKM
    # genes >
    cutoff
    Based on the segmented R package:
    break-point estimation
    jaffelab::expression_cutoff()

    View Slide

  22. DATA ANALYSIS
    22
    Differential expression models
    • Region-specific for adult or fetal ages
    • Using only adult samples or only prenatal samples
    • Test for differences between DLFPC and HIPPO
    • Development
    • Linear age splines with breakpoints at developmental stages
    • Test for interaction between age and brain region at these splines
    • Case-control
    • By brain region
    • Test for differences between non-psychiatric controls and individuals with
    schizophrenia
    • For the first two models, we account for the fact that an individual can have two
    correlated samples: one for each brain region
    limma::duplicateCorrelation()

    View Slide

  23. DATA ANALYSIS
    23
    Differential expression models
    • Region-specific for adult or prenatal ages
    • Alternative: !"#$ = &' + )*+ + ,+" +
    ∑./0
    1 23#45. + 6789:)8+ + 898);<227*3+=>+3+ + :?@ + :+*793
    • Development
    • Alternative: !"#$ = &' + )*+ ∗ :+*793 + B+8); ∗ :+*793 +
    C7$8ℎ ∗ :+*793 + 73B)38 ∗ :+*793 + Eℎ7;= ∗ :+*793 +
    8++3 ∗ :+*793 + )=F;8 ∗ :+*793 + ,+" +
    ∑./0
    1 23#45. + 6789:)8+ + 898);<227*3+=>+3+ + :?@ + :+*793
    • Case-control
    • Alternative: !"#$ = &' + )*+ + ,+" + 6789:)8+ +
    $:@∑./0
    1 23#45. + 898);<227*3+=>+3+ + :?@ + $+*793,#+E7B7EK,L2 + M7)*39272

    View Slide

  24. DATA ANALYSIS
    24
    Using BrainSpan for replication: region-specific model
    • P-value < 0.05 in BrainSpan, consistent direction
    Similar results for the development model
    prenatal adult
    gene exon jxn
    p<0.05
    p<0.01
    p<0.001
    p<1e−04
    p<1e−05
    p<1e−06
    p<0.05
    p<0.01
    p<0.001
    p<1e−04
    p<1e−05
    p<1e−06
    0.00
    0.25
    0.50
    0.75
    1.00
    0.00
    0.25
    0.50
    0.75
    1.00
    0.00
    0.25
    0.50
    0.75
    1.00
    p−value threshold
    Replication rate

    View Slide

  25. DATA ANALYSIS
    25
    Region-specific by age results
    • For adults: 1,612 DE genes with Bonferroni < 0.01 that replicate in BrainSpan
    • For prenatal samples: 32 DE genes
    23
    6
    2
    0
    3
    0
    3
    gene exon
    jxn
    DE features grouped by gene id (prenatal)
    328
    422
    778
    23
    839
    647
    388
    gene exon
    jxn
    DE features grouped by gene id (adult)

    View Slide

  26. DATA ANALYSIS
    26
    Region-specific by age results: adult enriched biological processes by region
    G: gene, E: exon, J: exon-exon junction
    D: DLPFC, H: HIPPO
    cell−matrix adhesion
    extracellular matrix organization
    ameboidal−type cell migration
    microtubule bundle formation
    axoneme assembly
    cilium movement
    establishment of synaptic vesicle localization
    synaptic vesicle transport
    synaptic vesicle localization
    signal release from synapse
    neurotransmitter secretion
    presynaptic process involved in chemical synaptic transmission
    positive regulation of GTPase activity
    regulation of GTPase activity
    synaptic vesicle exocytosis
    neurotransmitter transport
    regulation of calcium ion−dependent exocytosis
    synaptic vesicle cycle
    regulation of synapse organization
    positive regulation of synaptic transmission
    regulation of neuron projection development
    regulation of small GTPase mediated signal transduction
    extracellular structure organization
    positive regulation of nervous system development
    positive regulation of neurogenesis
    positive regulation of cell development
    axon development
    axonogenesis
    regulation of hormone levels
    cardiac chamber development
    cardiac septum development
    kidney epithelium development
    nephron development
    urogenital system development
    cardiac chamber morphogenesis
    cardiac septum morphogenesis
    renal system development
    kidney development
    regulation of transmembrane transport
    actomyosin structure organization
    regulation of trans−synaptic signaling
    modulation of chemical synaptic transmission
    synapse organization
    regulation of cell morphogenesis
    synaptic transmission, glutamatergic
    regulation of synaptic transmission, glutamatergic
    potassium ion transmembrane transport
    cellular potassium ion transport
    potassium ion transport
    G.D
    (609)
    G.H
    (573)
    E.D
    (1101)
    E.H
    (1087)
    J.D
    (850)
    J.H
    (823)
    GeneRatio
    0.02
    0.03
    0.04
    0.05
    0.06
    0.01
    0.02
    0.03
    0.04
    p.adjust
    ontology: BP

    View Slide

















  27. ● ●











































    Pluripotency Glial
    Fetal
    Infant
    C
    hild
    Teens
    Adult
    50+
    Fetal
    Infant
    C
    hild
    Teens
    Adult
    50+
    0.4
    0.6
    0.8
    0.0
    0.1
    0.2
    0.3
    Age Group
    Cell Type Proportion
    Region
    DLPFC
    Hippo
    DATA ANALYSIS
    27
    Development model: similar composition in prenatal across regions by DNAm
    Stephen A. Semick

    View Slide

  28. DATA ANALYSIS
    28
    Development model results
    • 5,982 (~55%) genes contain differentially expressed exons and splice junctions that
    replicated in BrainSpan (Bonferroni < 1%)
    2354
    2260
    1762
    243
    5982
    8501
    1558
    gene exon
    jxn
    DE features grouped by gene id

    View Slide

  29. DATA ANALYSIS
    29
    Development model results
    • Normalized expression over age for GABRD
    −2 0 2 4 6
    log2(CPM + 0.5)
    14 16 18 20 22
    PCW
    0.0 0.2 0.4 2 4 6 8 12 14 16 18 2020 30 40 50 50 55 60 65 70 75 80 85
    ENSG00000187730.7 GABRD p−bonf 0
    Age
    DLPFC HIPPO

    View Slide

  30. DATA ANALYSIS
    30
    Development model results
    • Normalized expression over age after removing effect of terms from the null model
    jaffelab::cleaningY() and jaffelab::agePlotter()
    −2 −1 0 1 2 3
    log2(CPM + 0.5) − covariate effects removed
    14 16 18 20 22
    PCW
    0.0 0.2 0.4 2 4 6 8 12 14 16 18 2020 30 40 50 50 55 60 65 70 75 80 85
    ENSG00000187730.7 GABRD p−bonf 0
    Age
    DLPFC HIPPO

    View Slide

  31. qSVA WORKFLOW
    31 Slide adapted from Amy Peterson
    Jaffe et al., PNAS, 2017
    Model 1 (6429 genes)
    Log2 FC Dx
    Log2 FC Degradation
    http://research.libd.org/rstatsclub/2018/12/11/quality-surrogate-variable-analysis/

    View Slide

  32. DEqual HIPPO
    32
    Model 1 (6429 genes)
    Model 1. Naïve model
    E / = 10
    + 11
    45
    DEqual plots demonstrate effectiveness of statistical correction
    r = 0.412
    Slide adapted from Amy Peterson
    Log2 FC Dx
    Log2 FC Degradation
    HIPPO
    333
    samples

    View Slide

  33. DEqual HIPPO
    33
    Model 1 (6429 genes) Model 2 (63 genes)
    Model 1. Naïve model
    E / = 10 + 1145
    Model 2. Added RNA-quality and demographic covariates
    E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH
    + 17PXS + ∑Z[\
    ] γMJV_`aM
    DEqual plots demonstrate effectiveness of statistical correction
    r = 0.412 r = 0.0712
    Slide adapted from Amy Peterson
    Log2 FC Dx Log2 FC Dx, adj. cov
    Log2 FC Degradation
    Log2 FC Degradation
    HIPPO
    333
    samples

    View Slide

  34. DEqual HIPPO
    34
    Model 1 (6429 genes) Model 2 (63 genes) Model 3 (48 genes)
    Model 1. Naïve model
    E / = 10 + 1145
    Model 2. Added RNA-quality and demographic covariates
    E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH
    + 17PXS + ∑Z[\
    ] γMJV_`aM
    Model 3. Added qSVs
    E / = 10 + 1145 + 12FGH + 13JH5+ 14LMNOPFNH+ 15RPSTRFte + 16GHVHPFNH
    + 17PXS + ∑Z[\
    ] γMJV_`aM
    + ∑Z[\
    d eMfghM
    DEqual plots demonstrate effectiveness of statistical correction
    HIPPO
    333
    samples
    r = 0.412 r = 0.0712 r = -0.00173
    Slide adapted from Amy Peterson
    Log2 FC Dx Log2 FC Dx, adj. cov Log2 FC Dx, adj
    qSVs
    Log2 FC Degradation
    Log2 FC Degradation
    Log2 FC Degradation

    View Slide

  35. −6 −4 −2 2 4 6
    −6 −4 −2 0 2 4 6
    0
    t−statistic DLPFC
    t−statistic BSP1
    r = 0.809
    DATA ANALYSIS
    35
    Case-control: by region
    • 48 DE genes at FDR <5% in hippocampus, 243 in DLPFC (FDR <5%)
    • DLPFC results agree with BrainSeq Phase I
    Amy Peterson
    −6 −4 −2 2 4
    −4 −2 0 2 4
    0
    t−statistic HIPPO
    t−statistic DLPFC
    r = 0.644
    Control > SCZD
    SCZD > Control
    DLPFC FDR<10%, HIPPO FDR<20%

    View Slide

  36. axon development
    axonogenesis
    regulation of axonogenesis
    regulation of neuron projection development
    neutrophil chemotaxis
    regulation of B cell proliferation
    regulation of B cell activation
    positive regulation of B cell activation
    B cell receptor signaling pathway
    B cell activation
    lymphocyte activation
    regulation of leukocyte cell−cell adhesion
    regulation of cell−cell adhesion
    positive regulation of alpha−beta T cell activation
    regulation of leukocyte activation
    positive regulation of lymphocyte activation
    positive regulation of cell activation
    positive regulation of leukocyte activation
    leukocyte migration
    positive regulation of leukocyte cell−cell adhesion
    positive regulation of T cell activation
    leukocyte chemotaxis
    response to organophosphorus
    ATF6−mediated unfolded protein response
    positive regulation of cell adhesion
    cell chemotaxis
    positive regulation of cell−cell adhesion
    protein folding
    protein folding in endoplasmic reticulum
    HgC
    (137)
    HeC
    (306)
    DgC
    (297)
    DeS
    (250)
    GeneRatio
    0.025
    0.050
    0.075
    0.100
    0.01
    0.02
    0.03
    0.04
    p.adjust
    ontology: BP
    DATA ANALYSIS
    36
    • BP enrichment in Control > SCZD at gene level
    • Immune processes
    g: gene, e: exon, j: exon-exon junction
    D: DLPFC, H: HIPPO, C: control, S: SCZD
    Case-control: by region

    View Slide

  37. Gene Individual 1 Individual 2
    1 10 15
    2 5 22

    DLPFC
    Gene Individual 1 Individual 2
    1 10 17
    2 6 16

    HIPPO
    correlation( (10, 5, …), (10, 6, …) ) =~ 0.9
    correlation( (15, 22, …), (17, 16, …) ) =~ 0.7
    Individual 1
    Individual 2
    Control vs SCZD

    View Slide

  38. A B
    C D
    Control SCZD
    0.76 0.77 0.78 0.79
    cleaned expr (keeping Dx) − geneRpkm
    p−value: 0.0164
    SCZD diagnosis
    Correlation
    Control SCZD
    0.71 0.72 0.73 0.74 0.75 0.76
    cleaned expr (keeping Dx) − exonRpkm
    p−value: 0.0499
    SCZD diagnosis
    Correlation
    Control SCZD
    0.54 0.55 0.56 0.57
    cleaned expr (keeping Dx) − jxnRp10m
    p−value: 1.72e−05
    Correlation
    Control SCZD
    0.47 0.48 0.49 0.50 0.51 0.52
    cleaned expr (keeping Dx) − txTpm
    p−value: 0.00992
    Correlation
    Correlation
    Correlation
    Correlation
    Correlation
    Gene: 0.0164
    Exon: 0.0499
    Jxn: 1.72 * 10-5
    Tx: 0.00992

    View Slide

  39. DATA ANALYSIS
    39
    HIPPO eQTLs
    • 11,237,357 eQTL associations (FDR <1%) across genes, exons and junctions
    corresponding to 17,719 genes
    Emily E. Burke
    2061
    3183
    2163
    274
    5945
    2915
    1178
    gene exon
    jxn
    eQTLs grouped by gene id
    Includes 60 risk SNPs from PGC2
    8
    +,332
    E
    0 1 2
    −1 0 1 2 3 4
    chr2:5822204358229ï-[Q
    rs74563533:58250433:G:A
    r2:58 250 433
    Residualized Expression







    ● ●










































































































































    ●●







































    ● ●
























































    ● ●



    ● ●





    ● ●
















    ● ●


    ●●




































    ●●



























    ● ●











    ● ●




























    ● ●


    0 1 2
    −1 0 1 2 3 4
    chr2:5822204358229ï-[Q
    rs75575209:58138192:A:T
    r2:58 138 192
    Residualized Expression



    ● ●
































    ●●
    ● ●






    ● ●









    ●●





























    ●●























































    ● ●

    ●●




































    ● ●
























































    ●●


    ● ●


    ● ●























    ● ●







    ● ●

















    ●●






    ● ●
    ● ●



























    ● ●























    ● ● ●








    ●●




    ● ● ●


    p=5.75e−34
    p=3.82e−81
    F

    View Slide

  40. DATA ANALYSIS
    40
    Region dependent eQTLs
    • 205,618 region-dependent eQTLs (FDR <1%) corresponding to 1,484 genes
    • Includes 3 PCG2 schizophrenia risk loci
    Emily E. Burke
    19394
    10734
    6493
    2622
    3937
    4398
    34259
    gene exon
    jxn
    eQTLs grouped by SNP id
    488
    246
    319
    18
    99
    63
    251
    gene exon
    jxn
    eQTLs grouped by gene id

    View Slide

  41. A B
    0.D 1.D 2.D 0.H 1.H 2.H
    4.0 4.4 4.8 5.2
    LRP1(16(
    (Exon)
    rs324015
    Residualized Expression
    p=3.91e−05
    0.D 1.D 2.D 0.H 1.H 2.H
    −1 0 1 2 3
    2YHUODSVJHQH15*1
    FKUï-[Q
    rs12293670:124612932:A:G
    Residualized Expression
    p=5.27e−26
    C D
    0.D 1.D 2.D 0.H 1.H 2.H
    −1 0 1 2 3 4
    2YHUODSVJHQH1*()
    FKUïï-[Q
    rs4144797:233562197:T:C
    Residualized Expression
    p=3.5e−14
    0.D 1.D 2.D 0.H 1.H 2.H
    6.8 7.2 7.6 8.0
    LRP1
    FKUï-[Q
    rs324015
    Residualized Expression
    p=2.02e−06
    D: DLPFC, H: HIPPO
    DATA ANALYSIS
    Region dependent eQTLs
    These 3 SNPs are also
    meQTLs (Jaffe et al, 2016)

    View Slide

  42. Overall eQTLs at FDR < 1% - rAggr eQTL analysis with PCG2 risk SNPs
    Hippocampus DLPFC
    # unique SNPs (unique index SNPs) 5510 (103) 6780 (116)
    # unique features 1731 2525
    # Unique genes 123 171
    # Unique transcripts 244 332
    # Unique exons 857 1363
    # Unique junctions 507 659
    Emily E. Burke
    21 8
    95
    DLPFC HIPPO
    In BSP2: 163/179 (91.1%)
    FDR <1%: 124/163 (76.6%)

    View Slide

  43. • We will soon update our eQTL browser at http://eqtl.brainseq.org/ Bill Ulrich

    View Slide

  44. TWAS
    Gusev et al, Nature Genetics, 2016

    View Slide

  45. TWAS
    SNP ID chromosome position
    rs101 12 1
    rs102 13 10000

    SNP ID chromosome position
    rs101:A:T 12 1
    rs102 13 10005

    SNP ID chromosome position
    rs101 12 1
    rs102 13 10000

    LD reference LIBD BSP2 SNPs
    GWAS summary SNPs
    https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md
    1,190,321 82,393,211
    6,500,475
    CLOZUK+PGC2

    View Slide

  46. TWAS
    SNP ID chromosome position
    rs101 12 1
    rs102 13 10000

    SNP ID chromosome position
    rs101:A:T 12 1
    rs102 13 10005

    SNP ID chromosome position
    rs101 12 1
    rs102 13 10000

    LD reference LIBD BSP2 SNPs
    GWAS summary SNPs
    https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md
    1,190,321 82,393,211
    6,500,475
    CLOZUK+PGC2

    View Slide

  47. TWAS
    SNP ID chromosome position
    rs101:A:T 12 1
    rs102 13 10005

    SNP ID chromosome position
    rs101:A:T 12 1
    rs102 13 10005

    SNP ID chromosome position
    rs101:A:T 12 1
    rs102 13 10005

    LD reference LIBD BSP2 SNPs
    GWAS summary SNPs
    https://github.com/LieberInstitute/brainseq_phase2/blob/master/twas/README.md
    1,022,527
    1,022,546
    4,316,013
    CLOZUK+PGC2

    View Slide

  48. TWAS
    DLPFC HIPPO
    gene 7,683 (32.8%) 5,866 (25.1%)
    exon 61,678 (16.2%) 47,583 (12.5%)
    jxn 33,754 (12.5%) 27,088 (10.1%)
    tx 14,118 (15.9%) 11,567 (13%)
    Features with TWAS weights
    Removed sex, snp PCs 1-5, expr PCs
    Kept diagnosis
    jaffelab::cleaningY()

    View Slide

  49. TWAS
    Among loci
    considered?
    DLPFC HIPPO
    Yes 99 (85.3%) 87 (84.5 %)
    No 17 16
    116 103
    PGC2 loci with eQTLs (FDR<1%): considered in TWAS?
    ? DLPFC HIPPO
    Yes 48 (48.5%) 45 (51.7%)
    No 51 42
    99 87
    Locus has a TWAS p-value among all features?
    ? DLPFC HIPPO
    Yes 26 (54.2%) 25 (55.5%)
    No 22 20
    48 45
    TWAS p-value <1x10-8?

    View Slide

  50. TWAS
    ENSG00000188511.12
    ENSG00000280224.1
    ENSG00000279597.1
    ENSG00000264139.1
    ENSG00000212939.2
    ENSG00000213279.2
    ENSG00000235111.1
    ENSG00000226954.1
    ENSG00000223142.1
    ENSG00000214727.2
    ENSG00000100425.18
    ENSG00000279345.1
    ENSG00000236867.1
    ENSG00000260613.1
    ENSG00000229409.1
    ENSG00000277771.1
    ENSG00000279494.1
    ENSG00000100426.6
    ENSG00000182858.13
    ENSG00000273192.1
    ENSG00000184164.14
    ENSG00000278869.1
    ENSG00000198355.4
    ENSG00000276753.1
    ENSG00000188263.10
    ENSG00000138892.11
    ENSG00000100427.15
    ENSG00000073146.15
    ENSG00000280395.1
    ENSG00000073150.13
    ENSG00000170638.9
    ENSG00000273188.1
    ENSG00000273253.2
    ENSG00000073169.13
    ENSG00000272836.1
    ENSG00000273137.1
    ENSG00000128159.11
    ENSG00000100429.17
    ENSG00000188130.13
    ENSG00000185386.14
    ENSG00000196576.14
    ENSG00000205593.11
    ENSG00000227484.1
    ENSG00000279182.1
    ENSG00000100239.15
    ENSG00000100241.20
    ENSG00000128165.8
    ENSG00000100253.12
    ENSG00000100258.17
    ENSG00000025770.18
    ENSG00000130489.13
    ENSG00000272821.1
    ENSG00000025708.13
    ENSG00000177989.13
    ENSG00000272666.1
    ENSG00000273272.1
    ENSG00000226738.1
    ENSG00000130487.5
    ENSG00000217442.3
    ENSG00000205560.12
    ENSG00000254413.8
    ENSG00000100288.19
    ENSG00000205559.3
    ENSG00000272940.1
    ENSG00000008735.13
    ENSG00000100299.17
    ENSG00000212569.1
    ENSG00000251322.7
    ENSG00000206841.1
    ENSG00000225929.1
    ENSG00000100312.10
    ENSG00000254499.1
    ENSG00000213683.4
    ENSG00000184319.15
    ENSG00000079974.17























































    ●●























































    ●●


    ●●



















































    ●●






    ●●



    ●●






    ●●




























































    ● ●






















































































    ●●















































































    ● ●





















































    ●●




































































    49.4 49.6 49.8 50.0 50.2 50.4 50.6
    0
    1
    2
    3
    4
    5
    chr 22 physical position (MB)
    −log10(P−value)
















































































































    ●●


    ●●



    ● ●













































    ●●●











    ●●






    ●●



































    ●●

    ● ●




























































































































































































































    ● ●






    ●●














    ●●
































































    ●●


    Can also identify conditionally independent features
    GWAS: grey
    TWAS: blue
    CRELD2

    View Slide

  51. TWAS
    ENSG00000188511.12
    ENSG00000280224.1
    ENSG00000279597.1
    ENSG00000264139.1
    ENSG00000212939.2
    ENSG00000213279.2
    ENSG00000235111.1
    ENSG00000226954.1
    ENSG00000223142.1
    ENSG00000214727.2
    ENSG00000100425.18
    ENSG00000279345.1
    ENSG00000236867.1
    ENSG00000260613.1
    ENSG00000229409.1
    ENSG00000277771.1
    ENSG00000279494.1
    ENSG00000100426.6
    ENSG00000182858.13
    ENSG00000273192.1
    ENSG00000184164.14
    ENSG00000278869.1
    ENSG00000198355.4
    ENSG00000276753.1
    ENSG00000188263.10
    ENSG00000138892.11
    ENSG00000100427.15
    ENSG00000073146.15
    ENSG00000280395.1
    ENSG00000073150.13
    ENSG00000170638.9
    ENSG00000273188.1
    ENSG00000273253.2
    ENSG00000073169.13
    ENSG00000272836.1
    ENSG00000273137.1
    ENSG00000128159.11
    ENSG00000100429.17
    ENSG00000188130.13
    ENSG00000185386.14
    ENSG00000196576.14
    ENSG00000205593.11
    ENSG00000227484.1
    ENSG00000279182.1
    ENSG00000100239.15
    ENSG00000100241.20
    ENSG00000128165.8
    ENSG00000100253.12
    ENSG00000100258.17
    ENSG00000025770.18
    ENSG00000130489.13
    ENSG00000272821.1
    ENSG00000025708.13
    ENSG00000177989.13
    ENSG00000272666.1
    ENSG00000273272.1
    ENSG00000226738.1
    ENSG00000130487.5
    ENSG00000217442.3
    ENSG00000205560.12
    ENSG00000254413.8
    ENSG00000100288.19
    ENSG00000205559.3
    ENSG00000272940.1
    ENSG00000008735.13
    ENSG00000100299.17
    ENSG00000212569.1
    ENSG00000251322.7
    ENSG00000206841.1
    ENSG00000225929.1
    ENSG00000100312.10
    ENSG00000254499.1
    ENSG00000213683.4
    ENSG00000184319.15
    ENSG00000079974.17























































    ●●























































    ●●


    ●●



















































    ●●






    ●●



    ●●






    ●●




























































    ● ●






















































































    ●●















































































    ● ●





































    ●●














    ●●




































































    49.4 49.6 49.8 50.0 50.2 50.4 50.6
    0
    1
    2
    3
    4
    5
    −log10(P−value)























































    ●●























































    ●●


    ●●



    ● ●













































    ●●●






    ●●



    ●●





    ● ●●



































    ●●

    ● ●












































































































    ●●














































































































    ● ●






    ●●














    ●●
































































    ●●































    ●●●


    ●●










    ● ●





































    ●●





    ●●




    ●● ●


    ●●





    ●●



    ● ●























    ● ●



















































    ● ●















    ●●






































    ●●





    ●●




























    ●●


    ● ●



    ●●




























    49.4 49.6 49.8 50.0 50.2 50.4 50.6
    0
    2
    4
    6
    8
    chr 22 physical position (MB)
    −log10(P−value)
    GWAS signal
    eQTL signal
    CRELD2

    View Slide

  52. TWAS
    Feature DLFPC HIPPO TWAS conditionally
    independent?
    Gene 29 (43.3%) 20 (40 %) Yes
    38 30 No
    Exon 42 (48.3%) 31 (40.3) Yes
    45 46 No
    Jxn 50 (51%) 40 (40.8%) Yes
    48 48 No
    Tx 35 (46.1%) 26 (40%) Yes
    41 39 No

    View Slide

  53. Authors
    • Leonardo Collado-Torres
    o [email protected]
    o @fellgernon
    • Emily E. Burke
    • Amy Peterson
    • Joo Heon Shin
    • Richard E. Straub
    • Anandita Rajpurohit
    • Stephen A. Semick
    • William S. Ulrich
    • Cristian Valencia
    • Ran Tao
    • Amy Deep-Soboslay
    • Thomas M. Hyde
    • Joel E. Kleinman
    • Daniel R. Weinberger+
    • Andrew E. Jaffe+
    o [email protected]
    o @andrewejaffe
    53
    • BrainSeq Consortium
    • LIBD @lieberinstitute
    Funding
    github.com/LieberInstitute/brainseq_phase2

    View Slide

  54. View Slide

  55. https://jhubiostatistics.shinyapps.io/recount/

    View Slide

  56. http://rail.bio/
    Slide adapted from Ben Langmead
    by Abhinav Nellore

    View Slide

  57. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

    View Slide

  58. GTEx TCGA
    slide adapted from Shannon Ellis

    View Slide

  59. SRA

    View Slide

  60. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5
    doi.org/10.12688/f1000research.12223.1

    View Slide

  61. doi.org/10.12688/f1000research.12223.1

    View Slide

  62. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse <- scale_counts(rse_gene)
    https://github.com/leekgroup/recount-analyses/

    View Slide

  63. slide adapted from Jeff Leek

    View Slide

  64. related projects
    • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1
    • Shannon Ellis & Leek: phenotype prediction doi.org/10.1093/nar/gky102
    • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346
    • Madugundu & Pandey (JHU):
    proteomics
    • Luidi-Imada & Marchionni (JHU):
    cancer and FANTOM
    • Kuri-Magaña & Martínez-Barnetche (INSP Mexico):
    immune expression doi.org/10.3389/fimmu.2018.02679
    • Ryten (UCL):
    Guelfi: validating expressed regions (ERs) eQTLs
    Zhang: improving the detection of ERs doi.org/10.1101/499103

    View Slide

  65. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    slide adapted from Shannon Ellis

    View Slide

  66. Category Frequency
    F 95
    female 2036
    Female 51
    M 77
    male 1240
    Male 141
    Total 3640
    Even when information is provided, it’s not always
    clear…
    sra_meta$Sex
    “1 Male, 2 Female”, “2 Male, 1 Female”,
    “3 Female”, “DK”, “male and female”
    “Male (note: ….)”, “missing”, “mixed”,
    “mixture”, “N/A”, “Not available”, “not
    applicable”, “not collected”, “not
    determined”, “pooled male and female”,
    “U”, “unknown”, “Unknown”
    slide adapted from Shannon Ellis

    View Slide

  67. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    slide adapted from Shannon Ellis

    View Slide

  68. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accuracy
    of
    predictor
    test
    set
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  69. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across
    samples in
    TCGA
    test
    set
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  70. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    predict
    phenotypes
    across SRA
    samples
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across
    samples in
    TCGA
    test
    set
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  71. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of
    Samples (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  72. Number of Regions 589 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 613 6,579 8,951
    97.3% 96.5% 91.0%
    70.2%
    Prediction
    is more
    accurate in
    healthy
    tissue
    50.6%
    slide adapted from Shannon Ellis

    View Slide

  73. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse <- scale_counts(rse_gene)
    > rse_with_pred <- add_predictions(rse_gene)
    https://github.com/leekgroup/recount-analyses/

    View Slide








































  74. ● ●



    adipose tissue
    adrenal gland
    bladder
    blood
    blood vessel
    bone
    bone marrow
    brain
    breast
    cervix
    cervix uteri
    colon
    epithelium
    esophagus
    fallopian tube
    heart
    intestine
    kidney
    liver
    lung
    melanoma
    monocytes
    muscle
    nerve
    ovary
    pancreas
    penis
    pituitary
    placenta
    prostate
    salivary gland
    skin
    small intestine
    spinal cord
    spleen
    stem cell
    stomach
    testis
    thyroid
    tonsil
    umbilical cord
    urinary bladder
    uterus
    vagina
    0
    3000
    6000
    9000
    12000
    0 1000 2000 3000
    reported
    predicted

    View Slide

  75. • 62 SRA studies
    • 4,431 rows by 48 columns
    Ashkaun Razmara, in prep.

    View Slide

  76. Sex Female Male
    Age/Development Fetus Child Adolescent Adult
    Race/Ethnicity Asian Black Hispanic White
    Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum
    Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia
    Tissue Site 3 Dorsolateral
    prefrontal cortex
    Superior temporal
    gyrus
    Substantia nigra Caudate
    Hemisphere Left Right
    Brodmann Area 1-52
    Disease Status Disease Neurological control
    Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder
    Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma
    Clinical Stage 1 Grade I Grade II Grade III Grade IV
    Clinical Stage 2 Primary Secondary Recurrent
    Viability Postmortem Biopsy
    Preparation Frozen Thawed
    Ashkaun Razmara, in prep.

    View Slide

  77. https://github.com/LieberInstitute/recount-brain/tree/master/metadata_reproducibility
    • Overall curation steps: starts by downloading SRA Run Table info, then info from the publications
    • Details for each SRA study
    Reproducibility document
    Ashkaun Razmara, in prep.

    View Slide

  78. Ashkaun Razmara, in prep.

    View Slide

  79. Replicates part of the GTEx PMI paper by
    Ferreira et al. doi.org/10.1038/s41467-017-
    02772-x

    View Slide

  80. Code Example:
    research.libd.org/recount-brain/example_PMI/example_PMI.html
    research.libd.org/recount-brain/example_PMI/example_PMI.Rmd
    Replicates part of the GTEx PMI paper by Ferreira et al.
    doi.org/10.1038/s41467-017-02772-x

    View Slide

  81. * Jeff Leek presented Shannon Ellis’ prediction work in Toronto (around April 2018)
    https://docs.google.com/presentation/d/1FgUZZU6pW91J7zH0OqrEgxfnV1Py_ZGL3ZKHfbOZskY/edit#slide=id.g2f831fd4ae_0_306
    * Dustin J. Sokolowski from Michael D. Wilson’s lab is using recount2
    * Dustin joins the project and merges recount-brain with GTEx and TCGA
    recount_brain_v2
    The SRA samples in recount-brain are complemented by 1,409 GTEx (GTEx Consortium
    2015) and 707 TCGA (Brennan et al. 2013; Cancer Genome Atlas Research Network et
    al. 2015) samples covering 13 healthy regions of the brain and 2 tumor types, respectively.
    In total, there are 6,547 samples with metadata in recount-brain with 5,330 (81.4%)
    present in recount2

    View Slide

  82. * Discussed with Sean Davis (NIH) at Biological Data Science #biodata18
    * Potential for contributing recount-brain to SRAdbV2 github.com/seandavi/SRAdbV2
    * Have mapped variables to ontologies
    What about maintaining/growing it?

    View Slide

  83. The recount-brain team
    Hopkins
    Ashkaun Razmara
    Shannon E. Ellis
    Jeff T. Leek
    University of
    Toronto
    Dustin J. Sokolowski
    Michael D. Wilson
    NIH
    Sean Davis
    LIBD
    Andrew E. Jaffe
    Funding
    NIH R01 GM105705
    NIH 1R21MH109956
    NIH R01 GM121459
    CIHR, NSERC
    Ontario Ministry of Research
    IDIES SciServer
    Hosting recount2
    github.com/LieberInstitute/recount-brain

    View Slide