Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible RNA-seq analysis with recount and recount-brain

Reproducible RNA-seq analysis with recount and recount-brain

Remote class/talk for LCG-UNAM 2018 on recount and recount-brain.

Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11
    Reproducible RNA-seq analysis with
    Leonardo Collado-Torres
    @fellgernon
    #LCG2018
    and

    View full-size slide

  2. Reference genome
    Reads

    View full-size slide

  3. GTE TCGA
    slide adapted from Shannon Elli

    View full-size slide

  4. Slide adapted from Ben Langmead

    View full-size slide

  5. http://rail.bio/
    Slide adapted from Ben Langmead

    View full-size slide

  6. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

    View full-size slide

  7. https://jhubiostatistics.shinyapps.io/recount/

    View full-size slide

  8. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5

    View full-size slide

  9. exon 1 exon 2
    exon 3

    View full-size slide

  10. disjoint exon 1
    disjoint exon 2 disjoint exon 3

    View full-size slide

  11. 5 10 15
    0 1 2 3 4 5
    Genome
    Coverage
    3 3 5 4 4 2 2 3 1 3 3 1 4 4 2 1
    AUC = area under coverage = 45

    View full-size slide

  12. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse <- scale_counts(rse_gene)
    https://github.com/leekgroup/recount-analyses/

    View full-size slide

  13. slide adapted from Jeff Leek

    View full-size slide

  14. >library('recount')
    > download_study('SRP029880', type='rse-gene')
    > download_study('SRP059039', type='rse-gene')
    > load(file.path('SRP029880 ', 'rse_gene.Rdata'))
    > load(file.path('SRP059039', 'rse_gene.Rdata'))
    > mdat <- do.call(cbind, dat)
    https://github.com/leekgroup/recount-analyses/

    View full-size slide

  15. Collado Torres et al. Nat. Biotech 2017

    View full-size slide

  16. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5

    View full-size slide

  17. Collado-Torres et al, NAR, 2017

    View full-size slide

  18. Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Discovery data
    Jaffe et al, Nat. Neuroscience, 2015
    Postmortem Human Brain Samples
    Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Replication data

    View full-size slide

  19. Jaffe et al, Nat. Neuroscience, 2015

    View full-size slide

  20. BrainSpan data
    Jaffe et al, Nat. Neuroscience, 2015

    View full-size slide

  21. expression data for ~70,000 human samples
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View full-size slide

  22. expression data for ~70,000 human samples
    Answer meaningful
    questions about
    human biology and
    expression
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View full-size slide

  23. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    slide adapted from Shannon Ellis

    View full-size slide

  24. Category Frequency
    F 95
    female 2036
    Female 51
    M 77
    male 1240
    Male 141
    Total 3640
    Even when information is provided, it’s not always
    clear…
    sra_meta$S
    ex
    “1 Male, 2 Female”, “2 Male, 1 Female”,
    “3 Female”, “DK”, “male and female”
    “Male (note: ….)”, “missing”, “mixed”,
    “mixture”, “N/A”, “Not available”, “not
    applicable”, “not collected”, “not
    determined”, “pooled male and female”,
    “U”, “unknown”, “Unknown”
    slide adapted from Shannon Ellis

    View full-size slide

  25. SRA phenotype information is far from complete
    SubjectID Sex Tissue Race Age
    662
    0 NA female liver NA NA
    662
    1 NA female liver NA NA
    662
    2 NA female liver NA NA
    662
    3 NA female liver NA NA
    662
    4 NA female liver NA NA
    662
    5 NA male liver NA NA
    662
    6 NA male liver NA NA
    662
    7 NA male liver NA NA
    662
    8 NA male liver NA NA
    z z z
    z
    slide adapted from Shannon Ellis

    View full-size slide

  26. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    slide adapted from Shannon Ellis

    View full-size slide

  27. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accurac
    y of
    predicto
    r
    test
    set
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View full-size slide

  28. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accurac
    y of
    predicto
    r
    predict
    phenotypes
    across
    samples in
    TCGA
    test
    set
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View full-size slide

  29. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    predict
    phenotypes
    across SRA
    samples
    test
    accurac
    y of
    predicto
    r
    predict
    phenotypes
    across
    samples in
    TCGA
    test
    set
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View full-size slide

  30. select_regions()
    Output:
    Coverage matrix (data.frame)
    Region information
    slide adapted from Shannon Ellis

    View full-size slide

  31. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of
    Samples (N)
    4,769 4,769 11,245 3,640
    99.8
    %
    99.6
    %
    99.4
    % 88.5
    %
    slide adapted from Shannon Ellis

    View full-size slide

  32. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of
    Samples (N)
    4,769 4,769 11,245 3,640
    99.8
    %
    99.6
    %
    99.4
    % 88.5
    %
    slide adapted from Shannon Ellis

    View full-size slide

  33. http://www.rna-seqblog.com/
    Can we use
    expression
    data to predict
    tissue?
    slide adapted from Shannon Ellis

    View full-size slide

  34. Number of Regions 589 589 589 589
    Number of
    Samples (N)
    4,769 4,769 7,193 8,951
    97.3
    %
    96.5
    %
    71.9
    %
    50.6
    %
    Tissue
    prediction is
    accurate
    across data
    sets
    slide adapted from Shannon Ellis

    View full-size slide

  35. Number of Regions 589 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 613 6,579 8,951
    97.3
    %
    96.5
    %
    91.0
    %
    70.2
    %
    Prediction
    is more
    accurate in
    healthy
    tissue
    50.6
    %
    slide adapted from Shannon Ellis

    View full-size slide

  36. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse <- scale_counts(rse_gene)
    > rse_with_pred <- add_predictions(rse_gene)
    https://github.com/leekgroup/recount-analyses/

    View full-size slide

  37. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    sex tissue
    M Blood
    F Heart
    F Liver
    slide adapted from Shannon Ellis

    View full-size slide

  38. slide adapted from Kai Kammers
    Can combine with
    genotype data to
    identify eQTLs

    View full-size slide

  39. biorxiv.org/content/early/2018/01/12/247346

    View full-size slide

  40. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    sex tissue
    M Blood
    F Heart
    F Liver
    slide adapted from Shannon Ellis

    View full-size slide

  41. Sex Female Male
    Age/Development Fetus Child Adolescent Adult
    Race/Ethnicity Asian Black Hispanic White
    Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum
    Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia
    Tissue Site 3 Dorsolateral
    prefrontal cortex
    Superior temporal
    gyrus
    Substantia nigra Caudate
    Hemisphere Left Right
    Brodmann Area 1-52
    Disease Status Disease Neurological control
    Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder
    Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma
    Clinical Stage 1 Grade I Grade II Grade III Grade IV
    Clinical Stage 2 Primary Secondary Recurrent
    Viability Postmortem Biopsy
    Preparation Frozen Thawed

    View full-size slide

  42. Ashkaun Razmara, in prep.

    View full-size slide

  43. Code Example:
    research.libd.org/recount-brain/example_PMI/example_PMI.html
    research.libd.org/recount-brain/example_PMI/example_PMI.Rmd
    Replicates part of the GTEx PMI paper by Ferreira et al.
    doi.org/10.1038/s41467-017-02772-x
    Ashkaun Razmara, in prep.

    View full-size slide

  44. The recount2 team
    Hopkins
    Kai Kammers
    Shannon Ellis
    Margaret Taub
    Kasper Hansen
    Jeff Leek
    Ben Langmead
    OHSU
    Abhinav Nellore
    LIBD
    Leonardo
    Collado-Torres
    Andrew Jaffe
    recount-brain
    Ashkaun Razmara
    Funding and hosting
    NIH R01 GM105705
    NIH 1R21MH109956
    CONACyT 351535
    AWS in Education
    Seven Bridges
    IDIES SciServer

    View full-size slide

  45. expression data for ~70,000 human samples
    (Multiple) Postdoc positions available to
    - develop methods to process and analyze data from recount2
    - use recount2 to address specific biological questions
    This project involves the Hansen, Leek, Langmead and Battle labs at JHU
    Contact: Kasper D. Hansen ([email protected] | www.hansenlab.org)

    View full-size slide