Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICSA 2017

ICSA 2017

Reproducible RNA-seq analysis with recount2

Leonardo Collado-Torres

June 27, 2017
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11
    Reproducible RNA-seq analysis with
    Leonardo Collado-Torres
    @fellgernon
    #ICSA2017

    View Slide

  2. Reference genome
    Reads

    View Slide

  3. View Slide

  4. GTEx TCGA
    slide adapted from Shannon Ellis

    View Slide

  5. SRA

    View Slide

  6. Slide adapted from Ben Langmead

    View Slide

  7. http://rail.bio/
    Slide adapted from Ben Langmead

    View Slide

  8. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

    View Slide

  9. Obstacle: our research moves (spot) markets
    Spike in market price due to preprocessing job flows
    slide adapted from Jeff Leek

    View Slide

  10. Obstacle: our research moves (spot) markets
    Weekday market volatility
    Weekend EC2
    inactivity
    slide adapted from Jeff Leek

    View Slide

  11. https://jhubiostatistics.shinyapps.io/recount/

    View Slide

  12. exon 1 exon 2
    exon 3

    View Slide

  13. disjoint exon 1
    disjoint exon 2 disjoint exon 3

    View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. 5 10 15
    0 1 2 3 4 5
    Genome
    Coverage
    3 3 5 4 4 2 2 3 1 3 3 1 4 4 2 1
    AUC = area under coverage = 45

    View Slide

  18. View Slide

  19. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse https://github.com/leekgroup/recount-analyses/

    View Slide

  20. slide adapted from Jeff Leek

    View Slide

  21. >library('recount')
    > download_study('SRP029880', type='rse-gene')
    > download_study('SRP059039', type='rse-gene')
    > load(file.path('SRP029880 ', 'rse_gene.Rdata'))
    > load(file.path('SRP059039', 'rse_gene.Rdata'))
    > mdat https://github.com/leekgroup/recount-analyses/

    View Slide

  22. Collado Torres et al. Nat. Biotech 2017

    View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5

    View Slide

  27. Collado-Torres et al, NAR, 2017

    View Slide

  28. Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Discovery data
    Jaffe et al, Nat. Neuroscience, 2015
    Postmortem Human Brain Samples
    Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Replication data

    View Slide

  29. Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  30. DERs outside of “known genes”
    Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  31. CBC: 28
    MD: 24
    STR: 28
    AMY: 31
    HIP: 32
    DFC: 34
    Total N samples: 487
    BrainSpan data
    Coverage Data from BrainSpan:
    http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/
    VFC: 30 MFC: 32 OFC: 30 M1C: 25
    S1C: 26 IPC: 33 A1C: 30 STC: 35 ITC: 33
    V1C: 33

    View Slide

  32. BrainSpan data
    Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  33. Percent
    Expressed
    Mean reads across GTEx

    View Slide

  34. > library('recount')
    > regions_list regs return(regs)
    }, BPPARAM = bp)
    > names(regions_list) > regions https://github.com/leekgroup/recount-analyses/

    View Slide

  35. > library('recount')
    > covMat coverageMatrix regions_list[[chr]])
    return(coverageMatrix)
    }, BPPARAM = bp)
    > covMat https://github.com/leekgroup/recount-analyses/

    View Slide

  36. expression data for ~70,000 human samples
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View Slide

  37. expression data for ~70,000 human samples
    Answer meaningful
    questions about
    human biology and
    expression
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View Slide

  38. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    slide adapted from Shannon Ellis

    View Slide

  39. Category Frequency
    F 95
    female 2036
    Female 51
    M 77
    male 1240
    Male 141
    Total 3640
    Even when information is provided, it’s not always clear…
    sra_meta$Se
    x
    “1 Male, 2 Female”, “2 Male, 1 Female”, “3
    Female”, “DK”, “male and female” “Male
    (note: ….)”, “missing”, “mixed”, “mixture”,
    “N/A”, “Not available”, “not applicable”,
    “not collected”, “not determined”, “pooled
    male and female”, “U”, “unknown”,
    “Unknown”
    slide adapted from Shannon Ellis

    View Slide

  40. SRA phenotype information is far from complete
    SubjectID Sex Tissue Race Age
    6620 NA female liver NA NA
    6621 NA female liver NA NA
    6622 NA female liver NA NA
    6623 NA female liver NA NA
    6624 NA female liver NA NA
    6625 NA male liver NA NA
    6626 NA male liver NA NA
    6627 NA male liver NA NA
    6628 NA male liver NA NA
    6629 NA male liver NA NA
    6630 NA male liver NA NA
    6631 NA NA blood NA NA
    6632 NA NA blood NA NA
    6633 NA NA blood NA NA
    6634 NA NA blood NA NA
    6635 NA NA blood NA NA
    6636 NA NA blood NA NA
    z z z
    z
    slide adapted from Shannon Ellis

    View Slide

  41. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    TCGA
    The Cancer Genome Atlas
    N=11,284
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    slide adapted from Shannon Ellis

    View Slide

  42. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accuracy
    of
    predictor
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  43. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across samples
    in TCGA
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  44. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    predict
    phenotypes
    across SRA
    samples
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across samples
    in TCGA
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  45. select_regions()
    Output:
    Coverage matrix (data.frame)
    Region information (GRanges)
    slide adapted from Shannon Ellis

    View Slide

  46. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of Samples
    (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  47. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of Samples
    (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  48. http://www.rna-seqblog.com/
    Can we use
    expression data
    to predict
    tissue?
    slide adapted from Shannon Ellis

    View Slide

  49. Number of Regions 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 7,193 8,951
    97.3% 96.5%
    71.9%
    50.6%
    Tissue
    prediction is
    accurate
    across data
    sets
    slide adapted from Shannon Ellis

    View Slide

  50. Number of Regions 589 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 613 6,579 8,951
    97.3% 96.5% 91.0%
    70.2%
    Prediction is
    more
    accurate in
    healthy
    tissue
    50.6%
    slide adapted from Shannon Ellis

    View Slide

  51. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse > rse_with_pred https://github.com/leekgroup/recount-analyses/

    View Slide

  52. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    sex tissue
    M Blood
    F Heart
    F Liver
    slide adapted from Shannon Ellis

    View Slide

  53. View Slide

  54. bioconductor.org/packages/derfinder
    bioconductor.org/packages/recount
    > biocLite(“derfinder”)
    > biocLite(“recount”)
    http://rail.bio
    $ ./install-rail-rna-V

    View Slide

  55. https://github.com/leekgroup/recount-contributions

    View Slide

  56. Collaborators
    The Leek Group
    Jeff Leek
    Shannon Ellis
    Hopkins
    Ben Langmead
    Chris Wilks
    Kai Kammers
    Kasper Hansen
    Margaret Taub
    OHSU
    Abhinav Nellore
    LIBD
    Andrew Jaffe
    Emily Burke
    Stephen Semick
    Carrie Wright
    Amanda Price
    Nina Rajpurohit
    Funding
    NIH R01 GM105705
    NIH 1R21MH109956
    CONACyT 351535
    AWS in Education
    Seven Bridges
    IDIES SciServer

    View Slide