Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bioconductor 2017 recount workshop

Bioconductor 2017 recount workshop

Introduction to recount presentation related to http://research.libd.org/recountWorkshop/

Leonardo Collado-Torres

July 27, 2017
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11
    Reproducible RNA-seq analysis with
    Leonardo Collado-Torres
    @fellgernon
    #bioc2017

    View Slide

  2. Reference genome
    Reads

    View Slide

  3. View Slide

  4. GTEx TCGA
    slide adapted from Shannon Ellis

    View Slide

  5. SRA

    View Slide

  6. Slide adapted from Ben Langmead

    View Slide

  7. http://rail.bio/
    Slide adapted from Ben Langmead

    View Slide

  8. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

    View Slide

  9. https://jhubiostatistics.shinyapps.io/recount/

    View Slide

  10. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5

    View Slide

  11. View Slide

  12. exon 1 exon 2
    exon 3

    View Slide

  13. disjoint exon 1
    disjoint exon 2 disjoint exon 3

    View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. 5 10 15
    0 1 2 3 4 5
    Genome
    Coverage
    3 3 5 4 4 2 2 3 1 3 3 1 4 4 2 1
    AUC = area under coverage = 45

    View Slide

  18. View Slide

  19. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse https://github.com/leekgroup/recount-analyses/

    View Slide

  20. slide adapted from Jeff Leek

    View Slide

  21. >library('recount')
    > download_study('SRP029880', type='rse-gene')
    > download_study('SRP059039', type='rse-gene')
    > load(file.path('SRP029880 ', 'rse_gene.Rdata'))
    > load(file.path('SRP059039', 'rse_gene.Rdata'))
    > mdat https://github.com/leekgroup/recount-analyses/

    View Slide

  22. Collado Torres et al. Nat. Biotech 2017

    View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5

    View Slide

  27. Collado-Torres et al, NAR, 2017

    View Slide

  28. Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Discovery data
    Jaffe et al, Nat. Neuroscience, 2015
    Postmortem Human Brain Samples
    Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Replication data

    View Slide

  29. Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  30. BrainSpan data
    Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  31. expression data for ~70,000 human samples
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View Slide

  32. expression data for ~70,000 human samples
    Answer meaningful
    questions about
    human biology and
    expression
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View Slide

  33. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    slide adapted from Shannon Ellis

    View Slide

  34. Category Frequency
    F 95
    female 2036
    Female 51
    M 77
    male 1240
    Male 141
    Total 3640
    Even when information is provided, it’s not always clear…
    sra_meta$Se
    x
    “1 Male, 2 Female”, “2 Male, 1 Female”, “3
    Female”, “DK”, “male and female” “Male
    (note: ….)”, “missing”, “mixed”, “mixture”,
    “N/A”, “Not available”, “not applicable”,
    “not collected”, “not determined”, “pooled
    male and female”, “U”, “unknown”,
    “Unknown”
    slide adapted from Shannon Ellis

    View Slide

  35. SRA phenotype information is far from complete
    SubjectID Sex Tissue Race Age
    6620 NA female liver NA NA
    6621 NA female liver NA NA
    6622 NA female liver NA NA
    6623 NA female liver NA NA
    6624 NA female liver NA NA
    6625 NA male liver NA NA
    6626 NA male liver NA NA
    6627 NA male liver NA NA
    6628 NA male liver NA NA
    6629 NA male liver NA NA
    6630 NA male liver NA NA
    6631 NA NA blood NA NA
    6632 NA NA blood NA NA
    6633 NA NA blood NA NA
    6634 NA NA blood NA NA
    6635 NA NA blood NA NA
    6636 NA NA blood NA NA
    z z z
    z
    slide adapted from Shannon Ellis

    View Slide

  36. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    TCGA
    The Cancer Genome Atlas
    N=11,284
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    slide adapted from Shannon Ellis

    View Slide

  37. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accuracy
    of
    predictor
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  38. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across samples
    in TCGA
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  39. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    predict
    phenotypes
    across SRA
    samples
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across samples
    in TCGA
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  40. select_regions()
    Output:
    Coverage matrix (data.frame)
    Region information (GRanges)
    slide adapted from Shannon Ellis

    View Slide

  41. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of Samples
    (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  42. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of Samples
    (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  43. http://www.rna-seqblog.com/
    Can we use
    expression data
    to predict
    tissue?
    slide adapted from Shannon Ellis

    View Slide

  44. Number of Regions 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 7,193 8,951
    97.3% 96.5%
    71.9%
    50.6%
    Tissue
    prediction is
    accurate
    across data
    sets
    slide adapted from Shannon Ellis

    View Slide

  45. Number of Regions 589 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 613 6,579 8,951
    97.3% 96.5% 91.0%
    70.2%
    Prediction is
    more
    accurate in
    healthy
    tissue
    50.6%
    slide adapted from Shannon Ellis

    View Slide

  46. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse > rse_with_pred https://github.com/leekgroup/recount-analyses/

    View Slide

  47. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    sex tissue
    M Blood
    F Heart
    F Liver
    slide adapted from Shannon Ellis

    View Slide

  48. View Slide

  49. Collaborators
    The Leek Group
    Jeff Leek
    Shannon Ellis
    Hopkins
    Ben Langmead
    Chris Wilks
    Kai Kammers
    Kasper Hansen
    Margaret Taub
    OHSU
    Abhinav Nellore
    LIBD
    Andrew Jaffe
    Emily Burke
    Stephen Semick
    Carrie Wright
    Amanda Price
    Nina Rajpurohit
    Funding
    NIH R01 GM105705
    NIH 1R21MH109956
    CONACyT 351535
    AWS in Education
    Seven Bridges
    IDIES SciServer

    View Slide

  50. 11
    http://research.libd.org/recountWorkshop/
    help(package = recountWorkshop)
    file.edit(
    system.file('doc/recount-workshop.Rmd', package = 'recountWorkshop')
    )
    Leonardo Collado-Torres
    @fellgernon
    #bioc2017

    View Slide

  51. expression data for ~70,000 human samples
    (Multiple) Postdoc positions available to
    - develop methods to process and analyze data from recount2
    - use recount2 to address specific biological questions
    This project involves the Hansen, Leek, Langmead and Battle labs at JHU
    Contact: Kasper D. Hansen ([email protected] | www.hansenlab.org)

    View Slide