Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SOBP 2017

SOBP 2017

RNA-seq samples beyond the known transcriptome with derfinder available via recount

Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11
    RNA-seq samples beyond the known
    transcriptome with derfinder
    available via recount
    Leonardo Collado-Torres
    @fellgernon
    #SOBP2017

    View Slide

  2. Genome
    Transcripts
    Reads
    slide adapted from Jeff Leek

    View Slide

  3. Genome
    slide adapted from Jeff Leek

    View Slide

  4. GTEx TCGA
    slide adapted from Shannon Ellis

    View Slide

  5. SRA

    View Slide

  6. Slide adapted from Ben Langmead

    View Slide

  7. http://rail.bio/
    Slide adapted from Ben Langmead

    View Slide

  8. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

    View Slide

  9. Obstacle: our research moves (spot) markets
    Spike in market price due to preprocessing job flows
    slide adapted from Jeff Leek

    View Slide

  10. Obstacle: our research moves (spot) markets
    Weekday market volatility
    Weekend EC2
    inactivity
    slide adapted from Jeff Leek

    View Slide

  11. https://jhubiostatistics.shinyapps.io/recount/

    View Slide

  12. slide adapted from Andrew Jaffe

    View Slide

  13. slide adapted from Andrew Jaffe

    View Slide

  14. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse https://github.com/leekgroup/recount-analyses/

    View Slide

  15. slide adapted from Jeff Leek

    View Slide

  16. >library('recount')
    > download_study('SRP029880', type='rse-gene')
    > download_study('SRP059039', type='rse-gene')
    > load(file.path('SRP029880 ', 'rse_gene.Rdata'))
    > load(file.path('SRP059039', 'rse_gene.Rdata'))
    > mdat https://github.com/leekgroup/recount-analyses/

    View Slide

  17. Collado Torres et al. Nat. Biotech 2017

    View Slide

  18. slide adapted from Andrew Jaffe

    View Slide

  19. slide adapted from Andrew Jaffe

    View Slide

  20. coverage
    vector
    2 6 0 11 6
    Genome
    (DNA)
    RNA-Sequencing: Alignment using Rail-RNA
    Nellore et al. (2016) Bioinformatics

    View Slide

  21. Collado-Torres et al, NAR, 2017

    View Slide

  22. Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Discovery data
    Jaffe et al, Nat. Neuroscience, 2015
    Postmortem Human Brain Samples
    Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Replication data

    View Slide

  23. Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  24. DERs outside of “known genes”
    Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  25. CBC: 28
    MD: 24
    STR: 28
    AMY: 31
    HIP: 32
    DFC: 34
    Total N samples: 487
    BrainSpan data
    Coverage Data from BrainSpan:
    http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/
    VFC: 30 MFC: 32 OFC: 30 M1C: 25
    S1C: 26 IPC: 33 A1C: 30 STC: 35 ITC: 33
    V1C: 33

    View Slide

  26. BrainSpan data
    Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  27. Percent
    Expressed
    Mean reads across GTEx

    View Slide

  28. > library('recount')
    > regions_list regs return(regs)
    }, BPPARAM = bp)
    > names(regions_list) > regions https://github.com/leekgroup/recount-analyses/

    View Slide

  29. > library('recount')
    > covMat coverageMatrix regions_list[[chr]])
    return(coverageMatrix)
    }, BPPARAM = bp)
    > covMat https://github.com/leekgroup/recount-analyses/

    View Slide

  30. expression data for ~70,000 human samples
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View Slide

  31. expression data for ~70,000 human samples
    Answer meaningful
    questions about
    human biology and
    expression
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View Slide

  32. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    slide adapted from Shannon Ellis

    View Slide

  33. Category Frequency
    F 95
    female 2036
    Female 51
    M 77
    male 1240
    Male 141
    Total 3640
    Even when information is provided, it’s not always clear…
    sra_meta$Se
    x
    “1 Male, 2 Female”, “2 Male, 1 Female”, “3
    Female”, “DK”, “male and female” “Male
    (note: ….)”, “missing”, “mixed”, “mixture”,
    “N/A”, “Not available”, “not applicable”,
    “not collected”, “not determined”, “pooled
    male and female”, “U”, “unknown”,
    “Unknown”
    slide adapted from Shannon Ellis

    View Slide

  34. SRA phenotype information is far from complete
    SubjectID Sex Tissue Race Age
    6620 NA female liver NA NA
    6621 NA female liver NA NA
    6622 NA female liver NA NA
    6623 NA female liver NA NA
    6624 NA female liver NA NA
    6625 NA male liver NA NA
    6626 NA male liver NA NA
    6627 NA male liver NA NA
    6628 NA male liver NA NA
    6629 NA male liver NA NA
    6630 NA male liver NA NA
    6631 NA NA blood NA NA
    6632 NA NA blood NA NA
    6633 NA NA blood NA NA
    6634 NA NA blood NA NA
    6635 NA NA blood NA NA
    6636 NA NA blood NA NA
    z z z
    z
    slide adapted from Shannon Ellis

    View Slide

  35. slide adapted from Jeff Leek

    View Slide

  36. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    TCGA
    The Cancer Genome Atlas
    N=11,284
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    slide adapted from Shannon Ellis

    View Slide

  37. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accuracy
    of
    predictor
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  38. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across samples
    in TCGA
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  39. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    predict
    phenotypes
    across SRA
    samples
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across samples
    in TCGA
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  40. phenopredict
    Expression
    Data
    Covariate
    Informatio
    n
    Genomic
    Region
    Information
    Pheno
    of
    Interest
    n p
    regions x individuals
    Input Data
    select_regions()
    build_predictor()
    test_predictor()
    extract_data()
    predict_pheno()
    functions
    slide adapted from Shannon Ellis

    View Slide

  41. select_regions()
    Output:
    Coverage matrix (data.frame)
    Region information (GRanges)
    slide adapted from Shannon Ellis

    View Slide

  42. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of Samples
    (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  43. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of Samples
    (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  44. http://www.rna-seqblog.com/
    Can we use
    expression data
    to predict
    tissue?
    slide adapted from Shannon Ellis

    View Slide

  45. Number of Regions 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 7,193 8,951
    97.3% 96.5%
    71.9%
    50.6%
    Tissue
    prediction is
    accurate
    across data
    sets
    slide adapted from Shannon Ellis

    View Slide

  46. Number of Regions 589 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 613 6,579 8,951
    97.3% 96.5% 91.0%
    70.2%
    Prediction is
    more
    accurate in
    healthy
    tissue
    50.6%
    slide adapted from Shannon Ellis

    View Slide

  47. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse > rse_with_pred https://github.com/leekgroup/recount-analyses/

    View Slide

  48. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    sex tissue
    M Blood
    F Heart
    F Liver
    slide adapted from Shannon Ellis

    View Slide








































  49. ● ●



    adipose tissue
    adrenal gland
    bladder
    blood
    blood vessel
    bone
    bone marrow
    brain
    breast
    cervix
    cervix uteri
    colon
    epithelium
    esophagus
    fallopian tube
    heart
    intestine
    kidney
    liver
    lung
    melanoma
    monocytes
    muscle
    nerve
    ovary
    pancreas
    penis
    pituitary
    placenta
    prostate
    salivary gland
    skin
    small intestine
    spinal cord
    spleen
    stem cell
    stomach
    testis
    thyroid
    tonsil
    umbilical cord
    urinary bladder
    uterus
    vagina
    0
    3000
    6000
    9000
    12000
    0 1000 2000 3000
    reported
    predicted

    View Slide

  50. bioconductor.org/packages/derfinder
    bioconductor.org/packages/recount
    > biocLite(“derfinder”)
    > biocLite(“recount”)
    http://rail.bio
    $ ./install-rail-rna-V

    View Slide

  51. https://github.com/leekgroup/recount-contributions

    View Slide

  52. STEPS
    LIBD RNA-seq pipeline
    1.Quality check (QC) on raw reads
    2.Failed QC? Then trim reads
    3.Align reads to the genome
    4.Count features
    5.Calculate coverage
    6.Transcript level quantification
    7.Create count tables
    8.Call variants for identifying swaps
    Work with Emily Burke

    View Slide

  53. Collaborators
    The Leek Group
    Jeff Leek
    Shannon Ellis
    Hopkins
    Ben Langmead
    Chris Wilks
    Kai Kammers
    Kasper Hansen
    Margaret Taub
    OHSU
    Abhinav Nellore
    LIBD
    Andrew Jaffe
    Emily Burke
    Stephen Semick
    Carrie Wright
    Badoi Phan
    Amanda Price
    Nina Rajpurohit
    Funding
    NIH R01 GM105705
    NIH 1R21MH109956
    CONACyT 351535
    AWS in Education
    Seven Bridges
    IDIES SciServer

    View Slide