$30 off During Our Annual Pro Sale. View Details »

Navigating tens of thousands of RNA-seq datasets with recount, SciServer & Jupyter

Ben Langmead
October 21, 2016

Navigating tens of thousands of RNA-seq datasets with recount, SciServer & Jupyter

RNA sequencing is a ubiquitous tool for assaying gene expression. Public sequencing data archives such as the Sequence Read Archive now hold more than 50,000 human RNA-seq samples, and the size of the archive doubles approximately every 18 months. Many of these archived studies are valuable to biological researchers and methods developers. However, samples are available only as compressed collections of raw data. Processing the raw data into a form suitable for various downstream analyses is challenging. Care is required to craft summaries that are both concise — convenient for researchers to download and interact with — and useful in a variety of downstream scenarios.

The recount resource addresses this issue by summarizing a huge amount of public data — about 50,000 human RNA-seq samples from the Sequence Read Archive (SRA) and almost 10,000 samples from the GTEx project — into a form that is easy to query. recount is hosted on SciServer, and takes advantage of R and Jupyter notebooks to make it easy for anyone to query the summarized data. Here we exhibit these data summaries — compiled at the level of genes, exons, exon-exon-junctions and base-level coverage — as well as how to use the SciServer Jupyter notebook interface to perform sophisticated analyses.

Ben Langmead

October 21, 2016
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead
    Assistant Professor, Computer Science
    [email protected]
    IDIES Symposium, October 21 2016
    Navigating tens of thousands of RNA-seq
    datasets with recount, SciServer & Jupyter

    View Slide

  2. View Slide

  3. View Slide

  4. +

    View Slide

  5. Jeff Leek
    Jacob Pritt
    Abhinav
    Nellore
    Kasper
    Hansen
    Alyssa
    Frazee
    Leo Collado
    Torres
    Chris Wilks
    Andrew Jaffe
    José Alquicira-
    Hernández
    Jamie
    Morton
    Kai
    Kammers
    Shannon
    Ellis
    Margaret
    Taub
    Rail-RNA and recount teams

    View Slide

  6. Sequence Read Archive (SRA) growth
    Terabases
    Open access
    Total
    1 Pbp
    3 -> 6 Pbp in
    ~18 months
    https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement

    View Slide

  7. Elastic MapReduce

    View Slide

  8. Abhinav
    Nellore
    Website: http://rail.bio, Paper: http://bit.ly/rail-aa
    Jeff Leek
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT,
    Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
    Thank you: IDIES Seed grant
    https://en.wikipedia.org/wiki/RNA-Seq

    View Slide

  9. Rail-RNA
    • Analyzed ~50,000 human RNA-seq samples
    with Rail-RNA; about 150 Tbp
    • Rapid: input to results in 2 weeks
    • Repeatable: http://github.com/nellore/runs
    • Inexpensive: ~ $1.40 / sample
    (Compare to sequencing costs)
    (Exact commands we used to run on AWS)
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT,
    Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  10. recount
    Junctions
    Genes
    Coverage
    Exons
    • Provides expression summaries at levels of
    genes, junctions, exons and coverage vectors
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount:
    A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

    View Slide

  11. recount
    • Shiny-app front-end:
    https://jhubiostatistics.shinyapps.io/recount/
    • Over 6 TB of data hosted at SciServer
    • SciServer Compute lets users to work with
    locally-hosted data in Jupyter notebook
    http://compute.sciserver.org/dashboard/
    • Preprint & Bioconductor 3.4 package available
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount:
    A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

    View Slide

  12. recount
    • Discovery of novel splicing events has
    leveled off
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount:
    A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

    View Slide

  13. recount
    • Distinct summaries tell complementary
    stories about differential expression
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount:
    A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

    View Slide

  14. recount
    • Some differential expression is outside of any
    known-transcribed area
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount:
    A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.

    View Slide

  15. Brief demo

    View Slide

  16. Jeff Leek
    Jacob Pritt
    Abhinav
    Nellore
    Kasper
    Hansen
    Alyssa
    Frazee
    Leo Collado
    Torres
    Chris Wilks
    Andrew Jaffe
    José Alquicira-
    Hernández
    Jamie
    Morton
    Kai
    Kammers
    Shannon
    Ellis
    Margaret
    Taub
    • NIH R01GM118568
    • NSF CAREER IIS-1349906
    • Sloan Research Fellowship
    • IDIES Seed Funding program
    • Amazon Web Services
    langmead-lab.org, @BenLangmead
    Thank you:
    IDIES Seed funding
    SciServer
    SciServer Compute

    View Slide