Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Navigating tens of thousands of RNA-seq dataset...

Ben Langmead
October 21, 2016

Navigating tens of thousands of RNA-seq datasets with recount, SciServer & Jupyter

RNA sequencing is a ubiquitous tool for assaying gene expression. Public sequencing data archives such as the Sequence Read Archive now hold more than 50,000 human RNA-seq samples, and the size of the archive doubles approximately every 18 months. Many of these archived studies are valuable to biological researchers and methods developers. However, samples are available only as compressed collections of raw data. Processing the raw data into a form suitable for various downstream analyses is challenging. Care is required to craft summaries that are both concise — convenient for researchers to download and interact with — and useful in a variety of downstream scenarios.

The recount resource addresses this issue by summarizing a huge amount of public data — about 50,000 human RNA-seq samples from the Sequence Read Archive (SRA) and almost 10,000 samples from the GTEx project — into a form that is easy to query. recount is hosted on SciServer, and takes advantage of R and Jupyter notebooks to make it easy for anyone to query the summarized data. Here we exhibit these data summaries — compiled at the level of genes, exons, exon-exon-junctions and base-level coverage — as well as how to use the SciServer Jupyter notebook interface to perform sophisticated analyses.

Ben Langmead

October 21, 2016
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead Assistant Professor, Computer Science [email protected] IDIES Symposium, October

    21 2016 Navigating tens of thousands of RNA-seq datasets with recount, SciServer & Jupyter
  2. +

  3. Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee

    Leo Collado Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub Rail-RNA and recount teams
  4. Sequence Read Archive (SRA) growth Terabases Open access Total 1

    Pbp 3 -> 6 Pbp in ~18 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement
  5. Abhinav Nellore Website: http://rail.bio, Paper: http://bit.ly/rail-aa Jeff Leek Nellore A,

    Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4. Thank you: IDIES Seed grant https://en.wikipedia.org/wiki/RNA-Seq
  6. Rail-RNA • Analyzed ~50,000 human RNA-seq samples with Rail-RNA; about

    150 Tbp • Rapid: input to results in 2 weeks • Repeatable: http://github.com/nellore/runs • Inexpensive: ~ $1.40 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  7. recount Junctions Genes Coverage Exons • Provides expression summaries at

    levels of genes, junctions, exons and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  8. recount • Shiny-app front-end: https://jhubiostatistics.shinyapps.io/recount/ • Over 6 TB of

    data hosted at SciServer • SciServer Compute lets users to work with locally-hosted data in Jupyter notebook http://compute.sciserver.org/dashboard/ • Preprint & Bioconductor 3.4 package available Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  9. recount • Discovery of novel splicing events has leveled off

    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  10. recount • Distinct summaries tell complementary stories about differential expression

    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  11. recount • Some differential expression is outside of any known-transcribed

    area Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  12. Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee

    Leo Collado Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute