Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PSU 2017: Unlocking sequence data archives with scalable software and resources

PSU 2017: Unlocking sequence data archives with scalable software and resources

The Sequence Read Archive contains data for over 450K RNA-seq samples, including over 140K from human samples. Large-scale projects like GTEx and ICGC are generating RNA-seq data on many thousands of samples. Such huge and carefully designed datasets are valuable, but unwieldy for typical researchers, especially when access to computational resources is limited.

I will describe work toward the goal of making it easy for biological researchers to use the archived RNA-seq data available today. I will highlight the Rail-RNA software (http://rail.bio), its dbGaP-protected version (http://docs.rail.bio/dbgap/), as well as the recount (https://jhubiostatistics.shinyapps.io/recount/) and Snaptron (http://snaptron.cs.jhu.edu) resources. The Rail-RNA software uses the Amazon Web Services commercial cloud to analyze many samples at once. We used Rail-RNA to study tens of thousands of public RNA-seq accessions, yielding new insights about the completeness of existing gene annotations and about how our knowledge of human splicing diversity has evolved over time. I will demonstrate how the recount resource can be used to answer questions about expression and differential expression across 10,000s of RNA-seq samples, and how the Snaptron API can be used to rapidly answer sophisticated queries against the splicing patterns in recount.

Ben Langmead

March 01, 2017
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead

    Penn State, March 1, 2017 Unlocking sequence data archives with scalable software and resources
  2. Themes • Public data is available & valuable but hard

    to use • Scalable software benefits from big resources & data • Strategically ignoring gene annotations can yield clearer results
  3. Sequence Read Archive (SRA) growth Open access Total 1 Pbp

    4.5 -> 9 Pbp in ~10 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement 10 Pbp 100 Tbp 10 Tbp
  4. Abhinav Nellore OHSU Jeff Leek, JHU Website: http://rail.bio Nellore A,

    Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  5. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; alignments only when asked • Runs easily on commercial AWS cloud, other clusters Website: http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  6. Pass 1: align to genome, make junction calls Pass 2:

    re-align to genome with putative junctions Reads: Ref: Readlets:
  7. Aggregating across samples adds a dimension to junction call confidence

    Sample 1 S 2 S 3 S 4 S 5 Candidate Junction 1 CJ 2 CJ 3
  8. Rail-RNA design Preprocess Aggregate duplicate reads Split into readlets Aggregate

    duplicate readlets Correlation clustering for readlet alignments Call splice junction Merge exon differentials Compile sample coverages Write bigWigs Write normalization factors Write spliced alignment BAMs Write junction & indel BEDs Align reads end-to-end to genome Align readlets to genome Align readlets to junction co-occurrence index Bowtie 2 Bowtie Bowtie
  9. Better-than-linear scaling Marginal cost of analyzing 1 additional sample decreases

    as we add more samples Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  10. Website: http://rail.bio Paper: http://bit.ly/rail-aa Repo: https://github.com/nellore/rail Chat: https://gitter.im/nellore/rail dbGaP Website:

    http://docs.rail.bio/dbgap/ Paper: http://bit.ly/rail_dbgap Abhinav Nellore OHSU Jeff Leek, JHU
  11. RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantification Count

    overlaps w/ annotated features Differential gene / exon expression (quasi-, pseudo-) Reads
  12. Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;

    about 62 Tbp • Repeatable: http://github.com/nellore/runs (Exact commands we used to run on AWS)
  13. Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;

    about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples
  14. Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;

    about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples
  15. a 0 2000 4000 6000 8000 10000 12000 14000 0

    100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266. Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega
  16. 0 2000 4000 6000 8000 10000 0 20 40 60

    80 Samples % called junctions that are annotated For ~2.5% of samples, <50% of junction calls are annotated Median fraction of junction calls that are annotated: ~80% GENCODE v19
  17. RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantitation Count

    overlaps w/ annotated features Differential gene / exon expression (often with annotation) (quasi-, pseudo-)
  18. A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed

    region finder derfinder: region-based, annotation-agnostic Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.
  19. Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI,

    Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA- seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9. A third way
  20. A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed

    region finder derfinder: region-based, annotation-agnostic bigWigs Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.
  21. Boiler: RNA-seq alignment compression • As big as bigWigs &

    1-2 orders of magnitude smaller than sorted BAMs • Usable with Cufflinks, StringTie Pritt J, Langmead B. Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res. 2016 Sep 19;44(16):e133. F R1 R2 Coverage Length tallies Co-occurrence patterns Jacob Pritt
  22. Snaptron & recount2 • Best places to start if you’re

    interested in our summaries of public human RNA-seq; both include: • ~50K SRA samples • ~10K GTEx samples • ~10K TCGA samples • Snaptron: pose sophisticated queries re: splicing, with quick responses, no need to download data • recount2: reanalyze in starting from our gene-, exon-, junction- or coverage-level summaries
  23. Snaptron Query planner delegates to appropriate systems (sqlite, tabix, lucene)

    and indexes (R-tree, B-tree, inverted full text) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.
  24. Snaptron • Junctions in ALK gene • # times junction

    occurs in each of 50,000 SRA samples • Tissue specificity of junction in GTEx data • Samples ranked according to how overrepresented one splicing pattern is relative to another Example queries: http://snaptron.cs.jhu.edu Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.
  25. recount2 Junctions Genes Coverage Exons • Summaries over the 70K

    samples at levels of genes, junctions, exons and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  26. recount2 • Shiny-app front-end: https://jhubiostatistics.shinyapps.io/recount/ • Bioconductor 3.4 package https://www.bioconductor.org/packages/recount/

    • Preprint: http://bit.ly/recount_pre Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478. Abhinav Nellore Leo Collado Torres
  27. Intropolis • Discovery of novel splicing events has leveled off

    • Good time to put effort into a more complete annotation Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
  28. Intropolis ABRF SEQC GEU -0.005 0.000 0.005 0.010 0.015 -0.02

    -0.01 0.00 0.01 0.02 PC1 PC2 Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
  29. Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee

    Leo Collado Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Also for DERfinder: Rafa Irizarry, Sarven Sabunciyan, Mike Love