Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Summarizing tens of thousands of RNA-seq samples: themes and lessons

Summarizing tens of thousands of RNA-seq samples: themes and lessons

Abstract: The Sequence Read Archive contains RNA-seq data for over 450K samples, including over 140K from humans. Large-scale projects like GTEx and ICGC are generating RNA-seq data on many thousands of samples. Such huge datasets are valuable, but unwieldy for typical researchers. I will describe work toward the goal of making it easy for researchers to use the archived RNA-seq data available today. I will highlight Rail-RNA (http://rail.bio), its dbGaP-protected version (http://docs.rail.bio/dbgap/), as well as the recount resource (https://jhubiostatistics.shinyapps.io/recount/) and Snaptron service/API (http://snaptron.cs.jhu.edu). Besides showcasing these tools and resources, I'll expound three themes: (a) pulic data is valuable but not easy to use and computationalists should attack this; (b) scalability is not just about scaling software to be distributed & multi-threaded, but is also about making the best use of many datasets at once; (c) "strategically unplugging" from gene annotations can lead to clearer statements about splicing and differential expression.

This is joint work with Abhinav Nellore, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

Ben Langmead

March 30, 2017
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead Assistant Professor, JHU Computer Science langmea@cs.jhu.edu, langmead-lab.org, @BenLangmead

    Banff, March 30, 2017 Summarizing tens of thousands of RNA-seq samples: themes and lessons
  2. Themes • Public data is available & valuable but hard

    to use
  3. Themes • Public data is available & valuable but hard

    to use • Scalable software benefits from big resources & data
  4. Themes • Public data is available & valuable but hard

    to use • Scalable software benefits from big resources & data • Strategically ignoring gene annotations can yield clearer results
  5. Sequence Read Archive (SRA) growth Open access Total 1 Pbp

    4.5 -> 9 Pbp in ~10 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement 10 Pbp 100 Tbp 10 Tbp
  6. Abhinav Nellore OHSU Jeff Leek, JHU Website: http://rail.bio Nellore A,

    Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  7. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work Website: http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  8. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation Website: http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  9. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for Website: http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  10. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for • Runs easily on commercial AWS cloud, other clusters Website: http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  11. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for • Runs easily on commercial AWS cloud, other clusters Website: http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  12. Sample 1 S 2 S 3 S 4 S 5

    Candidate Junction 1 CJ 2 CJ 3 Aggregating across samples adds a dimension to junction call confidence
  13. Aggregating across samples adds a dimension to junction call confidence

    Sample 1 S 2 S 3 S 4 S 5 Candidate Junction 1 CJ 2 CJ 3
  14. Rail-RNA design Preprocess Aggregate duplicate reads Split into readlets Aggregate

    duplicate readlets Correlation clustering for readlet alignments Call splice junction Merge exon differentials Compile sample coverages Write bigWigs Write normalization factors Write spliced alignment BAMs Write junction & indel BEDs Align reads end-to-end to genome Align readlets to genome Align readlets to junction co-occurrence index Bowtie 2 Bowtie Bowtie
  15. Sample 1 Sample 2 Sample 3 Log coverage

  16. Better-than-linear scaling Marginal cost of analyzing 1 additional sample decreases

    as we add more samples Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  17. Intropolis Website: http://intropolis.rail.bio Paper: http://bit.ly/intropolis Abhi Nellore OHSU Jeff Leek

    JHU
  18. RNA-seq & annotation Spliced alignment Count overlaps w/ annotated features

    Differential gene / exon expression Reads
  19. RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantification Count

    overlaps w/ annotated features Differential gene / exon expression Reads
  20. RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantification Count

    overlaps w/ annotated features Differential gene / exon expression (quasi-, pseudo-) Reads
  21. Good proxy? Transcripts in sample Isoform quantification Reads Transcripts in

    annotation
  22. Compiling Intropolis • Analyzed ~21,500 human RNA-seq samples with Rail-RNA;

    about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples
  23. a 0 2000 4000 6000 8000 10000 12000 14000 0

    100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266. Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega
  24. 0 2000 4000 6000 8000 10000 0 20 40 60

    80 Samples % called junctions that are annotated For ~2.5% of samples, <50% of junction calls are annotated Median fraction of junction calls that are annotated: ~80% GENCODE v19
  25. A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed

    region finder derfinder: region-based, annotation-agnostic Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.
  26. A third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially expressed

    region finder derfinder: region-based, annotation-agnostic bigWigs Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.
  27. Boiler: RNA-seq alignment compression • As big as bigWigs &

    1-2 orders of magnitude smaller than sorted BAMs • Usable with Cufflinks, StringTie Pritt J, Langmead B. Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res. 2016 Sep 19;44(16):e133. F R1 R2 Coverage Length tallies Co-occurrence patterns Jacob Pritt
  28. Snaptron & recount2 • Where to start for our summaries

    of public human RNA-seq; both include: • ~50K SRA samples • ~10K GTEx samples • ~10K TCGA samples • Snaptron: pose sophisticated queries re: splicing, quick responses, no downloading
  29. Snaptron Query planner delegates to appropriate systems (sqlite, tabix, lucene)

    and indexes (R-tree, B-tree, inverted full text) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.
  30. Snaptron • Junctions in ALK gene • # times junction

    occurs in each of 50,000 SRA samples • Tissue specificity of junction in GTEx data • Samples ranked according to how overrepresented one splicing pattern is relative to another Example queries: http://snaptron.cs.jhu.edu Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.
  31. Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee

    Leo Collado Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Also for DERfinder: Rafa Irizarry, Sarven Sabunciyan, Mike Love