Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable analysis of many sequencing datasets at once

Ben Langmead
November 12, 2016

Scalable analysis of many sequencing datasets at once

The Sequence Read Archive contains data for over 300K RNA-seq samples, including over 80K from human-derived samples. Large-scale projects like GTEx and TCGA are generating RNA-seq data on many thousands of samples. Such huge and carefully designed datasets are valuable, but unwieldy for typical biological researchers, especially when access to computational resources is limited.
I will describe our work toward making it easy for typical biological researchers to leverage the huge amount of public RNA-seq data available today. I will highlight the Rail-RNA software (http://rail.bio), its dbGaP-protected version (http://docs.rail.bio/dbgap/), as well as the Intropolis (http://intropolis.rail.bio/) and ReCount (https://jhubiostatistics.shinyapps.io/recount/) resources. I will describe how the Rail-RNA software uses the Amazon Web Services commercial cloud to analyze many samples at once. I will also describe how we used Rail-RNA to study tens of thousands of public RNA-seq accessions, and what those studies tell us about how our knowledge of human splicing diversity has evolved over time. Finally, I will demonstrate how the Intropolis and ReCount resources can be used to pose questions about expression and differential expression across 10,000s of RNA-seq samples from the Sequence Read Archive and GTEx projects.

Ben Langmead

November 12, 2016
Tweet

More Decks by Ben Langmead

Other Decks in Research

Transcript

  1. Ben Langmead Assistant Professor, JHU Computer Science langmea@cs.jhu.edu, langmead-lab.org, @BenLangmead

    University of Utah, November 11, 2016 Scalable analysis of many sequencing datasets at once
  2. Langmead lab Efficiency Scalability + Bowtie, Bowtie 2, Lighter, Arioc,

    HISAT Rail-RNA, Boiler, Rail-dbGaP Resources + Intropolis, recount, Snaptron
  3. Sequence Read Archive (SRA) growth Terabases Open access Total 1

    Pbp 3 -> 6 Pbp in ~18 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement
  4. Sequence Read Archive (SRA) growth Terabases Open access Total 1

    Pbp 3 -> 6 Pbp in ~18 months https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement 6.15 -> 8.72 Pbp in October Open access Total
  5. MapReduce Elastic MapReduce Spot Marketplace

  6. Abhinav Nellore JHU & OHSU Jeff Leek, JHU Website: http://rail.bio,

    Paper: http://bit.ly/rail-aa
  7. From reads to alignments, coverage vectors & junctions Aggregate across

    samples to borrow strength and eliminate redundant work Annotation agnostic: let data, not annotation, prune the junction space Spliced RNA-seq aligner for analyzing many samples at once Website: http://rail.bio, Paper: http://bit.ly/rail-aa
  8. Pass 1: align to genome, make junction calls Pass 2:

    re-align to genome with putative junctions Reads: Ref: Readlets:
  9. Sample 1 S 2 S 3 S 4 S 5

    Candidate Junction 1 CJ 2 CJ 3 Aggregating across samples adds a dimension to junction call confidence
  10. Rail-RNA: design Preprocess Aggregate duplicate reads Split into readlets Aggregate

    duplicate readlets Correlation clustering for readlet alignments Call splice junction Merge exon differentials Compile sample coverages Write bigWigs Write normalization factors Write spliced alignment BAMs Write junction & indel BEDs Align reads end-to-end to genome Align readlets to genome Align readlets to junction co-occurrence index Bowtie 2 Bowtie Bowtie
  11. Sample 1 Sample 2 Sample 3 Log coverage

  12. Marginal cost of analyzing 1 additional sample decreases as we

    add more samples Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  13. Website: http://rail.bio Paper: http://bit.ly/rail-aa Repo: https://github.com/nellore/rail Chat: https://gitter.im/nellore/rail dbGaP Website:

    http://docs.rail.bio/dbgap/ Paper: http://bit.ly/rail_dbgap Abhinav Nellore JHU & OHSU Jeff Leek, JHU
  14. What is the annotation tradeoff? Some new tools start with

    annotation; no attempt to discover junctions / isoforms Major projects (GTEx, GEUVADIS, TCGA) quantitate directly from annotated transcripts What is the nature of this tradeoff? How complete are the annotations?
  15. What is the annotation tradeoff? • Analyzed ~50,000 human RNA-seq

    samples with Rail-RNA; about 150 Tbp • Repeatable: http://github.com/nellore/runs • ~ $1.40 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS)
  16. a 0 2000 4000 6000 8000 10000 12000 14000 0

    100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega
  17. 0 2000 4000 6000 8000 10000 0 20 40 60

    80 Samples % called junctions that are annotated For ~2.5% of samples, <50% of junction calls are annotated Median fraction of junction calls that are annotated: ~80% GENCODE v19
  18. Djebali, Sarah, et al. "Landscape of transcription in human cells."

    Nature 489.7414 (2012): 101-108.
  19. RNA-seq & annotation Spliced alignment Isoform assembly Isoform quantitation Count

    overlaps w/ annotated features Differential gene / exon expression (often with annotation) (quasi-, pseudo-)
  20. RNA-seq: a third way Spliced alignment Rail-RNA: accurate, annotation-agnostic Differentially

    expressed region finder derfinder: region-based, annotation-agnostic See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa Collado-Torres L et al. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2016 Sep 29.
  21. Collado-Torres L et al. Flexible expressed region analysis for RNA-seq

    with derfinder. Nucleic Acids Res. 2016 Sep 29. RNA-seq: a third way See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa
  22. See also: GEUVADIS analysis in sec 2.4: http://bit.ly/rail-aa Spliced alignment

    Rail-RNA: accurate, annotation-agnostic Differentially expressed region finder derfinder: region-based, annotation-agnostic Collado-Torres L et al. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 2016 Sep 29. bigWigs RNA-seq: a third way
  23. Resources • Intropolis (in press, Genome biology) • Nellore A,

    et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. • Snaptron (in preparation) • recount (in revision) • Collado-Torres L, et al. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv (2016): 068478.
  24. Intropolis Site: http://intropolis.rail.bio, preprint: http://j.mp/rail-sra-pre Nellore A, et al. Human

    splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224. Abhinav Nellore JHU & OHSU
  25. Intropolis • Discovery of novel splicing events has leveled off

    Nellore A, et al. Human splicing diversity across the Sequence Read Archive. bioRxiv (2016): 038224.
  26. Intropolis Nellore A, et al. Human splicing diversity across the

    Sequence Read Archive. bioRxiv (2016): 038224. ABRF SEQC GEU -0.005 0.000 0.005 0.010 0.015 -0.02 -0.01 0.00 0.01 0.02 PC1 PC2
  27. Snaptron Users pose flexible queries about splicing Query planner delegates

    to appropriate systems (sqlite, tabix, lucene) and indexes (R-tree, B-tree, inverted full text) Lucene/Document Inverted Index SQLite/B-tree Index Tabix/R-tree Index 8 Sample Filter Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Sample Filter Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Chris Wilks
  28. Snaptron Web service and UI currently available, preprint soon B

    C 1 2 1 2 3 1 2 3 4 Example: two unannotated junctions on either side of an exonized repetitive element discovered by colleague Sarven Sabunciyan Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  29. recount Junctions Genes Coverage Exons • Provides expression summaries at

    levels of genes, junctions, exons and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  30. recount recount : expression data for ~70,000 human samples samples

    expression estimates gene exon junctions ERs samples phenotypes ? GTEx N=9,962 TCGA N=10,327 SRA N=49,848 1.  Have'’novel’'isoforms'ever'been' seen'previously?'' 2.  What'regions'of'the'human' genome'are'transcribed'in' humans?' 1.  Have'’novel’'isoforms' ever'been'seen' previously?'In#what# tissue?#At#what#levels?# 2.  What'regions'of'the' human'genome'are' transcribed'in'humans' and#in#what#tissues?' 3.  Do'the'same'genes' escape'X'Inactivation' across'all'tissues?' 4.  What'expression' changes'occur'as'we' age?' 5.  …….' Biological Phenotypes: -  Sex -  Age -  Tissue recount: A large-scale resource of analysis-ready RNA-seq expression data Leonardo Collado-Torres, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, Jeffrey Leek Slide courtesy of Shannon Ellis Abhinav Nellore Leo Collado Torres
  31. recount studies Collado-Torres L, Nellore A, Kammers K, Ellis SE,

    Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478. • Tissue meta-analysis: compare colon & blood tissues from SRA, do same from GTEx, compare differential expression rankings • Compare our gene expression measurements with those from GTEx project; high concordance • Compare differential expression results when analysis is performed at the level of gene, exon, junction or DER
  32. recount studies https://rpubs.com/crazyhottommy/heatmap_demystified

  33. recount • Shiny-app front-end: https://jhubiostatistics.shinyapps.io/recount/ • SciServer Compute lets users

    to work with locally-hosted data in Jupyter notebook http://compute.sciserver.org/dashboard/ • Bioconductor 3.4 package https://www.bioconductor.org/packages/recount/ • Preprint: http://bit.ly/recount_pre Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv doi: 10.1101/068478.
  34. Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Alyssa Frazee

    Leo Collado Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Also for DERfinder: Rafa Irizarry, Sarven Sabunciyan, Mike Love