Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tales of scale

Tales of scale

The Sequence Read Archive now contains over a million accessions, including over 200K RNA-seq runs for mouse and over 160K for human. Large-scale projects like GTEx, ICGC and TOPmed are major contributors and huge projects on the horizon, such as the All of Us and Million Veterans programs, will further accelerate this growth. These archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use.

Using the archive as a motivation, I will convey some insights -- gleaned from both successes and failures -- about how we as computational researchers can work toward the goal of making large public datasets easy to use. I will discuss some challenges that come with (a) working at scale, (b) using commercial (and non-commercial) cloud computing as a platform for this work, (c) pooling and borrowing strength across datasets, and (d) making public data available for use by everyday researchers. I will highlight ways in which we have learned how to make our tools better suited to how scientists work. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Luigi Marchionni, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

Ben Langmead

June 20, 2018
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Outline • Trends in biotech vs. trends in computing •

    Public data • Summarizing, indexing, searching • Case study: rod photoreceptors • Wild speculation, pompous pontification, humiliating mea culpas
  2. "Between 2008 and 2013, the performance of a single DNA

    sequencer increased about three-to fivefold per year. Using Moore’s Law as a benchmark...Sequencers are improving at a faster rate than computers. Something must be done now, or else we’ll need to put vital research on hold while the necessary computational techniques catch up—or are invented."
  3. Moore's law & sequencing cost ½ every 24 months ½

    every 18 months Source: https://www.genome.gov/27541954/dna-sequencing-costs-data/
  4. Who said that? "Between 2008 and 2013, the performance of

    a single DNA sequencer increased about three-to fivefold per year. Using Moore’s Law as a benchmark...Sequencers are improving at a faster rate than computers. Something must be done now, or else we’ll need to put vital research on hold while the necessary computational techniques catch up —or are invented." Illustration: Carl DeTorres
  5. Pontification • Within 2nd gen era, there's no great disparity

    between sequencing tech & Moore's law • Computing trends just as worthy of attention and study as sequencing trends • Must import more computational expertise, e.g. in HPC and CPU architecture, into genomics • 2nd gen era has proceeded largely without cloud computing "in the loop," but clouds fill other roles nicely (more later)
  6. Terabases Open access Total 1 Pbp 8 -> 16 Pbp

    in ~18 months 10 Pbp 4 -> 8 Pbp in ~12 months Sequence Read Archive (SRA) growth
  7. An index is a great leveler GB Shaw Even a

    summary would be an improvement Not GB Shaw
  8. Public summaries of sequencing data Langmead B, Nellore A. Cloud

    computing for genomic data analysis and collaboration. Nat Rev Genet. 2018 Apr;19(4):208-219. doi: 10.1038/nrg.2017.113.
  9. Indexing raw sequencing data Mantis. Ferdman, M., Johnson, R., &

    Patro, R. Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. In Research in Computational Molecular Biology (p. 271). Springer. BIGSI: Bradley, P., den Bakker, H., Rocha, E., McVean, G., & Iqbal, Z. (2017). Real-time search of all bacterial and viral genomic data. bioRxiv, 234955. Image from Mantis paper Image from Split SBT paper Sequence Bloom Trees. Solomon B, Kingsford C. Fast search of thousands of short-read sequencing experiments. Nat Biotechnol. 2016 Mar;34(3):300-2. Solomon B, Kingsford C. Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees. J Comput Biol. 2018 Mar 12. Sun C, Harris RS, Chikhi R, Medvedev P. AllSome Sequence Bloom Trees. J Comput Biol. 2018 May;25(5): 467-479. 1000 Genomes FM Index: Dolle DD, Liu Z, Cotten M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Res. 2017 Feb;27(2):300-309.
  10. Past work Langmead B, Schatz MC, Lin J, Pop M,

    Salzberg SL: Searching for SNPs with cloud computing. Genome Biol 2009, 10(11):R134. Crossbow Langmead B, Hansen KD, Leek JT. Cloud-scale RNA- sequencing differential expression analysis with Myrna. Genome Biol. 2010;11(8):R83. Myrna Frazee AC, Langmead B, Leek JT. ReCount: a multi- experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics. 2011 Nov 16;12:449. ReCount http://j.mp/crossbow_proj, http://j.mp/crossbow_repo http://j.mp/myrna_proj, http://j.mp/myrna_repo http://j.mp/recount_proj
  11. Today: a search engine for RNA-seq Snaptron Index & query

    engine w/ REST API snaptron.cs.jhu.edu doi:10.1093/bioinformatics/btx547 Clean summaries of data, metadata, packaged as R objects jhubiostatistics.shinyapps.io/recount/ doi:10.1038/nbt.3838 Scalable, cloud-based spliced alignment of archived RNA-seq datasets rail.bio doi:10.1093/bioinformatics/btw575
  12. Abhinav Nellore OHSU Jeff Leek, JHU http://rail.bio Nellore A, Collado-Torres

    L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4. Image by Rgocs
  13. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  14. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  15. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  16. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for • Runs easily on commercial AWS cloud, other clusters http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  17. dbGaP http://docs.rail.bio/dbgap/ Nellore A, Wilks C, Hansen KD, Leek JT,

    Langmead B. Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics. 2016 Aug 15;32(16):2551-3.
  18. Working toward recount2 • Analyzed ~21,500 human RNA-seq samples with

    Rail-RNA; about 62 Tbp • http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Commands we used to run on AWS) jxs samples http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
  19. a 0 2000 4000 6000 8000 10000 12000 14000 0

    100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
  20. recount2 • >50K human RNA-seq samples from SRA (open) •

    >10K human RNA-seq samples spanning cancer types in The Cancer Genome Atlas (dbGaP) Image: https://www.sevenbridges.com/welcome-to-the-cancer-genomics-cloud-2/ • >10K human RNA-seq samples from the Genotype-Tissue Expression (GTEx) project (dbGaP) • In total, ~4.4 trillion reads, 100s of terabases Image: doi:10.1038/ng.2653 Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
  21. recount2 Junctions Genes Coverage Exons Summarized at levels of genes,

    exons, junctions, and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
  22. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA,

    Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321. https://jhubiostatistics.shinyapps.io/recount/ recount2
  23. Snaptron Query planner delegates query components to appropriate systems (sqlite,

    tabix, lucene) and indexes (R-tree, B-tree, Lucene inverted text index) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  24. Snaptron Provides command-line tool and REST API for querying junctions

    (& more summaries coming soon) Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  25. Snaptron • For each junction in gene ABCD3, how many

    reads supported it in each of the 50K SRA samples? • What is a particular junction's tissue specificity in the GTEx dataset? • In which samples is splicing pattern A overrepresented relative to splicing pattern B? • (A/B might relate to alt splicing, fusions, etc) Examples: http://snaptron.cs.jhu.edu Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  26. Mini Snaptron case study • Goldstein et al searched for

    novel cassette exons in Illumina BodyMap 2.0 • Identified 249 cassette exons within known genes but not overlapping any annotated exon • Validated 216 out of 249 in independent sample via paired-end RNA-seq (2 x 250 bp) Goldstein LD, Cao Y, Pau G, Lawrence M, Wu TD, Seshagiri S, Gentleman R. Prediction and Quantification of Splice Events from RNA-Seq Data. PLoS One. 2016 May 24;11(5):e0156132.
  27. Mini Snaptron case study Wilks C, Gaddipati P, Nellore A,

    Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547. A. ABCD3 B. KMT2E 3 1 2 1 2 3 C. ALKATI 1 2 3 4 • Snaptron immediately recapitulates ABCD3 exon (above) • Of the 249 novel exons, 236 (94.8%) occurred in GTEx • Used shared sample count (SSC) query to measure # samples the novel exons occurred in...
  28. Mini Snaptron case study • • • • • •

    • • • • 0 5000 10000 15000 20000 GTEx SRAv2 Data compilation Shared sample count (SSC) Validation Failed Passed • Exons validated by Goldstein et al had higher SSC versus exons failing validation • SSC (prevalence) is related to how "real" they are Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  29. Snaptron case study: rod photoreceptors Collaborator Jonathan Ling studies how

    splicing factors affect splicing of certain cryptic cassette exons • cryptic: usually unannotated, usually unconserved Past work of Jonathan's showed that splicing factor protein TDP-43 suppresses splicing of non-conserved cryptic exons Implicated in ALS, frontotemporal dementia (FTD), Alzheimer’s Jonathan Ling Can we rapidly screen for regulatory relationships like those between TDP-43 and its cryptic-exon targets?
  30. Rod photoreceptors Ling J, Wilks C, Charles R, Blackshaw S,

    & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation "Supermouse" Rods have characteristic pattern of PSI levels
  31. Rod photoreceptors Ling J, Wilks C, Charles R, Blackshaw S,

    & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation PSIs can reveal specific signatures for cell types that are are not visible at the gene level
  32. Rod photoreceptors Ling J, Wilks C, Charles R, Blackshaw S,

    & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Certain cassettes have high PSI only in rods
  33. Rod photoreceptors Ling J, Wilks C, Charles R, Blackshaw S,

    & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Certain splicing factors are expressed specifically in rods -- could they drive rod-specific exon splicing?
  34. Rod photoreceptors Ling J, Wilks C, Charles R, Blackshaw S,

    & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Most of these are unannotated!
  35. Future: cloud computing Cloud computing may not be "in the

    loop" for most data-generating labs, but it's a natural fit for reanalyzing public data and for far-flung collaborations Next-generation sequencing (NGS) technologies have been improving rapidly and have become the work- horse technology for studying nucleic acids. NGS plat- forms work by collecting information on a large array of poly merase reactions working in parallel, up to bil- lions at a time inside a single sequencer1. The speed and decreasing cost of NGS have led to the rapid accu- mulation of raw sequencing data (sequencing reads), used in published studies, in public archives2 such as 3,4 programme17, among others (TABLE 1). gnomAD now spans over 120,000 exomes and over 15,000 whole genomes. ICGC encompasses over 70 subprojects target- ing distinct cancer types, which are conducted in more than a dozen countries and have already collected sam- ples from more than 20,000 donors. Aligned sequenc- ing reads for ICGC require over 1 petabyte (PB; that is, a million GB) of storage. The TOPMed programme, which plans to sequence more than 120,000 genomes17, ads A sequence as NA sequencer. f a computer . onent of a ich the Cloud computing for genomic data analysis and collaboration Ben Langmead1 and Abhinav Nellore2 Abstract | Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data. COMPUTATIONAL TOOLS REVIEWS Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nature Reviews Genetics. 2018 Apr;19(4):208-219.
  36. Future: public data "Queryability" means different things for different assays,

    scientific questions • Beyond targeted queries, users want bulk screens • Boiling cauldron of 10,000s samples aside, users want subsets with trustworthy metadata and particular properties • E.g. knocked-down splicing factor, carefully purified tissue, disease X
  37. Future: public data Rod photoreceptor study involved >90K public run

    accessions 3 out of the 4 figures I showed used only public data Desire: for querying and using public data to be everyday activity in bio research One of the best ways for a neuroscientist like me to keep up to date with what colleagues are working on is to attend confer- ences. But on recent trips I have noticed a problem. Too few researchers are consulting and using publicly available data — my own included. What is going on? Massive amounts of biological information are being accumu- lated using high-throughput sequencing techniques. Many scientists discrepancy, and propose a biologically valid reason for it. Why are so many bench biologists overlooking this wealth of cell-type-specific expression data? My hunch is there are two reasons. First, researchers under estimate how many of these data have been published over the past few years because they are being generated across so many different fields. Second, they are wary of the data. Because you need bioinformatics Don’t let useful data go to waste Researchers must seek out others’ deposited biological sequences in community databases, urges Franziska Denk. MEGHNA ABRAHAM WORLD VIEW A personal take on events
  38. Future: data science Single accession or study All of SRA

    With public data we are quickly confronted by issues like technical confounding and missing/incorrect metadata What kinds of questions can be answered robustly at what points on this spectrum? Can we "fix" metadata? Ellis SE, Collado-Torres L, Jaffe A, Leek JT. Improving the value of public RNA-seq expression data by phenotype prediction. Nucleic Acids Res. 2018 May 18;46(9):e54.
  39. Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Leo Collado

    Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services • NIH R01GM105705 (Leek) langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Jonathan Ling Seth Blackshaw