Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using huge public sequencing datasets to answer scientific questions

Using huge public sequencing datasets to answer scientific questions

The Sequence Read Archive now contains over a million accessions, including over 200K RNA-seq runs for mouse and over 160K for human. Large-scale projects like GTEx, ICGC and TOPmed are major contributors and huge projects on the horizon, such as the All of Us and Million Veterans programs, will further accelerate this growth. These archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use.

I will describe our progress toward the goal of making it easy for researchers to ask scientific questions about public RNA-seq datasets. I will highlight the Rail/recount2 system for ingesting and summarizing public data and the Snaptron service that makes it queryable. Finally, I will discuss scientific collaborations with neuroscientists and cancer researchers where we applied these tools to perform both targeted queries and large-scale screens. I will highlight ways in which we are learning to make our tools better suited to how scientists work. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Luigi Marchionni, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

Ben Langmead

May 25, 2018
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead

    BME seminar, Oregon Health & Science University May 25, 2018 Using huge public sequencing datasets to answer scientific questions
  2. Lab goals Efficient Scalable Interpretable Software: Topics: Bowtie 1&2, Arioc,

    Flash-dans applied algorithms, text indexing, sketching, thread scaling Myrna, Rail-RNA, recount2, Snaptron parallel and high-performance computing, cloud computing, indexing To make high-throughput life science data as usable as possible for scientific labs, especially small ones Qtip, FORGe modeling mapping quality, modeling graph-genome variants, addressing biases Software: Topics: Software: Topics:
  3. Terabases Open access Total 1 Pbp 8 -> 16 Pbp

    in ~18 months 10 Pbp 4 -> 8 Pbp in ~12 months Sequence Read Archive (SRA) growth
  4. Search engine for RNA-seq Snaptron Index & query engine w/

    REST API snaptron.cs.jhu.edu doi:10.1093/bioinformatics/btx547 Clean summaries of data, metadata, packaged as R objects jhubiostatistics.shinyapps.io/recount/ doi:10.1038/nbt.3838 Scalable, cloud-based spliced alignment of archived RNA-seq datasets rail.bio doi:10.1093/bioinformatics/btw575
  5. Themes • Cloud computing is a natural fit for public

    data • Scalable software benefits from big resources & many samples • Strategically ignoring gene annotations can yield clearer results • Queryability is in the eye of the beholder
  6. Search engine for RNA-seq Snaptron Index & query engine w/

    REST API snaptron.cs.jhu.edu doi:10.1093/bioinformatics/btx547 Clean summaries of data, metadata, packaged as R objects jhubiostatistics.shinyapps.io/recount/ doi:10.1038/nbt.3838 Scalable, cloud-based spliced alignment of archived RNA-seq datasets rail.bio doi:10.1093/bioinformatics/btw575
  7. Abhinav Nellore OHSU Jeff Leek, JHU http://rail.bio Nellore A, Collado-Torres

    L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4. Image by Rgocs
  8. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  9. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  10. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  11. Spliced RNA-seq aligner for analyzing many samples at once •

    Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for • Runs easily on commercial AWS cloud, other clusters http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  12. dbGaP http://docs.rail.bio/dbgap/ Nellore A, Wilks C, Hansen KD, Leek JT,

    Langmead B. Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics. 2016 Aug 15;32(16):2551-3.
  13. Working toward recount2 • Analyzed ~21,500 human RNA-seq samples with

    Rail-RNA; about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
  14. a 0 2000 4000 6000 8000 10000 12000 14000 0

    100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
  15. recount2 • >50K human RNA-seq samples from SRA (open) •

    >10K human RNA-seq samples spanning cancer types in The Cancer Genome Atlas (dbGaP) Image: https://www.sevenbridges.com/welcome-to-the-cancer-genomics-cloud-2/ • >10K human RNA-seq samples from the Genotype-Tissue Expression (GTEx) project (dbGaP) • In total, ~4.4 trillion reads, 100s of terabases Image: doi:10.1038/ng.2653 Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
  16. recount2 Junctions Genes Coverage Exons Summarized at levels of genes,

    exons, junctions, and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
  17. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA,

    Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321. https://jhubiostatistics.shinyapps.io/recount/ recount2
  18. recount2 Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub

    MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321. http://bit.ly/recount_sciserver
  19. Snaptron Query planner delegates query components to appropriate systems (sqlite,

    tabix, lucene) and indexes (R-tree, B-tree, Lucene inverted text index) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  20. Snaptron • For each junction in gene ABCD3, how many

    reads supported it in each of the 50K SRA samples? • What is a particular junction's tissue specificity in the GTEx dataset? • In which samples is splicing pattern A overrepresented relative to splicing pattern B? • (A/B might relate to alt splicing, fusions, etc) Examples: http://snaptron.cs.jhu.edu Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  21. Snaptron vignette 1 • Goldstein et al searched for novel

    cassette exons in Illumina BodyMap 2.0 • Identified 249 cassette exons within known genes but not overlapping any annotated exon • Validated 216 out of 249 in independent sample via paired-end RNA-seq (2 x 250 bp) Goldstein LD, Cao Y, Pau G, Lawrence M, Wu TD, Seshagiri S, Gentleman R. Prediction and Quantification of Splice Events from RNA-Seq Data. PLoS One. 2016 May 24;11(5):e0156132.
  22. Snaptron vignette 1 Wilks C, Gaddipati P, Nellore A, Langmead

    B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547. A. ABCD3 B. KMT2E 3 1 2 1 2 3 C. ALKATI 1 2 3 4 • Snaptron immediately recapitulates ABCD3 exon (above) • Of the 249 novel exons, 236 (94.8%) occurred in GTEx • Used shared sample count (SSC) query to measure # samples the novel exons occurred in...
  23. Snaptron vignette 1 • • • • • • •

    • • • 0 5000 10000 15000 20000 GTEx SRAv2 Data compilation Shared sample count (SSC) Validation Failed Passed • Exons validated by Goldstein et al had higher SSC versus exons failing validation • SSC (prevalence) is related to how "real" they are Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  24. Snaptron vignette 2 Darby MM, Leek JT, Langmead B, Yolken

    RH, Sabunciyan S. Widespread splicing of repetitive element loci into coding regions of gene transcripts. Hum Mol Genet. 2016 Nov 15;25(22):4962-4982. • Darby et al studied prevalence of repeat element (RE) expression in the human orbitofrontal cortex • Used RNA-seq to find junctions linking annotated exons to REs in annotated introns, indicating exonization • They supplied us 5 events where RE exon was unannotated; Snaptron SSC query confirmed all 5 occurred at least 35 times in SRAv2 & GTEx Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  25. Snaptron vignette 2 Wilks C, Gaddipati P, Nellore A, Langmead

    B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547. • Tissue specificity query showed all 5 events were expressed in a tissue-specific pattern in GTEx (Kruskal-Wallis P < 0.01) A. ABCD3 B. KMT2E 3 1 2 1 2 3 C. ALKATI 1 2 3 4 • One of the 5 shown here (arrow 2)
  26. Snaptron vignette 3 Collaborator Jonathan Ling studies how splicing factors

    affect splicing of certain cryptic cassette exons • cryptic: usually unannotated, usually unconserved Past work of Jonathan's showed that splicing factor protein TDP-43 suppresses splicing of non-conserved cryptic exons Implicated in ALS, frontotemporal dementia (FTD), Alzheimer’s Jonathan Ling Can we rapidly screen for regulatory relationships like those between TDP-43 and its cryptic-exon targets?
  27. Snaptron vignette 3 Jonathan Ling Let's look at mouse datasets

    because that's where you can get the really nice purified tissues Let's look at cassette-exon percent- spliced-in (PSI) as a summary measure Let's look at what patterns seem to define rod photoreceptors as a cell type
  28. Snaptron vignette 3 Ling J, Wilks C, Charles R, Blackshaw

    S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation "Supermouse" Rods have characteristic pattern of PSI levels
  29. Snaptron vignette 3 Ling J, Wilks C, Charles R, Blackshaw

    S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation PSIs can reveal specific signatures for cell types that are are not visible at the gene level
  30. Snaptron vignette 3 Ling J, Wilks C, Charles R, Blackshaw

    S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Certain alternative exons seem to have high PSI only in rods
  31. Snaptron vignette 3 Ling J, Wilks C, Charles R, Blackshaw

    S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Certain splicing factors are expressed specifically in rods -- could they drive rod-specific exon splicing?
  32. Snaptron vignette 3 Ling J, Wilks C, Charles R, Blackshaw

    S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Several of these exons involve at least one unannotated junction!
  33. Themes (redux) Cloud computing is a natural fit for public

    data Next-generation sequencing (NGS) technologies have been improving rapidly and have become the work- horse technology for studying nucleic acids. NGS plat- forms work by collecting information on a large array of poly merase reactions working in parallel, up to bil- lions at a time inside a single sequencer1. The speed and decreasing cost of NGS have led to the rapid accu- mulation of raw sequencing data (sequencing reads), used in published studies, in public archives2 such as the Sequence Read Archive (SRA)3,4, which is hosted by the US National Center for Biotechnology Information (NCBI), and the European Nucleotide Archive (ENA)5, which is hosted by the European Molecular Biology Laboratory at the European Bioinformatics Institute (EMBL–EBI). The SRA now holds about 14 petabases (millions of billions of bases) and has been doubling in size every 10–20 months (FIG. 1). Genomics researchers programme17, among others (TABLE 1). gnomAD now spans over 120,000 exomes and over 15,000 whole genomes. ICGC encompasses over 70 subprojects target- ing distinct cancer types, which are conducted in more than a dozen countries and have already collected sam- ples from more than 20,000 donors. Aligned sequenc- ing reads for ICGC require over 1 petabyte (PB; that is, a million GB) of storage. The TOPMed programme, which plans to sequence more than 120,000 genomes17, has already deposited more than 18,000 human whole- genome sequencing data sets in the SRA, comprising approximately 2.3 petabases or about 16.5% of the entire archive. Large observational studies currently in progress, such as the Precision Medicine Initiative18 and Million Veterans Project19, will drive up the totals yet more rapidly. While advances in NGS have increased opportunities eads A sequence as DNA sequencer. of a computer a. onent of a hich the akes place. ster connected t are able to dinated fashion a. Cloud computing for genomic data analysis and collaboration Ben Langmead1 and Abhinav Nellore2 Abstract | Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data. COMPUTATIONAL TOOLS REVIEWS Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nature Reviews Genetics. 2018 Apr;19(4):208-219.
  34. Themes (redux) Scalable software benefits from big resources & many

    samples Strategically ignoring gene annotations can yield clearer results • Many of the splicing patterns described in the vignettes were unannotated. Crucial not to be biased against these during analysis.
  35. Themes (redux) Queryability is in the eye of the beholder

    • Beyond targeted queries, users want bulk screens • Beyond the boiling cauldron of 10,000s samples, users want specific subsets with key properties • Knocked-down splicing factor • Carefully purified tissue • Disease X
  36. Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Leo Collado

    Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services • NIH R01GM105705 (Leek) langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Jonathan Ling Seth Blackshaw