Using huge public sequencing datasets to answer scientific questions

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead
BME seminar, Oregon Health & Science University May 25, 2018 Using huge public sequencing datasets to answer scientiﬁc questions

Lab goals Eﬃcient Scalable Interpretable Software: Topics: Bowtie 1&2, Arioc,
Flash-dans applied algorithms, text indexing, sketching, thread scaling Myrna, Rail-RNA, recount2, Snaptron parallel and high-performance computing, cloud computing, indexing To make high-throughput life science data as usable as possible for scientiﬁc labs, especially small ones Qtip, FORGe modeling mapping quality, modeling graph-genome variants, addressing biases Software: Topics: Software: Topics:

Terabases Open access Total 1 Pbp 8 -> 16 Pbp
in ~18 months 10 Pbp 4 -> 8 Pbp in ~12 months Sequence Read Archive (SRA) growth

Search engine for RNA-seq Snaptron Index & query engine w/
REST API snaptron.cs.jhu.edu doi:10.1093/bioinformatics/btx547 Clean summaries of data, metadata, packaged as R objects jhubiostatistics.shinyapps.io/recount/ doi:10.1038/nbt.3838 Scalable, cloud-based spliced alignment of archived RNA-seq datasets rail.bio doi:10.1093/bioinformatics/btw575

Themes • Cloud computing is a natural ﬁt for public
data • Scalable software beneﬁts from big resources & many samples • Strategically ignoring gene annotations can yield clearer results • Queryability is in the eye of the beholder

Search engine for RNA-seq Snaptron Index & query engine w/
REST API snaptron.cs.jhu.edu doi:10.1093/bioinformatics/btx547 Clean summaries of data, metadata, packaged as R objects jhubiostatistics.shinyapps.io/recount/ doi:10.1038/nbt.3838 Scalable, cloud-based spliced alignment of archived RNA-seq datasets rail.bio doi:10.1093/bioinformatics/btw575

Abhinav Nellore OHSU Jeﬀ Leek, JHU http://rail.bio Nellore A, Collado-Torres
L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4. Image by Rgocs

Spliced RNA-seq aligner for analyzing many samples at once •
Aggregate across samples to borrow strength and eliminate redundant alignment work http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

Aggregate across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions, junction evidence, coverage vectors; no alignments, unless asked for • Runs easily on commercial AWS cloud, other clusters http://rail.bio Nellore A, Collado-Torres L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

dbGaP http://docs.rail.bio/dbgap/ Nellore A, Wilks C, Hansen KD, Leek JT,
Langmead B. Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics. 2016 Aug 15;32(16):2551-3.

Working toward recount2 • Analyzed ~21,500 human RNA-seq samples with
Rail-RNA; about 62 Tbp • Repeatable: http://github.com/nellore/runs • ~ $0.72 / sample (Compare to sequencing costs) (Exact commands we used to run on AWS) jxs samples http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

a 0 2000 4000 6000 8000 10000 12000 14000 0
100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

recount2 • >50K human RNA-seq samples from SRA (open) •
>10K human RNA-seq samples spanning cancer types in The Cancer Genome Atlas (dbGaP) Image: https://www.sevenbridges.com/welcome-to-the-cancer-genomics-cloud-2/ • >10K human RNA-seq samples from the Genotype-Tissue Expression (GTEx) project (dbGaP) • In total, ~4.4 trillion reads, 100s of terabases Image: doi:10.1038/ng.2653 Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

recount2 Junctions Genes Coverage Exons Summarized at levels of genes,
exons, junctions, and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA,
Hansen KD, Jaﬀe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321. https://jhubiostatistics.shinyapps.io/recount/ recount2

recount2 Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub
MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321. http://bit.ly/recount_sciserver

Search engine for RNA-seq Snaptron

Snaptron Query planner delegates query components to appropriate systems (sqlite,
tabix, lucene) and indexes (R-tree, B-tree, Lucene inverted text index) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

Snaptron • For each junction in gene ABCD3, how many
reads supported it in each of the 50K SRA samples? • What is a particular junction's tissue speciﬁcity in the GTEx dataset? • In which samples is splicing pattern A overrepresented relative to splicing pattern B? • (A/B might relate to alt splicing, fusions, etc) Examples: http://snaptron.cs.jhu.edu Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

Snaptron vignette 1 • Goldstein et al searched for novel
cassette exons in Illumina BodyMap 2.0 • Identiﬁed 249 cassette exons within known genes but not overlapping any annotated exon • Validated 216 out of 249 in independent sample via paired-end RNA-seq (2 x 250 bp) Goldstein LD, Cao Y, Pau G, Lawrence M, Wu TD, Seshagiri S, Gentleman R. Prediction and Quantiﬁcation of Splice Events from RNA-Seq Data. PLoS One. 2016 May 24;11(5):e0156132.

Snaptron vignette 1 Wilks C, Gaddipati P, Nellore A, Langmead
B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547. A. ABCD3 B. KMT2E 3 1 2 1 2 3 C. ALKATI 1 2 3 4 • Snaptron immediately recapitulates ABCD3 exon (above) • Of the 249 novel exons, 236 (94.8%) occurred in GTEx • Used shared sample count (SSC) query to measure # samples the novel exons occurred in...

Snaptron vignette 1 • • • • • • •
• • • 0 5000 10000 15000 20000 GTEx SRAv2 Data compilation Shared sample count (SSC) Validation Failed Passed • Exons validated by Goldstein et al had higher SSC versus exons failing validation • SSC (prevalence) is related to how "real" they are Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

Snaptron vignette 2 Darby MM, Leek JT, Langmead B, Yolken
RH, Sabunciyan S. Widespread splicing of repetitive element loci into coding regions of gene transcripts. Hum Mol Genet. 2016 Nov 15;25(22):4962-4982. • Darby et al studied prevalence of repeat element (RE) expression in the human orbitofrontal cortex • Used RNA-seq to ﬁnd junctions linking annotated exons to REs in annotated introns, indicating exonization • They supplied us 5 events where RE exon was unannotated; Snaptron SSC query conﬁrmed all 5 occurred at least 35 times in SRAv2 & GTEx Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

Snaptron vignette 2 Wilks C, Gaddipati P, Nellore A, Langmead
B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547. • Tissue speciﬁcity query showed all 5 events were expressed in a tissue-speciﬁc pattern in GTEx (Kruskal-Wallis P < 0.01) A. ABCD3 B. KMT2E 3 1 2 1 2 3 C. ALKATI 1 2 3 4 • One of the 5 shown here (arrow 2)

Snaptron vignette 3 Collaborator Jonathan Ling studies how splicing factors
aﬀect splicing of certain cryptic cassette exons • cryptic: usually unannotated, usually unconserved Past work of Jonathan's showed that splicing factor protein TDP-43 suppresses splicing of non-conserved cryptic exons Implicated in ALS, frontotemporal dementia (FTD), Alzheimer’s Jonathan Ling Can we rapidly screen for regulatory relationships like those between TDP-43 and its cryptic-exon targets?

Snaptron vignette 3 Jonathan Ling Let's look at mouse datasets
because that's where you can get the really nice puriﬁed tissues Let's look at cassette-exon percent- spliced-in (PSI) as a summary measure Let's look at what patterns seem to deﬁne rod photoreceptors as a cell type

Snaptron vignette 3 Ling J, Wilks C, Charles R, Blackshaw
S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation "Supermouse" Rods have characteristic pattern of PSI levels

S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation PSIs can reveal speciﬁc signatures for cell types that are are not visible at the gene level

S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Certain alternative exons seem to have high PSI only in rods

S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Certain splicing factors are expressed speciﬁcally in rods -- could they drive rod-speciﬁc exon splicing?

S, & Langmead, B. "Exploratory analysis of alternative splicing in tens of thousands of bulk and single-cell samples" in preparation Several of these exons involve at least one unannotated junction!

Themes (redux) Cloud computing is a natural ﬁt for public
data Next-generation sequencing (NGS) technologies have been improving rapidly and have become the work- horse technology for studying nucleic acids. NGS plat- forms work by collecting information on a large array of poly merase reactions working in parallel, up to billions at a time inside a single sequencer1. The speed and decreasing cost of NGS have led to the rapid accu- mulation of raw sequencing data (sequencing reads), used in published studies, in public archives2 such as the Sequence Read Archive (SRA)3,4, which is hosted by the US National Center for Biotechnology Information (NCBI), and the European Nucleotide Archive (ENA)5, which is hosted by the European Molecular Biology Laboratory at the European Bioinformatics Institute (EMBL–EBI). The SRA now holds about 14 petabases (millions of billions of bases) and has been doubling in size every 10–20 months (FIG. 1). Genomics researchers programme17, among others (TABLE 1). gnomAD now spans over 120,000 exomes and over 15,000 whole genomes. ICGC encompasses over 70 subprojects target- ing distinct cancer types, which are conducted in more than a dozen countries and have already collected samples from more than 20,000 donors. Aligned sequencing reads for ICGC require over 1 petabyte (PB; that is, a million GB) of storage. The TOPMed programme, which plans to sequence more than 120,000 genomes17, has already deposited more than 18,000 human whole- genome sequencing data sets in the SRA, comprising approximately 2.3 petabases or about 16.5% of the entire archive. Large observational studies currently in progress, such as the Precision Medicine Initiative18 and Million Veterans Project19, will drive up the totals yet more rapidly. While advances in NGS have increased opportunities eads A sequence as DNA sequencer. of a computer a. onent of a hich the akes place. ster connected t are able to dinated fashion a. Cloud computing for genomic data analysis and collaboration Ben Langmead1 and Abhinav Nellore2 Abstract | Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data. COMPUTATIONAL TOOLS REVIEWS Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nature Reviews Genetics. 2018 Apr;19(4):208-219.

Themes (redux) Scalable software beneﬁts from big resources & many
samples Strategically ignoring gene annotations can yield clearer results • Many of the splicing patterns described in the vignettes were unannotated. Crucial not to be biased against these during analysis.

Themes (redux) Queryability is in the eye of the beholder
• Beyond targeted queries, users want bulk screens • Beyond the boiling cauldron of 10,000s samples, users want speciﬁc subsets with key properties • Knocked-down splicing factor • Carefully puriﬁed tissue • Disease X

Jeﬀ Leek Jacob Pritt Abhinav Nellore Kasper Hansen Leo Collado
Torres Chris Wilks Andrew Jaﬀe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services • NIH R01GM105705 (Leek) langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Jonathan Ling Seth Blackshaw

Using huge public sequencing datasets to answer...

Using huge public sequencing datasets to answer scientific questions

More Decks by Ben Langmead

Other Decks in Science

Featured

Transcript