Using huge public sequencing datasets to answer scientific questions

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead
UCLA CGSI Tutorial, July 30, 2018 Using huge public sequencing datasets to answer scientiﬁc questions

Links Code & links: https://github.com/BenLangmead/cgsi18 Slides: http://bit.ly/langmead-cgsi18

Related readings • Langmead B, Nellore A. Cloud computing for
genomic data analysis and collaboration. Nat Rev Genet. 2018 May;19(5):325. https://doi.org/10.1038/nrg.2017.113 • Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples. Bioinformatics. 2018 Jan 1;34(1):114-116. https:// doi.org/10.1093/bioinformatics/btx547 • Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nat Biotechnol. 2017 Apr 11;35(4): 319-321. https://doi.org/10.1038/nbt.3838 • Nellore A, Jaffe AE, Fortin JP, Alquicira-Hernández J, Collado-Torres L, Wang S, Phillips RA III, Karbhari N, Hansen KD, Langmead B, Leek JT. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266. https://doi.org/10.1186/s13059-016-1118-6 • Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2017 Dec 15;33(24):4033-4040. https://doi.org/10.1093/bioinformatics/btw575

Terabases Open access Total 1 Pbp 8 -> 16 Pbp
in ~18 months 10 Pbp 4 -> 8 Pbp in ~12 months Sequence Read Archive (SRA) growth

An index is a great leveler GB Shaw Even a
summary would be an improvement Not GB Shaw

Aside: Indexing raw sequencing data Mantis. Ferdman, M., Johnson, R.,
& Patro, R. Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. In Research in Computational Molecular Biology (p. 271). Springer. BIGSI: Bradley, P., den Bakker, H., Rocha, E., McVean, G., & Iqbal, Z. (2017). Real-time search of all bacterial and viral genomic data. bioRxiv, 234955. Image from Mantis paper Image from Split SBT paper Sequence Bloom Trees. Solomon B, Kingsford C. Fast search of thousands of short-read sequencing experiments. Nat Biotechnol. 2016 Mar;34(3):300-2. Solomon B, Kingsford C. Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees. J Comput Biol. 2018 Mar 12. Sun C, Harris RS, Chikhi R, Medvedev P. AllSome Sequence Bloom Trees. J Comput Biol. 2018 May;25(5): 467-479. 1000 Genomes FM Index: Dolle DD, Liu Z, Cotten M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Res. 2017 Feb;27(2):300-309.

Public summaries of sequencing data Langmead B, Nellore A. Cloud
computing for genomic data analysis and collaboration. Nat Rev Genet. 2018 Apr;19(4):208-219. doi: 10.1038/nrg.2017.113. Name Website Notes ExAC / gnomAD http://gnomad.broadinstitute.org Non-REF alleles in aligned exomes/genomes Cistrome http://cistrome.org/db/#/ Summarized ChIP and DNAse seq; human & mouse SRAdb https://github.com/seandavi/SRAdb Queryable SRA metadata, updated regularly

Search engine for RNA-seq Snaptron Index & query engine w/
REST API snaptron.cs.jhu.edu doi:10.1093/bioinformatics/btx547 Clean summaries of data, metadata, packaged as R objects jhubiostatistics.shinyapps.io/recount/ doi:10.1038/nbt.3838 Scalable, cloud-based spliced alignment of archived RNA-seq datasets rail.bio doi:10.1093/bioinformatics/btw575

Abhinav Nellore OHSU Jeﬀ Leek, JHU http://rail.bio Nellore A, Collado-Torres
L, Jaﬀe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4. Image by Rgocs

Junction-level summaries • Analyzed >50K human RNA-seq samples from SRA;
trillions of reads, 100s of terabases Junction 1 J 2 J 3 J 4 Sample 1 S 2 S 3 S1 S2 S3 J1 0 0 2 J2 0 1 1 J3 0 1 1 J4 8 1 1

Exercise 1 • How often are the exon- exon junctions
we detect also present in annotations like GENCODE? S1 S2 S3 ... S50,000 J1 0 0 2 0 J2 0 1 1 3 J3 0 1 1 40 J4 8 1 1 2 ... ... J81,066,376 0 10 0 ... 0

a 0 2000 4000 6000 8000 10000 12000 14000 0
100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Annotation includes: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

Search engine for RNA-seq Snaptron Index & query engine w/
REST API snaptron.cs.jhu.edu doi:10.1093/bioinformatics/btx547 Clean summaries of data, metadata, packaged as R objects jhubiostatistics.shinyapps.io/recount/ doi:10.1038/nbt.3838 Scalable, cloud-based spliced alignment of archived RNA-seq datasets rail.bio doi:10.1093/bioinformatics/btw575

Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA,
Hansen KD, Jaﬀe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321. https://jhubiostatistics.shinyapps.io/recount/ recount2

recount2 Junctions Genes Coverage Exons Summarized at levels of genes,
exons, junctions, and coverage vectors Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaﬀe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

Exercise 2 Adapted from the recount2 quick-start guide by Leo
Collado-Torres http://bioconductor.org/packages/devel/bioc/vignettes/recount/inst/doc/recount-quickstart.html

Search engine for RNA-seq Snaptron

Snaptron Query planner delegates query components to appropriate systems (sqlite,
tabix, lucene) and indexes (R-tree, B-tree, Lucene inverted text index) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

Snaptron Provides command-line tool and REST API for querying junctions
(& more summaries coming soon) Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

Snaptron • For each junction in gene ABCD3, how many
reads supported it in each of the 50K SRA samples? • What is a particular junction's tissue speciﬁcity in the GTEx dataset? • In which samples is splicing pattern A overrepresented relative to splicing pattern B? • (A/B might relate to alt splicing, fusions, etc) Examples: http://snaptron.cs.jhu.edu Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

Mini Snaptron case study • Goldstein et al searched for
novel cassette exons in Illumina BodyMap 2.0 • Identiﬁed 249 cassette exons within known genes but not overlapping any annotated exon • Validated 216 out of 249 in independent sample via paired-end RNA-seq (2 x 250 bp) Goldstein LD, Cao Y, Pau G, Lawrence M, Wu TD, Seshagiri S, Gentleman R. Prediction and Quantiﬁcation of Splice Events from RNA-Seq Data. PLoS One. 2016 May 24;11(5):e0156132.

Exercise 3

Mini Snaptron case study • • • • • •
• • • • 0 5000 10000 15000 20000 GTEx SRAv2 Data compilation Shared sample count (SSC) Validation Failed Passed • Exons validated by Goldstein et al had higher SSC versus exons failing validation • SSC (prevalence) is related to how "real" they are Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

Future: public data Desire: for querying and using public data
to be everyday activity in bio research Is training keeping up? If not, can we ﬁx? One of the best ways for a neuroscientist like me to keep up to date with what colleagues are working on is to attend confer- ences. But on recent trips I have noticed a problem. Too few researchers are consulting and using publicly available data — my own included. What is going on? Massive amounts of biological information are being accumu- lated using high-throughput sequencing techniques. Many scientists have used some of those resources, such as the Encyclopedia of DNA Elements (ENCODE) launched by the US National Human Genome Research Institute. But many more laboratories in neuroscience and other subdisciplines of cell and molecular biology generate their own data sets. These data are piling up in community databases and offer information on gene expression and regulation. Unless this information is used, it is wasted. For instance, I study brain cells thought to be important for the maintenance of chronic pain. Called microglia, these cells are also investi- gated by immunologists interested in the cells’ role in, say, multiple sclerosis. Together, these results provide a full profile of which genes these cells express. discrepancy, and propose a biologically valid reason for it. Why are so many bench biologists overlooking this wealth of cell-type-specific expression data? My hunch is there are two reasons. First, researchers under estimate how many of these data have been published over the past few years because they are being generated across so many different fields. Second, they are wary of the data. Because you need bioinformatics knowledge to generate and analyse sequencing results, people assume that they also need such expertise to locate and interpret them. Not so. In the past five years, improvements in technology, together with stricter deposition guidelines, mean that simple Excel files com- monly accompany papers. These can be downloaded in minutes from the Supplementary Information of a relevant paper, or from the ‘GEO Datasets’ tab on the NCBI website using search terms. It is like PubMed for spreadsheets. They require minimal knowledge to browse. It is often difficult to share big data in science. Sequencing data are fairly unusual, in that it is easy to standardize, display and judge them from the outside. This is not the case for many other kinds of scientific output. For instance, TAKING NO NOTICE OF DEPOSITED DATA IS AKIN TO Don’t let useful data go to waste Researchers must seek out others’ deposited biological sequences in community databases, urges Franziska Denk. MEGHNA ABRAHAM WORLD VIEW A personal take on events

Future: public data Single accession or study All of SRA
With public data we are quickly confronted by issues like technical confounding and missing/incorrect metadata How do we know what questions can be answered robustly at what points on the spectrum? Can we "ﬁx" metadata? Ellis SE, Collado-Torres L, Jaﬀe A, Leek JT. Improving the value of public RNA-seq expression data by phenotype prediction. Nucleic Acids Res. 2018 May 18;46(9):e54.

Jeﬀ Leek Jacob Pritt Abhinav Nellore Kasper Hansen Leo Collado
Torres Chris Wilks Andrew Jaﬀe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 • NSF CAREER IIS-1349906 • Sloan Research Fellowship • IDIES Seed Funding program • Amazon Web Services • NIH R01GM105705 (Leek) langmead-lab.org, @BenLangmead Thank you: IDIES Seed funding SciServer SciServer Compute Jonathan Ling Seth Blackshaw

Using huge public sequencing datasets to answer...

Using huge public sequencing datasets to answer scientific questions

Ben Langmead

More Decks by Ben Langmead

Other Decks in Science

Featured

Transcript

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead

Links Code & links: https://github.com/BenLangmead/cgsi18 Slides: http://bit.ly/langmead-cgsi18

Related readings • Langmead B, Nellore A. Cloud computing for

Terabases Open access Total 1 Pbp 8 -> 16 Pbp

An index is a great leveler GB Shaw Even a

Aside: Indexing raw sequencing data Mantis. Ferdman, M., Johnson, R.,

Public summaries of sequencing data Langmead B, Nellore A. Cloud

Search engine for RNA-seq Snaptron Index & query engine w/

Abhinav Nellore OHSU Jeﬀ Leek, JHU http://rail.bio Nellore A, Collado-Torres

Junction-level summaries • Analyzed >50K human RNA-seq samples from SRA;

Exercise 1 • How often are the exon- exon junctions

a 0 2000 4000 6000 8000 10000 12000 14000 0

Search engine for RNA-seq Snaptron Index & query engine w/

Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA,

recount2 Junctions Genes Coverage Exons Summarized at levels of genes,

Exercise 2 Adapted from the recount2 quick-start guide by Leo

Search engine for RNA-seq Snaptron

Snaptron Query planner delegates query components to appropriate systems (sqlite,

Snaptron Provides command-line tool and REST API for querying junctions

Snaptron • For each junction in gene ABCD3, how many

Mini Snaptron case study • Goldstein et al searched for

Exercise 3

Mini Snaptron case study • • • • • •

Future: public data Desire: for querying and using public data

Future: public data Single accession or study All of SRA

Jeﬀ Leek Jacob Pritt Abhinav Nellore Kasper Hansen Leo Collado