Marshaling public data for lean and powerful splicing studies

Marshaling public data for lean and powerful splicing studies

The Sequence Read Archive now contains over a million accessions, including over 500K RNA-seq run accessions for mouse and over 300K for human. Large-scale projects like GTEx, ICGC and TOPmed are major contributors and huge projects on the horizon, such as the All of Us and Million Veterans programs, will throw on more fuel. Such archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use.

I will describe our work on making making large public RNA sequencing datasets easy to use. I will describe our multi-layered design, with one layer for scalable and uniform (and secure) analysis (Rail-RNA), another for forming easy-to-use summarized (recount2), and a third for indexing the summaries and making them queryable (Snaptron). The overall result is a system where scientists can pose questions that are scientific in nature, and aren't simply about data retrieval. Finally, I will describe collaborations where these tools were applied to (a) evaluate hypotheses about prevalence or specificity of splicing patterns, (b) characterize completeness of the gene annotations we use to understand splicing patterns, and (c) reveal patterns in public data that ultimately changed the study design and allowed more targeted hypotheses to be tested with less new data generation. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

2faef7dd62bc570c9fbe5a3620726ff3?s=128

Ben Langmead

March 07, 2019
Tweet

Transcript

  1. Ben Langmead Assistant Professor, JHU Computer Science langmea@cs.jhu.edu, langmead-lab.org, @BenLangmead

    Vanderbilt Genetics Institute March 7, 2019 Marshaling public data for lean and powerful splicing studies
  2. None
  3. None
  4. Lab goals Efficient Scalable Interpretable Software: Topics: Bowtie 1&2, Dashing,

    Arioc applied algorithms, text indexing, sketching, thread scaling Rail-RNA, recount2, Snaptron, Boiler parallel and high-performance computing, cloud computing, indexing To make high-throughput life science data as usable as possible for scientific labs, especially small ones Qtip, FORGe, r-index, ref. relaxation modeling mapping quality, graph- genome variants, addressing biases Software: Topics: Software: Topics:
  5. Sequence Read Archive Langmead B, Nellore A. Cloud computing for

    genomic data analysis and collaboration. Nat Rev Genet. 2018 May;19(5):325. Currently ~ 26 petabases
  6. An index is a great leveler GB Shaw Summaries are

    good too Not GB Shaw
  7. Public summaries of sequencing data Langmead B, Nellore A. Cloud

    computing for genomic data analysis and collaboration. Nat Rev Genet. 2018 May;19(5):325.
  8. Search engine for RNA-seq Snaptron Index & query engine w/

    REST API snaptron.cs.jhu.edu doi:10.1093/bioinformatics/btx547 Clean summaries of data, metadata, packaged as R objects jhubiostatistics.shinyapps.io/recount/ doi:10.1038/nbt.3838 Scalable, cloud-based spliced alignment of archived RNA-seq datasets rail.bio doi:10.1093/bioinformatics/btw575
  9. Themes • Cloud computing is a natural fit for public

    data • Think outside the gene annotation • Much of the work is in the "last mile"
  10. Abhinav Nellore OHSU Jeff Leek, JHU http://rail.bio Nellore A, Collado-Torres

    L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4. Image by Rgocs
  11. Spliced RNA-seq aligner for analyzing many samples at once •

    Group across samples to borrow strength and eliminate redundant alignment work • Let data prune false junction calls, not annotation • Concise outputs: junctions & coverage vectors; no alignments, unless asked for • Runs easily on commercial AWS cloud http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
  12. First foray: Intropolis • Analyzed ~21,500 human RNA-seq samples with

    Rail-RNA; about 62 Tbp Exon-exon junctions (10s of millions) Samples (21.5K) http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266. Counts
  13. a 0 2000 4000 6000 8000 10000 12000 14000 0

    100000 200000 300000 400000 500000 600000 700000 Minimum number S of samples in which jx is called Junction (jx) count J 18.6% 56,861 jx 100% 96.5% 81.4% 85.8% Novel Alternative donor/acceptor Exon skip Fully annotated 800 900 1000 1100 1200 240000 260000 280000 300000 320000 b 8000 10000 samples c 2500 3000 Annotations: UCSC, GENCODE v19 & v24, RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega http://intropolis.rail.bio Nellore A, et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
  14. recount2 • >50K human RNA-seq samples from SRA (open) •

    >10K human RNA-seq samples from The Cancer Genome Atlas (dbGaP) Image: https://www.sevenbridges.com/welcome-to-the-cancer-genomics-cloud-2/ • >10K human RNA-seq from the Genotype- Tissue Expression (GTEx) project (dbGaP) • Total: ~4.4 trillion reads, 100s of terabases Image: doi:10.1038/ng.2653 Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
  15. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA,

    Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321. bit.ly/recount2 (jhubiostatistics.shinyapps.io/recount/) recount2 Enter search -> Study list is instantly filtered Links to data objects
  16. Search engine for RNA-seq Snaptron

  17. Snaptron Query planner breaks down queries, delegates to appropriate systems

    (sqlite, tabix, Lucene) and indexes (R-tree, B-tree, inverted index) Chris Wilks Sample Filter 8 Region Limited Region Limited & Filtered Region Junction Records Sample Metadata Records Junction Records Filtered Region Filtered Samples Snaptron Query Planner Query Data Store/Index Output 1 2 6 7 3 9 4 5 10 11 12 13 4 7 3 1 2 8 5 6 Sample Metadata Terms Samples "Brain" 1,2,3,6 "Liver" 4,6,9,11 Sample Filter Tabix/R-tree Index Lucene/Inverted Document Index SQLite/B-tree Index Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  18. Snaptron Provides command-line tool and REST API for querying junctions

    (& more summaries coming soon) Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  19. Snaptron • For each junction in a gene, what is

    its read support in each of 50K SRA samples? • What is a junction's tissue specificity in GTEx? • In which samples is splicing pattern A overrepresented relative to pattern B? Example queries: http://snaptron.cs.jhu.edu Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
  20. Case study • Goldstein et al searched for novel cassette

    exons in Illumina BodyMap 2.0 RNA-seq • Identified 249 within known genes, not overlapping a RefSeq-annotated exon • Validated 216 out of 249 in independent sample via RNA-seq
  21. Case study A. ABCD3 B. KMT2E 3 1 2 1

    2 3 C. ALKATI 1 2 3 4 • Of the 249 novel exons, 236 (94.8%) occurred in GTEx (one shown above) Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547. • • • • • • • • • • 0 5000 10000 15000 20000 GTEx SRAv2 Data compilation Shared sample count (SSC) Validation Failed Passed • Shared sample count predicts how likely novel exons were to validate (right)
  22. RNA-seq dataset Discovery Case study Validation Snaptron Independent dataset •

    Snaptron for validation: what discoveries have support in public data? Snaptron for hypothesis generation Snaptron Snaptron • Snaptron for discovery: what exists? what's prevalent? what's specific? • Snaptron for prioritization of potential discoveries: what discoveries are best supported?
  23. Rod photoreceptors Jonathan Ling Seth Blackshaw • Detect light &

    transduce signal to brain • Degeneration is main cause of hereditary blindness; treatments are few • Can we find rod-specific patterns and splicing factors, with the aim of creating a rod-like model from a human cell line?
  24. Rod photoreceptors Rods and retinal cells have characteristic exon-usage patterns

    1. Purified tissue (FACS/affinity) Certain exons are utilized only in rods 2. Purified tissue Certain splicing factors work specifically in rods 3. GTEx Purified tissue ENCODE Up-regulating those factors induces rod-like splicing in a human cell line 4. New data, HepG2 cell line
  25. Ling JP, Wilks C, Charles R, Ghosh D, Jiang L,

    Santiago CP, Pang B, Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882. Rods have characteristic patterns of exon usage Rod photoreceptors
  26. Rod photoreceptors Exon usage is a useful cell-type signature; sometimes

    invisible at gene level Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882. Cochlear Hair Cells Pyramidal Neurons
  27. Certain exons are used only in rods Rod photoreceptors Ling

    JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.
  28. Certain splicing factors are specific to rods -- could they

    drive rod-specific splicing? Rod photoreceptors Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.
  29. Rod photoreceptors Up-regulating those splicing factors yields rod-like splicing in

    HepG2 cells Unannotated Unannotated Unannotated Unannotated Unannotated Unannotated Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.
  30. ASCOT • Visually explore alternative splicing events in the same

    datasets we used http://ascot.cs.jhu.edu Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.
  31. Future: public data Rod photoreceptor study involved >90K public datasets

    Used public data only up to final HepG2 experiment Desire: querying public data as an everyday activity in bio research • "Leveler" in a field of haves & have nots One of the best ways for a neuroscientist like me to keep up to date with what colleagues are working on is to attend confer- ences. But on recent trips I have noticed a problem. Too few researchers are consulting and using publicly available data — my own included. What is going on? Massive amounts of biological information are being accumu- discrepancy, and propose a biologically valid reason for it. Why are so many bench biologists overlooking this wealth of cell-type-specific expression data? My hunch is there are two reasons. First, researchers under estimate how many of these data have been published over the past few years because they are being generated across so many different fields. Don’t let useful data go to waste Researchers must seek out others’ deposited biological sequences in community databases, urges Franziska Denk. MEGHNA ABRAHAM WORLD VIEW A personal take on events
  32. Future: data science One dataset All of SRA Public data

    quickly confronts us with technical confounders & missing/incorrect metadata What questions can we answer robustly? At what points on the spectrum? Is metadata fixable? Ellis SE, Collado-Torres L, Jaffe A, Leek JT. Improving the value of public RNA-seq expression data by phenotype prediction. Nucleic Acids Res. 2018 May 18;46(9):e54.
  33. Jeff Leek Jacob Pritt Abhinav Nellore Kasper Hansen Leo Collado

    Torres Chris Wilks Andrew Jaffe José Alquicira- Hernández Jamie Morton Kai Kammers Shannon Ellis Margaret Taub • NIH R01GM118568 (Langmead) • NSF CAREER IIS-1349906 (Langmead) • NIH R01GM105705 (Leek) • NIH R01GM121459 (Hansen) • NIH Cloud Credits Model Pilot, CCREQ-2017-03-00086 (Langmead) • NSF XSEDE projects (TG-CIE170020, TG-DEB180021) langmead-lab.org, @BenLangmead IDIES Seed funding SciServer SciServer Compute Jonathan Ling Seth Blackshaw Rone Charles