The Sequence Read Archive now contains over a million accessions, including over 500K RNA-seq run accessions for mouse and over 300K for human. Large-scale projects like GTEx, ICGC and TOPmed are major contributors and huge projects on the horizon, such as the All of Us and Million Veterans programs, will throw on more fuel. Such archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use.
I will describe our work on making making large public RNA sequencing datasets easy to use. I will describe our multi-layered design, with one layer for scalable and uniform (and secure) analysis (Rail-RNA), another for forming easy-to-use summarized (recount2), and a third for indexing the summaries and making them queryable (Snaptron). The overall result is a system where scientists can pose questions that are scientific in nature, and aren't simply about data retrieval. Finally, I will describe collaborations where these tools were applied to (a) evaluate hypotheses about prevalence or specificity of splicing patterns, (b) characterize completeness of the gene annotations we use to understand splicing patterns, and (c) reveal patterns in public data that ultimately changed the study design and allowed more targeted hypotheses to be tested with less new data generation. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.