The Sequence Read Archive now contains over a million accessions, including over 200K RNA-seq runs for mouse and over 160K for human. Large-scale projects like GTEx, ICGC and TOPmed are major contributors and huge projects on the horizon, such as the All of Us and Million Veterans programs, will further accelerate this growth. These archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use.
Using the archive as a motivation, I will convey some insights -- gleaned from both successes and failures -- about how we as computational researchers can work toward the goal of making large public datasets easy to use. I will discuss some challenges that come with (a) working at scale, (b) using commercial (and non-commercial) cloud computing as a platform for this work, (c) pooling and borrowing strength across datasets, and (d) making public data available for use by everyday researchers. I will highlight ways in which we have learned how to make our tools better suited to how scientists work. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Luigi Marchionni, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.