PSU 2017: Unlocking sequence data archives with scalable software and resources

PSU 2017: Unlocking sequence data archives with scalable software and resources

The Sequence Read Archive contains data for over 450K RNA-seq samples, including over 140K from human samples. Large-scale projects like GTEx and ICGC are generating RNA-seq data on many thousands of samples. Such huge and carefully designed datasets are valuable, but unwieldy for typical researchers, especially when access to computational resources is limited.

I will describe work toward the goal of making it easy for biological researchers to use the archived RNA-seq data available today. I will highlight the Rail-RNA software (, its dbGaP-protected version (, as well as the recount ( and Snaptron ( resources. The Rail-RNA software uses the Amazon Web Services commercial cloud to analyze many samples at once. We used Rail-RNA to study tens of thousands of public RNA-seq accessions, yielding new insights about the completeness of existing gene annotations and about how our knowledge of human splicing diversity has evolved over time. I will demonstrate how the recount resource can be used to answer questions about expression and differential expression across 10,000s of RNA-seq samples, and how the Snaptron API can be used to rapidly answer sophisticated queries against the splicing patterns in recount.


Ben Langmead

March 01, 2017