ISMB 2017: Supporting highly scalable scientific data analysis with Galaxy
Technology Talk for ISMB 2017 on 1) Galaxy scalability to thousands of samples, 2) Practical reproducibility with #bioconda, #biocontainers, and virtualization, and 3) [didn't get to this] working with Galaxy entirely from the command line.
Idea Experiment Raw Data Tidy Data Summarized data Results Experimental design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
Goals Accessibility: Eliminate barriers for researchers wanting to use complex methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically
A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
Nestorowa et al. (GSE81682) Single-cell RNA-seq analysis of 7,248 cells (432 LT-HSCs, 1704 HSC-MPPs, and 1704 HPCs) Sequenced ~1-2 million reads per cell: 3.4 TB raw data.
Critical points framework needs to address Keeping the naming traceable Collections Collapsing single cell data to single tables Collection collapse (“reduce”) Operating on an unknown number of columns Melt and cast tools Visualize hundreds of samples easily New visualization tools
Import from SRA a list of dataset pairs Read QC Mapping Quantification Comprehensive expression table Collection collapse Cell based metrics Expression table of cells passing filters Expression table of cells and genes passing filters Table of z-scores per gene per cell Report of experimental metrics Mo Heydarian
Big Fella taking big strides Processing all 3840 cells took 108 h generated 100,149 history items!! Zero errors! Big Fella taking big strides Processing all 3840 cells took 108 hours and generated 100,149 history items!!! Zero errors! 3,840 cells: 108 hours and 100,149 history items. Zero errors. Mo Heydarian
My lncRNAs are expressed in real cells and in jackpot model across the population My lncRNAs are expressed in real cells and in ackpot model across the population 2. Novel lncRNAs follow “jackpot model”
What about the backend? Extensive improvements to the Galaxy workflow to support analysis at this scale. Robustness: pausing, partial restarts, better recovery, better throughput (but nothing you can see)
Galaxy’s workflow system is robust, flexible, and integrates with nearly any environment Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
For example, The single-cell RNA- seq analysis was run on Running Galaxy version 16.10 Head node: 16 core, 122 GB (r4.4xlarge) Worker nodes: 2 x 16 core, 122 GB (r4.4xlarge) 10 TB EBS volume
State of the Galaxy ToolShed ToolShed now contains thousands of tools Community response has been phenomenal However, packaging is challenging — it never ends! Need to move to a model that pulls in and integrates with a broader community
It is now reasonable to support one major server platform — Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)
Builds on Conda packaging system, designed “for installing multiple versions of software packages and their dependencies and switching easily between them” ~2200 recipes for software packages (as of yesterday) All packages are automatically built in a minimal environment to ensure isolation and portability
Submit recipe to GitHub Travis CI pulls recipes and builds in minimal docker container Successful binary builds from main repo uploaded to Anaconda to be installed anywhere
Containerization Builds on Linux kernel features enabling complete isolation from the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — e.g. Docker hub, quay.io
Galaxy + Containers Run every analysis in a clean container — analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated
Bioconda + Containers Given a set of packages and versions in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled)
Travis CI pulls recipes and builds in minimal docker container Successful builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider rkt Singularity
Bioconda + Containers + Virtualization If we run our containers inside a specific (ideally minimal) known VM we can control the kernel environment as well Atmosphere funded by the National Science Foundation
Tool and dependency binaries, built in minimal environment with controlled libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control
…and it all just works in Galaxy Depending on how Galaxy is configured this can be resolved with conda, with biocontainers… …or environment modules, or brew, guix, … (Resolvers are completely pluggable)
What about multiple packages? Generate containers based on a reproducible has of package name and version Walk the ToolShed and archive containers for every combination of tools used
Not just for Galaxy Docker requirement, tightly coupled Software requirement, can be resolved in an environment specific way Implemented in “galaxy-lib” — integrated in CWL reference implementation, …
This is the best stack for complete reproducibility we have ever had in bioinformatics. With the right technologies, reproducibility is possible and practical.
Acknowledgements Galaxy Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier, Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler, Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Marius van den Beek BioConda and Biocontainers: Johannes Köster, Ryan Dale, Björn Grüning, … All contributors to and users of all of the projects I’ve talked about NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620) NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)