analysis is described/captured in suﬃcient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
complex methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically
Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Leveraging National Cyberinfrastructure: Galaxy/XSEDE Gateway
service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud
capture of parameters for every tool invocation Complete provenance for data relationships (user defined and system wide) Usefulness of such a system relies on having large numbers of tools integrated, how do we facilitate this?
and nurturing community Provide infrastructure to host all tools, make it easy to build tools, install tools into Galaxy, … Quality oversight by a group of volunteers from the community Version and store every dependency of every tool to ensure that we can reconstruct environments exactly
be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
(e.g. RStudio) Problems with the notebook model: history can be edited! Only reproducible when all cells are rerun Goal: keep complete history (provenence graph) for every dataset generated from a notebook — preserve Galaxy’s provenance guarantees
of software packages and their dependencies and switching easily between them” ~936 recipes for software packages (as of yesterday) All packages are built in a minimal environment to ensure isolation and portability
the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub
in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) And we can even host on a specific VM image…
Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Nitesh Turaga, Marius van den Beek JHU Data Science: Jeﬀ Leek, Roger Peng, … BioConda: Johannes Köster, Björn Grüning, Ryan Dale, Andreas Sjödin, Adam Caprez, Chris Tomkins-Tinch, Brad Chapman, Alexey Strokach, … CWL: Peter Amstutz, Robin Andeer, Brad Chapman, John Chilton, Michael R. Crusoe, Roman Valls Guimerà, Guillermo Carrasco Hernandez, Sinisa Ivkovic, Andrey Kartashov, John Kern, Dan Leehr, Hervé Ménager, Maxim Mikheev, Tim Pierce, Josh Randall, Stian Soiland-Reyes, Luka Stojanovic, Nebojša Tijanić Everyone I forgot…