Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators: Craig Stewart and the group Ross Hardison and the VISION group Victor Corces (Emory), Karen Reddy (JHU) Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology) Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620) NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604
reproducible scientific research Initially build for genomics, but intended to support any compute and data intensive discipline Provided both as a free public SaaS application (usegalaxy.org), and open-source software
• Defined in terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning
relationships between a set of tools (the steps of the workflow) • With some extras • Map and reduce data flows enabled by dataset collections • Sub-workflows • Pause points, decision points, re-planning
interface to the underlying software • Given valid data and parameters, we can realize this to an ordered graph of command lines to execute • But, we still need to ensure that the appropriate software is available
install everything on the PATH Galaxy is using. I know. Look, it was 2005… • Biggest problem: versioning. • We soon had workflows where different steps required different versions of some underlying software (hello samtools…) • For reproducibility, we wanted to be able to run workflows with older versions of the underlying software
• Allow a command line to be augmented based on a tools requirements (with a Plugin interface of course) • Default implementation looks for a directory based on the tool name/version and runs a shell script “env.sh” which adds to the environment • Alternative implementations for modules, brew, … soon followed
ToolShed enters the scene • Uses a similar structure, separating dependencies/versions into different directories • But, includes installation recipes so that the Galaxy maintainer no longer needs to install each tool manually • Made sense at the time, but packaging is hard, and was basically a nightmare
of software packages and their dependencies and switching easily between them” More than 4000 recipes for software packages All packages are automatically built in a minimal environment to ensure isolation and portability
time - binaries with their dependencies, libraries... • Support for all operating systems Galaxy targets • Easy to manage multiple versions of the same recipe • HPC-ready: no root privileges needed • Easy-to-write YAML recipes • Community - not restricted to Galaxy
packages • iuc • bioconda • defaults • conda-forge • Galaxy now automatically installs Conda when first launched and will use Bioconda and other channels for package resolution
practice for tool dependency management in Galaxy • All tools in the “devteam” and “iuc” repositories now use requirement specifications that can be resolved by conda • ToolShed packages still supported, but deprecated • Result: completely automatic installation of all the software needed to run a Galaxy workflow
is running in a known (and minimal) environment, limit side effects • Better reproducibility • Packaging and distribution, leverage existing ecosystem for deploying and running software • Security? Would be nice to be able to count on that…
For instance, transform the cluster destination: <destination id="short_fast" runner="slurm"> <param id="nativeSpecification">--time=00:05:00 </destination> as follows: <destination id="short_fast" runner="slurm"> <param id="nativeSpecification">--time=00:05:00 <param id="docker_enabled">true <param id="docker_sudo">false </destination> But, how do we find the right container for a tool?
for best practice tools. Let’s lint that tool config with Planemo: $ planemo lint --biocontainers seqtk_seq.xml ... Applying linter biocontainer_registered... CHECK .. INFO: BioContainer best-practice container found [quay.io/biocontainers/seqtk:1.2--0].
in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled)
Successful builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider rkt Singularity
service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud
• Uses existing job runners as long as the target supports the container engine (Docker or Singularity) • Resolve containers using existing requirement tags, allows flexibility in how dependencies are resolved in different environments • Doesn’t matter if Galaxy itself is running in a container or not.
Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Any number of object stores: Files, S3, Azure, iRods, … Hierarchical, distributed…
in maintaining a Galaxy instance • Galaxy is deployed in a wide variety of environments, from appliances to institutional HPC to all different sorts of clouds • In recent years we’ve introduced lots of automation to make things easier, primarily through ansible.
automating deployment, scaling, and management of containerized applications.” • Helm: “The package manager for Kubernetes.” • Rancher: “Enterprise management for Kubernetes. Every distro. Every cluster. Every cloud.”
Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators: Craig Stewart and the group Ross Hardison and the VISION group Victor Corces (Emory), Karen Reddy (JHU) Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology) Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620) NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604