Containers and Workflows in Galaxy

Containers and workflows in @jxtx / #usegalaxy https://speakerdeck.com/jxtx

Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,
Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:  Craig Stewart and the group  Ross Hardison and the VISION group  Victor Corces (Emory), Karen Reddy (JHU)  Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)  Battle, Goﬀ, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)  NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604

What is Galaxy?

Galaxy is a web-based analysis environment for accessible, transparent, and
reproducible scientific research Initially build for genomics, but intended to support any compute and data intensive discipline Provided both as a free public SaaS application (usegalaxy.org), and open-source software

Tools and workflows in Galaxy (Today we’re going to focus
on tools and workflows, but Galaxy is much more including visualization, interactive environments, sharing and publishing integrated analyses…)

The core unit of analysis in Galaxy is the Tool
• Defined in terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning

User interface generated from abstract parameter description

} Template for generating command line from parameter values

Functional tests to be run with the “full stack” in
place

Complex interfaces can be described

Repeating groups of parameters

Conditional groups, grouping constructs can be nested

Template language for building complex command lines

Or additional configuration files, scripts, ...

Workflows combine tools into more complex analyses

Simple data flows Output of bwa mapping (a SAM file)…flows
into MACS for peak calling

Parallelized data flows Input is a “dataset collection” with any
number of paired-read datasets

Parallelized data flows HiSat2 alignments are run in parallel across
each collection

Parallelized data flows StringTie transcript assembly also run in parallel
across all datasets

Parallelized data flows Transcript from all datasets combined to produce
a consensus assembly

Galaxy Workflows: Summary • A directed graph capturing data flow
relationships between a set of tools (the steps of the workflow) • With some extras • Map and reduce data flows enabled by dataset collections • Sub-workflows • Pause points, decision points, re-planning

Tooling to develop tools and workflows: “A scientific workflow SDK”
The way to develop Galaxy tools Linting, testing, … Support every aspect of the tool development lifecycle

Run a workflow in a dynamically created or existing Galaxy
template

Build / edit your workflows in a text editor

Testing workflows

Realizing tool and workflow execution

From “tools” to “dependencies” • Tools and workflows describe the
interface to the underlying software • Given valid data and parameters, we can realize this to an ordered graph of command lines to execute • But, we still need to ensure that the appropriate software is available

History of Galaxy tool dependencies… • In the beginning, just
install everything on the PATH Galaxy is using. I know. Look, it was 2005… • Biggest problem: versioning. • We soon had workflows where diﬀerent steps required diﬀerent versions of some underlying software (hello samtools…) • For reproducibility, we wanted to be able to run workflows with older versions of the underlying software

Requirement metadata <requirements> <requirement type="package" version=“1.2”> seqtk </requirement> </requirements>

History of Galaxy tool dependencies… • Take 2: Dependency Resolvers
• Allow a command line to be augmented based on a tools requirements (with a Plugin interface of course) • Default implementation looks for a directory based on the tool name/version and runs a shell script “env.sh” which adds to the environment • Alternative implementations for modules, brew, … soon followed

History of Galaxy tool dependencies… • Take 3: The Galaxy
ToolShed enters the scene • Uses a similar structure, separating dependencies/versions into diﬀerent directories • But, includes installation recipes so that the Galaxy maintainer no longer needs to install each tool manually • Made sense at the time, but packaging is hard, and was basically a nightmare

History of Galaxy tool dependencies… • Take 4: The packaging
ecosystem is getting better, community developed projects are realizing the importance of OS independent versioned package management, let’s join in!

https://bioconda.github.io

It is now reasonable to support one major server platform
— Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)

Builds on Conda packaging system, designed “for installing multiple versions
of software packages and their dependencies and switching easily between them” More than 4000 recipes for software packages All packages are automatically built in a minimal environment to ensure isolation and portability

Submit recipe to GitHub Travis CI pulls recipes and builds
in minimal docker container Successful binary builds from main repo uploaded to Anaconda to be installed anywhere

Conda: Key Features for Galaxy • No compilation at install
time - binaries with their dependencies, libraries... • Support for all operating systems Galaxy targets • Easy to manage multiple versions of the same recipe • HPC-ready: no root privileges needed • Easy-to-write YAML recipes • Community - not restricted to Galaxy

Best practice channels • Conda channels searched by Galaxy for
packages • iuc • bioconda • defaults • conda-forge • Galaxy now automatically installs Conda when first launched and will use Bioconda and other channels for package resolution

Installing tools with Conda

Managing Tool Dependencies

Tool Dependencies 2018 • Conda and Bioconda is the best
practice for tool dependency management in Galaxy • All tools in the “devteam” and “iuc” repositories now use requirement specifications that can be resolved by conda • ToolShed packages still supported, but deprecated • Result: completely automatic installation of all the software needed to run a Galaxy workflow

But what about Contanerization

Why we like containers • Isolation: can ensure that software
is running in a known (and minimal) environment, limit side eﬀects • Better reproducibility • Packaging and distribution, leverage existing ecosystem for deploying and running software • Security? Would be nice to be able to count on that…

2015: Galaxy in Docker

Galaxy and Containers • We can run Galaxy itself in
a container. Great. Super easy distribution. • Next level: run tools in containers, so that every step of a workflow runs in an isolated environment

Configuring a Galaxy instance to use Docker Configure the destination.
For instance, transform the cluster destination: <destination id="short_fast" runner="slurm"> <param id="nativeSpecification">--time=00:05:00 </destination> as follows: <destination id="short_fast" runner="slurm"> <param id="nativeSpecification">--time=00:05:00 <param id="docker_enabled">true <param id="docker_sudo">false </destination> But, how do we find the right container for a tool?

Remember that requirement metadata? <requirements> <requirement type="package" version=“1.2”> seqtk </requirement>
</requirements>

Automatic container resolution Galaxy will automatically find or build containers
for best practice tools. Let’s lint that tool config with Planemo: $ planemo lint --biocontainers seqtk_seq.xml ... Applying linter biocontainer_registered... CHECK .. INFO: BioContainer best-practice container found [quay.io/biocontainers/seqtk:1.2--0].

BioContainers

Bioconda + Containers Given a set of packages and versions
in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled)

Travis CI pulls recipes and builds in minimal docker container
Successful builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider rkt Singularity

Bioconda + Containers + Virtualization If we run our containers
inside a specific (ideally minimal) known VM we can control the kernel environment as well Atmosphere funded by the National Science Foundation

Tool and dependency binaries, built in minimal environment with controlled
libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control

CloudMan: General purpose deployment manager for any cloud. Cluster and
service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud

What about multiple dependencies? Generate containers based on a reproducible
hash of package name and version

Not just for Galaxy

Not just for Galaxy Docker requirement, tightly coupled

Not just for Galaxy Docker requirement, tightly coupled Software requirement,
can be resolved in an environment specific way

Not just for Galaxy Docker requirement, tightly coupled Software requirement,
can be resolved in an environment specific way Implemented in “galaxy-lib” — integrated in CWL reference implementation, …

Galaxy Containerization • Galaxy can run all jobs in containers
• Uses existing job runners as long as the target supports the container engine (Docker or Singularity) • Resolve containers using existing requirement tags, allows flexibility in how dependencies are resolved in diﬀerent environments • Doesn’t matter if Galaxy itself is running in a container or not.

Anatomy of a Galaxy instance

Anatomy of a Galaxy instance (“out of the box”) Web
Server (Paste or uwsgi) Database (sqlite) Job Runner (local) Object Store (local files)

Anatomy of a Galaxy instance (“production”) Web Server (uwsgi) Database
(postgres…) Job Runner(s) Object Store Proxy Server (nginx)

Anatomy of a Galaxy instance: Running Jobs Web Server (uwsgi)
Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Any number of Job Runners: Slurm, PBS, Grid Engine, Condor, Kubernetes…

Anatomy of a Galaxy instance: Storing Data Web Server (uwsgi)
Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Any number of object stores: Files, S3, Azure, iRods, … Hierarchical, distributed…

Anatomy of a Galaxy instance: More services… Web Server (uwsgi)
Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Messaging (RabbitMQ) File Transfers (ProFTPd)

It can get pretty complicated (e.g. usegalaxy.org) PSC, Pittsburgh Stampede
• 462,462 cores • 205 TB memory Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster   (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington

Jetstream (TACC) VMWare web-01 web-02 db-01 slurm rabbitmq Stampede (TACC)
pulsar web-03 web-04 slurm/pulsar instance instance instance instance pulsar Bridges (PSC) Jetstream (IU) slurm/pulsar instance instance instance instance Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL Main Compute Architecture NFS Corral (2 PB dataset storage) dedicated cluster roundup49 ... roundup64 Swarm @ JS instance instance instance instance Swarm

125,000 registered users 2PB user data 19M jobs run 100
training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018

Orchestrating all this is hard • Lots of pieces involved
in maintaining a Galaxy instance • Galaxy is deployed in a wide variety of environments, from appliances to institutional HPC to all diﬀerent sorts of clouds • In recent years we’ve introduced lots of automation to make things easier, primarily through ansible.

Containers all the way down

“Production-Grade Container Orchestration”

Galaxy + Kubernetes Stack • Kubernetes: “an open-source system for
automating deployment, scaling, and management of containerized applications.” • Helm: “The package manager for Kubernetes.” • Rancher: “Enterprise management for Kubernetes. Every distro. Every cluster. Every cloud.”

Demo? http://launch.usegalaxy.org

Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,
Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:  Craig Stewart and the group  Ross Hardison and the VISION group  Victor Corces (Emory), Karen Reddy (JHU)  Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)  Battle, Goﬀ, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)  NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604

Containers and Workflows in Galaxy

Containers and Workflows in Galaxy

More Decks by James Taylor

Other Decks in Science

Featured

Transcript