Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Containers and Workflows in Galaxy

Containers and Workflows in Galaxy

Presented at the Rocky Mountain Genomics HackCon workshop on Containers and Workflows in Genomics #rmghc18.

James Taylor

June 18, 2018
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:
 Craig Stewart and the group
 Ross Hardison and the VISION group
 Victor Corces (Emory), Karen Reddy (JHU)
 Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)
 Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)
 NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604
  2. Galaxy is a web-based analysis environment for accessible, transparent, and

    reproducible scientific research Initially build for genomics, but intended to support any compute and data intensive discipline Provided both as a free public SaaS application (usegalaxy.org), and open-source software
  3. Tools and workflows in Galaxy (Today we’re going to focus

    on tools and workflows, but Galaxy is much more including visualization, interactive environments, sharing and publishing integrated analyses…)
  4. The core unit of analysis in Galaxy is the Tool

    • Defined in terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning
  5. Galaxy Workflows: Summary • A directed graph capturing data flow

    relationships between a set of tools (the steps of the workflow) • With some extras • Map and reduce data flows enabled by dataset collections • Sub-workflows • Pause points, decision points, re-planning
  6. Tooling to develop tools and workflows: “A scientific workflow SDK”

    The way to develop Galaxy tools Linting, testing, … Support every aspect of the tool development lifecycle
  7. From “tools” to “dependencies” • Tools and workflows describe the

    interface to the underlying software • Given valid data and parameters, we can realize this to an ordered graph of command lines to execute • But, we still need to ensure that the appropriate software is available
  8. History of Galaxy tool dependencies… • In the beginning, just

    install everything on the PATH Galaxy is using. I know. Look, it was 2005… • Biggest problem: versioning. • We soon had workflows where different steps required different versions of some underlying software (hello samtools…) • For reproducibility, we wanted to be able to run workflows with older versions of the underlying software
  9. History of Galaxy tool dependencies… • Take 2: Dependency Resolvers

    • Allow a command line to be augmented based on a tools requirements (with a Plugin interface of course) • Default implementation looks for a directory based on the tool name/version and runs a shell script “env.sh” which adds to the environment • Alternative implementations for modules, brew, … soon followed
  10. History of Galaxy tool dependencies… • Take 3: The Galaxy

    ToolShed enters the scene • Uses a similar structure, separating dependencies/versions into different directories • But, includes installation recipes so that the Galaxy maintainer no longer needs to install each tool manually • Made sense at the time, but packaging is hard, and was basically a nightmare
  11. History of Galaxy tool dependencies… • Take 4: The packaging

    ecosystem is getting better, community developed projects are realizing the importance of OS independent versioned package management, let’s join in!
  12. It is now reasonable to support one major server platform

    — Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)
  13. Builds on Conda packaging system, designed “for installing multiple versions

    of software packages and their dependencies and switching easily between them” More than 4000 recipes for software packages All packages are automatically built in a minimal environment to ensure isolation and portability
  14. Submit recipe to GitHub Travis CI pulls recipes and builds

    in minimal docker container Successful binary builds from main repo uploaded to Anaconda to be installed anywhere
  15. Conda: Key Features for Galaxy • No compilation at install

    time - binaries with their dependencies, libraries... • Support for all operating systems Galaxy targets • Easy to manage multiple versions of the same recipe • HPC-ready: no root privileges needed • Easy-to-write YAML recipes • Community - not restricted to Galaxy
  16. Best practice channels • Conda channels searched by Galaxy for

    packages • iuc • bioconda • defaults • conda-forge • Galaxy now automatically installs Conda when first launched and will use Bioconda and other channels for package resolution
  17. Tool Dependencies 2018 • Conda and Bioconda is the best

    practice for tool dependency management in Galaxy • All tools in the “devteam” and “iuc” repositories now use requirement specifications that can be resolved by conda • ToolShed packages still supported, but deprecated • Result: completely automatic installation of all the software needed to run a Galaxy workflow
  18. Why we like containers • Isolation: can ensure that software

    is running in a known (and minimal) environment, limit side effects • Better reproducibility • Packaging and distribution, leverage existing ecosystem for deploying and running software • Security? Would be nice to be able to count on that…
  19. Galaxy and Containers • We can run Galaxy itself in

    a container. Great. Super easy distribution. • Next level: run tools in containers, so that every step of a workflow runs in an isolated environment
  20. Configuring a Galaxy instance to use Docker Configure the destination.

    For instance, transform the cluster destination: <destination id="short_fast" runner="slurm"> <param id="nativeSpecification">--time=00:05:00 </destination> as follows: <destination id="short_fast" runner="slurm"> <param id="nativeSpecification">--time=00:05:00 <param id="docker_enabled">true <param id="docker_sudo">false </destination> But, how do we find the right container for a tool?
  21. Automatic container resolution Galaxy will automatically find or build containers

    for best practice tools. Let’s lint that tool config with Planemo: $ planemo lint --biocontainers seqtk_seq.xml ... Applying linter biocontainer_registered... CHECK .. INFO: BioContainer best-practice container found [quay.io/biocontainers/seqtk:1.2--0].
  22. Bioconda + Containers Given a set of packages and versions

    in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled)
  23. Travis CI pulls recipes and builds in minimal docker container

    Successful builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider rkt Singularity
  24. Bioconda + Containers + Virtualization If we run our containers

    inside a specific (ideally minimal) known VM we can control the kernel environment as well Atmosphere funded by the National Science Foundation
  25. Tool and dependency binaries, built in minimal environment with controlled

    libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control
  26. CloudMan: General purpose deployment manager for any cloud. Cluster and

    service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud
  27. Not just for Galaxy Docker requirement, tightly coupled Software requirement,

    can be resolved in an environment specific way Implemented in “galaxy-lib” — integrated in CWL reference implementation, …
  28. Galaxy Containerization • Galaxy can run all jobs in containers

    • Uses existing job runners as long as the target supports the container engine (Docker or Singularity) • Resolve containers using existing requirement tags, allows flexibility in how dependencies are resolved in different environments • Doesn’t matter if Galaxy itself is running in a container or not.
  29. Anatomy of a Galaxy instance (“out of the box”) Web

    Server (Paste or uwsgi) Database (sqlite) Job Runner (local) Object Store (local files)
  30. Anatomy of a Galaxy instance (“production”) Web Server (uwsgi) Database

    (postgres…) Job Runner(s) Object Store Proxy Server (nginx)
  31. Anatomy of a Galaxy instance: Running Jobs Web Server (uwsgi)

    Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Any number of Job Runners: Slurm, PBS, Grid Engine, Condor, Kubernetes…
  32. Anatomy of a Galaxy instance: Storing Data Web Server (uwsgi)

    Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Any number of object stores: Files, S3, Azure, iRods, … Hierarchical, distributed…
  33. Anatomy of a Galaxy instance: More services… Web Server (uwsgi)

    Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Messaging (RabbitMQ) File Transfers (ProFTPd)
  34. It can get pretty complicated (e.g. usegalaxy.org) PSC, Pittsburgh Stampede

    • 462,462 cores • 205 TB memory Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster 
 (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington
  35. Jetstream (TACC) VMWare web-01 web-02 db-01 slurm rabbitmq Stampede (TACC)

    pulsar web-03 web-04 slurm/pulsar instance instance instance instance pulsar Bridges (PSC) Jetstream (IU) slurm/pulsar instance instance instance instance Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL Main Compute Architecture NFS Corral (2 PB dataset storage) dedicated cluster roundup49 ... roundup64 Swarm @ JS instance instance instance instance Swarm
  36. 125,000 registered users 2PB user data 19M jobs run 100

    training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018
  37. Orchestrating all this is hard • Lots of pieces involved

    in maintaining a Galaxy instance • Galaxy is deployed in a wide variety of environments, from appliances to institutional HPC to all different sorts of clouds • In recent years we’ve introduced lots of automation to make things easier, primarily through ansible.
  38. Galaxy + Kubernetes Stack • Kubernetes: “an open-source system for

    automating deployment, scaling, and management of containerized applications.” • Helm: “The package manager for Kubernetes.” • Rancher: “Enterprise management for Kubernetes. Every distro. Every cluster. Every cloud.”
  39. Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:
 Craig Stewart and the group
 Ross Hardison and the VISION group
 Victor Corces (Emory), Karen Reddy (JHU)
 Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)
 Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)
 NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604