Containers and Workflows in Galaxy

Containers and Workflows in Galaxy

Presented at the Rocky Mountain Genomics HackCon workshop on Containers and Workflows in Genomics #rmghc18.

3ee44f53c39bcd4bc663a2ea0e21d526?s=128

James Taylor

June 18, 2018
Tweet

Transcript

  1. Containers and workflows in @jxtx / #usegalaxy https://speakerdeck.com/jxtx

  2. Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:
 Craig Stewart and the group
 Ross Hardison and the VISION group
 Victor Corces (Emory), Karen Reddy (JHU)
 Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)
 Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)
 NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604
  3. What is Galaxy?

  4. Galaxy is a web-based analysis environment for accessible, transparent, and

    reproducible scientific research Initially build for genomics, but intended to support any compute and data intensive discipline Provided both as a free public SaaS application (usegalaxy.org), and open-source software
  5. None
  6. Tools and workflows in Galaxy (Today we’re going to focus

    on tools and workflows, but Galaxy is much more including visualization, interactive environments, sharing and publishing integrated analyses…)
  7. The core unit of analysis in Galaxy is the Tool

    • Defined in terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning
  8. None
  9. User interface generated from abstract parameter description

  10. } Template for generating command line from parameter values

  11. Functional tests to be run with the “full stack” in

    place
  12. Complex interfaces can be described

  13. Repeating groups of parameters

  14. Conditional groups, grouping constructs can be nested

  15. Template language for building complex command lines

  16. Or additional configuration files, scripts, ...

  17. Workflows combine tools into more complex analyses

  18. Simple data flows Output of bwa mapping (a SAM file)…flows

    into MACS for peak calling
  19. Parallelized data flows Input is a “dataset collection” with any

    number of paired-read datasets
  20. Parallelized data flows HiSat2 alignments are run in parallel across

    each collection
  21. Parallelized data flows StringTie transcript assembly also run in parallel

    across all datasets
  22. Parallelized data flows Transcript from all datasets combined to produce

    a consensus assembly
  23. Galaxy Workflows: Summary • A directed graph capturing data flow

    relationships between a set of tools (the steps of the workflow) • With some extras • Map and reduce data flows enabled by dataset collections • Sub-workflows • Pause points, decision points, re-planning
  24. Tooling to develop tools and workflows: “A scientific workflow SDK”

    The way to develop Galaxy tools Linting, testing, … Support every aspect of the tool development lifecycle
  25. Run a workflow in a dynamically created or existing Galaxy

    template
  26. Build / edit your workflows in a text editor

  27. Testing workflows

  28. Realizing tool and workflow execution

  29. From “tools” to “dependencies” • Tools and workflows describe the

    interface to the underlying software • Given valid data and parameters, we can realize this to an ordered graph of command lines to execute • But, we still need to ensure that the appropriate software is available
  30. History of Galaxy tool dependencies… • In the beginning, just

    install everything on the PATH Galaxy is using. I know. Look, it was 2005… • Biggest problem: versioning. • We soon had workflows where different steps required different versions of some underlying software (hello samtools…) • For reproducibility, we wanted to be able to run workflows with older versions of the underlying software
  31. Requirement metadata <requirements> <requirement type="package" version=“1.2”> seqtk </requirement> </requirements>

  32. History of Galaxy tool dependencies… • Take 2: Dependency Resolvers

    • Allow a command line to be augmented based on a tools requirements (with a Plugin interface of course) • Default implementation looks for a directory based on the tool name/version and runs a shell script “env.sh” which adds to the environment • Alternative implementations for modules, brew, … soon followed
  33. None
  34. History of Galaxy tool dependencies… • Take 3: The Galaxy

    ToolShed enters the scene • Uses a similar structure, separating dependencies/versions into different directories • But, includes installation recipes so that the Galaxy maintainer no longer needs to install each tool manually • Made sense at the time, but packaging is hard, and was basically a nightmare
  35. History of Galaxy tool dependencies… • Take 4: The packaging

    ecosystem is getting better, community developed projects are realizing the importance of OS independent versioned package management, let’s join in!
  36. https://bioconda.github.io

  37. It is now reasonable to support one major server platform

    — Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)
  38. Builds on Conda packaging system, designed “for installing multiple versions

    of software packages and their dependencies and switching easily between them” More than 4000 recipes for software packages All packages are automatically built in a minimal environment to ensure isolation and portability
  39. Submit recipe to GitHub Travis CI pulls recipes and builds

    in minimal docker container Successful binary builds from main repo uploaded to Anaconda to be installed anywhere
  40. Conda: Key Features for Galaxy • No compilation at install

    time - binaries with their dependencies, libraries... • Support for all operating systems Galaxy targets • Easy to manage multiple versions of the same recipe • HPC-ready: no root privileges needed • Easy-to-write YAML recipes • Community - not restricted to Galaxy
  41. Best practice channels • Conda channels searched by Galaxy for

    packages • iuc • bioconda • defaults • conda-forge • Galaxy now automatically installs Conda when first launched and will use Bioconda and other channels for package resolution
  42. Installing tools with Conda

  43. Managing Tool Dependencies

  44. Managing Tool Dependencies

  45. Tool Dependencies 2018 • Conda and Bioconda is the best

    practice for tool dependency management in Galaxy • All tools in the “devteam” and “iuc” repositories now use requirement specifications that can be resolved by conda • ToolShed packages still supported, but deprecated • Result: completely automatic installation of all the software needed to run a Galaxy workflow
  46. But what about Contanerization

  47. Why we like containers • Isolation: can ensure that software

    is running in a known (and minimal) environment, limit side effects • Better reproducibility • Packaging and distribution, leverage existing ecosystem for deploying and running software • Security? Would be nice to be able to count on that…
  48. 2015: Galaxy in Docker

  49. Galaxy and Containers • We can run Galaxy itself in

    a container. Great. Super easy distribution. • Next level: run tools in containers, so that every step of a workflow runs in an isolated environment
  50. Configuring a Galaxy instance to use Docker Configure the destination.

    For instance, transform the cluster destination: <destination id="short_fast" runner="slurm"> <param id="nativeSpecification">--time=00:05:00 </destination> as follows: <destination id="short_fast" runner="slurm"> <param id="nativeSpecification">--time=00:05:00 <param id="docker_enabled">true <param id="docker_sudo">false </destination> But, how do we find the right container for a tool?
  51. Remember that requirement metadata? <requirements> <requirement type="package" version=“1.2”> seqtk </requirement>

    </requirements>
  52. Automatic container resolution Galaxy will automatically find or build containers

    for best practice tools. Let’s lint that tool config with Planemo: $ planemo lint --biocontainers seqtk_seq.xml ... Applying linter biocontainer_registered... CHECK .. INFO: BioContainer best-practice container found [quay.io/biocontainers/seqtk:1.2--0].
  53. BioContainers

  54. Bioconda + Containers Given a set of packages and versions

    in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled)
  55. Travis CI pulls recipes and builds in minimal docker container

    Successful builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider rkt Singularity
  56. Bioconda + Containers + Virtualization If we run our containers

    inside a specific (ideally minimal) known VM we can control the kernel environment as well Atmosphere funded by the National Science Foundation
  57. Tool and dependency binaries, built in minimal environment with controlled

    libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control
  58. CloudMan: General purpose deployment manager for any cloud. Cluster and

    service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud
  59. What about multiple dependencies? Generate containers based on a reproducible

    hash of package name and version
  60. Not just for Galaxy

  61. Not just for Galaxy Docker requirement, tightly coupled

  62. Not just for Galaxy Docker requirement, tightly coupled Software requirement,

    can be resolved in an environment specific way
  63. Not just for Galaxy Docker requirement, tightly coupled Software requirement,

    can be resolved in an environment specific way Implemented in “galaxy-lib” — integrated in CWL reference implementation, …
  64. Galaxy Containerization • Galaxy can run all jobs in containers

    • Uses existing job runners as long as the target supports the container engine (Docker or Singularity) • Resolve containers using existing requirement tags, allows flexibility in how dependencies are resolved in different environments • Doesn’t matter if Galaxy itself is running in a container or not.
  65. Anatomy of a Galaxy instance

  66. Anatomy of a Galaxy instance (“out of the box”) Web

    Server (Paste or uwsgi) Database (sqlite) Job Runner (local) Object Store (local files)
  67. Anatomy of a Galaxy instance (“production”) Web Server (uwsgi) Database

    (postgres…) Job Runner(s) Object Store Proxy Server (nginx)
  68. Anatomy of a Galaxy instance: Running Jobs Web Server (uwsgi)

    Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Any number of Job Runners: Slurm, PBS, Grid Engine, Condor, Kubernetes…
  69. Anatomy of a Galaxy instance: Storing Data Web Server (uwsgi)

    Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Any number of object stores: Files, S3, Azure, iRods, … Hierarchical, distributed…
  70. Anatomy of a Galaxy instance: More services… Web Server (uwsgi)

    Database (postgres…) Job Runner(s) Object Store Proxy Server (nginx) Messaging (RabbitMQ) File Transfers (ProFTPd)
  71. It can get pretty complicated (e.g. usegalaxy.org) PSC, Pittsburgh Stampede

    • 462,462 cores • 205 TB memory Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster 
 (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington
  72. Jetstream (TACC) VMWare web-01 web-02 db-01 slurm rabbitmq Stampede (TACC)

    pulsar web-03 web-04 slurm/pulsar instance instance instance instance pulsar Bridges (PSC) Jetstream (IU) slurm/pulsar instance instance instance instance Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL Main Compute Architecture NFS Corral (2 PB dataset storage) dedicated cluster roundup49 ... roundup64 Swarm @ JS instance instance instance instance Swarm
  73. 125,000 registered users 2PB user data 19M jobs run 100

    training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018
  74. Orchestrating all this is hard • Lots of pieces involved

    in maintaining a Galaxy instance • Galaxy is deployed in a wide variety of environments, from appliances to institutional HPC to all different sorts of clouds • In recent years we’ve introduced lots of automation to make things easier, primarily through ansible.
  75. Containers all the way down

  76. “Production-Grade Container Orchestration”

  77. Galaxy + Kubernetes Stack • Kubernetes: “an open-source system for

    automating deployment, scaling, and management of containerized applications.” • Helm: “The package manager for Kubernetes.” • Rancher: “Enterprise management for Kubernetes. Every distro. Every cluster. Every cloud.”
  78. Demo? http://launch.usegalaxy.org

  79. None
  80. None
  81. None
  82. None
  83. None
  84. None
  85. None
  86. None
  87. None
  88. None
  89. None
  90. None
  91. None
  92. None
  93. None
  94. Acknowledgements Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators:
 Craig Stewart and the group
 Ross Hardison and the VISION group
 Victor Corces (Emory), Karen Reddy (JHU)
 Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)
 Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)
 NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604