Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Containers and Workflows in Galaxy

Containers and Workflows in Galaxy

Presented at the Rocky Mountain Genomics HackCon workshop on Containers and Workflows in Genomics #rmghc18.

James Taylor

June 18, 2018
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Containers and workflows in
    @jxtx / #usegalaxy
    https://speakerdeck.com/jxtx

    View full-size slide

  2. Acknowledgements
    Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave
    Bouvier, Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy
    Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton
    Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek
    +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno
    Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord,
    Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy
    Collaborators:

    Craig Stewart and the group

    Ross Hardison and the VISION group

    Victor Corces (Emory), Karen Reddy (JHU)

    Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)

    Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)
    NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)

    NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)
    funded by the National Science Foundation
    Award #ACI-1445604

    View full-size slide

  3. What is Galaxy?

    View full-size slide

  4. Galaxy is a web-based analysis environment for
    accessible, transparent, and reproducible
    scientific research
    Initially build for genomics, but intended to support
    any compute and data intensive discipline
    Provided both as a free public SaaS application
    (usegalaxy.org), and open-source software

    View full-size slide

  5. Tools and workflows in Galaxy
    (Today we’re going to focus on tools and workflows, but Galaxy
    is much more including visualization, interactive environments,
    sharing and publishing integrated analyses…)

    View full-size slide

  6. The core unit of analysis in Galaxy is the Tool
    • Defined in terms of an abstract
    interface (inputs and outputs)
    • In practice, mostly
    command line tools, a
    declarative XML description
    of the interface, how to
    generate a command line
    • Designed to be as easy as
    possible for tool authors, while
    still allowing rigorous
    reasoning

    View full-size slide

  7. User interface generated from abstract parameter description

    View full-size slide

  8. }
    Template for generating
    command line from
    parameter values

    View full-size slide

  9. Functional tests to be run
    with the “full stack” in place

    View full-size slide

  10. Complex interfaces can be described

    View full-size slide

  11. Repeating groups of parameters

    View full-size slide

  12. Conditional groups, grouping constructs can be nested

    View full-size slide

  13. Template language for building complex command lines

    View full-size slide

  14. Or additional configuration files, scripts, ...

    View full-size slide

  15. Workflows combine tools into more complex analyses

    View full-size slide

  16. Simple data flows
    Output of bwa mapping (a SAM file)…flows into MACS for peak calling

    View full-size slide

  17. Parallelized data flows
    Input is a “dataset collection” with any number of paired-read datasets

    View full-size slide

  18. Parallelized data flows
    HiSat2 alignments are run in parallel across each collection

    View full-size slide

  19. Parallelized data flows
    StringTie transcript assembly also run in parallel across all datasets

    View full-size slide

  20. Parallelized data flows
    Transcript from all datasets combined to produce a consensus assembly

    View full-size slide

  21. Galaxy Workflows: Summary
    • A directed graph capturing data flow relationships
    between a set of tools (the steps of the workflow)
    • With some extras
    • Map and reduce data flows enabled by dataset
    collections
    • Sub-workflows
    • Pause points, decision points, re-planning

    View full-size slide

  22. Tooling to develop tools and workflows:
    “A scientific workflow SDK”
    The way to develop Galaxy tools
    Linting, testing, … Support every aspect of
    the tool development lifecycle

    View full-size slide

  23. Run a workflow in a dynamically created or existing
    Galaxy template

    View full-size slide

  24. Build / edit your workflows in a text editor

    View full-size slide

  25. Testing workflows

    View full-size slide

  26. Realizing tool and workflow execution

    View full-size slide

  27. From “tools” to “dependencies”
    • Tools and workflows describe the interface to the
    underlying software
    • Given valid data and parameters, we can realize
    this to an ordered graph of command lines to
    execute
    • But, we still need to ensure that the appropriate
    software is available

    View full-size slide

  28. History of Galaxy tool dependencies…
    • In the beginning, just install everything on the
    PATH Galaxy is using. I know. Look, it was 2005…
    • Biggest problem: versioning.
    • We soon had workflows where different steps
    required different versions of some
    underlying software (hello samtools…)
    • For reproducibility, we wanted to be able to
    run workflows with older versions of the
    underlying software

    View full-size slide

  29. Requirement metadata


    seqtk


    View full-size slide

  30. History of Galaxy tool dependencies…
    • Take 2: Dependency Resolvers
    • Allow a command line to be augmented based
    on a tools requirements (with a Plugin interface
    of course)
    • Default implementation looks for a directory
    based on the tool name/version and runs a shell
    script “env.sh” which adds to the environment
    • Alternative implementations for modules, brew,
    … soon followed

    View full-size slide

  31. History of Galaxy tool dependencies…
    • Take 3: The Galaxy ToolShed enters the scene
    • Uses a similar structure, separating
    dependencies/versions into different directories
    • But, includes installation recipes so that the
    Galaxy maintainer no longer needs to install
    each tool manually
    • Made sense at the time, but packaging is hard,
    and was basically a nightmare

    View full-size slide

  32. History of Galaxy tool dependencies…
    • Take 4: The packaging ecosystem is getting better,
    community developed projects are realizing the
    importance of OS independent versioned package
    management, let’s join in!

    View full-size slide

  33. https://bioconda.github.io

    View full-size slide

  34. It is now reasonable to support one major
    server platform — Linux
    (this is great for portability and reproducibility, but scary
    for other reasons — monoculture leads to fragility)

    View full-size slide

  35. Builds on Conda packaging system, designed
    “for installing multiple versions of software
    packages and their dependencies and
    switching easily between them”
    More than 4000 recipes for software packages
    All packages are automatically built in a
    minimal environment to ensure isolation and
    portability

    View full-size slide

  36. Submit recipe to GitHub
    Travis CI pulls recipes and builds
    in minimal docker container
    Successful binary builds from
    main repo uploaded to Anaconda
    to be installed anywhere

    View full-size slide

  37. Conda: Key Features for Galaxy
    • No compilation at install time - binaries with their
    dependencies, libraries...
    • Support for all operating systems Galaxy targets
    • Easy to manage multiple versions of the same
    recipe
    • HPC-ready: no root privileges needed
    • Easy-to-write YAML recipes
    • Community - not restricted to Galaxy

    View full-size slide

  38. Best practice channels
    • Conda channels searched by Galaxy for packages
    • iuc
    • bioconda
    • defaults
    • conda-forge
    • Galaxy now automatically installs Conda when first
    launched and will use Bioconda and other
    channels for package resolution

    View full-size slide

  39. Installing tools with Conda

    View full-size slide

  40. Managing Tool Dependencies

    View full-size slide

  41. Managing Tool Dependencies

    View full-size slide

  42. Tool Dependencies 2018
    • Conda and Bioconda is the best practice for tool
    dependency management in Galaxy
    • All tools in the “devteam” and “iuc” repositories now
    use requirement specifications that can be
    resolved by conda
    • ToolShed packages still supported, but deprecated
    • Result: completely automatic installation of all the
    software needed to run a Galaxy workflow

    View full-size slide

  43. But what about Contanerization

    View full-size slide

  44. Why we like containers
    • Isolation: can ensure that software is running in a
    known (and minimal) environment, limit side
    effects
    • Better reproducibility
    • Packaging and distribution, leverage existing
    ecosystem for deploying and running software
    • Security? Would be nice to be able to count on
    that…

    View full-size slide

  45. 2015: Galaxy in Docker

    View full-size slide

  46. Galaxy and Containers
    • We can run Galaxy itself in a container. Great. Super
    easy distribution.
    • Next level: run tools in containers, so that every
    step of a workflow runs in an isolated environment

    View full-size slide

  47. Configuring a Galaxy instance to use Docker
    Configure the destination. For instance, transform the cluster destination:

    --time=00:05:00

    as follows:

    --time=00:05:00
    true
    false

    But, how do we find the right container for a tool?

    View full-size slide

  48. Remember that requirement metadata?


    seqtk


    View full-size slide

  49. Automatic container resolution
    Galaxy will automatically find or build containers for best practice
    tools.
    Let’s lint that tool config with Planemo:
    $ planemo lint --biocontainers seqtk_seq.xml
    ...
    Applying linter biocontainer_registered... CHECK
    .. INFO: BioContainer best-practice container
    found [quay.io/biocontainers/seqtk:1.2--0].

    View full-size slide

  50. BioContainers

    View full-size slide

  51. Bioconda + Containers
    Given a set of packages and versions in Conda/
    Bioconda, we can build a container with just that
    software on a minimal base image
    If we use the same base image, we can reconstruct
    exactly the same container (since we archive all
    binary builds of all versions)
    With automation, these containers can be built
    automatically for every package with no manual
    modification or intervention (e.g. mulled)

    View full-size slide

  52. Travis CI pulls recipes and builds
    in minimal docker container
    Successful builds from main
    repo uploaded to Anaconda
    to be installed anywhere
    Same binary
    from
    bioconda
    installed into
    minimal
    container for
    each provider
    rkt Singularity

    View full-size slide

  53. Bioconda + Containers + Virtualization
    If we run our containers inside a specific (ideally
    minimal) known VM we can control the kernel
    environment as well
    Atmosphere
    funded by the National Science Foundation

    View full-size slide

  54. Tool and dependency binaries, built in minimal
    environment with controlled libs
    Container defines minimum environment
    Virtual machine controls kernel and apparent
    hardware environment
    KVM, Xen, ….
    Increasingly precise environment control

    View full-size slide

  55. CloudMan: General purpose deployment
    manager for any cloud. Cluster and service
    management, auto-scaling
    Cloudbridge: New abstraction library for
    working with multiple cloud APIs
    Genomics Virtual Lab: CloudMan + Galaxy +
    many other common bioinformatics tools
    and frameworks
    Galaxy Cloud

    View full-size slide

  56. What about multiple dependencies?
    Generate containers based on a reproducible hash
    of package name and version

    View full-size slide

  57. Not just for Galaxy

    View full-size slide

  58. Not just for Galaxy
    Docker requirement,
    tightly coupled

    View full-size slide

  59. Not just for Galaxy
    Docker requirement,
    tightly coupled
    Software requirement,
    can be resolved in an
    environment specific way

    View full-size slide

  60. Not just for Galaxy
    Docker requirement,
    tightly coupled
    Software requirement,
    can be resolved in an
    environment specific way
    Implemented in “galaxy-lib” — integrated in CWL
    reference implementation, …

    View full-size slide

  61. Galaxy Containerization
    • Galaxy can run all jobs in containers
    • Uses existing job runners as long as the target
    supports the container engine (Docker or
    Singularity)
    • Resolve containers using existing requirement
    tags, allows flexibility in how dependencies are
    resolved in different environments
    • Doesn’t matter if Galaxy itself is running in a
    container or not.

    View full-size slide

  62. Anatomy of a Galaxy instance

    View full-size slide

  63. Anatomy of a Galaxy instance (“out of the box”)
    Web Server
    (Paste or uwsgi)
    Database
    (sqlite)
    Job Runner
    (local)
    Object Store
    (local files)

    View full-size slide

  64. Anatomy of a Galaxy instance (“production”)
    Web Server
    (uwsgi)
    Database
    (postgres…)
    Job Runner(s) Object Store
    Proxy Server
    (nginx)

    View full-size slide

  65. Anatomy of a Galaxy instance: Running Jobs
    Web Server
    (uwsgi)
    Database
    (postgres…)
    Job Runner(s) Object Store
    Proxy Server
    (nginx)
    Any number of Job Runners: Slurm, PBS, Grid Engine, Condor, Kubernetes…

    View full-size slide

  66. Anatomy of a Galaxy instance: Storing Data
    Web Server
    (uwsgi)
    Database
    (postgres…)
    Job Runner(s) Object Store
    Proxy Server
    (nginx)
    Any number of object stores: Files, S3, Azure, iRods, … Hierarchical, distributed…

    View full-size slide

  67. Anatomy of a Galaxy instance: More services…
    Web Server
    (uwsgi)
    Database
    (postgres…)
    Job Runner(s) Object Store
    Proxy Server
    (nginx)
    Messaging
    (RabbitMQ)
    File Transfers
    (ProFTPd)

    View full-size slide

  68. It can get pretty complicated (e.g. usegalaxy.org)
    PSC, Pittsburgh
    Stampede
    ● 462,462 cores
    ● 205 TB memory
    Blacklight
    Bridges
    Dedicated resources Shared XSEDE resources
    TACC
    Austin
    Galaxy Cluster 

    (Rodeo)
    ● 256 cores
    ● 2 TB memory
    Corral/Stockyard
    ● 20 PB disk
    funded by the National Science Foundation
    Award #ACI-1445604
    PTI IU Bloomington

    View full-size slide

  69. Jetstream (TACC)
    VMWare
    web-01
    web-02
    db-01
    slurm
    rabbitmq
    Stampede (TACC)
    pulsar
    web-03
    web-04
    slurm/pulsar
    instance
    instance
    instance
    instance
    pulsar
    Bridges (PSC)
    Jetstream (IU)
    slurm/pulsar
    instance
    instance
    instance
    instance
    Pulsar/AMQP
    Pulsar/HTTP
    Slurm
    PostgreSQL
    Main Compute Architecture
    NFS
    Corral (2 PB dataset storage)
    dedicated cluster
    roundup49
    ...
    roundup64
    Swarm @
    JS
    instance
    instance
    instance
    instance
    Swarm

    View full-size slide

  70. 125,000
    registered users
    2PB
    user data
    19M
    jobs run
    100
    training events
    (2017 & 2018)
    Stats for Galaxy Main (usegalaxy.org) in May 2018

    View full-size slide

  71. Orchestrating all this is hard
    • Lots of pieces involved in maintaining a Galaxy
    instance
    • Galaxy is deployed in a wide variety of
    environments, from appliances to institutional HPC
    to all different sorts of clouds
    • In recent years we’ve introduced lots of automation
    to make things easier, primarily through ansible.

    View full-size slide

  72. Containers all the way down

    View full-size slide

  73. “Production-Grade Container Orchestration”

    View full-size slide

  74. Galaxy + Kubernetes Stack
    • Kubernetes: “an open-source system for
    automating deployment, scaling, and
    management of containerized applications.”
    • Helm: “The package manager for Kubernetes.”
    • Rancher: “Enterprise management for Kubernetes.
    Every distro. Every cluster. Every cloud.”

    View full-size slide

  75. Demo?
    http://launch.usegalaxy.org

    View full-size slide

  76. Acknowledgements
    Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave
    Bouvier, Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy
    Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton
    Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek
    +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno
    Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord,
    Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy
    Collaborators:

    Craig Stewart and the group

    Ross Hardison and the VISION group

    Victor Corces (Emory), Karen Reddy (JHU)

    Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)

    Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)
    NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)

    NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)
    funded by the National Science Foundation
    Award #ACI-1445604

    View full-size slide