Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Containers and Workflows in Galaxy

Containers and Workflows in Galaxy

Presented at the Rocky Mountain Genomics HackCon workshop on Containers and Workflows in Genomics #rmghc18.

James Taylor

June 18, 2018
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Containers and workflows in
    @jxtx / #usegalaxy
    https://speakerdeck.com/jxtx

    View Slide

  2. Acknowledgements
    Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave
    Bouvier, Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy
    Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton
    Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek
    +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno
    Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord,
    Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy
    Collaborators:

    Craig Stewart and the group

    Ross Hardison and the VISION group

    Victor Corces (Emory), Karen Reddy (JHU)

    Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)

    Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)
    NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)

    NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)
    funded by the National Science Foundation
    Award #ACI-1445604

    View Slide

  3. What is Galaxy?

    View Slide

  4. Galaxy is a web-based analysis environment for
    accessible, transparent, and reproducible
    scientific research
    Initially build for genomics, but intended to support
    any compute and data intensive discipline
    Provided both as a free public SaaS application
    (usegalaxy.org), and open-source software

    View Slide

  5. View Slide

  6. Tools and workflows in Galaxy
    (Today we’re going to focus on tools and workflows, but Galaxy
    is much more including visualization, interactive environments,
    sharing and publishing integrated analyses…)

    View Slide

  7. The core unit of analysis in Galaxy is the Tool
    • Defined in terms of an abstract
    interface (inputs and outputs)
    • In practice, mostly
    command line tools, a
    declarative XML description
    of the interface, how to
    generate a command line
    • Designed to be as easy as
    possible for tool authors, while
    still allowing rigorous
    reasoning

    View Slide

  8. View Slide

  9. User interface generated from abstract parameter description

    View Slide

  10. }
    Template for generating
    command line from
    parameter values

    View Slide

  11. Functional tests to be run
    with the “full stack” in place

    View Slide

  12. Complex interfaces can be described

    View Slide

  13. Repeating groups of parameters

    View Slide

  14. Conditional groups, grouping constructs can be nested

    View Slide

  15. Template language for building complex command lines

    View Slide

  16. Or additional configuration files, scripts, ...

    View Slide

  17. Workflows combine tools into more complex analyses

    View Slide

  18. Simple data flows
    Output of bwa mapping (a SAM file)…flows into MACS for peak calling

    View Slide

  19. Parallelized data flows
    Input is a “dataset collection” with any number of paired-read datasets

    View Slide

  20. Parallelized data flows
    HiSat2 alignments are run in parallel across each collection

    View Slide

  21. Parallelized data flows
    StringTie transcript assembly also run in parallel across all datasets

    View Slide

  22. Parallelized data flows
    Transcript from all datasets combined to produce a consensus assembly

    View Slide

  23. Galaxy Workflows: Summary
    • A directed graph capturing data flow relationships
    between a set of tools (the steps of the workflow)
    • With some extras
    • Map and reduce data flows enabled by dataset
    collections
    • Sub-workflows
    • Pause points, decision points, re-planning

    View Slide

  24. Tooling to develop tools and workflows:
    “A scientific workflow SDK”
    The way to develop Galaxy tools
    Linting, testing, … Support every aspect of
    the tool development lifecycle

    View Slide

  25. Run a workflow in a dynamically created or existing
    Galaxy template

    View Slide

  26. Build / edit your workflows in a text editor

    View Slide

  27. Testing workflows

    View Slide

  28. Realizing tool and workflow execution

    View Slide

  29. From “tools” to “dependencies”
    • Tools and workflows describe the interface to the
    underlying software
    • Given valid data and parameters, we can realize
    this to an ordered graph of command lines to
    execute
    • But, we still need to ensure that the appropriate
    software is available

    View Slide

  30. History of Galaxy tool dependencies…
    • In the beginning, just install everything on the
    PATH Galaxy is using. I know. Look, it was 2005…
    • Biggest problem: versioning.
    • We soon had workflows where different steps
    required different versions of some
    underlying software (hello samtools…)
    • For reproducibility, we wanted to be able to
    run workflows with older versions of the
    underlying software

    View Slide

  31. Requirement metadata


    seqtk


    View Slide

  32. History of Galaxy tool dependencies…
    • Take 2: Dependency Resolvers
    • Allow a command line to be augmented based
    on a tools requirements (with a Plugin interface
    of course)
    • Default implementation looks for a directory
    based on the tool name/version and runs a shell
    script “env.sh” which adds to the environment
    • Alternative implementations for modules, brew,
    … soon followed

    View Slide

  33. View Slide

  34. History of Galaxy tool dependencies…
    • Take 3: The Galaxy ToolShed enters the scene
    • Uses a similar structure, separating
    dependencies/versions into different directories
    • But, includes installation recipes so that the
    Galaxy maintainer no longer needs to install
    each tool manually
    • Made sense at the time, but packaging is hard,
    and was basically a nightmare

    View Slide

  35. History of Galaxy tool dependencies…
    • Take 4: The packaging ecosystem is getting better,
    community developed projects are realizing the
    importance of OS independent versioned package
    management, let’s join in!

    View Slide

  36. https://bioconda.github.io

    View Slide

  37. It is now reasonable to support one major
    server platform — Linux
    (this is great for portability and reproducibility, but scary
    for other reasons — monoculture leads to fragility)

    View Slide

  38. Builds on Conda packaging system, designed
    “for installing multiple versions of software
    packages and their dependencies and
    switching easily between them”
    More than 4000 recipes for software packages
    All packages are automatically built in a
    minimal environment to ensure isolation and
    portability

    View Slide

  39. Submit recipe to GitHub
    Travis CI pulls recipes and builds
    in minimal docker container
    Successful binary builds from
    main repo uploaded to Anaconda
    to be installed anywhere

    View Slide

  40. Conda: Key Features for Galaxy
    • No compilation at install time - binaries with their
    dependencies, libraries...
    • Support for all operating systems Galaxy targets
    • Easy to manage multiple versions of the same
    recipe
    • HPC-ready: no root privileges needed
    • Easy-to-write YAML recipes
    • Community - not restricted to Galaxy

    View Slide

  41. Best practice channels
    • Conda channels searched by Galaxy for packages
    • iuc
    • bioconda
    • defaults
    • conda-forge
    • Galaxy now automatically installs Conda when first
    launched and will use Bioconda and other
    channels for package resolution

    View Slide

  42. Installing tools with Conda

    View Slide

  43. Managing Tool Dependencies

    View Slide

  44. Managing Tool Dependencies

    View Slide

  45. Tool Dependencies 2018
    • Conda and Bioconda is the best practice for tool
    dependency management in Galaxy
    • All tools in the “devteam” and “iuc” repositories now
    use requirement specifications that can be
    resolved by conda
    • ToolShed packages still supported, but deprecated
    • Result: completely automatic installation of all the
    software needed to run a Galaxy workflow

    View Slide

  46. But what about Contanerization

    View Slide

  47. Why we like containers
    • Isolation: can ensure that software is running in a
    known (and minimal) environment, limit side
    effects
    • Better reproducibility
    • Packaging and distribution, leverage existing
    ecosystem for deploying and running software
    • Security? Would be nice to be able to count on
    that…

    View Slide

  48. 2015: Galaxy in Docker

    View Slide

  49. Galaxy and Containers
    • We can run Galaxy itself in a container. Great. Super
    easy distribution.
    • Next level: run tools in containers, so that every
    step of a workflow runs in an isolated environment

    View Slide

  50. Configuring a Galaxy instance to use Docker
    Configure the destination. For instance, transform the cluster destination:

    --time=00:05:00

    as follows:

    --time=00:05:00
    true
    false

    But, how do we find the right container for a tool?

    View Slide

  51. Remember that requirement metadata?


    seqtk


    View Slide

  52. Automatic container resolution
    Galaxy will automatically find or build containers for best practice
    tools.
    Let’s lint that tool config with Planemo:
    $ planemo lint --biocontainers seqtk_seq.xml
    ...
    Applying linter biocontainer_registered... CHECK
    .. INFO: BioContainer best-practice container
    found [quay.io/biocontainers/seqtk:1.2--0].

    View Slide

  53. BioContainers

    View Slide

  54. Bioconda + Containers
    Given a set of packages and versions in Conda/
    Bioconda, we can build a container with just that
    software on a minimal base image
    If we use the same base image, we can reconstruct
    exactly the same container (since we archive all
    binary builds of all versions)
    With automation, these containers can be built
    automatically for every package with no manual
    modification or intervention (e.g. mulled)

    View Slide

  55. Travis CI pulls recipes and builds
    in minimal docker container
    Successful builds from main
    repo uploaded to Anaconda
    to be installed anywhere
    Same binary
    from
    bioconda
    installed into
    minimal
    container for
    each provider
    rkt Singularity

    View Slide

  56. Bioconda + Containers + Virtualization
    If we run our containers inside a specific (ideally
    minimal) known VM we can control the kernel
    environment as well
    Atmosphere
    funded by the National Science Foundation

    View Slide

  57. Tool and dependency binaries, built in minimal
    environment with controlled libs
    Container defines minimum environment
    Virtual machine controls kernel and apparent
    hardware environment
    KVM, Xen, ….
    Increasingly precise environment control

    View Slide

  58. CloudMan: General purpose deployment
    manager for any cloud. Cluster and service
    management, auto-scaling
    Cloudbridge: New abstraction library for
    working with multiple cloud APIs
    Genomics Virtual Lab: CloudMan + Galaxy +
    many other common bioinformatics tools
    and frameworks
    Galaxy Cloud

    View Slide

  59. What about multiple dependencies?
    Generate containers based on a reproducible hash
    of package name and version

    View Slide

  60. Not just for Galaxy

    View Slide

  61. Not just for Galaxy
    Docker requirement,
    tightly coupled

    View Slide

  62. Not just for Galaxy
    Docker requirement,
    tightly coupled
    Software requirement,
    can be resolved in an
    environment specific way

    View Slide

  63. Not just for Galaxy
    Docker requirement,
    tightly coupled
    Software requirement,
    can be resolved in an
    environment specific way
    Implemented in “galaxy-lib” — integrated in CWL
    reference implementation, …

    View Slide

  64. Galaxy Containerization
    • Galaxy can run all jobs in containers
    • Uses existing job runners as long as the target
    supports the container engine (Docker or
    Singularity)
    • Resolve containers using existing requirement
    tags, allows flexibility in how dependencies are
    resolved in different environments
    • Doesn’t matter if Galaxy itself is running in a
    container or not.

    View Slide

  65. Anatomy of a Galaxy instance

    View Slide

  66. Anatomy of a Galaxy instance (“out of the box”)
    Web Server
    (Paste or uwsgi)
    Database
    (sqlite)
    Job Runner
    (local)
    Object Store
    (local files)

    View Slide

  67. Anatomy of a Galaxy instance (“production”)
    Web Server
    (uwsgi)
    Database
    (postgres…)
    Job Runner(s) Object Store
    Proxy Server
    (nginx)

    View Slide

  68. Anatomy of a Galaxy instance: Running Jobs
    Web Server
    (uwsgi)
    Database
    (postgres…)
    Job Runner(s) Object Store
    Proxy Server
    (nginx)
    Any number of Job Runners: Slurm, PBS, Grid Engine, Condor, Kubernetes…

    View Slide

  69. Anatomy of a Galaxy instance: Storing Data
    Web Server
    (uwsgi)
    Database
    (postgres…)
    Job Runner(s) Object Store
    Proxy Server
    (nginx)
    Any number of object stores: Files, S3, Azure, iRods, … Hierarchical, distributed…

    View Slide

  70. Anatomy of a Galaxy instance: More services…
    Web Server
    (uwsgi)
    Database
    (postgres…)
    Job Runner(s) Object Store
    Proxy Server
    (nginx)
    Messaging
    (RabbitMQ)
    File Transfers
    (ProFTPd)

    View Slide

  71. It can get pretty complicated (e.g. usegalaxy.org)
    PSC, Pittsburgh
    Stampede
    ● 462,462 cores
    ● 205 TB memory
    Blacklight
    Bridges
    Dedicated resources Shared XSEDE resources
    TACC
    Austin
    Galaxy Cluster 

    (Rodeo)
    ● 256 cores
    ● 2 TB memory
    Corral/Stockyard
    ● 20 PB disk
    funded by the National Science Foundation
    Award #ACI-1445604
    PTI IU Bloomington

    View Slide

  72. Jetstream (TACC)
    VMWare
    web-01
    web-02
    db-01
    slurm
    rabbitmq
    Stampede (TACC)
    pulsar
    web-03
    web-04
    slurm/pulsar
    instance
    instance
    instance
    instance
    pulsar
    Bridges (PSC)
    Jetstream (IU)
    slurm/pulsar
    instance
    instance
    instance
    instance
    Pulsar/AMQP
    Pulsar/HTTP
    Slurm
    PostgreSQL
    Main Compute Architecture
    NFS
    Corral (2 PB dataset storage)
    dedicated cluster
    roundup49
    ...
    roundup64
    Swarm @
    JS
    instance
    instance
    instance
    instance
    Swarm

    View Slide

  73. 125,000
    registered users
    2PB
    user data
    19M
    jobs run
    100
    training events
    (2017 & 2018)
    Stats for Galaxy Main (usegalaxy.org) in May 2018

    View Slide

  74. Orchestrating all this is hard
    • Lots of pieces involved in maintaining a Galaxy
    instance
    • Galaxy is deployed in a wide variety of
    environments, from appliances to institutional HPC
    to all different sorts of clouds
    • In recent years we’ve introduced lots of automation
    to make things easier, primarily through ansible.

    View Slide

  75. Containers all the way down

    View Slide

  76. “Production-Grade Container Orchestration”

    View Slide

  77. Galaxy + Kubernetes Stack
    • Kubernetes: “an open-source system for
    automating deployment, scaling, and
    management of containerized applications.”
    • Helm: “The package manager for Kubernetes.”
    • Rancher: “Enterprise management for Kubernetes.
    Every distro. Every cluster. Every cloud.”

    View Slide

  78. Demo?
    http://launch.usegalaxy.org

    View Slide

  79. View Slide

  80. View Slide

  81. View Slide

  82. View Slide

  83. View Slide

  84. View Slide

  85. View Slide

  86. View Slide

  87. View Slide

  88. View Slide

  89. View Slide

  90. View Slide

  91. View Slide

  92. View Slide

  93. View Slide

  94. Acknowledgements
    Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave
    Bouvier, Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy
    Goecks, Björn Grüning, Sam Guerler, Jennifer Hillman-Jackson, Anton
    Nekrutenko, Helena Rasche, Nicola Soranzo, Marius van den Beek
    +CloudLaunch+Kubernetes: Nuwan Goonasekera, Pablo Moreno
    Other lab members : Boris Brenerman, Min Hyung Cho, Peter DeFord,
    Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy
    Collaborators:

    Craig Stewart and the group

    Ross Hardison and the VISION group

    Victor Corces (Emory), Karen Reddy (JHU)

    Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology)

    Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)
    NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)

    NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)
    funded by the National Science Foundation
    Award #ACI-1445604

    View Slide