$30 off During Our Annual Pro Sale. View Details »

ISMB 2017: Supporting highly scalable scientific data analysis with Galaxy

ISMB 2017: Supporting highly scalable scientific data analysis with Galaxy

Technology Talk for ISMB 2017 on 1) Galaxy scalability to thousands of samples, 2) Practical reproducibility with #bioconda, #biocontainers, and virtualization, and 3) [didn't get to this] working with Galaxy entirely from the command line.

James Taylor

July 22, 2017
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Supporting highly scalable scientific
    data analysis with
    @jxtx / #usegalaxy
    https://speakerdeck.com/jxtx

    View Slide

  2. 0. What is Galaxy?
    1. Galaxy support for large-scale analysis
    2. A infrastructure stack for practical reproducibility
    3. Galaxy without the UI

    View Slide

  3. What happens to traditional research outputs
    when an area of science rapidly become data
    intensive?

    View Slide

  4. Idea
    Experiment
    Raw Data
    Tidy Data
    Summarized data
    Results
    Experimental design
    Data collection
    Data cleaning
    Data analysis
    Inference
    Data Pipeline, inspired by Leek and Peng, Nature 2015
    The part we are
    considering here
    The part that
    ends up in the
    Publication

    View Slide

  5. Goals
    Accessibility: Eliminate barriers for researchers
    wanting to use complex methods, make these
    methods available to everyone
    Transparency: Facilitate communication of analyses
    and results in ways that are easy to understand while
    providing all details
    Reproducibility: Ensure that analysis performed in
    the system can be reproduced precisely and
    practically

    View Slide

  6. Galaxy: accessible analysis system

    View Slide

  7. A free (for everyone) web service integrating a
    wealth of tools, compute resources, terabytes of
    reference data and permanent storage
    Open source software that makes integrating your
    own tools and data and customizing for your own
    site simple
    An open extensible platform for sharing tools,
    datatypes, workflows, ...

    View Slide

  8. Describe analysis tool
    behavior abstractly
    Analysis environment automatically
    and transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically
    Pervasive sharing, and publication
    of documents with integrated analysis

    View Slide

  9. 1. Analysis user interfaces for large-scale
    data analyses: An example using
    Dataset Collections

    View Slide

  10. John Chilton

    View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. Single Dataset In
    One execution of bwa men
    Single Dataset Out

    View Slide

  15. View Slide

  16. Collection In
    “Map” over collection, execute bwa mem for each element
    Collection Out

    View Slide

  17. View Slide

  18. Nestorowa et al. (GSE81682)
    Single-cell RNA-seq analysis of 7,248 cells
    (432 LT-HSCs, 1704 HSC-MPPs, and 1704 HPCs)
    Sequenced ~1-2 million reads per cell:
    3.4 TB raw data.

    View Slide

  19. Critical points framework needs to address
    Keeping the naming traceable

    Collapsing single cell data to single tables

    Operating on an unknown number of columns

    Visualize hundreds of samples easily

    View Slide

  20. Critical points framework needs to address
    Keeping the naming traceable
    Collections

    Collapsing single cell data to single tables
    Collection collapse (“reduce”)

    Operating on an unknown number of columns
    Melt and cast tools

    Visualize hundreds of samples easily
    New visualization tools

    View Slide

  21. Import from SRA a
    list of dataset pairs
    Read QC Mapping Quantification Comprehensive
    expression table
    Collection
    collapse
    Cell based
    metrics
    Expression table
    of cells passing
    filters
    Expression table
    of cells and genes
    passing filters
    Table of z-scores
    per gene per cell
    Report of
    experimental
    metrics
    Mo Heydarian

    View Slide

  22. q workflow
    Mo Heydarian

    View Slide

  23. View Slide

  24. QC, Trimming, and HiSat+StringTie
    workflow per-cell

    View Slide

  25. Collection collapse, “reduce” to
    aggregate elements of collection
    into single dataset

    View Slide

  26. Downstream analysis using single
    datasets and collections

    View Slide

  27. Big Fella taking big strides
    Processing all 3840 cells took 108 h
    generated 100,149 history items!!
    Zero errors!
    Big Fella taking big strides
    Processing all 3840 cells took 108 hours and
    generated 100,149 history items!!!
    Zero errors!
    3,840 cells: 108 hours and 100,149 history items. Zero errors.
    Mo Heydarian

    View Slide

  28. 1. The results look correct in aggregate
    The data looks about right.
    tSNE clustering resembles our understanding of
    hematopoiesis
    atopoiesis

    View Slide

  29. My lncRNAs are expressed in real cells and in
    jackpot model across the population
    My lncRNAs are expressed in real cells and in
    ackpot model across the population
    2. Novel lncRNAs follow “jackpot model”

    View Slide

  30. What about the backend?
    Extensive improvements to the Galaxy workflow to
    support analysis at this scale.
    Robustness: pausing, partial restarts, better
    recovery, better throughput
    (but nothing you can see)

    View Slide

  31. Galaxy’s workflow system is robust, flexible,
    and integrates with nearly any environment
    Install locally with many compute environments
    Deploy on a cloud using Cloudman
    Atmosphere

    View Slide

  32. For example, The single-cell RNA-
    seq analysis was run on
    Running Galaxy version 16.10
    Head node: 16 core, 122 GB
    (r4.4xlarge)
    Worker nodes: 2 x 16 core, 122 GB
    (r4.4xlarge)
    10 TB EBS volume

    View Slide

  33. 2. An infrastructure stack for practical
    reproducibility

    View Slide

  34. 1 2 3 ∞
    http://usegalaxy.org
    http://usegalaxy.org/community
    ...
    Galaxies on
    private clouds
    Galaxies on
    public clouds
    ...
    private Galaxy installations
    Private Tool Sheds
    Galaxy Tool
    Shed

    View Slide

  35. State of the Galaxy ToolShed
    ToolShed now contains thousands of tools
    Community response has been phenomenal
    However, packaging is challenging — it never ends!
    Need to move to a model that pulls in and integrates with
    a broader community

    View Slide

  36. Packaging software for
    reproducible research

    View Slide

  37. Portability and Isolation
    are crucial for practical reproducibility

    View Slide

  38. https://bioconda.github.io

    View Slide

  39. It is now reasonable to support one major
    server platform — Linux
    (this is great for portability and reproducibility, but scary
    for other reasons — monoculture leads to fragility)

    View Slide

  40. Builds on Conda packaging system, designed
    “for installing multiple versions of software
    packages and their dependencies and
    switching easily between them”
    ~2200 recipes for software packages
    (as of yesterday)
    All packages are automatically built in a
    minimal environment to ensure isolation and
    portability

    View Slide

  41. Submit recipe to GitHub
    Travis CI pulls recipes and builds
    in minimal docker container
    Successful binary builds from
    main repo uploaded to Anaconda
    to be installed anywhere

    View Slide

  42. Containers for composing an recreating
    complete environments

    View Slide

  43. rkt Singularity

    View Slide

  44. Containerization
    Builds on Linux kernel features enabling complete
    isolation from the kernel level up
    Containers — lightweight environments with
    isolation enforced at the OS level, complete control
    over all software
    Adds a complete ecosystem for sharing, versioning,
    managing containers — e.g. Docker hub, quay.io

    View Slide

  45. Galaxy + Containers
    Run every analysis in a clean container — analysis
    are isolated and environment is the same every time
    Archive that container — containers are lightweight
    thanks to layers — and the analysis can always be
    recreated

    View Slide

  46. Bioconda + Containers
    Given a set of packages and versions in Conda/
    Bioconda, we can build a container with just that
    software on a minimal base image
    If we use the same base image, we can reconstruct
    exactly the same container (since we archive all
    binary builds of all versions)
    With automation, these containers can be built
    automatically for every package with no manual
    modification or intervention (e.g. mulled)

    View Slide

  47. Travis CI pulls recipes and builds
    in minimal docker container
    Successful builds from main
    repo uploaded to Anaconda
    to be installed anywhere
    Same binary
    from
    bioconda
    installed into
    minimal
    container for
    each provider
    rkt Singularity

    View Slide

  48. Bioconda + Containers + Virtualization
    If we run our containers inside a specific (ideally
    minimal) known VM we can control the kernel
    environment as well
    Atmosphere
    funded by the National Science Foundation

    View Slide

  49. Tool and dependency binaries, built in minimal
    environment with controlled libs
    Container defines minimum environment
    Virtual machine controls kernel and apparent
    hardware environment
    KVM, Xen, ….
    Increasingly precise environment control

    View Slide

  50. …and it all just works in Galaxy
    Depending on how Galaxy is configured this can be
    resolved with conda, with biocontainers…

    View Slide

  51. …and it all just works in Galaxy
    Depending on how Galaxy is configured this can be
    resolved with conda, with biocontainers…
    …or environment modules, or brew, guix, …
    (Resolvers are completely pluggable)

    View Slide

  52. What about multiple packages?
    Generate containers based on a reproducible has of
    package name and version
    Walk the ToolShed and archive containers for every
    combination of tools used

    View Slide

  53. Not just for Galaxy

    View Slide

  54. Not just for Galaxy
    Docker requirement,
    tightly coupled

    View Slide

  55. Not just for Galaxy
    Docker requirement,
    tightly coupled
    Software requirement,
    can be resolved in an
    environment specific way

    View Slide

  56. Not just for Galaxy
    Docker requirement,
    tightly coupled
    Software requirement,
    can be resolved in an
    environment specific way
    Implemented in “galaxy-lib” — integrated in CWL
    reference implementation, …

    View Slide

  57. This is the best stack for complete
    reproducibility we have ever had in
    bioinformatics.
    With the right technologies, reproducibility
    is possible and practical.

    View Slide

  58. 3. Galaxy without the UI

    View Slide

  59. John Chilton

    View Slide

  60. “A scientific workflow SDK”
    The way to develop Galaxy tools
    Linting, testing, … Support every aspect of
    the tool development lifecycle

    View Slide

  61. What about workflows?

    View Slide

  62. Start a Galaxy instance serving a specific workflow
    and specific tools

    View Slide

  63. Create and save “template” Galaxy instances

    View Slide

  64. Run a workflow in a dynamically created or existing
    Galaxy template

    View Slide

  65. Build / edit your workflows in a text editor

    View Slide

  66. Testing workflows

    View Slide

  67. Acknowledgements
    Galaxy Team: Enis Afgan, Dannon Baker, Daniel Blankenberg,
    Dave Bouvier, Martin Cěch, John Chilton, Dave Clements, Nate Coraor,
    Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler, 

    Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric
    Rasche, Nicola Soranzo, Marius van den Beek
    BioConda and Biocontainers: Johannes Köster, Ryan Dale,
    Björn Grüning, …
    All contributors to and users of all of the projects I’ve talked about
    NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620)
    NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)

    View Slide