ISMB 2017: Supporting highly scalable scientific data analysis with Galaxy

ISMB 2017: Supporting highly scalable scientific data analysis with Galaxy

Technology Talk for ISMB 2017 on 1) Galaxy scalability to thousands of samples, 2) Practical reproducibility with #bioconda, #biocontainers, and virtualization, and 3) [didn't get to this] working with Galaxy entirely from the command line.

3ee44f53c39bcd4bc663a2ea0e21d526?s=128

James Taylor

July 22, 2017
Tweet

Transcript

  1. 2.

    0. What is Galaxy? 1. Galaxy support for large-scale analysis

    2. A infrastructure stack for practical reproducibility 3. Galaxy without the UI
  2. 3.

    What happens to traditional research outputs when an area of

    science rapidly become data intensive?
  3. 4.

    Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

    design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
  4. 5.

    Goals Accessibility: Eliminate barriers for researchers wanting to use complex

    methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically
  5. 7.

    A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  6. 8.

    Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  7. 11.
  8. 12.
  9. 13.
  10. 15.
  11. 17.
  12. 18.

    Nestorowa et al. (GSE81682) Single-cell RNA-seq analysis of 7,248 cells

    (432 LT-HSCs, 1704 HSC-MPPs, and 1704 HPCs) Sequenced ~1-2 million reads per cell: 3.4 TB raw data.
  13. 19.

    Critical points framework needs to address Keeping the naming traceable

    
 Collapsing single cell data to single tables 
 Operating on an unknown number of columns 
 Visualize hundreds of samples easily
  14. 20.

    Critical points framework needs to address Keeping the naming traceable

    Collections
 Collapsing single cell data to single tables Collection collapse (“reduce”)
 Operating on an unknown number of columns Melt and cast tools
 Visualize hundreds of samples easily New visualization tools
  15. 21.

    Import from SRA a list of dataset pairs Read QC

    Mapping Quantification Comprehensive expression table Collection collapse Cell based metrics Expression table of cells passing filters Expression table of cells and genes passing filters Table of z-scores per gene per cell Report of experimental metrics Mo Heydarian
  16. 23.
  17. 27.

    Big Fella taking big strides Processing all 3840 cells took

    108 h generated 100,149 history items!! Zero errors! Big Fella taking big strides Processing all 3840 cells took 108 hours and generated 100,149 history items!!! Zero errors! 3,840 cells: 108 hours and 100,149 history items. Zero errors. Mo Heydarian
  18. 28.

    1. The results look correct in aggregate The data looks

    about right. tSNE clustering resembles our understanding of hematopoiesis atopoiesis
  19. 29.

    My lncRNAs are expressed in real cells and in jackpot

    model across the population My lncRNAs are expressed in real cells and in ackpot model across the population 2. Novel lncRNAs follow “jackpot model”
  20. 30.

    What about the backend? Extensive improvements to the Galaxy workflow

    to support analysis at this scale. Robustness: pausing, partial restarts, better recovery, better throughput (but nothing you can see)
  21. 31.

    Galaxy’s workflow system is robust, flexible, and integrates with nearly

    any environment Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  22. 32.

    For example, The single-cell RNA- seq analysis was run on

    Running Galaxy version 16.10 Head node: 16 core, 122 GB (r4.4xlarge) Worker nodes: 2 x 16 core, 122 GB (r4.4xlarge) 10 TB EBS volume
  23. 34.

    1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed
  24. 35.

    State of the Galaxy ToolShed ToolShed now contains thousands of

    tools Community response has been phenomenal However, packaging is challenging — it never ends! Need to move to a model that pulls in and integrates with a broader community
  25. 39.

    It is now reasonable to support one major server platform

    — Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)
  26. 40.

    Builds on Conda packaging system, designed “for installing multiple versions

    of software packages and their dependencies and switching easily between them” ~2200 recipes for software packages (as of yesterday) All packages are automatically built in a minimal environment to ensure isolation and portability
  27. 41.

    Submit recipe to GitHub Travis CI pulls recipes and builds

    in minimal docker container Successful binary builds from main repo uploaded to Anaconda to be installed anywhere
  28. 44.

    Containerization Builds on Linux kernel features enabling complete isolation from

    the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — e.g. Docker hub, quay.io
  29. 45.

    Galaxy + Containers Run every analysis in a clean container

    — analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated
  30. 46.

    Bioconda + Containers Given a set of packages and versions

    in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled)
  31. 47.

    Travis CI pulls recipes and builds in minimal docker container

    Successful builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider rkt Singularity
  32. 48.

    Bioconda + Containers + Virtualization If we run our containers

    inside a specific (ideally minimal) known VM we can control the kernel environment as well Atmosphere funded by the National Science Foundation
  33. 49.

    Tool and dependency binaries, built in minimal environment with controlled

    libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control
  34. 50.

    …and it all just works in Galaxy Depending on how

    Galaxy is configured this can be resolved with conda, with biocontainers…
  35. 51.

    …and it all just works in Galaxy Depending on how

    Galaxy is configured this can be resolved with conda, with biocontainers… …or environment modules, or brew, guix, … (Resolvers are completely pluggable)
  36. 52.

    What about multiple packages? Generate containers based on a reproducible

    has of package name and version Walk the ToolShed and archive containers for every combination of tools used
  37. 55.
  38. 56.

    Not just for Galaxy Docker requirement, tightly coupled Software requirement,

    can be resolved in an environment specific way Implemented in “galaxy-lib” — integrated in CWL reference implementation, …
  39. 57.

    This is the best stack for complete reproducibility we have

    ever had in bioinformatics. With the right technologies, reproducibility is possible and practical.
  40. 60.

    “A scientific workflow SDK” The way to develop Galaxy tools

    Linting, testing, … Support every aspect of the tool development lifecycle
  41. 67.

    Acknowledgements Galaxy Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave

    Bouvier, Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler, 
 Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Marius van den Beek BioConda and Biocontainers: Johannes Köster, Ryan Dale, Björn Grüning, … All contributors to and users of all of the projects I’ve talked about NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620) NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)