Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bio Genomics 2017

James Taylor
February 22, 2017

Bio Genomics 2017

Galaxy, BioConda, BioContainers at http://biogenomics2017.org/

James Taylor

February 22, 2017
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. What happens to traditional research outputs when an area of

    science rapidly become data intensive?
  2. Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

    design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
  3. Galaxy’s goals: Accessibility: Eliminate barriers for researchers wanting to use

    complex methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility*: Ensure that analysis performed in the system can be reproduced precisely and practically *The state of which is still frighteningly bad, see doi:10.1038/nrg3305, doi:10.7717/peerj.148
  4. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  5. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  6. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  7. More Powerful Workflows Arbitrary # of Inputs (... paired). Run

    applications in parallel (one per input). Merged output for subsequent processing. Dataset collections: map/reduce workflows over 1000s of datasets Interactive tours for building realtime interactive training Interactive Environments: custom analysis in Galaxy workflows using Jupyter, …
  8. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  9. Galaxy was designed to be a flexible configuration driven platform

    Galaxy can be easily customized to the needs of different types of analyses by assembly different tools, workflows, visualizations, datasets… For this to work, we need to make it as easy as possible for developers to integrate and share tools
  10. The Galaxy ToolShed For early Galaxy instances, tool wrapper management

    was very ad hoc. No tracking of wrapper version information in the Galaxy database, no standard way to share. ToolShed enables not just sharing, but global identifiers and versions across all Galaxy instances.
  11. The Galaxy ToolShed Easy to make, version, and share tool

    wrappers, but we also need to manage the underlying software packages This is hard, and best handled by a broader community
  12. It is now reasonable to support one major server platform

    — Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)
  13. Builds on Conda packaging system, designed “for installing multiple versions

    of software packages and their dependencies and switching easily between them” ~2000 recipes for software packages* All packages are built in a minimal environment to ensure isolation and portability *not even including different versions!
  14. Submit recipe to GitHub Travis CI pulls recipes and builds

    in minimal docker container Successful builds from main repo uploaded to Anaconda to be installed anywhere
  15. Containerization Builds on Linux kernel features enabling complete isolation from

    the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — e.g. Docker hub
  16. Galaxy + Containers Run every analysis in a clean container

    — analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated
  17. Bioconda + Containers Given a set of packages and versions

    in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled)
  18. Bioconda + Containers + Virtualization If we run our containers

    inside a specific (ideally minimal) known VM we can control the kernel environment as well Atmosphere funded by the National Science Foundation
  19. Tool and dependency binaries, built in minimal environment with controlled

    libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control
  20. This is the best stack for complete reproducibility we have

    ever had in bioinformatics. With the right technologies, reproducibility is possible and practical.
  21. This is the best stack for complete reproducibility we have

    ever had in bioinformatics. With the right technologies, reproducibility is possible and practical. …and for Galaxy users, it makes getting new tools into a Galaxy instance incredibly easy
  22. Galaxy’s inherent flexibility makes it an ideal platform for new

    genome projects Many relevant tools are wrapped and can now be easily used in a fully reproducible way Custom data can easily be added to the system using “Data Managers” Custom genomes can be added on the fly and then used in any of Galaxy’s genome analysis tools
  23. Galaxy is a community Users, tool developers, maintainers, training material

    authors, … The value of the Galaxy platform comes from these community contributions Join us!
  24. ACKnowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Čech, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler, Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Marius van den Beek JHU Data Science: Jeff Leek, Roger Peng, … Jetstream: Craig Stewart, Ian Foster, Matthew Vaughn, Nirav Merchant BioConda: Johannes Köster, Björn Grüning, Ryan Dale, Chris Tomkins-Tinch, Brad Chapman, … Other lab members: Boris Brenerman, Min Hyung Cho, Peter DeFord, German Uritskiy, Mallory Freeberg NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620) NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)