Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy Workshop Tokyo 2016

Galaxy Workshop Tokyo 2016

Brief presentation and Galaxy update for the Galaxy Workshop Tokyo 2016.

James Taylor

April 28, 2016
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. What happens to traditional research outputs when an area of

    science rapidly become data intensive?
  2. Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

    design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
  3. What is reproducibility? (for computational analyses) Reproducibility means that an

    analysis is described/captured in sufficient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses
  4. A minimum standard for evaluating analyses Yet most published analyses

    are not reproducible 
 Ioannadis et al. 2009 – 6/18 microarray experiments reproducible Nekrutenko and Taylor 2012 – 7/50 re-sequencing experiments reproducible … Missing software, versions, parameters, data…
  5. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  6. Galaxy’s goals: Accessibility: Eliminate barriers for researchers wanting to use

    complex methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically
  7. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  8. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  9. PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory

    Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster 
 (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Leveraging National Cyberinfrastructure: Galaxy/XSEDE Gateway
  10. CloudMan: General purpose deployment manager for any cloud. Cluster and

    service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud
  11. Galaxy gives us… Abstract definition of tool interfaces and precise

    capture of parameters for every tool invocation Complete provenance for data relationships (user defined and system wide) Usefulness of such a system relies on having large numbers of tools integrated, how do we facilitate this?
  12. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed
  13. Vision for the Galaxy ToolShed Grow tool development by supporting

    and nurturing community Provide infrastructure to host all tools, make it easy to build tools, install tools into Galaxy, … Quality oversight by a group of volunteers from the community Version and store every dependency of every tool to ensure that we can reconstruct environments exactly
  14. New tools: ~400 new tools for the main Galaxy server

    deployed in the last year, all available to any Galaxy through the Tool Shed
  15. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy Carl Eberhard
  16. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  17. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  18. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing. John Chilton
  19. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  20. Workflow engine Improved workflow scheduling — workflows can be paused,

    restarted, etc Sub-workflows can be embedded in other workflows and reused Much more to come here!
  21. +

  22. Galaxy Interactive Environments General framework support environments other that Jupyer

    (e.g. RStudio) Problems with the notebook model: history can be edited! Only reproducible when all cells are rerun Goal: keep complete history (provenence graph) for every dataset generated from a notebook — preserve Galaxy’s provenance guarantees
  23. Planemo Utilities to assist in building and publishing Galaxy tools

    Automates tool creation, testing, publishing to the ToolShed, etc % planemo lint mytool.xml % planemo test --galaxy_root=../myTestServer mytool.xml % planemo serve mytool.xml
  24. It is now reasonable to support one major server platform

    — Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)
  25. Builds on Conda packaging system, designed “for installing multiple versions

    of software packages and their dependencies and switching easily between them” ~936 recipes for software packages (as of yesterday) All packages are built in a minimal environment to ensure isolation and portability
  26. Submit recipe to GitHub Travis CI pulls recipes and builds

    in minimal docker container Successful builds from main repo uploaded to Anaconda to be installed anywhere
  27. Docker Builds on Linux kernel features enabling complete isolation from

    the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub
  28. Galaxy + Docker Run every analysis in a clean container

    — analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated
  29. Bioconda + Docker Given a set of packages and versions

    in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) And we can even host on a specific VM image…
  30. Tool and dependency binaries, built in minimal environment with controlled

    libs Docker container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control
  31. ACKnowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Nitesh Turaga, Marius van den Beek JHU Data Science: Jeff Leek, Roger Peng, … BioConda: Johannes Köster, Björn Grüning, Ryan Dale, Andreas Sjödin, Adam Caprez, Chris Tomkins-Tinch, Brad Chapman, Alexey Strokach, … CWL: Peter Amstutz, Robin Andeer, Brad Chapman, John Chilton, Michael R. Crusoe, Roman Valls Guimerà, Guillermo Carrasco Hernandez, Sinisa Ivkovic, Andrey Kartashov, John Kern, Dan Leehr, Hervé Ménager, Maxim Mikheev, Tim Pierce, Josh Randall, Stian Soiland-Reyes, Luka Stojanovic, Nebojša Tijanić Everyone I forgot…
  32. 2016 Galaxy Community Conference (GCC2016) June 25-29, 2016 Bloomington, Indiana

    galaxyproject.org/GCC2016 Posters & Demos due May 20 Early registration ends May 20