Galaxy Workshop Tokyo 2016

Galaxy Workshop Tokyo 2016

Brief presentation and Galaxy update for the Galaxy Workshop Tokyo 2016.


James Taylor

April 28, 2016


  1. @jxtx / #usegalaxy Reproducible computational research with

  2. What happens to traditional research outputs when an area of

    science rapidly become data intensive?
  3. Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

    design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
  4. What is reproducibility? (for computational analyses) Reproducibility means that an

    analysis is described/captured in sufficient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses
  5. A minimum standard for evaluating analyses Yet most published analyses

    are not reproducible 
 Ioannadis et al. 2009 – 6/18 microarray experiments reproducible Nekrutenko and Taylor 2012 – 7/50 re-sequencing experiments reproducible … Missing software, versions, parameters, data…
  6. None
  7. Galaxy: accessible analysis system

  8. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  9. Galaxy’s goals: Accessibility: Eliminate barriers for researchers wanting to use

    complex methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically
  10. Describe analysis tool behavior abstractly

  11. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details
  12. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  13. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  14. How do we make this available to as many people

    as possible?
  15. PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory

    Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster 
 (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Leveraging National Cyberinfrastructure: Galaxy/XSEDE Gateway
  16. CloudMan: General purpose deployment manager for any cloud. Cluster and

    service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud
  17. Proteomics Metabolomics Natural Language Image Analysis Climate Change Social Science

  18. Galaxy gives us… Abstract definition of tool interfaces and precise

    capture of parameters for every tool invocation Complete provenance for data relationships (user defined and system wide) Usefulness of such a system relies on having large numbers of tools integrated, how do we facilitate this?
  19. 1 2 3 ∞ ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed
  20. Vision for the Galaxy ToolShed Grow tool development by supporting

    and nurturing community Provide infrastructure to host all tools, make it easy to build tools, install tools into Galaxy, … Quality oversight by a group of volunteers from the community Version and store every dependency of every tool to ensure that we can reconstruct environments exactly
  21. New and upcoming

  22. New tools: ~400 new tools for the main Galaxy server

    deployed in the last year, all available to any Galaxy through the Tool Shed
  23. User interface improvements for large scale data analysis

  24. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy Carl Eberhard
  25. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  26. Dataset Collections Organize user data Individual Datasets Collection Collection Contents

    John Chilton and Carl Eberhard
  27. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  28. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing. John Chilton
  29. Enhanced Tuxedo Suite Workflow RNA-Seq workflow based using the Tuxedo

    suite. John Chilton
  30. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  31. Workflow engine Improved workflow scheduling — workflows can be paused,

    restarted, etc Sub-workflows can be embedded in other workflows and reused Much more to come here!
  32. Assistive interfaces: Interactive tours

  33. 1. 2. 3. 4.

  34. Galaxy Interactive Environments

  35. None
  36. +

  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. Galaxy Interactive Environments General framework support environments other that Jupyer

    (e.g. RStudio) Problems with the notebook model: history can be edited! Only reproducible when all cells are rerun Goal: keep complete history (provenence graph) for every dataset generated from a notebook — preserve Galaxy’s provenance guarantees
  45. Making tool development easier

  46. None
  47. Planemo Utilities to assist in building and publishing Galaxy tools

    Automates tool creation, testing, publishing to the ToolShed, etc % planemo lint mytool.xml % planemo test --galaxy_root=../myTestServer mytool.xml % planemo serve mytool.xml
  48. Packaging software for reproducible research

  49. Portability and Isolation are crucial for practical reproducibility


  51. It is now reasonable to support one major server platform

    — Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)
  52. Builds on Conda packaging system, designed “for installing multiple versions

    of software packages and their dependencies and switching easily between them” ~936 recipes for software packages (as of yesterday) All packages are built in a minimal environment to ensure isolation and portability
  53. Submit recipe to GitHub Travis CI pulls recipes and builds

    in minimal docker container Successful builds from main repo uploaded to Anaconda to be installed anywhere
  54. Containers for composing an recreating complete environments

  55. None
  56. Docker Builds on Linux kernel features enabling complete isolation from

    the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub
  57. Galaxy + Docker Run every analysis in a clean container

    — analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated
  58. Bioconda + Docker Given a set of packages and versions

    in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) And we can even host on a specific VM image…
  59. Tool and dependency binaries, built in minimal environment with controlled

    libs Docker container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control
  60. None
  61. None
  62. ACKnowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Nitesh Turaga, Marius van den Beek JHU Data Science: Jeff Leek, Roger Peng, … BioConda: Johannes Köster, Björn Grüning, Ryan Dale, Andreas Sjödin, Adam Caprez, Chris Tomkins-Tinch, Brad Chapman, Alexey Strokach, … CWL: Peter Amstutz, Robin Andeer, Brad Chapman, John Chilton, Michael R. Crusoe, Roman Valls Guimerà, Guillermo Carrasco Hernandez, Sinisa Ivkovic, Andrey Kartashov, John Kern, Dan Leehr, Hervé Ménager, Maxim Mikheev, Tim Pierce, Josh Randall, Stian Soiland-Reyes, Luka Stojanovic, Nebojša Tijanić Everyone I forgot…
  63. 2016 Galaxy Community Conference (GCC2016) June 25-29, 2016 Bloomington, Indiana Posters & Demos due May 20 Early registration ends May 20
  64. (fin)