Galaxy Workshop Tokyo 2016

@jxtx / #usegalaxy Reproducible computational research with https://speakerdeck.com/jxtx

What happens to traditional research outputs when an area of
science rapidly become data intensive?

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental
design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

What is reproducibility? (for computational analyses) Reproducibility means that an
analysis is described/captured in suﬃcient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses

A minimum standard for evaluating analyses Yet most published analyses
are not reproducible   Ioannadis et al. 2009 – 6/18 microarray experiments reproducible Nekrutenko and Taylor 2012 – 7/50 re-sequencing experiments reproducible … Missing software, versions, parameters, data…

Galaxy: accessible analysis system

A free (for everyone) web service integrating a wealth of
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Galaxy’s goals: Accessibility: Eliminate barriers for researchers wanting to use
complex methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically

Describe analysis tool behavior abstractly

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details

tracks details Workflow system for complex analysis, constructed explicitly or automatically

tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

How do we make this available to as many people
as possible?

PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory
Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster   (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Leveraging National Cyberinfrastructure: Galaxy/XSEDE Gateway

CloudMan: General purpose deployment manager for any cloud. Cluster and
service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud

Proteomics Metabolomics Natural Language Image Analysis Climate Change Social Science
Cosmology

Galaxy gives us… Abstract definition of tool interfaces and precise
capture of parameters for every tool invocation Complete provenance for data relationships (user defined and system wide) Usefulness of such a system relies on having large numbers of tools integrated, how do we facilitate this?

1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private
clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed

Vision for the Galaxy ToolShed Grow tool development by supporting
and nurturing community Provide infrastructure to host all tools, make it easy to build tools, install tools into Galaxy, … Quality oversight by a group of volunteers from the community Version and store every dependency of every tool to ensure that we can reconstruct environments exactly

New and upcoming

New tools: ~400 new tools for the main Galaxy server
deployed in the last year, all available to any Galaxy through the Tool Shed

User interface improvements for large scale data analysis

Users typically use many histories when working with many samples;
New multiple history view makes working with 100s of histories easy Carl Eberhard

A not-so-new feature: mapping over multiple datasets However, this breaks
down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)

Dataset Collections Organize user data Individual Datasets Collection Collection Contents
John Chilton and Carl Eberhard

Operations over collections For “list” collections, existing tools can automatically
be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools

Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs
(... paired). Run applications in parallel (one per input). Merged output for subsequent processing. John Chilton

Enhanced Tuxedo Suite Workflow RNA-Seq workflow based using the Tuxedo
suite. John Chilton

Dataset Collections Extremely flexible for grouping collections of complex datasets,
can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)

Workflow engine Improved workflow scheduling — workflows can be paused,
restarted, etc Sub-workflows can be embedded in other workflows and reused Much more to come here!

Assistive interfaces: Interactive tours

1. 2. 3. 4.

Galaxy Interactive Environments

Galaxy Interactive Environments General framework support environments other that Jupyer
(e.g. RStudio) Problems with the notebook model: history can be edited! Only reproducible when all cells are rerun Goal: keep complete history (provenence graph) for every dataset generated from a notebook — preserve Galaxy’s provenance guarantees

Making tool development easier

Planemo Utilities to assist in building and publishing Galaxy tools
Automates tool creation, testing, publishing to the ToolShed, etc % planemo lint mytool.xml % planemo test --galaxy_root=../myTestServer mytool.xml % planemo serve mytool.xml

Packaging software for reproducible research

Portability and Isolation are crucial for practical reproducibility

https://bioconda.github.io

It is now reasonable to support one major server platform
— Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)

Builds on Conda packaging system, designed “for installing multiple versions
of software packages and their dependencies and switching easily between them” ~936 recipes for software packages (as of yesterday) All packages are built in a minimal environment to ensure isolation and portability

Submit recipe to GitHub Travis CI pulls recipes and builds
in minimal docker container Successful builds from main repo uploaded to Anaconda to be installed anywhere

Containers for composing an recreating complete environments

Docker Builds on Linux kernel features enabling complete isolation from
the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub

Galaxy + Docker Run every analysis in a clean container
— analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated

Bioconda + Docker Given a set of packages and versions
in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) And we can even host on a specific VM image…

Tool and dependency binaries, built in minimal environment with controlled
libs Docker container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control

ACKnowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,
Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Nitesh Turaga, Marius van den Beek JHU Data Science: Jeﬀ Leek, Roger Peng, … BioConda: Johannes Köster, Björn Grüning, Ryan Dale, Andreas Sjödin, Adam Caprez, Chris Tomkins-Tinch, Brad Chapman, Alexey Strokach, … CWL: Peter Amstutz, Robin Andeer, Brad Chapman, John Chilton, Michael R. Crusoe, Roman Valls Guimerà, Guillermo Carrasco Hernandez, Sinisa Ivkovic, Andrey Kartashov, John Kern, Dan Leehr, Hervé Ménager, Maxim Mikheev, Tim Pierce, Josh Randall, Stian Soiland-Reyes, Luka Stojanovic, Nebojša Tijanić Everyone I forgot…

2016 Galaxy Community Conference (GCC2016) June 25-29, 2016 Bloomington, Indiana
galaxyproject.org/GCC2016 Posters & Demos due May 20 Early registration ends May 20

Galaxy Workshop Tokyo 2016

Galaxy Workshop Tokyo 2016

More Decks by James Taylor

Other Decks in Science

Featured

Transcript