ISMB 2017: Supporting highly scalable scientific data analysis with Galaxy

Supporting highly scalable scientific data analysis with @jxtx / #usegalaxy
https://speakerdeck.com/jxtx

0. What is Galaxy? 1. Galaxy support for large-scale analysis
2. A infrastructure stack for practical reproducibility 3. Galaxy without the UI

What happens to traditional research outputs when an area of
science rapidly become data intensive?

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental
design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

Goals Accessibility: Eliminate barriers for researchers wanting to use complex
methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically

Galaxy: accessible analysis system

A free (for everyone) web service integrating a wealth of
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

1. Analysis user interfaces for large-scale data analyses: An example
using Dataset Collections

John Chilton

Single Dataset In One execution of bwa men Single Dataset
Out

Collection In “Map” over collection, execute bwa mem for each
element Collection Out

Nestorowa et al. (GSE81682) Single-cell RNA-seq analysis of 7,248 cells
(432 LT-HSCs, 1704 HSC-MPPs, and 1704 HPCs) Sequenced ~1-2 million reads per cell: 3.4 TB raw data.

Critical points framework needs to address Keeping the naming traceable
  Collapsing single cell data to single tables   Operating on an unknown number of columns   Visualize hundreds of samples easily

Critical points framework needs to address Keeping the naming traceable
Collections  Collapsing single cell data to single tables Collection collapse (“reduce”)  Operating on an unknown number of columns Melt and cast tools  Visualize hundreds of samples easily New visualization tools

Import from SRA a list of dataset pairs Read QC
Mapping Quantification Comprehensive expression table Collection collapse Cell based metrics Expression table of cells passing filters Expression table of cells and genes passing filters Table of z-scores per gene per cell Report of experimental metrics Mo Heydarian

q workflow Mo Heydarian

QC, Trimming, and HiSat+StringTie workflow per-cell

Collection collapse, “reduce” to aggregate elements of collection into single
dataset

Downstream analysis using single datasets and collections

Big Fella taking big strides Processing all 3840 cells took
108 h generated 100,149 history items!! Zero errors! Big Fella taking big strides Processing all 3840 cells took 108 hours and generated 100,149 history items!!! Zero errors! 3,840 cells: 108 hours and 100,149 history items. Zero errors. Mo Heydarian

1. The results look correct in aggregate The data looks
about right. tSNE clustering resembles our understanding of hematopoiesis atopoiesis

My lncRNAs are expressed in real cells and in jackpot
model across the population My lncRNAs are expressed in real cells and in ackpot model across the population 2. Novel lncRNAs follow “jackpot model”

What about the backend? Extensive improvements to the Galaxy workflow
to support analysis at this scale. Robustness: pausing, partial restarts, better recovery, better throughput (but nothing you can see)

Galaxy’s workflow system is robust, flexible, and integrates with nearly
any environment Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere

For example, The single-cell RNA- seq analysis was run on
Running Galaxy version 16.10 Head node: 16 core, 122 GB (r4.4xlarge) Worker nodes: 2 x 16 core, 122 GB (r4.4xlarge) 10 TB EBS volume

2. An infrastructure stack for practical reproducibility

1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private
clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed

State of the Galaxy ToolShed ToolShed now contains thousands of
tools Community response has been phenomenal However, packaging is challenging — it never ends! Need to move to a model that pulls in and integrates with a broader community

Packaging software for reproducible research

Portability and Isolation are crucial for practical reproducibility

https://bioconda.github.io

It is now reasonable to support one major server platform
— Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)

Builds on Conda packaging system, designed “for installing multiple versions
of software packages and their dependencies and switching easily between them” ~2200 recipes for software packages (as of yesterday) All packages are automatically built in a minimal environment to ensure isolation and portability

Submit recipe to GitHub Travis CI pulls recipes and builds
in minimal docker container Successful binary builds from main repo uploaded to Anaconda to be installed anywhere

Containers for composing an recreating complete environments

rkt Singularity

Containerization Builds on Linux kernel features enabling complete isolation from
the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — e.g. Docker hub, quay.io

Galaxy + Containers Run every analysis in a clean container
— analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated

Bioconda + Containers Given a set of packages and versions
in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled)

Travis CI pulls recipes and builds in minimal docker container
Successful builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider rkt Singularity

Bioconda + Containers + Virtualization If we run our containers
inside a specific (ideally minimal) known VM we can control the kernel environment as well Atmosphere funded by the National Science Foundation

Tool and dependency binaries, built in minimal environment with controlled
libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control

…and it all just works in Galaxy Depending on how
Galaxy is configured this can be resolved with conda, with biocontainers…

…and it all just works in Galaxy Depending on how
Galaxy is configured this can be resolved with conda, with biocontainers… …or environment modules, or brew, guix, … (Resolvers are completely pluggable)

What about multiple packages? Generate containers based on a reproducible
has of package name and version Walk the ToolShed and archive containers for every combination of tools used

Not just for Galaxy

Not just for Galaxy Docker requirement, tightly coupled

Not just for Galaxy Docker requirement, tightly coupled Software requirement,
can be resolved in an environment specific way

Not just for Galaxy Docker requirement, tightly coupled Software requirement,
can be resolved in an environment specific way Implemented in “galaxy-lib” — integrated in CWL reference implementation, …

This is the best stack for complete reproducibility we have
ever had in bioinformatics. With the right technologies, reproducibility is possible and practical.

3. Galaxy without the UI

John Chilton

“A scientific workflow SDK” The way to develop Galaxy tools
Linting, testing, … Support every aspect of the tool development lifecycle

What about workflows?

Start a Galaxy instance serving a specific workflow and specific
tools

Create and save “template” Galaxy instances

Run a workflow in a dynamically created or existing Galaxy
template

Build / edit your workflows in a text editor

Testing workflows

Acknowledgements Galaxy Team: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave
Bouvier, Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler,   Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Marius van den Beek BioConda and Biocontainers: Johannes Köster, Ryan Dale, Björn Grüning, … All contributors to and users of all of the projects I’ve talked about NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620) NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103)

ISMB 2017: Supporting highly scalable scientifi...

ISMB 2017: Supporting highly scalable scientific data analysis with Galaxy

More Decks by James Taylor

Other Decks in Science

Featured

Transcript