Molecular Medicine Tri-con 2015

Accessible, transparent, and reproducible Genomics at scale with Galaxy @jxtx
/ #usegalaxy https://speakerdeck.com/jxtx

A continuing crisis in genomics research: reproducibility

What is reproducibility? (for computational analyses) Reproducibility is not provenance,
reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in suﬃcient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible   (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…

Reproducibility Project: Cancer Biology Independently replicating 50 “high-impact” cancer studies
from 2010-2012 (https://osf.io/e81xl/wiki/home/)

Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;
Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers

#METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498
0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)

Is reproducibility achievable?

A spectrum of solutions Analysis environments (Galaxy, GenePattern, Mobyle, …)
Workflow systems (Taverna, Pegasus, VisTrails, …) Notebook style (iPython notebook, …) Literate programming style (Sweave/knitR, …) System level provenance capture (ReproZip, …) Complete environment capture (VMs, containers, …)

Galaxy’s motivating questions How best can data intensive methods be
accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?

Galaxy: accessible analysis system

A free (for everyone) web service integrating a wealth of
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Describe analysis tool behavior abstractly

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details

tracks details Workflow system for complex analysis, constructed explicitly or automatically

tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Visualization and visual analytics

Ways to use Galaxy The public web service at http://usegalaxy.org
Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere

Galaxy in a world of increasingly complex analyses

1) Shift from tools to workflows As analyses needs become
increasingly complex, typical users have moved from running individual tools to primarily running workflows

For research use, users need to be able to construct
and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows

(Goecks et al. Cancer Medicine, 2015)

ensures that the pipelines can evolve and incorporate new tools
as they become available rather than requiring the development of new pipelines. The exome and transcriptome analysis pipelines require vastly more time and computing resources than the variant analysis pipeline: the exome/transcriptome processing pipelines require about a day to complete on a small computing cluster, while the integrated variant analysis pipeline can be run in less than an hour. Also, there are established protocols for exome and transcriptome processing but less so for variant analysis. Hence, by splitting the pipelines up as we have and putting the pipelines in Galaxy, it is simple and fast to experiment with different settings in the variant analysis pipeline and ﬁnd settings that are most useful for a particular set of samples. Results Validation using cell line data To validate our pipelines, we analyzed targeted exome and whole transcriptome sequencing data from three well-characterized pancreatic cancer cell lines: MIA PaCa2 (MP), HPAC, and PANC-1. Exonic regions of 577 genes that are commonly included in cancer gene panels were sequenced. All three cell lines are included in the Cancer Cell Line Encyclopedia (CCLE) [15]; the CCLE includes a mutational proﬁle for known oncogenes and drug response information for each cell line. The goal of this analysis is to use our pipelines to process the cell line (A) (B) Figure 2. Galaxy Circos plot showing data produced from (A; at top) exome and transcriptome analysis of Mia PaCa2 cell line and (B; at bottom) transcriptome analysis of a pancreatic adenocarcinoma tumor. Starting at the innermost track, the data are: (i) mapped read coverage; (ii) mapped read coverage after PCR duplicates removed; (iii) called variants; (iv) rare and deleterious variants; (v) rare, deleterious, and druggable variants; (vi) rare and deleterious variants performance Figure 2A shows an interactive Galaxy- Circos plot of data generated from analysis of the MIA PaCa2 cell line. (A) (Goecks et al. Cancer Medicine, 2015)

(Goecks et al. Cancer Medicine, 2015)

However, for reproducibility, we want to be able to ensure
that a workflow can be exactly rerun, even in a diﬀerent compute environment, and get exactly the same results

1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private
clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed Greg von Kuster

The Galaxy Toolshed: Sharing tools, workflows, and their dependencies

Repositories are owned by the contributor, can contain tools, workflows,
etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)

2700 tools in 1200 repositories

ToolShed Challenges Good for deployment and archiving, diﬃcult for development

New command line tools to address concerns from tool developers
Tool Development Planemo Command-line tools to aid development. ◦ Test tools quickly without worrying about configuration files. ◦ Check tools for common bugs and best practices. ◦ Optimized publishing to the ToolShed. ◦ Testbed for new dependency management - Homebrew and Homebrew-science John Chilton

Move to git[hub] centric development workflow Within three weeks, four
major community contributions to core tools ols hub. eeks: ols of FastQC

Tool citations, credit and incentivization Embed DOIs in Tool Configuration,
Galaxy resolves and provides a list of citations, with links, which can exported for reference managers

ToolShed Challenges Complex dependency definitions, packaging dependencies is a rabbit
hole

Virtualize everything: control the host environment

Enis Afgan, Dannon Baker

POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the
cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diagnostics. However, combining NGS and clinical data presents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certified Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)

funded by the National Science Foundation Award #ACI-1445604

A user-friendly cloud environment designed to give researchers access to
interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identifiers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.

Share a snapshot of this instance Current support for archiving
instances with CloudMan Plan to support archiving analyses both from custom   Galaxy instances and on Galaxy main Enis Afgan

New approaches for dependency management Alternative approach for installing dependencies:
Homebrew/Linuxbrew How can we run community contributed tools safely and eﬃciently? Support for defining dependencies as Docker containers

What is Docker? Docker Virtual Machines “It run proce host
o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management benefits on VMs but much more lightweight and eﬃcient

Reproducibility advantages of Docker Standard recipe approach for creating Docker
containers called a Dockerfile Where VMs are typically a blackbox, the Dockerfile allows inspection of exactly how the container was created; leading to greater transparency

ToolShed and Docker Tools can assert their dependencies are provided
by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies

2) Greater need for building new, ad hoc, analysis within
Galaxy

Interactive programming environments Björn Grüning, Eric Rasche, John Chilton

For researchers without informatics expertise, the web UI and existing
tools are often suﬃcient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting

Docker enables interactive———— environments Framework allows spinning up secure* isolated
environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook

Example from John Chilton

Next steps Support for Jupyter (both Python and Julia) and
RStudio environments Interactive programming environments as first class citizens: full provenance tracking, establish inputs and outputs, be used in workflows, etc. Databases as first class citizens, e.g. GEMINI query interface as a reusable tool

3) Galaxy users need to work not just with large
datasets, but large numbers of datasets

Galaxy’s user interface is designed to be simple and intuitive
for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?

Users typically use many histories when working with many samples;
New multiple history view makes working with 100s of histories easy Carl Eberhard

A not-so-new feature: mapping over multiple datasets However, this breaks
down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)

Dataset collections complex combinations of datasets that can be treated
as a single unit

Dataset Collections Organize user data Individual Datasets Collection Collection Contents
John Chilton and Carl Eberhard

Operations over collections For “list” collections, existing tools can automatically
be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools

Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs
(... paired). Run applications in parallel (one per input). Merged output for subsequent processing. John Chilton

Enhanced Tuxedo Suite Workflow RNA-Seq workflow based using the Tuxedo
suite. John Chilton

Dataset Collections Extremely flexible for grouping collections of complex datasets,
can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)

4) Visualization as a tool to make sense of complex
data

Towards a pluggable interactive visualization framework

Modifying Cuﬄinks parameters and locally reassembling

PhyloViz from Google Summer of Code student Tomithy Too

Circster,interactive circos-style plots

Visualization framework: Charts plugin

ables users to quickly visualize tabular data. reencast Sam Guerler

Pluggable visualization framework Similar to tools, new visualizations can be
dropped into a Galaxy instance Typically a simple server side template to bootstrap a client side visualization Framework for serving data sliced and aggregated in various ways Adaptor for BioJS visualizations in progress Linked visualizations on related data

Stuﬀ that’s coming Backend workflow engine improvements to support the
much larger analyses that can now be constructed in the UI (ongoing) Increasing complexity and control over how datasets are used Federation between Galaxy instances, support for transparently accessing data from other APIs

Using Galaxy main to drive scalability improvements…

PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores
• 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin Nate Coraor

Summary Galaxy is an (obsessively) open framework for making data
analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools

Galaxy is a community! Join us on irc, mailing lists,
Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks

Molecular Medicine Tri-con 2015

Molecular Medicine Tri-con 2015

More Decks by James Taylor

Other Decks in Science

Featured

Transcript