Galaxy and Reproducibility

Biology as a data intensive science: Computation is the new
cloning @galaxyproject / #usegalaxy http://www.galaxyproject.org

A continuing crisis in genomics research: reproducibility

What is reproducibility? (for computational analyses) Reproducibility is not provenance,
reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in suﬃcient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible   (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…

Reproducibility Project: Cancer Biology Independently replicating 50 “high-impact” cancer studies
from 2010-2012 (https://osf.io/e81xl/wiki/home/)

Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;
Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers

#METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498
0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15

Example: A tale of two Science papers

Paper 1

All you need for reproducing is here (Fig. 2)

Paper 2

Genomic signatures to guide the use of chemotherapeutics Anil Potti1,2,
Holly K Dressman1,3, Andrea Bild1,3, Richard F Riedel1,2, Gina Chan4, Robyn Sayer4, Janiel Cragun4, Hope Cottrill4, Michael J Kelley2, Rebecca Petersen5, David Harpole5, Jeffrey Marks5, Andrew Berchuck1,6, Geoffrey S Ginsburg1,2, Phillip Febbo1–3, Johnathan Lancaster4 & Joseph R Nevins1–3 Using in vitro drug sensitivity data coupled with Affymetrix microarray data, we developed gene expression signatures that predict sensitivity to individual chemotherapeutic drugs. Each signature was validated with response data from an independent set of cell line studies. We further show that many of these signatures can accurately predict clinical response in individuals treated with these drugs. Notably, signatures developed to predict response to individual agents, when combined, could also predict response to multidrug regimens. Finally, we integrated the chemotherapy response signatures with signatures of oncogenic pathway deregulation to identify new therapeutic strategies that make use of all available drugs. The development of gene expression profiles that can predict response to commonly used cytotoxic agents provides opportunities to better use these drugs, including using them in combination with existing targeted therapies. Numerous advances have been achieved in the development, selection and application of chemotherapeutic agents, sometimes with remark- able clinical successes—as in the case of treatment for lymphomas or platinum-based therapy for testicular cancers1. In addition, in several instances, combination chemotherapy in the postoperative (adjuvant) setting has been curative. However, most people with advanced solid tumors will relapse and die of their disease. Moreover, administration of ineffective chemotherapy increases the probability of side effects, particularly those from cytotoxic agents, and of a consequent decrease in quality of life1,2. Recent work has demonstrated the value in using biomarkers to select individuals for various targeted therapeutics, including tamox- ifen, trastuzumab and imatinib mesylate. In contrast, equivalent tools to select those most likely to respond to the commonly used chemotherapeutic drugs are lacking3. With the goal of developing genomic predictors of chemotherapy sensitivity that could direct the use of cytotoxic agents to those most likely to respond, we combined in vitro drug response data, together with microarray gene expression data, to develop models that could potentially predict responses to various cytotoxic chemotherapeutic drugs4. We now show that these signatures can predict clinical or pathologic response to the corresponding drugs, including combinations of drugs. We further use the ability to predict deregulated oncogenic signaling pathways in tumors to develop a strategy that identifies opportunities for combining chemotherapeutic drugs with targeted therapeutic drugs in a way that best matches the character- istics of the individual. RESULTS A gene expression–based predictor of sensitivity to docetaxel To develop predictors of cytotoxic chemotherapeutic drug response, we used an approach similar to previous work analyzing the NCI-60 panel4 from the US National Cancer Institute (NCI). We first identified cell lines that were most resistant or sensitive to docetaxel (Fig. 1a,b) and then genes whose expression correlated most highly with drug sensitivity, and used Bayesian binary regression analysis to develop a model that differentiates a pattern of docetaxel sensitivity from that of resistance. A gene expression signature consisting of 50 genes was identified that classified cell lines on the basis of docetaxel sensitivity (Fig. 1b, right). In addition to leave-one-out cross-validation, we used an independent dataset derived from docetaxel sensitivity assays in a series of 30 lung and ovarian cancer cell lines for further validation. The significant correlation (P o 0.01, log-rank test) between the predicted probability of sensitivity to docetaxel (in both lung and ovarian cell lines) (Fig. 1c, left) and the respective 50% inhibitory concentration (IC50) for docetaxel confirmed the capacity of the docetaxel predictor to predict sensitivity to the drug in cancer cell A R T I C L E S © 2011 Nature America, Inc. All rights reserved.

The importance of being reproducible Starting in 2006, Potti published
papers describing algorithms that take gene-expression data from a cancer cell and predict whether the cancer will be sensitive to a particular therapy Duke began three clinical trials based on the technology enrolling 110 patients

The importance of being reproducible However, Keith Baggerly and Kevin
Coombes demonstrate that the findings cannot be replicated Long and difficult fight to get this acknowledged, followed as a series of investigations So far, ten major paper retractions, all trials cancelled, two lawsuits ongoing…

The importance of being reproducible NCI investigates, demands the software
for the method be provided Not only could they not replicate the results, the software produced substantially diﬀerent predictions when run again on the same data! Some scores changed from 5% to 95%, classiﬁcations changed ~25% of the time!

How does this even pass peer review? DON’T TRUST BLACK
BOXES! Be smart consumers!

Is reproducibility achievable?

To answer this question we need to understand causes of
the problem

Who are we dealing with? Users (Biologists) Developers HPC

Users (Biologists) troubles: - Data logistics - HPC - Poor
knowledge of exiting tools - Inability to develop new tools - Lack of transparency and reproducibility

Developers’ grief: - Limited tool exposure - Parameter picking troubles
- Data format nightmare - High proﬁle publications

HPC providers’ challenges: - Lack of HPC utilization skills -
Software is not optimized - HPC is heterogeneous

user (Biologist) admin dev

user admin dev

admin user dev

user dev Galaxy admin

Galaxy: accessible analysis system

Galaxy Servers Worldwide http://bit.ly/gxyServers

A free (for everyone) web service integrating a wealth of
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workﬂows, ...

Galaxy’s ideological goals: How best can data intensive methods be
accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?

Galaxy’s practical goals: How to arm researchers with access to
powerful compute and latest tools How to build a community of tool developers How to run Galaxy on any HPC

Describe analysis tool behavior abstractly

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details

tracks details Workﬂow system for complex analysis, constructed explicitly or automatically

tracks details Workﬂow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Visualization and visual analytics

Ways to use Galaxy The public web service at http://usegalaxy.org
Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere

Galaxy in a world of increasingly complex analyses

user HPC dev Galaxy

user HPC dev

We are in the age of multiple datasets

Galaxy’s user interface is designed to be simple and intuitive
for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?

Users typically use many histories when working with many samples;
New multiple history view makes working with 100s of histories easy

A not-so-new feature: mapping over multiple datasets However, this breaks
down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)

Dataset collections complex combinations of datasets that can be treated
as a single unit

Dataset Collections Organize user data Individual Datasets Collection Collection Contents

Operations over collections For “list” collections, existing tools can automatically
be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools

Map/reduce in workﬂows More Powerful Workflows Arbitrary # of Inputs
(... paired). Run applications in parallel (one per input). Merged output for subsequent processing.

Enhanced Tuxedo Suite Workﬂow RNA-Seq workflow based using the Tuxedo
suite.

Dataset Collections Extremely ﬂexible for grouping collections of complex datasets,
can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workﬂow scheduling improvements (backgrounding, decision points, streaming)

An analysis is really a workﬂow

As analyses needs become increasingly complex, typical users have moved
from running individual tools to primarily running workﬂows

For research use, users need to be able to construct
and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows

(Goecks et al. Cancer Medicine, 2015)

However, for reproducibility, we want to be able to ensure
that a workﬂow can be exactly rerun, even in a diﬀerent compute environment, and get exactly the same results

1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private
clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed

Fostering the tool developer community

Galaxy has highly expressive tool deﬁnition syntax

Conditionals

Repeats

Dynamic options

And many others…

The Galaxy Toolshed: Sharing tools, workﬂows, and their dependencies

Repositories are owned by the contributor, can contain tools, workﬂows,
etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)

ToolShed Challenges Good for deployment and archiving, diﬃcult for development

Tool citations, credit and incentivization Embed DOIs in Tool Conﬁguration,
Galaxy resolves and provides a list of citations, with links, which can exported for reference managers

ToolShed Challenges Complex dependency deﬁnitions, packaging dependencies is a rabbit
hole

Virtualize everything: control the host environment

POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the
cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diagnostics. However, combining NGS and clinical data presents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certiﬁed Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)

Share a snapshot of this instance Current support for archiving
instances with CloudMan Plan to support archiving analyses both from custom   Galaxy instances and on Galaxy main

New approaches for dependency management Alternative approach for installing dependencies:
Conda How can we run community contributed tools safely and eﬃciently? Support for deﬁning dependencies as Docker containers

What is Docker? Docker Virtual Machines “It run proce host
o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management beneﬁts of VMs but much more lightweight and eﬃcient

ToolShed and Docker Tools can assert their dependencies are provided
by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies

What is you ned a new, ad hoc, analysis within
Galaxy

Interactive programming environments

For researchers without informatics expertise, the web UI and existing
tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting

Docker enables interactive———— environments Framework allows spinning up secure* isolated
environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook

Using Galaxy main to drive scalability improvements…

PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores
• 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin

funded by the National Science Foundation Award #ACI-1445604

A user-friendly cloud environment designed to give researchers access to
interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identiﬁers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.

Summary Galaxy is an (obsessively) open framework for making data
analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacriﬁcing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools

Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko
James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Custodians Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Nitesh Turaga The “Core” Galaxy Team

Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott UCSC
Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …

Galaxy is a community! Join us on irc, mailing lists,
Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference 2016

Galaxy and Reproducibility

Galaxy and Reproducibility

More Decks by Anton Nekrutenko

Other Decks in Education

Featured

Transcript