Slide 1

Slide 1 text

Building data analysis ecosystem in life sciences with Galaxy @galaxyproject / #usegalaxy http://www.galaxyproject.org

Slide 2

Slide 2 text

A continuing crisis in genomics research: reproducibility

Slide 3

Slide 3 text

What is reproducibility? (for computational analyses) Reproducibility is not provenance, reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible 
 (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…

Slide 4

Slide 4 text

Reproducibility ≈ Engine efficiency Schwarz 2015 (DOI: 10.1126/science.aaa3276)

Slide 5

Slide 5 text

Reproducibility Project: Cancer Biology Independently replicating 50 “high-impact” cancer studies from 2010-2012 (https://osf.io/e81xl/wiki/home/)

Slide 6

Slide 6 text

Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa; Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

#METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498 0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15

Slide 9

Slide 9 text

Example: A tale of two Science papers

Slide 10

Slide 10 text

Paper 1

Slide 11

Slide 11 text

All you need for reproducing is here (Fig. 2)

Slide 12

Slide 12 text

Paper 2

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Genomic signatures to guide the use of chemotherapeutics Anil Potti1,2, Holly K Dressman1,3, Andrea Bild1,3, Richard F Riedel1,2, Gina Chan4, Robyn Sayer4, Janiel Cragun4, Hope Cottrill4, Michael J Kelley2, Rebecca Petersen5, David Harpole5, Jeffrey Marks5, Andrew Berchuck1,6, Geoffrey S Ginsburg1,2, Phillip Febbo1–3, Johnathan Lancaster4 & Joseph R Nevins1–3 Using in vitro drug sensitivity data coupled with Affymetrix microarray data, we developed gene expression signatures that predict sensitivity to individual chemotherapeutic drugs. Each signature was validated with response data from an independent set of cell line studies. We further show that many of these signatures can accurately predict clinical response in individuals treated with these drugs. Notably, signatures developed to predict response to individual agents, when combined, could also predict response to multidrug regimens. Finally, we integrated the chemotherapy response signatures with signatures of oncogenic pathway deregulation to identify new therapeutic strategies that make use of all available drugs. The development of gene expression profiles that can predict response to commonly used cytotoxic agents provides opportunities to better use these drugs, including using them in combination with existing targeted therapies. Numerous advances have been achieved in the development, selection and application of chemotherapeutic agents, sometimes with remark- able clinical successes—as in the case of treatment for lymphomas or platinum-based therapy for testicular cancers1. In addition, in several instances, combination chemotherapy in the postoperative (adjuvant) setting has been curative. However, most people with advanced solid tumors will relapse and die of their disease. Moreover, administration of ineffective chemotherapy increases the probability of side effects, particularly those from cytotoxic agents, and of a consequent decrease in quality of life1,2. Recent work has demonstrated the value in using biomarkers to select individuals for various targeted therapeutics, including tamox- ifen, trastuzumab and imatinib mesylate. In contrast, equivalent tools to select those most likely to respond to the commonly used chemotherapeutic drugs are lacking3. With the goal of developing genomic predictors of chemotherapy sensitivity that could direct the use of cytotoxic agents to those most likely to respond, we combined in vitro drug response data, together with microarray gene expression data, to develop models that could potentially predict responses to various cytotoxic chemotherapeutic drugs4. We now show that these signatures can predict clinical or pathologic response to the corresponding drugs, including combina- tions of drugs. We further use the ability to predict deregulated oncogenic signaling pathways in tumors to develop a strategy that identifies opportunities for combining chemotherapeutic drugs with targeted therapeutic drugs in a way that best matches the character- istics of the individual. RESULTS A gene expression–based predictor of sensitivity to docetaxel To develop predictors of cytotoxic chemotherapeutic drug response, we used an approach similar to previous work analyzing the NCI-60 panel4 from the US National Cancer Institute (NCI). We first identified cell lines that were most resistant or sensitive to docetaxel (Fig. 1a,b) and then genes whose expression correlated most highly with drug sensitivity, and used Bayesian binary regression analysis to develop a model that differentiates a pattern of docetaxel sensitivity from that of resistance. A gene expression signature consisting of 50 genes was identified that classified cell lines on the basis of docetaxel sensitivity (Fig. 1b, right). In addition to leave-one-out cross-validation, we used an indepen- dent dataset derived from docetaxel sensitivity assays in a series of 30 lung and ovarian cancer cell lines for further validation. The significant correlation (P o 0.01, log-rank test) between the predicted probability of sensitivity to docetaxel (in both lung and ovarian cell lines) (Fig. 1c, left) and the respective 50% inhibitory concentration (IC50) for docetaxel confirmed the capacity of the docetaxel predictor to predict sensitivity to the drug in cancer cell A R T I C L E S © 2011 Nature America, Inc. All rights reserved.

Slide 18

Slide 18 text

The importance of being reproducible Starting in 2006, Potti published papers describing algorithms that take gene-expression data from a cancer cell and predict whether the cancer will be sensitive to a particular therapy Duke began three clinical trials based on the technology enrolling 110 patients

Slide 19

Slide 19 text

The importance of being reproducible However, Keith Baggerly and Kevin Coombes demonstrate that the findings cannot be replicated Long and difficult fight to get this acknowledged, followed be a series of investigations So far, ten major paper retractions, all trials cancelled, two lawsuits ongoing…

Slide 20

Slide 20 text

The importance of being reproducible NCI investigates, demands the software for the method be provided Not only could they not replicate the results, the software produced substantially different predictions when run again on the same data! Some scores changed from 5% to 95%, classifications changed ~25% of the time!

Slide 21

Slide 21 text

How does this even pass peer review? DON’T TRUST BLACK BOXES!

Slide 22

Slide 22 text

Is reproducibility achievable?

Slide 23

Slide 23 text

To answer this question we need to understand causes of the problem

Slide 24

Slide 24 text

Who are we dealing with? Users Developers HPC

Slide 25

Slide 25 text

Users troubles: - Data logistics - HPC - Poor knowledge of exiting tools - Inability to develop new tools - Lack of transparency and reproducibility

Slide 26

Slide 26 text

Developers’ grief: - Limited tool exposure - Parameter picking troubles - Data format nightmare - High profile publications

Slide 27

Slide 27 text

HPC providers’ challenges: - Lack of HPC utilization skills - Software is not optimized - HPC is heterogeneous

Slide 28

Slide 28 text

user HPC dev

Slide 29

Slide 29 text

user HPC dev

Slide 30

Slide 30 text

user HPC dev

Slide 31

Slide 31 text

user HPC dev Galaxy

Slide 32

Slide 32 text

Galaxy: accessible analysis system

Slide 33

Slide 33 text

A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Slide 34

Slide 34 text

Galaxy’s ideological goals: How best can data intensive methods be accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?

Slide 35

Slide 35 text

Galaxy’s practical goals: How to arm researchers with access to powerful compute and latest tools How to build a community of tool developers How to run Galaxy on any HPC

Slide 36

Slide 36 text

Galaxy’s goals (an xkcd version) Galaxy no Galaxy

Slide 37

Slide 37 text

Describe analysis tool behavior abstractly

Slide 38

Slide 38 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details

Slide 39

Slide 39 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically

Slide 40

Slide 40 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Slide 41

Slide 41 text

Visualization and visual analytics

Slide 42

Slide 42 text

Ways to use Galaxy The public web service at http://usegalaxy.org Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere

Slide 43

Slide 43 text

Galaxy in a world of increasingly complex analyses

Slide 44

Slide 44 text

user HPC dev Galaxy

Slide 45

Slide 45 text

user HPC dev

Slide 46

Slide 46 text

We are in the age of multiple datasets

Slide 47

Slide 47 text

Galaxy’s user interface is designed to be simple and intuitive for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?

Slide 48

Slide 48 text

Users typically use many histories when working with many samples; New multiple history view makes working with 100s of histories easy

Slide 49

Slide 49 text

A not-so-new feature: mapping over multiple datasets However, this breaks down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)

Slide 50

Slide 50 text

Dataset collections complex combinations of datasets that can be treated as a single unit

Slide 51

Slide 51 text

Dataset Collections Organize user data Individual Datasets Collection Collection Contents

Slide 52

Slide 52 text

Operations over collections For “list” collections, existing tools can automatically be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools

Slide 53

Slide 53 text

Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs (... paired). Run applications in parallel (one per input). Merged output for subsequent processing.

Slide 54

Slide 54 text

Enhanced Tuxedo Suite Workflow RNA-Seq workflow based using the Tuxedo suite.

Slide 55

Slide 55 text

Dataset Collections Extremely flexible for grouping collections of complex datasets, can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)

Slide 56

Slide 56 text

An analysis is really a workflow

Slide 57

Slide 57 text

As analyses needs become increasingly complex, typical users have moved from running individual tools to primarily running workflows

Slide 58

Slide 58 text

For research use, users need to be able to construct and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows

Slide 59

Slide 59 text

(Goecks et al. Cancer Medicine, 2015)

Slide 60

Slide 60 text

(Goecks et al. Cancer Medicine, 2015)

Slide 61

Slide 61 text

However, for reproducibility, we want to be able to ensure that a workflow can be exactly rerun, even in a different compute environment, and get exactly the same results

Slide 62

Slide 62 text

1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed

Slide 63

Slide 63 text

Fostering the tool developer community

Slide 64

Slide 64 text

Galaxy has highly expressive tool definition syntax

Slide 65

Slide 65 text

Conditionals

Slide 66

Slide 66 text

Conditionals

Slide 67

Slide 67 text

Conditionals

Slide 68

Slide 68 text

Repeats

Slide 69

Slide 69 text

Repeats

Slide 70

Slide 70 text

Dynamic options

Slide 71

Slide 71 text

And many others…

Slide 72

Slide 72 text

The Galaxy Toolshed: Sharing tools, workflows, and their dependencies

Slide 73

Slide 73 text

Repositories are owned by the contributor, can contain tools, workflows, etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

ToolShed Challenges Good for deployment and archiving, difficult for development

Slide 82

Slide 82 text

New command line tools to address concerns from tool developers Tool Development Planemo Command-line tools to aid development. ○ Test tools quickly without worrying about configuration files. ○ Check tools for common bugs and best practices. ○ Optimized publishing to the ToolShed. ○ Testbed for new dependency management - Homebrew and Homebrew-science

Slide 83

Slide 83 text

Move to git[hub] centric development workflow Within three weeks, four major community contributions to core tools ols hub. eeks: ols of FastQC

Slide 84

Slide 84 text

Tool citations, credit and incentivization Embed DOIs in Tool Configuration, Galaxy resolves and provides a list of citations, with links, which can exported for reference managers

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

ToolShed Challenges Complex dependency definitions, packaging dependencies is a rabbit hole

Slide 87

Slide 87 text

Virtualize everything: control the host environment

Slide 88

Slide 88 text

No content

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diag- nostics. However, combining NGS and clinical data pre- sents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certified Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)

Slide 91

Slide 91 text

Share a snapshot of this instance Current support for archiving instances with CloudMan Plan to support archiving analyses both from custom 
 Galaxy instances and on Galaxy main

Slide 92

Slide 92 text

New approaches for dependency management Alternative approach for installing dependencies: Homebrew/Linuxbrew How can we run community contributed tools safely and efficiently? Support for defining dependencies as Docker containers

Slide 93

Slide 93 text

What is Docker? Docker Virtual Machines “It run proce host o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management benefits of VMs but much more lightweight and efficient

Slide 94

Slide 94 text

ToolShed and Docker Tools can assert their dependencies are provided by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies

Slide 95

Slide 95 text

What is you ned a new, ad hoc, analysis within Galaxy

Slide 96

Slide 96 text

Interactive programming environments

Slide 97

Slide 97 text

For researchers without informatics expertise, the web UI and existing tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting

Slide 98

Slide 98 text

Docker enables interactive———— environments Framework allows spinning up secure* isolated environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook

Slide 99

Slide 99 text

No content

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

No content

Slide 103

Slide 103 text

No content

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

No content

Slide 106

Slide 106 text

Next steps Support for Jupyter (both Python and Julia) and RStudio environments Interactive programming environments as first class citizens: full provenance tracking, establish inputs and outputs, be used in workflows, etc. Databases as first class citizens, e.g. GEMINI query interface as a reusable tool

Slide 107

Slide 107 text

Visualization as a tool to make sense of complex data

Slide 108

Slide 108 text

Towards a pluggable interactive visualization framework

Slide 109

Slide 109 text

No content

Slide 110

Slide 110 text

Modifying Cufflinks parameters and locally reassembling

Slide 111

Slide 111 text

PhyloViz from Google Summer of Code student Tomithy Too

Slide 112

Slide 112 text

Circster,interactive circos-style plots

Slide 113

Slide 113 text

Visualization framework: Charts plugin

Slide 114

Slide 114 text

Visualization framework: Charts plugin

Slide 115

Slide 115 text

ables users to quickly visualize tabular data. reencast

Slide 116

Slide 116 text

Stuff that’s coming Backend workflow engine improvements to support the much larger analyses that can now be constructed in the UI (ongoing) Increasing complexity and control over how datasets are used Federation between Galaxy instances, support for transparently accessing data from other APIs

Slide 117

Slide 117 text

Using Galaxy main to drive scalability improvements…

Slide 118

Slide 118 text

PSC, Pittsburgh SDSC, San Diego Galaxy Cluster ● 256 cores ● 2 TB memory Rodeo ● 128 cores ● 1 TB memory Corral/Stockyard ● 20 PB disk Stampede ● 462,462 cores ● 205 TB memory Blacklight ● 4,096 cores ● 32 TB memory ● Dedicated resources Trestles ● 10,368 cores ● 20.7 TB memory ● Shared resources TACC Austin

Slide 119

Slide 119 text

funded by the National Science Foundation Award #ACI-1445604

Slide 120

Slide 120 text

A user-friendly cloud environment designed to give researchers access to interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identifiers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.

Slide 121

Slide 121 text

Summary Galaxy is an (obsessively) open framework for making data analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools

Slide 122

Slide 122 text

Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Custodians Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Nitesh Turaga The “Core” Galaxy Team

Slide 123

Slide 123 text

Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott UCSC Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …

Slide 124

Slide 124 text

Galaxy is a community! Join us on irc, mailing lists, Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks