reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in suﬃcient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
papers describing algorithms that take gene-expression data from a cancer cell and predict whether the cancer will be sensitive to a particular therapy Duke began three clinical trials based on the technology enrolling 110 patients
Coombes demonstrate that the ﬁndings cannot be replicated Long and diﬃcult ﬁght to get this acknowledged, followed be a series of investigations So far, ten major paper retractions, all trials cancelled, two lawsuits ongoing…
for the method be provided Not only could they not replicate the results, the software produced substantially diﬀerent predictions when run again on the same data! Some scores changed from 5% to 95%, classiﬁcations changed ~25% of the time!
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workﬂows, ...
be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workﬂow scheduling improvements (backgrounding, decision points, streaming)
etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
Tool Development Planemo Command-line tools to aid development. ◦ Test tools quickly without worrying about configuration files. ◦ Check tools for common bugs and best practices. ◦ Optimized publishing to the ToolShed. ◦ Testbed for new dependency management - Homebrew and Homebrew-science
cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diag- nostics. However, combining NGS and clinical data pre- sents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certiﬁed Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)
o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management beneﬁts of VMs but much more lightweight and eﬃcient
tools are often suﬃcient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the beneﬁts of Galaxy with the ﬂexibility to do some scripting
RStudio environments Interactive programming environments as ﬁrst class citizens: full provenance tracking, establish inputs and outputs, be used in workﬂows, etc. Databases as ﬁrst class citizens, e.g. GEMINI query interface as a reusable tool
much larger analyses that can now be constructed in the UI (ongoing) Increasing complexity and control over how datasets are used Federation between Galaxy instances, support for transparently accessing data from other APIs
interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identiﬁers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.
analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacriﬁcing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools
James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Custodians Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Nitesh Turaga The “Core” Galaxy Team
Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …
Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks