Presentation on scaling Galaxy, particularly from the UI perspective, at Molecular Medicine Tri-con 2015 session "Large-scale genomics data transfer, analysis, and storage".
reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
as they become available rather than requiring the development of new pipelines. The exome and transcriptome analysis pipelines require vastly more time and computing resources than the vari- ant analysis pipeline: the exome/transcriptome processing pipelines require about a day to complete on a small computing cluster, while the integrated variant analysis pipeline can be run in less than an hour. Also, there are established protocols for exome and transcriptome pro- cessing but less so for variant analysis. Hence, by splitting the pipelines up as we have and putting the pipelines in Galaxy, it is simple and fast to experiment with different settings in the variant analysis pipeline and find settings that are most useful for a particular set of samples. Results Validation using cell line data To validate our pipelines, we analyzed targeted exome and whole transcriptome sequencing data from three well-characterized pancreatic cancer cell lines: MIA PaCa2 (MP), HPAC, and PANC-1. Exonic regions of 577 genes that are commonly included in cancer gene panels were sequenced. All three cell lines are included in the Cancer Cell Line Encyclopedia (CCLE) [15]; the CCLE includes a mutational profile for known oncogenes and drug response information for each cell line. The goal of this analysis is to use our pipelines to process the cell line (A) (B) Figure 2. Galaxy Circos plot showing data produced from (A; at top) exome and transcriptome analysis of Mia PaCa2 cell line and (B; at bottom) transcriptome analysis of a pancreatic adenocarcinoma tumor. Starting at the innermost track, the data are: (i) mapped read coverage; (ii) mapped read coverage after PCR duplicates removed; (iii) called variants; (iv) rare and deleterious variants; (v) rare, deleterious, and druggable variants; (vi) rare and deleterious variants performance Figure 2A shows an interactive Galaxy- Circos plot of data generated from analysis of the MIA PaCa2 cell line. (A) (Goecks et al. Cancer Medicine, 2015)
etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
Tool Development Planemo Command-line tools to aid development. ◦ Test tools quickly without worrying about configuration files. ◦ Check tools for common bugs and best practices. ◦ Optimized publishing to the ToolShed. ◦ Testbed for new dependency management - Homebrew and Homebrew-science John Chilton
cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diag- nostics. However, combining NGS and clinical data pre- sents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certified Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)
interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identifiers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.
o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management benefits on VMs but much more lightweight and efficient
containers called a Dockerfile Where VMs are typically a blackbox, the Dockerfile allows inspection of exactly how the container was created; leading to greater transparency
by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies
tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting
RStudio environments Interactive programming environments as first class citizens: full provenance tracking, establish inputs and outputs, be used in workflows, etc. Databases as first class citizens, e.g. GEMINI query interface as a reusable tool
for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?
be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
dropped into a Galaxy instance Typically a simple server side template to bootstrap a client side visualization Framework for serving data sliced and aggregated in various ways Adaptor for BioJS visualizations in progress Linked visualizations on related data
much larger analyses that can now be constructed in the UI (ongoing) Increasing complexity and control over how datasets are used Federation between Galaxy instances, support for transparently accessing data from other APIs
analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools
Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks