The Story of Galaxy, NSF SI2 PI Meeting Keynote

The story of Galaxy www.galaxyproject.org @jxtx / #usegalaxy

My Story 2000, BS in CS, concentration in software engineering,
XP fanatic 2003, vowed never to develop software frameworks for other people to use ever again, devote life to science 2000-2003: Series of small Internet startups, Healthcare, publishing 2003-2006: Graduate School in CS at Penn State using machine learning to identify cis-regulatory elements and understand transcriptional regulation 2005, saw how diﬃcult it was for computational and biological researchers to interact around data intensive problems, sucked back into building tools 2006-present: Building tools and infrastructure for researchers, while trying to maintain a research program in chromatin structure and gene regulation

Why Galaxy?

Map of high-throughput sequencing instruments (http://pathogenomics.bham.ac.uk/hts/)

Resequencing De novo genome sequencing Direct RNA sequencing Open Chromatin
assays (DNase, FAIRE) Transcription factors (ChIP-seq) Histones variants (ChIP-seq, MNase-seq) Long range interactions (5C, Hi-C, ChIA-PET Methylation (Bisulfite-seq)

Biology has rapidly become data intensive, and dependent on computational
methods How can we ensure that these methods are accessible to researchers? ...while also ensuring that scientific results remain reproducible?

A crisis in genomics research: reproducibility

Are we capturing analysis details precisely?

Microarray Experiment Reproducibility • 18 Nat. Genetics microarray gene expression
experiments • Less than 50% reproducible • Problems • missing data (38%) • missing software, hardware details (50%) • missing method, processing details (66%) Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat Genet 41, 149-155 (2009)

NGS Re-sequencing Experiment Reproducibility

• Consider a sample 50 papers from 2011 that used
bwa for mapping (of ~380 published): • 36 did not provide primary data (all but 2 provided it upon request) • 31 provide neither parameters nor versions • 19 provide settings only and 8 list versions only • Only 7 provide all details

These details matter! igure 1 0.480 0.483 0.486 0.489 0.492
0.495 0.498 0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15

Even when methods are captured precisely, are they followed by
others?

• 19 publications in 2011 described genome- or exome-level re-sequencing
for polymorphism discovery and cited the 1000 genomes analysis • Only 5 explicitly use the methods described in that paper

1000 genomes analysis methods precisely documented, but few papers citing
these methods actually follow them

Today, one often hears that life sciences are faced with
the ‘big data problem’. However, data are just a small facet of a much bigger challenge. The true difficulty is that most biomedical researchers have no capacity studies, such as the recently published genomic profiling of a human individual whose genome was sequenced and gene expression tracked over an extended period in a series of RNA-seq experiments7. As a analysis transparency and reproducibility. To give the reader a sense of immediate urgency, we survey a number of recent studies that use NGS technologies and that show the lack of general agreement on how data analyses are to be carried out. We specifically highlight the fact that very few current studies record exact details of their computational experiments, making it difficult for others to repeat them. Adoption of existing analysis practices As mentioned above, there are numerous applications of NGS technologies. Yet there are common analysis challenges among all of these applications. Here we use one type of NGS application — variant discovery — as an example. In this analysis, which is becoming common in medical genetics and serves as the foundation for future personalized medicine, genomic DNA is sequenced, and the resulting data are compared against a reference sequence to catalogue differences: such differences can range from SNPs to complex chromosomal rearrangements. A series of accepted practices for variant discovery is starting to emerge owing to efforts such as the 1000 Genomes Project8 (also see BOX 1). One would expect that these approaches will be widely used in studies APPLICATIONS OF NEXT-GENERATION SEQUENCING — OPINION Next-generation sequencing data interpretation: enhancing reproducibility and accessibility Anton Nekrutenko and James Taylor Abstract | Areas of life sciences research that were previously distant from each other in ideology, analysis practices and toolkits, such as microbial ecology and personalized medicine, have all embraced techniques that rely on next-generation sequencing instruments. Yet the capacity to generate the data greatly outpaces our ability to analyse it. Existing sequencing technologies are more mature and accessible than the methodologies that are available for individual researchers to move, store, analyse and present data in a fashion that is transparent and reproducible. Here we discuss currently pressing issues with analysis, interpretation, reproducibility and accessibility of these data, and we present promising solutions and venture into potential future developments. PERSPECTIVES

Just capturing methods doesn’t lead to reuse We need to
both make it easy to precisely capture and communicate all of the details necessary to reproduce an analysis, and facilitate the development and sharing of reusable best practices

What is Galaxy?

Galaxy: accessible analysis system

Describe analysis tool behavior abstractly

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details

tracks details Workflow system for complex analysis, constructed explicitly or automatically

tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Visualization and visual analytics

History

GALA, a Database for Genomic Sequence Alignments and Annotations Belinda
Giardine,1 Laura Elnitski,1,2 Cathy Riemer,1 Izabela Makalowska,4 Scott Schwartz,1 Webb Miller,1,3,4 and Ross C. Hardison2,4,5 Departments of 1Computer Science and Engineering, 2Biochemistry and Molecular Biology, 3Biology, and 4Huck Institute for Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA We have developed a relational database to contain whole genome sequence alignments between human and mouse with extensive annotations of the human sequence. Complex queries are supported on recorded features, both directly and on proximity among them. Searches can reveal a wide variety of relationships, such as finding all genes expressed in a designated tissue that have a highly conserved noncoding sequence 5Ј to the start site. Other examples are finding single nucleotide polymorphisms that occur in conserved noncoding regions upstream of genes and identifying CpG islands that overlap the 5Ј ends of divergently transcribed genes. The database is available online at http://globin.cse.psu.edu/ and http://bio.cse.psu.edu/. The determination and annotation of complete genomic DNA sequences provide the opportunity for unprecedented advances in our understanding of evolution, genetics, and physiology, but the amount and diversity of data pose daunt- ing challenges as well. Three excellent browsers provide access to the sequence and annotations of the human genome, viz., the human genome browser (HGB) at UCSC (Kent et al. 2002) (http://genome.ucsc.edu/), Map Viewer at the National Cen- ter for Biotechnology Information (NCBI) (http:// www.ncbi.nlm.nih.gov/), and Ensembl (Hubbard et al. 2002) at the Sanger Centre (http://www.sanger.ac.uk/) and EBI (http://www.ebi.ac.uk/). These sites show known and pre- dicted genes, repetitive elements, genetic markers, and many other types of information as separate tracks in a display using combined with sequence conservation can refine predictions of functional sequences (Levy et al. 2001). One way to do this is to record both extensive annotations and sequence alignments in a database. We have developed a database of genomic DNA sequence alignments and annotations, called GALA, to search across tracks of data supplied by current browsers, alignment resources, and databases. Output from GALA queries can be viewed as tracks on the Human Genome Browser, as a table of data with hyperlinks, or as text. When users of the GALA database wish to view alignments, they can choose to display their query output using local alignments with Java (Laj), which is a versatile, interactive alignment viewer imple- mented using Java (Wilson et al. 2001). Resources The story of Galaxy starts in 2003 with the Genome Annotation and Alignment DB

A. B. C. GALA enables query annotation information from the
human genome, alongside alignments with the mouse genome, integrates with the UCSC browser, and allows building up set queries using the results of previous queries (the birth of the History system)

Galaxy began with a simple idea, can we extend GALA
to enable other types of analysis?

Galaxy prototype as a single Perl script (~2005)

We threw the first one away (quickly) and rewrote from
scratch in Python At this point we made several key design decisions that (in hindsight) determined whether we would succeed or fail (We got very lucky)

1. No longer store data in a database, but in
flat files in various common formats This meant existing tools could be integrated easily because they did not need to change the data formats they work with or interact with a database It also meant that when high-throughput sequence data suddenly came along, we were prepared to deal with data at that scale easily

2. Rather than build new analysis tools in the system,
build an abstract configuration driven interface to command line tools We did this to make our lives easier, we had many analysis tools lying around that we didn’t want to rewrite for Galaxy But this was equally appealing to other developers who could now easily make their tools available to Biologists

Early Pythonic Age Galaxy (mid 2005): Built around existing command
line tools

3. Make the entire stack self-contained, allowing a complete Galaxy
to be setup on most systems in minutes We primarily did this to engage tool developers, making it as easy as possible to develop new tool wrappers for contribution We envisioned those tools would all be made available through the main Galaxy service But it also provided a scaling strategy, making it very easy for sites to run their own Galaxy

4. Open-source and openly developed from the first commit Provide
everything we do under a liberal open- source license (no copyleft), and only support open- source tools on the main instance Our primary development repository is exposed to the public, initially hosted by us but later moved out to a third party (bitbucket.org) The software is distributed only through version control, with a rapid release cycle (at least monthly)

2006-2013: Evolution

2006-2013: Evolution ~2007: Workflows ~2008: Data libraries, role based security
~2009: Sample tracking (“LIMS”) ~2010: Sharing, tagging, annotation, pages ~2011: Visualization and visual analytics

Growing a community

Engaging users

Be useful! Be useful as quickly as possible Don’t build
anything if it isn’t useful Minimize barriers to entry (Running a free web service really helps here, new users can start working with their data immediately)

Be passionate and evangelize We were trying to solve a
serious problem and save the world (and we still are) Go to tons of scientific meetings, organize workshops wherever you can Above all, stay close to the science. Use your own tools to do science and present on that. Let the science drive everything.

A framework for collaborative analysis of ENCODE data: Making large-scale
analyses biologist-friendly Daniel Blankenberg, James Taylor, Ian Schenck, Jianbin He, Yi Zhang, Matthew Ghent, Narayanan Veeraraghavan, Istvan Albert, Webb Miller, Kateryna D. Makova, Ross C. Hardison, and Anton Nekrutenko1 Center for Comparative Genomics and Bioinformatics, Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA The standardization and sharing of data and tools are the biggest challenges of large collaborative projects such as the Encyclopedia of DNA Elements (ENCODE). Here we describe a compact Web application, Galaxy2ENCODE, that effectively addresses these issues. It provides an intuitive interface for the deposition and access of data, and features a vast number of analysis tools including operations on genomic intervals, utilities for manipulation of multiple sequence alignments, and molecular evolution algorithms. By providing a direct link between data and analysis tools, Galaxy2ENCODE allows addressing biological questions that are beyond the reach of existing software. We use Galaxy2ENCODE to show that the ENCODE regions contain >2000 unannotated transcripts under strong purifying selection that are likely functional. We also show that the ENCODE regions are representative of the entire genome by estimating the rate of nucleotide substitution and comparing it to published data. Although each of these analyses is complex, none takes more than 15 min from beginning to end. Finally, we demonstrate how new tools can be added to Galaxy2ENCODE with almost no effort. Every section of the manuscript is supplemented with QuickTime screencasts. Galaxy2ENCODE and the screencasts can be accessed at http://g2.bx.psu.edu. [Supplemental material is available online at www.genome.org and http://g2.bx.psu.edu.] Analysis of data generated by The ENCODE Project Consortium (2004) for the Encyclopedia of DNA Elements (ENCODE) is proving to be one of the most exciting collaborative events of the post-genomic era. The interpretation of enormous amounts of data generated by the ENCODE Consortium requires new methodologies for the sharing and standardization of data and new analysis tools. The system we describe here, Galaxy2ENCODE (http://g2.bx.psu.edu), is the first attempt to solve data and tool integration challenges for ENCODE-like projects and make data (ESTs) in ENCODE regions. We show that over 2000 ESTs do not correspond to any annotated genes, yet show strong signature of purifying selection, indicating possible function. In the second example, we estimate the rate of nucleotide substitutions in ENCODE regions and demonstrate that it is consistent with genome-wide estimates. The two analyses are designed as “cook- book” examples for two distinct audiences. The first analysis is geared toward researchers studying the structure and function of the human genome. The second example is for researchers work- Resource As part of the ENCODE pilot project used Galaxy for several evolutionary analyses

SOFTWARE Open Access Galaxy: a comprehensive approach for supporting accessible,
reproducible, and transparent computational research in the life sciences Jeremy Goecks1, Anton Nekrutenko2*, James Taylor1*, The Galaxy Team Abstract Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis. Rationale Computation has become an essential tool in life science research. This is exemplified in genomics, where first microarrays and now massively parallel DNA sequencing have enabled a variety of genome-wide functional assays, such as ChIP-seq [1] and RNA-seq [2] (and many others), that require increasingly complex analysis tools [3]. However, sudden reliance on computation has created an ‘informatics crisis’ for life science researchers: computational resources can be difficult to use, and ensuring that computational experiments are communi- cated well and hence reproducible is challenging. Galaxy helps to address this crisis by providing an open, web- based platform for performing accessible, reproducible, and transparent genomic science. The problem of accessibility of computational tools has long been recognized. Without programming or informatics expertise, scientists needing to use computational approaches are impeded by problems ranging from tool installation; to determining which parameter values to use; to efficiently combining multiple tools together in an analysis chain. The severity of these problems is evidenced by the numerous solutions to address them. Tutorials [4,5], software libraries such as Bioconductor [6] and Bioperl [7], and web-based interfaces for tools [8,9] all improve the accessibility of computation. These approaches each have advantages, but do not offer a general solution that enables a computational tool to be easily included in an analysis chain and run by scientists without programming experience. However, making tools accessible does not necessarily address the crucial problem of reproducibility. Reprodu- cing experimental results is an essential facet of scientific inquiry, providing the foundation for understanding, integrating, and extending results toward new discov- eries. Learning a programming language might enable a scientist to perform a given analysis, but ensuring that analysis is documented in a form another scientist can reproduce requires learning and practicing software engineering skills (Note that neither programming nor software engineering are included in a typical biomedical curriculum.) A recent investigation found that less than half of selected microarray experiments published in Nature Genetics could be reproduced. Issues that pre- vented reproduction included missing raw data, details in processing methods (especially computational ones), and software and hardware details [10]. Experiments that employ next-generation sequencing (NGS) will only exacerbate challenges in reproducibility due to a lack of standards, exceedingly large dataset sizes, and increasingly complex computational tools. In addition, integra- tive experiments, which use multiple data sources and multiple computational tools in their analyses, further complicate reproducibility. * Correspondence: [email protected]; [email protected] 1Department of Biology and Department of Mathematics and Computer Science, Emory University, 1510 Clifton Road NE, Atlanta, GA 30322, USA 2Center for Comparative Genomics and Bioinformatics, Penn State University, 505 Wartik Lab, University Park, PA 16802, USA Full list of author information is available at the end of the article Goecks et al. Genome Biology 2010, 11:R86 http://genomebiology.com/2010/11/8/R86 © 2010 Goecks et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Resource Windshield splatter analysis with the Galaxy metagenomic pipeline Sergei Kosakovsky Pond,1,2,6,9 Samir Wadhawan,3,6,7 Francesca Chiaromonte,4 Guruprasad Ananda,1,3 Wen-Yu Chung,1,3,8 James Taylor,1,5,9 Anton Nekrutenko,1,3,9 and The Galaxy Team1 1http://galaxyproject.org; 2Division of Infectious Diseases, Division of Biomedical Informatics, School of Medicine University of California San Diego, San Diego, California 92103, USA; 3Huck Institute for the Life Sciences, Penn State University, University Park, Pennsylvania 16803, USA; 4Department of Statistics, Penn State University, University Park, Pennsylvania 16803, USA; 5Departments of Biology and Mathematics & Computer Science, Emory University, Atlanta, Georgia 30322, USA How many species inhabit our immediate surroundings? A straightforward collection technique suitable for answering this question is known to anyone who has ever driven a car at highway speeds. The windshield of a moving vehicle is subjected to numerous insect strikes and can be used as a collection device for representative sampling. Unfortunately the analysis of biological material collected in that manner, as with most metagenomic studies, proves to be rather demanding due to the large number of required tools and considerable computational infrastructure. In this study, we use organic matter collected by a moving vehicle to design and test a comprehensive pipeline for phylogenetic profiling of metagenomic samples that includes all steps from processing and quality control of data generated by next-generation sequencing technologies to statistical analyses and data visualization. To the best of our knowledge, this is also the first publication that features a live online supplement providing access to exact analyses and workflows used in the article. [Supplemental material is available online at http:/ /www.genome.org. All data and tools described in this manuscript can be downloaded or used directly at http:/ /galaxyproject.org. Exact analyses and workflows used in this paper are available at http:/ /usegalaxy.org/u/aun1/p/windshield-splatter.] Metagenomics is often thought of as an exclusively microbial enterprise, as one of the field’s seminal papers was titled ‘‘Meta- genomics: application of genomics to uncultured microorgan- isms’’ (Handelsman 2004). Because we simply do not know the number of bacterial taxa, the major motivation behind metagenomic studies was the need to estimate the biodiversity of various environments by direct sampling of potentially unculturable organisms (Beja et al. 2000, 2001; Tyson et al. 2004; Venter et al. 2004; DeLong 2005; Tringe et al. 2005; Gill et al. 2006; Poinar et al. 2006; von Mering et al. 2007). However, our understanding of eukaryotic diversity may not be much more advanced. Although the number of distinct eukaryotic (and, in particular, insect) taxa is likely far below microbial, the existing confusion about the species number is as striking. For example, Erwin (1982) obtained an estimate of 30 million insect species via extrapolation. This figure was fiercely debated, and the latest calculations converge on an educated guess on the order of 10 million (May 1988; Erwin 1991; Mayr 1998; Odegaard 2000). If we assume that these estimates are correct, then only a minute number of insect species have been described to date. For example, as of February 2009 the taxonomy database at the National Center for Biotechnology Information (NCBI) lists 318,068 species from all branches of life. In this study we apply existing metagenomic methodologies to directly de- termine the taxonomic composition of biological matter collected by the front end of a moving vehicle. Although our specimen collection strategy is straightforward, we set ourselves the non- trivial task of taxonomic identification of collected species. Be- cause morphological identification is precluded by the destructive nature of the collection procedure, only DNA sequence analysis is feasible making this study de facto metagenomic. Metagenomic methodology has been evolving rapidly in the past 5 yr, and now includes a diverse array of approaches for profiling (binning) of complex samples (for excellent reviews, see McHardy and Rigoutsos 2007; Raes et al. 2007; Kunin et al. 2008; Pop and Salzberg 2008). Classification procedures make use of multiple sequence features including GC content (Foerstner et al. 2005), oligonucleotide composition (McHardy et al. 2007; McHardy and Rigoutsos 2007; Chatterji et al. 2008), and codon usage bias (Noguchi et al. 2006). Homology-based methods compare sequence reads against existing protein markers (Baldauf et al. 2000; Ludwig and Klenk 2001; Rusch et al. 2007; Wu and Eisen 2008) or genomic data (Angly et al. 2006; DeLong et al. 2006; Poinar et al. 2006; Huson et al. 2007). For our study (a eukaryotic metagenome survey), a homology-based approach is more suitable, as we do not expect compositional properties (i.e., GC content) to be infor- mative for, say, a particular family of insects. In addition, because we expect high taxonomic complexity within our samples, the coverage of individual eukaryotic genomes will likely be small, rendering protein (gene)-based approaches useless. Hence our best chance for successful phylogenetic profiling of windshield samples is the approach used by Poinar et al. (2006) and Huson et al. (2007), which relies on the comparison of metagenomic reads against existing sequence databases. 6These authors contributed equally to this work. Present addresses: 7Department of Genetics, University of Pennsyl- vania Medical School, 415 Curie Blvd., Philadelphia, PA 19104, USA; 8Cold Spring Harbor Laboratory, One Bungtown Rd., Cold Spring Harbor, NY 11724, USA. 9Corresponding authors. E-mail [email protected]; fax (619) 543-5094. E-mail [email protected]; fax (404) 727-2880. E-mail [email protected]; fax (814) 863-6699. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.094508.109. Freely available online through the Genome Research Open Access option. 2144 Genome Research www.genome.org 19:2144–2153 Ó 2009 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/09; www.genome.org Cold Spring Harbor Laboratory Press on January 16, 2013 - Published by genome.cshlp.org Downloaded from Metagenomics pipelines, windshield splatter analysis, and Galaxy Pages based interactive supplements

972 VOLUME 29 NUMBER 11 NOVEMBER 2011 NATURE BIOTECHNOLOGY To
the Editor: Continuing evolution of DNA sequencing has transformed modern biology. Lower sequencing costs coupled with novel sequencing-based assays have led to rapid adoption of next-generation sequencing across diverse areas of life sciences research1–4. Sequencing has moved out of the genome centers into core facilities and individual laboratories where any investigator can access it for modest and progressively declining cost. Although easy to generate in tremendous quantities, sequence data are still difficult to manage and analyze. Sophisticated informatics techniques and supporting infrastructure are needed to make sense of even conceptually simple sequencing experiments, let alone the more complex analysis techniques being developed. The most pressing challenge facing the sequencing community today is providing the informatics infrastructure and accessible analysis methods needed to make it possible for all investigators to realize the power of high-throughput sequencing to advance their research. A possible solution to this infrastructure challenge comes in the form of cloud computing, a model where computation and storage exist as virtual resources, accessed by means of the internet, which can be dynamically allocated and released as needed5. Where previously acquisition of large amounts of computing power required large initial and ongoing costs, the cloud model radically alters this by allowing computing resources and services to be acquired and paid for on demand. Importantly, cloud resources can provide storage and computation at far lower cost than dedicated resources for certain use cases. For several specific applications, effective use of cloud resources has already been demonstrated6–8. In general, however, cloud resources are not provided in a form that can be immediately used by a researcher without informatics expertise. Several commercial vendors provide cloud-based sequence analysis services through the web that hide all complexity of the underlying infrastructure. Yet these contain limited sets of analysis tools, and because they are proprietary solutions, users must give up some control over their own data and risk becoming dependent on a single commercial service for continued data access and analysis. All ‘battle-tested’ next-generation sequencing analysis practices (e.g., analysis of human variation exemplified by the 1000 Genome Consortium publication9) are open source. One popular open-source platform that has made substantial progress toward making complex analysis available to researchers is Galaxy10,11. Galaxy enables users to perform analysis using nothing more than a web browser. The environment automatically and transparently tracks every detail of the analysis, allows the construction of complex workflows and permits the results to be documented, shared and published with complete provenance, guaranteeing transparency and reproducibility. Importantly, Galaxy is an extensible platform; nearly any software tool can easily be integrated into Galaxy, and there is an active community of developers ensuring the latest tools are wrapped and made available through the Galaxy Tool Shed (http://usegalaxy.org/ community). Galaxy is provided as a free public service with which thousands of users perform hundreds of thousands of analyses each month. However, this free public resource cannot meet increasing demand without implementing limits on data transfer and computer usage, resulting daily-report/25572> (accessed August 5, 2011). 9. FAO/WHO. Report of the Thirty-Ninth Session of their allies have managed to prevent any real progress. The US’s main motivation is to prevent the adoption of any Codex text which would encourage GM labelling and also make our GM labelling framework WTO compatible…”10. It is important that politicians and regulators do not rely on the widely publicized and biased interpretations of activists with respect to the Codex recommendations (and other international agreements and laws) in formulating national policies. Instead, the various countries that are deliberating about labeling policies for biotech-derived foods should take into consideration several critical facts. First, Codex neither requires nor recommends mandatory labeling; second, nations that adopt mandatory labeling that lacks a scientific basis are vulnerable to WTO challenges and risk economic sanctions; third, the UN’s biotech-specific Cartagena Protocol on Biosafety does not address consumer labeling; and fourth, mandatory labeling limits consumers’ choices, discourages innovation and impedes advances in food variety, safety and nutrition11. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests. Henry I Miller1 & Drew L Kershen2 1The Hoover Institution, Stanford University, Stanford, California, USA. 2University of Oklahoma, School of Law, Norman, Oklahoma, USA. e-mail: [email protected] 1. WTO/FAO/WHO. The WTO and the FAO/WHO Codex Alimentarius (2011). <http://www.wto.org/english/ thewto_e/coher_e/wto_codex_e.htm>. 2. Anonymous. Nature 356, 1–2 (1992). 3. FDA. Statement of policy—food derived from new plant varieties. (FDA, 2010). <http://www.fda.gov/ Food/GuidanceComplianceRegulatoryInformation/ GuidanceDocuments/Biotechnology/ucm096095.htm> (accessed 8/5/11). 4. http://www.codexalimentarius.net/download/report/765/ REP11_FLe.pdf (accessed October 14, 2011). 5. FAO/WHO. Note to the press issued by WHO on the outcome of the 34th Codex Alimentarius Commission. (FAO/WTO, 2011). <ftp://ftp.fao.org/codex/CAC/cac34/ Codex_note_EN.pdf> (accessed August 5, 2011). 6. Anonymous. Codex Alimentarius adopts labeling of genetically modified foods. (Food Freedom, July 5, 2011). <http://foodfreedom.wordpress. com/2011/07/05/codex-alimentarius-adopts-labeling- of-genetically-modified-foods/> (accessed August 5, 2011). Harnessing cloud computing with Galaxy Cloud RESEARCH Open Access Dynamics of mitochondrial heteroplasmy in three families investigated via a repeatable re-sequencing study Hiroki Goto1†, Benjamin Dickins2†, Enis Afgan3, Ian M Paul4, James Taylor3*, Kateryna D Makova1* and Anton Nekrutenko2* Abstract Background: Originally believed to be a rare phenomenon, heteroplasmy - the presence of more than one mitochondrial DNA (mtDNA) variant within a cell, tissue, or individual - is emerging as an important component of eukaryotic genetic diversity. Heteroplasmies can be used as genetic markers in applications ranging from forensics to cancer diagnostics. Yet the frequency of heteroplasmic alleles may vary from generation to generation due to the bottleneck occurring during oogenesis. Therefore, to understand the alterations in allele frequencies at heteroplasmic sites, it is of critical importance to investigate the dynamics of maternal mtDNA transmission. Results: Here we sequenced, at high coverage, mtDNA from blood and buccal tissues of nine individuals from three families with a total of six maternal transmission events. Using simulations and re-sequencing of clonal DNA, we devised a set of criteria for detecting polymorphic sites in heterogeneous genetic samples that is resistant to the noise originating from massively parallel sequencing technologies. Application of these criteria to nine human mtDNA samples revealed four heteroplasmic sites. Conclusions: Our results suggest that the incidence of heteroplasmy may be lower than estimated in some other recent re-sequencing studies, and that mtDNA allelic frequencies differ significantly both between tissues of the same individual and between a mother and her offspring. We designed our study in such a way that the complete analysis described here can be repeated by anyone either at our site or directly on the Amazon Cloud. Our computational pipeline can be easily modified to accommodate other applications, such as viral re-sequencing. Background The mitochondrial genome is maternally inherited and harbors 37 genes in a circular molecule of approxi- mately 16.6 kb that is present in hundreds to thousands of copies per cell [1] and has accumulated mutations at a rate at least an order of magnitude higher than its nuclear counterpart [2,3]. Frequently, more than one mtDNA variant is present in the same individual, a phenomenon called ‘heteroplasmy’ [4]. The mitochondrial genome is implicated in hundreds of diseases (over 200 catalogued at [5] as of mid-2010) with the majority of them caused by point mutations [6]. Multiple mtDNA mutations might also predispose one to common meta- bolic and neurological diseases of advanced age, such as diabetes as well as Parkinson’s and Alzheimer’s diseases [7]. Additionally, mtDNA mutations appear to have a role in cancer etiology [8]. Many disease-causing mtDNA variants are heteroplasmic and their clinical manifestation depends on the relative proportion of mutant versus normal mitochondrial genomes [7,9,10]. No effective treatment for genetic diseases caused by mtDNA mutations currently exists, placing great emphasis on reducing the occurrence and preventing the transmission of these mutations in human popula- tions [11]. There is therefore a pressing need to understand the biological mechanisms for the origin and * Correspondence: [email protected]; [email protected]; [email protected]. edu † Contributed equally 1The Huck Institutes of Life Sciences and Department of Biology, Penn State University, 305 Wartik Lab, University Park, PA 16802, USA 2The Huck Institutes for the Life Sciences and Department of Biochemistry and Molecular Biology, Penn State University, Wartik 505, University Park, PA 16802, USA Full list of author information is available at the end of the article Goto et al. Genome Biology 2011, 12:R59 http://genomebiology.com/2011/12/6/R59 © 2011 Goto et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Mitochondrial re-sequencing + cloud computing infrastructure

Training Provide a wide variety of diﬀerent training tools for
diﬀerent kinds of users From traditional publications (e.g. Current Protocols), to online wiki based tutorials, to integrated tutorials, to screencasts Screencasts continue to be one of our most successful ventures

Always do live demos, and obsess about doing them perfectly
This is a fantastic form of dogfooding, preparing for demos has driven many usability and stability improvements for us

Invest in answering user questions We run several mailing lists
and track all emails We try to ensure some response to all emails in two weeks Everyone on the project (support personnel, developers, and PIs) participates in support and outreach activities

User List Developer List

Use good tools We currently use bitbucket, mailing lists, redmine,
mediawiki, buildbot, trello

Hire a community organizer

Dave Clements, Galaxy Community Director

Engaging developers

Can be difficult, developers have a very low tolerance for
difficulties before they just decide to do it themselves Invest in lowering barriers to entry, make things as easy as possible, choose good simple technologies that are easy to learn There will always be barriers, needs to be offset by clearly articulated and significant value added

Easy to turn oﬀ outside developers when a project is
rapidly changing Based on feedback, we now invest signifigant eﬀort in documenting and communicating all changes Specifically, the Galaxy Development News Brief

When all else fails, organize a workshop to engage developers
directly Our first attempt was at the American Society for Human Genetics meeting in 2008, with the specific goal of targetting tool developers in the human variation community Free lunchtime workshop for ~25 developers, focused on understanding the architecture and integrating their own tools Participants were bribed with t-shirts (Lesson: should have provided food and beer)

2010, the First Galaxy Developers conference held after Genome Informatics
at CSHL

2011, rebranded as the Galaxy Community Conference, organized largely by
the Netherlands Bioinformatics Center

2012, organized with several prominent institutions in the Chicago area,
first full Training Day

GCC attendance through the years 2010 2011 2012 69 148
201

Participation outside the “core” team

[email protected] 439975 *************************************************************************************************************************************** [email protected] 433227 ************************************************************************************************************************************* [email protected] 236491 ************************************************************************* [email protected]
118044 ************************************ [email protected] 106226 ********************************* [email protected] 61444 ******************* rc 56488 ***************** [email protected] 46319 ************** [email protected] 32294 ********** [email protected] 30999 ********** [email protected] 29337 ********* [email protected] 24798 ******** guru 19287 ****** fubar: ross Lazarus at gmail period com 19232 ****** [email protected] 18878 ****** [email protected] 18707 ****** [email protected] 16373 ***** [email protected] 13650 **** [email protected] 10210 *** [email protected] 9548 *** [email protected] 7415 ** [email protected] 6491 ** [email protected] 5333 ** gua110 5122 ** [email protected] 4490 * [email protected] 3943 * [email protected] 3885 * [email protected] 2444 * clements 2118 * [email protected] 2112 * [email protected] 2061 * jeremy.goecks at emory.edu 2039 * [email protected] 2027 * wychung 1892 * [email protected] 1710 * [email protected] 1395 [email protected] 1047 [email protected] 1025 rpark37 800 ichorny 792 [email protected] 660 fubar/ross period lazarus at gmail d0t com 611 [email protected] 601 [email protected] 598 [email protected] 491 [email protected] 473 [email protected] 437 [email protected] 416 [email protected] 387 [email protected] 385 [email protected] 345 [email protected] 336 [email protected] 332 [email protected] 331 hiralv 328 [email protected] 260 [email protected] 229 greg 207 [email protected] 194 [email protected] 146 [email protected] 130 [email protected] 114 roryk 101 [email protected] 89 Guru Ananda 81 rerla@localhost 80 [email protected] 77 [email protected] 76 dan 56 [email protected] 56 [email protected] 53 smcmanus 49 [email protected] 40 [email protected] 40 Rory Kirchner ([email protected]) 37 chapmanb 35 [email protected] 32 nuwan_ag 31 [email protected] 21 [email protected] 10 [email protected] 9 [email protected] 8 jen 8 [email protected] 8 takadonet 6 [email protected] 5 [email protected] 5 [email protected] 3 [email protected] 2 [email protected] 2 [email protected] 2 [email protected] 1 [email protected] 1

[email protected] 439975 *************************************************************************************************************************************** [email protected] 433227 ************************************************************************************************************************************* [email protected] 236491 ************************************************************************* [email protected]
118044 ************************************ [email protected] 106226 ********************************* [email protected] 61444 ******************* rc 56488 ***************** [email protected] 46319 ************** [email protected] 32294 ********** [email protected] 30999 ********** [email protected] 29337 ********* [email protected] 24798 ******** guru 19287 ****** fubar: ross Lazarus at gmail period com 19232 ****** [email protected] 18878 ****** [email protected] 18707 ****** [email protected] 16373 ***** [email protected] 13650 **** [email protected] 10210 *** [email protected] 9548 *** [email protected] 7415 ** [email protected] 6491 ** [email protected] 5333 ** gua110 5122 ** [email protected] 4490 * [email protected] 3943 * [email protected] 3885 * [email protected] 2444 * clements 2118 * [email protected] 2112 * [email protected] 2061 * jeremy.goecks at emory.edu 2039 * [email protected] 2027 * wychung 1892 * [email protected] 1710 * [email protected] 1395 [email protected] 1047 [email protected] 1025 rpark37 800 ichorny 792 [email protected] 660 fubar/ross period lazarus at gmail d0t com 611 [email protected] 601 [email protected] 598 [email protected] 491 [email protected] 473 [email protected] 437 [email protected] 416 [email protected] 387 [email protected] 385 [email protected] 345 [email protected] 336 [email protected] 332 [email protected] 331 hiralv 328 [email protected] 260 [email protected] 229 greg 207 [email protected] 194 [email protected] 146 [email protected] 130 [email protected] 114 roryk 101 [email protected] 89 Guru Ananda 81 rerla@localhost 80 [email protected] 77 [email protected] 76 dan 56 [email protected] 56 [email protected] 53 smcmanus 49 [email protected] 40 [email protected] 40 Rory Kirchner ([email protected]) 37 chapmanb 35 [email protected] 32 nuwan_ag 31 [email protected] 21 [email protected] 10 [email protected] 9 [email protected] 8 jen 8 [email protected] 8 takadonet 6 [email protected] 5 [email protected] 5 [email protected] 3 [email protected] 2 [email protected] 2 [email protected] 2 [email protected] 1 [email protected] 1 Greg James Dan Nate Jeremy Mostly community contributions

Scalability and Sustainability

The State of Galaxy “main” Public web site that anyone
can use for free, backed by a medium scale compute cluster currently hosted at Penn State

Plateau is due to resource constraints not reduced demand, introduced
quotas in late 2011

How can this possibly scale? Decentralize, provide many deployment models

Leverage existing compute infrastructure Currently engaged in several eﬀorts to
leverage XSEDE in Galaxy 10 Gig dedicated link from Galaxy server room to XSEDE network via PSC Mirror of all Galaxy data on PSC’s SLASH2 store Ongoing work with XSEDE Extended Support group enabling job submission to XSEDE Integration with Globus Online from Ian Foster’s group Goal is to allow users to link XSEDE allocations to Galaxy accounts, submit directly from Galaxy main

Local Galaxy Deployment Galaxy is designed for local installation and
customization... just download and run Pluggable interfaces to compute resources, easily connect to one or more existing clusters Ideally, allow users to take advantage of whatever computational resources they already have access to.

Running analysis • A Galaxy instance can be configured to
use on or more existing batch systems (PBS, SGE, Platform, anything that uses DRMAA) • Handles workflow that span different resources • Job running is completely pluggable at multiple levels, plugin your own execution engine, workflow engine, storage model, whatever • Even different approaches for different types of analysis

~25 known public Galaxy servers

Many large-scale private Galaxy instances (typically at the institutional level)
Since April 2012 we have been supporting a community intiated group called GalaxyCzarsAdmins organizing regular Meetups for administrators of local Galaxy instances So far, talks from University of Iowa, University of Florida, UAB, and Minnesota Supercomputing Institute

However, local installations require informatics expertise and resources...

High throughput sequencing Much more widely available, accessible to individual
investigators, relatively inexpensive Cloud computing Available to anyone, Rapidly acquired and released, Pay for use A natural fit?

Directly launching a Galaxy/Cloudman VM on AWS

Share a snapshot of this instance Complete instances can be
archived and shared with others

The IaaS cloud solution may still be challenging for naive
users, but is proving extremely useful for local instances

Scaling Galaxy: two distinct problems • So much data, not
enough infrastructure. • Solution, encourage local Galaxy instances, cloud Galaxy, support increasingly decentralized model, improve access to exiting resources • So many tools and workflows, not enough manpower • Focus on building infrastructure to allow community to integrate and share tools, workflows, and best practices

1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private
clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed

Galaxy toolshed vision • Allow users to share “suites” containing
tools, datatypes, workflows, sample data, and automated installation scripts for tool dependencies • Version controlled • Community annotation, rating, comments, review • Dependency resolution • Integration with Galaxy instances to automate tool installation and updates

Repositories are owned by the contributor, can contain tools, workflows,
etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)

Summary

What’s worked for us Be useful quickly Be idealistic Low
barriers, especially providing a completely free web service Don’t pay too much attention to user demands, do less but do it well Stay close to the science; don’t engage the user, be the user Serious investment in community outreach and infrastructure to support community activities (nearly half of our entire team’s time)

Challenges Finding the right people, good engineers who can work
with researchers, fit with the culture of an academic research environment, deal with rapdily changing requirements... A large user community (success!) makes it more diﬃcult to make big changes, greatly increasing time spent on backwards compatibility, migrations Diﬃcult to invest time and resources in improving existing components, our incentives (funding, publications) require building new things

Our team

Dan Blankenberg Nate Coraor Greg von Kuster Dannon Baker Jeremy
Goecks Anton Nekrutenko James Taylor Dave Clements Jennifer Jackson Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Engineering Support and outreach Keeping the lights on Carl Eberhard Dave Bouvier

The Story of Galaxy, NSF SI2 PI Meeting Keynote

The Story of Galaxy, NSF SI2 PI Meeting Keynote

More Decks by James Taylor

Other Decks in Science

Featured

Transcript