working on methods for understanding gene regulation using comparative genomic data While there, I started Galaxy with Anton Nekrutenko as a way to facilitate better collaborations between computational and experimental researchers Since then, I’ve continued (with Anton) to lead the Galaxy project in my groups at Emory and Johns Hopkins
Informatics: one semester undergraduate class on the core technologies and algorithms of genomics: no programming, uses Galaxy for assignments. Quantitative Biology Bootcamp: a one week intensive bootcamp for entering Biology PhD students at Hopkins: hands on learning to work at the UNIX command line, basic Python programming, using genomic data examples, followed by a year long weekly lab.
at CSHL for decades, covers a wide variety of topics in comparative and computational genomics, has used Galaxy since 2010, also covers some UNIX, R/RStudio, project based. Genomic Data Science: I teach one course of nine in the Genomic Data Science Coursera Specialization, a popular MOOC sequence covering many aspects of genomic data analysis. Penguin Coast: …
design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
analysis is described/captured in sufficient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses
bwa for mapping (of ~380 published): • 36 did not provide primary data (all but 2 provided it upon request) • 31 provide neither parameters nor versions • 19 provide settings only and 8 list versions only • Only 7 provide all details
identification of research resources in the biomedical literature Nicole A. Vasilevsky1, Matthew H. Brush1, Holly Paddock2, Laura Ponting3, Shreejoy J. Tripathy4, Gregory M. LaRocca4 and Melissa A. Haendel1 1 Ontology Development Group, Library, Oregon Health & Science University, Portland, OR, USA 2 Zebrafish Information Framework, University of Oregon, Eugene, OR, USA 3 FlyBase, Department of Genetics, University of Cambridge, Cambridge, UK 4 Department of Biological Sciences and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA, USA ABSTRACT Scientific reproducibility has been at the forefront of many news stories and there exist numerous initiatives to help address this problem. We posit that a contributor is simply a lack of specificity that is required to enable adequate research repro- ducibility. In particular, the inability to uniquely identify research resources, such as antibodies and model organisms, makes it diYcult or impossible to reproduce experiments even where the science is otherwise sound. In order to better understand the magnitude of this problem, we designed an experiment to ascertain the “iden- tifiability” of research resources in the biomedical literature. We evaluated recent journal articles in the fields of Neuroscience, Developmental Biology, Immunology, Cell and Molecular Biology and General Biology, selected randomly based on a diversity of impact factors for the journals, publishers, and experimental method reporting guidelines. We attempted to uniquely identify model organisms (mouse, rat, zebrafish, worm, fly and yeast), antibodies, knockdown reagents (morpholinos or RNAi), constructs, and cell lines. Specific criteria were developed to determine if a resource was uniquely identifiable, and included examining relevant repositories
Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
experiment (either as it is being carried out, or after the fact) 2. Assemble all of the necessary data and software dependencies needed by the described experiment 3. Combine the above to verify the analysis
Many analyses require data such as genome annotations from external databases that change regularly, either record versions or store a copy of the specific data used.
methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning
tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
start: • Linux install • Linux Admin • Command line • Tool installs The Galaxy Goal: Focus on the questions and techniques, rather than on the compute infrastructure.
the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub
— analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated
there are now many), deploying your own instance at an institution or course level, or having students deploy there own instances. All have strengths and weaknesses depending on the training goals
Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Galaxy main (http://usegalaxy.org): Leveraging National Cyberinfrastructure
of science” where traditional HPC resources have not served well Jetstream will enable archival of volumes and virtual machines in the IU Scholarworks Archive with DOIs attached Exact analysis environment, data, and provenance becomes an archived, citable entity funded by the National Science Foundation Award #ACI-1445604
to relatively long wait times, especially for larger jobs (sequence mapping, et cetera). Can be fine for homework assignments and exercises completed offline, but difficult for in person training
service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks Galaxy Cloud
money, but education grants are available funded by the National Science Foundation Award #ACI-1445604 All Cloudman and Galaxy Features available Fewer node types and available resources Educational allocations available through XSEDE (Also available on Google Compute and Azure but some features like autoscaling are still in development)
Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Sam Guerler, Mo Heydarian, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Marius van den Beek Our lab: Enis Afgan, Dannon Baker, Boris Brenerman, Min Hyung Cho, Dave Clements, Peter DeFord, Sam Guerler, Nathan Roach, Michael E. G. Sauria, German Uritskiy Collaborators: Craig Stewart and the group Ross Hardison and the VISION group Victor Corces (Emory), Karen Reddy (JHU) Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology) Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics) NHGRI (HG005133, HG004909, HG005542, HG005573, HG006620) NIDDK (DK065806) and NSF (DBI 0543285, DBI 0850103) funded by the National Science Foundation Award #ACI-1445604