U in St. Louis Workshop Leaders and Key URLs Jeremy Goecks OHSU Dave Clements Johns Hopkins Galaxy: http://galaxyproject.org G-OnRamp: http://gonramp.org
unfamiliar with computation, but complex methods and infrastructure required Creating and reproducing workflows (pipelines) hindered by complexity: systems, scripts, tools, parameters Collaboration and reuse difficult because current approaches do not support computational artifacts well
command line syntax ‣ tool and dependency installation ‣ creating pipelines (workflows) ‣ using computing clouds/clusters These difficulties hinder biomedicine in profound ways ‣ time spent on computing rather than science ‣ little exploration and difficult to test ideas ‣ computing is underutilized
correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
methods in PDF documents ‣ extract a table from a PDF document? Need links/embedding of methods plus surrounding discussion ‣ community understanding and evaluation critical ‣ want to build on existing methods rather than start from scratch
a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
details Scalable* workflow system for automated complex analysis Pervasive sharing, and publication of documents with integrated analysis *several examples of 10,000+ dataset analyses across the world
than exons and SNPs? which transcription factor binding sites have the most SNPs? which exons have the most repeats? Exons SNPs Join exons with SNPs Group by exons Sort exons by SNP count Select top five exons Recover exon info
and modify workflows, not just run existing best practice pipelines The Galaxy workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
genomics datasets to annotate any eukaryotic genome ‣ providing educators with a platform to train undergraduate students on “big data” biomedical analyses Collaboration between Galaxy and Genomics Education Partnership (GEP) Opportunities to participate in G-OnRamp workshops this summer for research or for education: June 20-22 or July 25-27
into the undergraduate curriculum ‣ engage students in genomics research Approach ‣ use genome annotation of Drosophila for “hands-on” exercise ‣ students learn to integrate multiple lines of evidence, learn about genes/genomes, about genomics, underlying algorithms, and more
annually Shaffer CD et al. 2014, CBE Life Sci Educ. 13(1):111-30 Source of funding Total enrollment Admissions selectivity Highest biology degree % life sciences majors Residential vs. commuter Minority/Hispanic serving Minority Non-traditional students First generation (>30%)
Galaxy features Requires expertise (e.g., familiarity with Linux) to configure and run bioinformatics tools Provides a web-based user interface to configure and run tools Difficult to share workflows and results Can make Histories, Datasets, and Workflows publicly available or share with individual Galaxy users Difficult to incorporate additional analyses and tools Can use the Workflow Canvas to modify existing workflows and add new tools from the Galaxy Tool Shed GEP projects are currently limited to the analysis of different Drosophila species Can extract a Workflow from History and run the Workflow on other genome assemblies
for genome annotation Combines multiple tools into reproducible sub-workflows Uses Hub Archive Creator (HAC) to create UCSC Assembly Hubs Displays genome browsers using the servers maintained by UCSC
the G- OnRamp beta testers workshop 10 participants from 9 institutions ‣ Five genome assemblies: Amazona vittata, Chlamydomonas reinhardtii, Kryptolebias marmoratus, Sebastes rubrivinctus, Xenopus laevis ‣ Assembly sizes: 111Mb - 2.8Gb ‣ Number of scaffolds: 54 - 402,501 ‣ Four genomes with RNA-Seq data Photos by Tom MacKenzie (A. vittata), Dartmouth Electron Microscope Facility (C. reinhardtii), Chad King (S. rubrivinctus), Brian Gratwicke (X. laevis), and Jean-Paul Cicéron (K. marmoratus)
for interactive viewing ‣ WebApollo for real-time interactive collaborative annotation ‣ CyVerse for storing and accessing generated data Make easier to install and use ‣ on local computer with a virtual machine for running small analyses ‣ on the cloud for large analyses
G-OnRamp best testing workshops will be held in Summer 2017 at Washington University in St. Louis ‣ June 20-22 or July 25-27 ‣ Lodging and food costs covered, you pay travel costs Express interest and join mailing list at http://gonramp.org/signup
data analysis accessible and reproducible ‣ Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools http://galaxyproject.org for all things Galaxy
the Genomics Education Project (GEP) Provides best-practice genome annotation workflows for engaging undergraduates in data science and for data analysis Provides workshops for learning about workflow and how to use it in education—see sign up sheet in the back or visit http://gonramp.org/signup
James Taylor Dave Clements Jennifer Jackson Support and outreach Leadership Dave Bouvier Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Johns Hopkins University, Oregon Health and Science University, and the Pennsylvania Department of Public Health Nick Stoler The “Core” Galaxy Team Mo Heydarian John Chilton Engineering
Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …