Genome Annotation with Galaxy and G-OnRamp Jeremy Goecks Assistant Professor, Oregon Health and Science University @galaxyproject / #usegalaxy http://www.galaxyproject.org
Wilson Leung Wash U in St. Louis Yating Liu Wash U in St. Louis Workshop Leaders and Key URLs Jeremy Goecks OHSU Dave Clements Johns Hopkins Galaxy: http://galaxyproject.org G-OnRamp: http://gonramp.org
Motivation for Galaxy: Computational (e.g., Genomic) Analyses are Difficult Investigators unfamiliar with computation, but complex methods and infrastructure required Creating and reproducing workflows (pipelines) hindered by complexity: systems, scripts, tools, parameters Collaboration and reuse difficult because current approaches do not support computational artifacts well
Accessibility Getting started in bioinformatics is so very hard ‣ command line syntax ‣ tool and dependency installation ‣ creating pipelines (workflows) ‣ using computing clouds/clusters These difficulties hinder biomedicine in profound ways ‣ time spent on computing rather than science ‣ little exploration and difficult to test ideas ‣ computing is underutilized
Reproducibility for Computational Science Reproducibility is not provenance, reusability/generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa; Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
Collaboration and Reuse There is very little actionable data or methods in PDF documents ‣ extract a table from a PDF document? Need links/embedding of methods plus surrounding discussion ‣ community understanding and evaluation critical ‣ want to build on existing methods rather than start from scratch
Galaxy: accessible analysis system Consistent tool user interfaces automatically generated History system facilitates and tracks multistep analyses Exact parameters of a step can always be inspected, and easily rerun Workflow system
Galaxy is… A free (for everyone) web service (http://usegalaxy.org) integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
Galaxy’s Ideological Goals How best can data intensive methods be accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
Galaxy’s Practical Goals How to arm researchers with access to latest tools and applications How to build a community of tool developers How to run Galaxy on any HPC
Ways to use Galaxy The public web service at http://usegalaxy.org Install locally with many compute environments Deploy on a cloud using Cloudman Atmospher e
Galaxy 101-1: Find the top 5 exons with the highest number of SNPs Exons: ~10,000 regions SNPs: ~200,000 regions https://github.com/nekrut/galaxy/wiki/Galaxy101-1
Describe analysis tool behavior abstractly Scalable* analysis environment transparently tracks details *several examples of 10,000+ dataset analyses across the world
Describe analysis tool behavior abstractly Scalable* analysis environment transparently tracks details Scalable* workflow system for automated complex analysis *several examples of 10,000+ dataset analyses across the world
Describe analysis tool behavior abstractly Scalable* analysis environment transparently tracks details Scalable* workflow system for automated complex analysis Pervasive sharing, and publication of documents with integrated analysis *several examples of 10,000+ dataset analyses across the world
From Galaxy 101-1 to Galaxy 101-2 What about features other than exons and SNPs? which transcription factor binding sites have the most SNPs? which exons have the most repeats? Exons SNPs Join exons with SNPs Group by exons Sort exons by SNP count Select top five exons Recover exon info
For research use, users need to be able to construct and modify workflows, not just run existing best practice pipelines The Galaxy workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
G-OnRamp http://gonramp.org/ Create Galaxy servers for ‣ utilizing large genomics datasets to annotate any eukaryotic genome ‣ providing educators with a platform to train undergraduate students on “big data” biomedical analyses Collaboration between Galaxy and Genomics Education Partnership (GEP) Opportunities to participate in G-OnRamp workshops this summer for research or for education: June 20-22 or July 25-27
Genomics Education Partnership http://gep.wustl.edu Goals ‣ introduce genomics and bioinformatics into the undergraduate curriculum ‣ engage students in genomics research Approach ‣ use genome annotation of Drosophila for “hands-on” exercise ‣ students learn to integrate multiple lines of evidence, learn about genes/genomes, about genomics, underlying algorithms, and more
GEP Results Results produced by GEP students are assembled for domain analysis and scientific publications Students report substantial learning gains ‣ Gains enhanced by increased time investment; Q4 > Q1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Means Q1 Q4 SURE Q1 (1-10 hrs.) Q4 (>36 hrs.) SURE (Summer research) Understanding the research process Ability to analyze data Independence Learning gain items in the SURE survey Mean scores Scientific results: Leung et al. 2015, G3. 5(5):719-42
GEP Participants >100 faculty from >100 schools, >1000 undergraduates participate annually Shaffer CD et al. 2014, CBE Life Sci Educ. 13(1):111-30 Source of funding Total enrollment Admissions selectivity Highest biology degree % life sciences majors Residential vs. commuter Minority/Hispanic serving Minority Non-traditional students First generation (>30%)
G-OnRamp: Use Galaxy to address GEP Challenges GEP challenges Galaxy features Requires expertise (e.g., familiarity with Linux) to configure and run bioinformatics tools Provides a web-based user interface to configure and run tools Difficult to share workflows and results Can make Histories, Datasets, and Workflows publicly available or share with individual Galaxy users Difficult to incorporate additional analyses and tools Can use the Workflow Canvas to modify existing workflows and add new tools from the Galaxy Tool Shed GEP projects are currently limited to the analysis of different Drosophila species Can extract a Workflow from History and run the Workflow on other genome assemblies
Galaxy for Genome Annotation Extends Galaxy with tools and workflows for genome annotation Combines multiple tools into reproducible sub-workflows Uses Hub Archive Creator (HAC) to create UCSC Assembly Hubs Displays genome browsers using the servers maintained by UCSC
Summer 2016 G-OnRamp Beta Workshop Created UCSC Assembly Hubs for the G- OnRamp beta testers workshop 10 participants from 9 institutions ‣ Five genome assemblies: Amazona vittata, Chlamydomonas reinhardtii, Kryptolebias marmoratus, Sebastes rubrivinctus, Xenopus laevis ‣ Assembly sizes: 111Mb - 2.8Gb ‣ Number of scaffolds: 54 - 402,501 ‣ Four genomes with RNA-Seq data Photos by Tom MacKenzie (A. vittata), Dartmouth Electron Microscope Facility (C. reinhardtii), Chad King (S. rubrivinctus), Brian Gratwicke (X. laevis), and Jean-Paul Cicéron (K. marmoratus)
G-OnRamp: Coming Features (1/2) Better documentation for usage and for teaching Extend workflow: ‣ ChIP-seq ‣ DNase-seq/ATAC-seq ‣ DNA methylation (bisulfite sequencing) Optimize workflow: ‣ better labels ‣ better repeat detection to speed up workflow
G-OnRamp: Coming Features (2/2) Connect to broader ecosystem ‣ JBrowse for interactive viewing ‣ WebApollo for real-time interactive collaborative annotation ‣ CyVerse for storing and accessing generated data Make easier to install and use ‣ on local computer with a virtual machine for running small analyses ‣ on the cloud for large analyses
Want to participate in Summer 2017 G-OnRamp beta testing workshop? G-OnRamp best testing workshops will be held in Summer 2017 at Washington University in St. Louis ‣ June 20-22 or July 25-27 ‣ Lodging and food costs covered, you pay travel costs Express interest and join mailing list at http://gonramp.org/signup
Galaxy Summary Galaxy is an (obsessively) open framework for making data analysis accessible and reproducible ‣ Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools http://galaxyproject.org for all things Galaxy
G-OnRamp Summary G-OnRamp is a project that joins Galaxy with the Genomics Education Project (GEP) Provides best-practice genome annotation workflows for engaging undergraduates in data science and for data analysis Provides workshops for learning about workflow and how to use it in education—see sign up sheet in the back or visit http://gonramp.org/signup
Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko James Taylor Dave Clements Jennifer Jackson Support and outreach Leadership Dave Bouvier Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Johns Hopkins University, Oregon Health and Science University, and the Pennsylvania Department of Public Health Nick Stoler The “Core” Galaxy Team Mo Heydarian John Chilton Engineering
Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott OSHU Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …
Galaxy is a community! Join us on irc, mailing lists, Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference