Genome Annotation with Galaxy and G-OnRamp

4f34bca33e4f7b830f5f1cb3ce26958b?s=47 Jeremy Goecks
January 17, 2017

Genome Annotation with Galaxy and G-OnRamp

Slides for "Genome Annotation with Galaxy and G-OnRamp" workshop at 2017 Plant and Animal Genomes Conference (PAG)

4f34bca33e4f7b830f5f1cb3ce26958b?s=128

Jeremy Goecks

January 17, 2017
Tweet

Transcript

  1. Genome Annotation with Galaxy and G-OnRamp Jeremy Goecks Assistant Professor,

    Oregon Health and Science University @galaxyproject / #usegalaxy http://www.galaxyproject.org
  2. Agenda Introductions Galaxy introduction and exercises G-OnRamp introduction and demonstration

  3. Wilson Leung
 Wash U in St. Louis Yating Liu Wash

    U in St. Louis Workshop Leaders and Key URLs Jeremy Goecks
 OHSU Dave Clements Johns Hopkins Galaxy: http://galaxyproject.org G-OnRamp: http://gonramp.org
  4. Goals Share Galaxy awesomeness with you Get you working with

    Galaxy Demonstrate a complex workflow in Galaxy
  5. Agenda Introductions Galaxy introduction and exercises G-OnRamp introduction and demonstration

  6. Motivation for Galaxy: Computational (e.g., Genomic) Analyses are Difficult Investigators

    unfamiliar with computation, but complex methods and infrastructure required Creating and reproducing workflows (pipelines) hindered by complexity: systems, scripts, tools, parameters Collaboration and reuse difficult because current approaches do not support computational artifacts well
  7. Accessibility Getting started in bioinformatics is so very hard ‣

    command line syntax ‣ tool and dependency installation ‣ creating pipelines (workflows) ‣ using computing clouds/clusters These difficulties hinder biomedicine in profound ways ‣ time spent on computing rather than science ‣ little exploration and difficult to test ideas ‣ computing is underutilized
  8. Reproducibility for Computational Science Reproducibility is not provenance, reusability/generalizability, or

    correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible 
 (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
  9. Reproducibility Project: Cancer Biology Independently replicating 50 “high-impact” cancer studies

    from 2010-2012 (https://osf.io/e81xl/wiki/home/)
  10. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

    Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
  11. Collaboration and Reuse There is very little actionable data or

    methods in PDF documents ‣ extract a table from a PDF document? Need links/embedding of methods plus surrounding discussion ‣ community understanding and evaluation critical ‣ want to build on existing methods rather than start from scratch
  12. Galaxy: accessible analysis system Consistent tool user interfaces automatically generated

    History system facilitates and tracks multistep analyses Exact parameters of a step can always be inspected, and easily rerun Workflow system
  13. Galaxy is… A free (for everyone) web service (http://usegalaxy.org) integrating

    a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  14. Galaxy’s Ideological Goals How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
  15. Galaxy’s Practical Goals How to arm researchers with access to

    latest tools and applications How to build a community of tool developers How to run Galaxy on any HPC
  16. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmospher e
  17. Galaxy Main Usage

  18. Galaxy Main Usage

  19. bit.ly/gxyServers

  20. Proteomics Metabolomics Drug Discovery Cosmology Image Analysis Climate Change Social

    Science Natural Language
  21. Galaxy Citations

  22. Goal: Bring everyone together user (scientist) HPC admin dev

  23. Goal: Bring everyone together user admin dev

  24. Goal: Bring everyone together admin user dev

  25. Goal: Bring everyone together user dev Galaxy admin

  26. Galaxy 101-1

  27. Galaxy 101-1: Find the top 5 exons with the highest

    number of SNPs Exons: ~10,000 regions SNPs: ~200,000 regions https://github.com/nekrut/galaxy/wiki/Galaxy101-1
  28. Galaxy features

  29. Describe analysis tool behavior abstractly

  30. Describe analysis tool behavior abstractly Scalable* analysis environment transparently tracks

    details *several examples of 10,000+ dataset analyses across the world
  31. Describe analysis tool behavior abstractly Scalable* analysis environment transparently tracks

    details Scalable* workflow system for automated complex analysis *several examples of 10,000+ dataset analyses across the world
  32. Describe analysis tool behavior abstractly Scalable* analysis environment transparently tracks

    details Scalable* workflow system for automated complex analysis Pervasive sharing, and publication of documents with integrated analysis *several examples of 10,000+ dataset analyses across the world
  33. Visualization and visual analytics

  34. From Galaxy 101-1 to Galaxy 101-2 What about features other

    than exons and SNPs? which transcription factor binding sites have the most SNPs? which exons have the most repeats? Exons SNPs Join exons with SNPs Group by exons Sort exons by SNP count Select top five exons Recover exon info
  35. An analysis is really a workflow

  36. As analyses needs become increasingly complex, typical users have moved

    from running individual tools to primarily running workflows
  37. For research use, users need to be able to construct

    and modify workflows, not just run existing best practice pipelines The Galaxy workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
  38. (Goecks et al. Cancer Medicine, 2015)

  39. (Goecks et al. Cancer Medicine, 2015)

  40. Galaxy 101-2 https://github.com/nekrut/galaxy/wiki/Galaxy101-2 On your own: do analysis across entire

    genome
  41. Agenda Introductions Galaxy introduction and exercises G-OnRamp introduction and demonstration

  42. G-OnRamp 
 http://gonramp.org/ Create Galaxy servers for ‣ utilizing large

    genomics datasets to annotate any eukaryotic genome ‣ providing educators with a platform to train undergraduate students on “big data” biomedical analyses Collaboration between Galaxy and Genomics Education Partnership (GEP) Opportunities to participate in G-OnRamp workshops this summer for research or for education: June 20-22 or July 25-27
  43. Genomics Education Partnership http://gep.wustl.edu Goals ‣ introduce genomics and bioinformatics

    into the undergraduate curriculum ‣ engage students in genomics research Approach ‣ use genome annotation of Drosophila for “hands-on” exercise ‣ students learn to integrate multiple lines of evidence, learn about genes/genomes, about genomics, underlying algorithms, and more
  44. GEP Gene Annotation Evidence tracks Reconciled gene models Sequence similarity

    Gene predictions RNA-seq Comparative genomics Repeats Genomic sequence D. erecta F element contig1
  45. GEP Results Results produced by GEP students are assembled for

    domain analysis and scientific publications Students report substantial learning gains ‣ Gains enhanced by increased time investment; Q4 > Q1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Means Q1 Q4 SURE Q1 (1-10 hrs.) Q4 (>36 hrs.) SURE (Summer research) Understanding the research process Ability to analyze data Independence Learning gain items in the SURE survey Mean scores Scientific results: Leung et al. 2015, G3. 5(5):719-42
  46. GEP Participants >100 faculty from >100 schools, >1000 undergraduates participate

    annually Year joined 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
  47. GEP Participants >100 faculty from >100 schools, >1000 undergraduates participate

    annually Shaffer CD et al. 2014, CBE Life Sci Educ. 13(1):111-30 Source of funding Total enrollment Admissions selectivity Highest biology degree % life sciences majors Residential vs. commuter Minority/Hispanic serving Minority Non-traditional students First generation (>30%)
  48. G-OnRamp: 
 Use Galaxy to address GEP Challenges GEP challenges

    Galaxy features Requires expertise (e.g., familiarity with Linux) to configure and run bioinformatics tools Provides a web-based user interface to configure and run tools Difficult to share workflows and results Can make Histories, Datasets, and Workflows publicly available or share with individual Galaxy users Difficult to incorporate additional analyses and tools Can use the Workflow Canvas to modify existing workflows and add new tools from the Galaxy Tool Shed GEP projects are currently limited to the analysis of different Drosophila species Can extract a Workflow from History and run the Workflow on other genome assemblies
  49. GEP + Galaxy = G-OnRamp

  50. Galaxy for Genome Annotation Extends Galaxy with tools and workflows

    for genome annotation Combines multiple tools into reproducible sub-workflows Uses Hub Archive Creator (HAC) to create UCSC Assembly Hubs Displays genome browsers using the servers maintained by UCSC
  51. G-OnRamp Subworkflows Sequence similarity (tblastn search against protein sequences from

    informant species) Gene predictions (GlimmerHMM, Augustus, and SNAP) RNA-Seq (HISAT2, read coverage, splice junctions, and StringTie) Repeats (TRF)
  52. G-OnRamp Demonstration

  53. Simplify with Subworkflows

  54. Simplify with Subworkflows

  55. Summer 2016 G-OnRamp Beta Workshop Created UCSC Assembly Hubs for

    the G- OnRamp beta testers workshop 10 participants from 9 institutions ‣ Five genome assemblies: Amazona vittata, Chlamydomonas reinhardtii, Kryptolebias marmoratus, Sebastes rubrivinctus, Xenopus laevis ‣ Assembly sizes: 111Mb - 2.8Gb ‣ Number of scaffolds: 54 - 402,501 ‣ Four genomes with RNA-Seq data Photos by Tom MacKenzie (A. vittata), Dartmouth Electron Microscope Facility (C. reinhardtii), Chad King (S. rubrivinctus), Brian Gratwicke (X. laevis), and Jean-Paul Cicéron (K. marmoratus)
  56. G-OnRamp: Coming Features (1/2) Better documentation for usage and for

    teaching Extend workflow: ‣ ChIP-seq ‣ DNase-seq/ATAC-seq ‣ DNA methylation (bisulfite sequencing) Optimize workflow: ‣ better labels ‣ better repeat detection to speed up workflow
  57. G-OnRamp: Coming Features (2/2) Connect to broader ecosystem ‣ JBrowse

    for interactive viewing ‣ WebApollo for real-time interactive collaborative annotation ‣ CyVerse for storing and accessing generated data Make easier to install and use ‣ on local computer with a virtual machine for running small analyses ‣ on the cloud for large analyses
  58. Want to participate in Summer 2017 G-OnRamp beta testing workshop?

    G-OnRamp best testing workshops will be held in Summer 2017 at Washington University in St. Louis ‣ June 20-22 or July 25-27 ‣ Lodging and food costs covered, you pay travel costs Express interest and join mailing list at http://gonramp.org/signup
  59. Galaxy Summary Galaxy is an (obsessively) open framework for making

    data analysis accessible and reproducible ‣ Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools http://galaxyproject.org for all things Galaxy
  60. G-OnRamp Summary G-OnRamp is a project that joins Galaxy with

    the Genomics Education Project (GEP) Provides best-practice genome annotation workflows for engaging undergraduates in data science and for data analysis Provides workshops for learning about workflow and how to use it in education—see sign up sheet in the back or visit http://gonramp.org/signup
  61. Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko

    James Taylor Dave Clements Jennifer Jackson Support and outreach Leadership Dave Bouvier Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Johns Hopkins University, Oregon Health and Science University, and the Pennsylvania Department of Public Health Nick Stoler The “Core” Galaxy Team Mo Heydarian John Chilton Engineering
  62. Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott OSHU

    Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …
  63. Galaxy is a community! Join us on irc, mailing lists,

    Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference
  64. Wilson Leung
 Wash U in St. Louis Yating Liu Wash

    U in St. Louis G-OnRamp http://gonramp.org Sarah Elgin
 Wash U in St. Louis Jeremy Goecks
 OHSU