PAG 2018 Galaxy Workshop - G-OnRamp

4f34bca33e4f7b830f5f1cb3ce26958b?s=47 Jeremy Goecks
January 17, 2018

PAG 2018 Galaxy Workshop - G-OnRamp

G-OnRamp (http://gonramp.org) is a collaboration between two successful and long-running projects — the Genomics Education Partnership (GEP; http://gep.wustl.edu) and the Galaxy Project (https://galaxyproject.org). G-OnRamp provides biologists with an integrated, web-based, scalable environment for interactive annotation of eukaryotic genomes using large genomic datasets. It also provides educators with a platform to help undergraduates develop “big data” science skills through eukaryotic genome annotation.

GEP is a consortium of over 100 colleges and universities that provides Classroom Undergraduate Research Experiences (CURE) in bioinformatics/genomics for students at all levels. GEP faculty currently use the annotation of multiple Drosophilaspecies to introduce genomics and research thinking to undergraduates. Galaxy is a popular open-source, web-based scientific gateway for accessible, reproducible, and transparent analyses of large biomedical datasets. G-OnRamp extends Galaxy with tools and workflows that creates UCSC Assembly Hubs and Apollo/JBrowse genome browsers with evidence tracks for sequence similarity, ab initio gene predictions, RNA-Seq, and repeats. Educators can use this system to design CUREs based on their favorite eukaryotic species (e.g., parasitoid wasps).

G-OnRamp provides a VirtualBox virtual appliance and an AMI image for local and cloud (Amazon EC2) deployments. Future versions of G-OnRamp will (i) enable data storage with CyVerse; (ii) support additional configuration options for UCSC Assembly Hubs; and (iii) adapt GEP annotation tools for other informant species. We will host G-OnRamp training workshops in June and July 2018 at Washington University in St. Louis. If you are interested in attending, please sign up for the G-OnRamp mailing list at http://gonramp.org/signup. Supported by NIH 1R25GM119157.

4f34bca33e4f7b830f5f1cb3ce26958b?s=128

Jeremy Goecks

January 17, 2018
Tweet

Transcript

  1. 1.

    Using Galaxy for Genome Annotation Jeremy Goecks Assistant Professor, Oregon

    Health and Science University @galaxyproject / #usegalaxy http://www.galaxyproject.org
  2. 2.

    The Lead: G-OnRamp
 http://gonramp.org Ready to use Galaxy server and

    workflows for genome annotation ✦ for any eukaryotic genome, can adapt as needed ✦ generates genome browser for doing manual annotation ✦ integrates into ecosystem of bioinformatics tools such as UCSC Genome Browser, JBrowse/Apollo, and CyVerse ✦ applicable to both research and education Travel funds available for attending G-OnRamp workshops this summer: 
 http://gonramp.org/signup RNA-Seq reads Sequence similarity RNA-Seq analysis Gene predictions Repeats Hub Archive Creator JBrowse Archive Creator Transcripts / proteins from informant genome Reference genome assembly
  3. 3.

    Outline Motivation: Genomics Education Project and Eukaryotic Genome Annotation G-OnRamp

    Features and Demo Next Steps and Participation Opportunities
  4. 4.

    • Integration of genomics into the undergraduate biology curriculum •

    Integration of research thinking into the academic year curriculum • Creation of dynamic student-scientist partnerships • Publication of research in genomics & in science education Advantages of bioinformatics: – Low laboratory costs (computers, internet connection) – Web-based tools – No lab safety issues; open access 24/7 – Lends itself to peer instruction – Allows low-cost failure – Large pool of publicly accessible raw data • Currently 126 faculty from >100 colleges & universities; last year 74 schools claimed projects, >1500 students involved. Goals: Courtesy of Sally Elgin (and next ~9 slides)
  5. 5.

    Year joined: 2006 2007 2008 2009 2010 2011 2012 2013

    2014 2015 2016 & 2017 Current GEP Membership • Mostly PUIs & community colleges • >100 faculty from >100 schools, >1000 undergraduates participate annually
  6. 6.

    Drosophila comparative genomics:
 to understand the organization, evolution and 


    gene function on the 4th (dot) chromosome D. melanogaster D. simulans D. sechellia D. yakuba D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. ananassae D. bipectinata D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi Reference Ongoing annotation 2015 G3 publication New species sequenced by modENCODE Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project 2017 G3 publication D. erecta 2010 Genetics publication
  7. 7.

    Strategy: divide and conquer! Students improve the sequence & carefully

    annotate the genes • Use publicly available genome sequences from the web • Divide into 100 or 40 kb projects, which different schools claim & then submit • Each project done at least twice independently; checked by student; results reconciled ( ~75% complete congruence for D. biarmipes). Fosmid sequence matches consensus sequence Putative polymorphisms Finished sequences and improved annotations noted in FlyBase, publically available. This could not be accomplished without the efforts of hundreds of students; what research could you do with that kind of help?
  8. 8.

    2015 publication built on student assemblies and gene annotations from

    four species (940 students contributed)
 
 Students improved 3.8 Mb, closed 72/86 gaps, added 44,468 bases;
 annotated 878 genes (1619 isoforms); only 58% agree with GLEAN-R. (D. mojavensis Muller F element)
  9. 9.

    Annotation challenge: Create gene models from evidence: sequence similarity, computational

    predictions, RNA-seq, etc. (GEP UCSC Genome Browser Mirror, D. mojavensis) Improved Sequence Sequence Homology Gene Predictions RNA-seq, TopHat, Cufflinks Repeats
  10. 10.

    SURE survey: More time spent on the GEP project results

    in a better understanding of science. . GEP school SURE 09 Bio Skill in interpreting results Ability to analyze data Blue = 10 hr Red = ~45 hr
  11. 11.

    Conclusions: • Genomics is an excellent topic for CUREs (course-

    based undergraduate research experiences) • Evidence points to knowledge gains, gains in understanding of science, improved retention in STEM & improved graduation rates with CUREs. • Massively parallel undergraduates can accomplish research that could not be done in any other way! • How to create varied opportunities? G-OnRamp should make it easy to create good genome browsers for annotating new genomes!
  12. 12.

    Galaxy features complement GEP challenges GEP challenges Galaxy features Requires

    expertise (e.g., familiarity with Linux) to configure and run bioinformatics tools Provides a web-based user interface to configure and run tools Difficult to share workflows and results Can make Histories, Datasets, and Workflows publicly available or share with individual Galaxy users Difficult to incorporate additional analyses and tools Can use the Workflow Canvas to modify existing workflows and add new tools from the Galaxy Tool Shed GEP projects are currently limited to the analysis of different Drosophila species Can extract a Workflow from History and run the Workflow on other genome assemblies
  13. 13.

    Outline Motivation: Genomics Education Project and Eukaryotic Genome Annotation G-OnRamp

    Features and Demo Next Steps and Participation Opportunities
  14. 14.

    The Lead: G-OnRamp
 http://gonramp.org Ready to use Galaxy server and

    workflows for genome annotation ✦ for any eukaryotic genome, can adapt as needed ✦ generates genome browser for doing manual annotation ✦ integrates into ecosystem of bioinformatics tools such as UCSC Genome Browser, JBrowse/Apollo, and CyVerse ✦ applicable to both research and education Travel funds available for attending G-OnRamp workshops this summer: 
 http://gonramp.org/signup RNA-Seq reads Sequence similarity RNA-Seq analysis Gene predictions Repeats Hub Archive Creator JBrowse Archive Creator Transcripts / proteins from informant genome Reference genome assembly
  15. 15.

    G-OnRamp can facilitate annotation projects For faculty ✦ Investigator identifies

    new species and any genes, sequences, or regions of interest ✦ Students can use GEP and G- OnRamp for annotation ✦ Recommend “parallel” projects with two independent determinations ✦ GEP looking for partners with new projects For researchers: get a copy of G- OnRamp and start working Gene Model Checker
  16. 16.

    The Tools, by Sub-Workflow Sequence similarity NCBI BLAST+ UCSC BLAT

    RNA-Seq HISAT StringTie regtools Gene predictions Augustus GlimmerHMM SNAP Repeats Tandem Repeats Finder Create Genome Browser Assembly Hubs Hub Archive Creator JBrowse Archive Creator
  17. 17.

    Input Data Sequence Similarity Ab initio Gene Predictions RNA-Seq JBrowse

    Archive Creator JBrowse Archive to Apollo Repeats
  18. 18.

    Use the UCSC genome browser to visualize multiple genomic datasets

    • NCBI BLAST+ • UCSC BLAT • Augustus • GlimmerHMM • SNAP • HISAT • StringTie • regtools • TRF • WindowMasker Repeats Sequence similarity RNA-Seq analysis Gene predictions D. miranda
  19. 19.

    Apollo: The Annotation Platform
 An instantaneous, collaborative, genome annotation editor

    Programmatic interaction between Galaxy and Apollo, from organism population to user administration
  20. 20.

    Use the JBrowse/Apollo browser to visualize multiple genomic datasets Sequence

    similarity Gene predictions RNA-Seq analysis Repeats • UCSC BLAT • Augustus • GlimmerHMM • SNAP • HISAT • StringTie • regtools • TRF • WindowMasker D. miranda
  21. 22.

    Outline Motivation: Genomics Education Project and Eukaryotic Genome Annotation G-OnRamp

    Features and Demo Next Steps and Participation Opportunities
  22. 25.

    G-OnRamp: Future Developments Adding tools for simple user management Galaxy:

    ✦ Adding personal storage with 'Pluggable Media' ✦ AWS, Azure, GCE & CyVerse integration ✦ Authorization from external providers (Google, etc) via Python Social Auth Apollo: ✦ Instructor role as 'middle manager'
  23. 26.

    Acquiring G-OnRamp Available through (a) virtual Machine image for local

    deployment or (b) AWS AMI for cloud deployment ✦ includes Galaxy, JBrowse, and Apollo Detailed installation instructions can be found at http://gonramp.org: Amazon EC2
  24. 27.

    Summer 2016 G-OnRamp Alpha Workshop Created UCSC Assembly Hubs for

    the G-OnRamp alpha testers workshop 10 participants from 9 institutions ✦ Five genome assemblies: Amazona vittata, Chlamydomonas reinhardtii, Kryptolebias marmoratus, Sebastes rubrivinctus, Xenopus laevis ✦ Assembly sizes: 111Mb - 2.8Gb ✦ Number of scaffolds: 54 - 402,501 ✦ Four genomes with RNA-Seq data Photos by Tom MacKenzie (A. vittata), Dartmouth Electron Microscope Facility (C. reinhardtii), Chad King (S. rubrivinctus), Brian Gratwicke (X. laevis), and Jean-Paul Cicéron (K. marmoratus)
  25. 28.

    Summer 2017 G-OnRamp Beta Workshop Created UCSC Assembly Hubs for

    the G-OnRamp beta testers workshop 23 participants from 21 institutions ✦ Ten genome assemblies ✦ Assembly sizes: 70Mb - 2.8Gb ✦ Eight genomes with RNA-Seq data https://de.cyverse.org/anon-files/ iplant/home/shared/G- OnRamp_hubs/index.html
  26. 29.

    Summer 2018 G-OnRamp Workshops • 15 genome browsers created: •

    Assembly sizes: 70Mb - 2.8Gb • Number of scaffolds: 54 - 402,501 • Data hosted on the CyVerse Data Store • http://gonramp.org ➜ “View Assembly” • June 12-15 and July 16-19 at Washington University in St. Louis • Travel, room and meals supported by NIH BD2K grant http://gonramp.org/signup Sign up for more information Summer 2018 G-OnRamp Workshops
  27. 30.

    G-OnRamp Summary G-OnRamp is a project that joins Galaxy with

    the Genomics Education Project (GEP) Provides best-practice genome annotation workflows for engaging undergraduates in data science and for data analysis Provides workshops for learning about workflow and how to use it in education—see sign up sheet in the back or visit http://gonramp.org/signup
  28. 31.

    Thank you G-OnRamp web site (VMs, workshop information)
 http://gonramp.org G-OnRamp

    Demo Server
 http://cloud5.galaxyproject.org 
 Galaxy
 http://galaxyproject.org Genomics Education Partnership
 http://gep.wustl.edu Wilson Leung
 Wash U in St. Louis Yating Liu Wash U in St. Louis Sarah Elgin
 Wash U in St. Louis Jeremy Goecks
 OHSU Luke Sargent
 OHSU