$30 off During Our Annual Pro Sale. View Details »

PAG 2018 Galaxy Workshop - G-OnRamp

Jeremy Goecks
January 17, 2018

PAG 2018 Galaxy Workshop - G-OnRamp

G-OnRamp (http://gonramp.org) is a collaboration between two successful and long-running projects — the Genomics Education Partnership (GEP; http://gep.wustl.edu) and the Galaxy Project (https://galaxyproject.org). G-OnRamp provides biologists with an integrated, web-based, scalable environment for interactive annotation of eukaryotic genomes using large genomic datasets. It also provides educators with a platform to help undergraduates develop “big data” science skills through eukaryotic genome annotation.

GEP is a consortium of over 100 colleges and universities that provides Classroom Undergraduate Research Experiences (CURE) in bioinformatics/genomics for students at all levels. GEP faculty currently use the annotation of multiple Drosophilaspecies to introduce genomics and research thinking to undergraduates. Galaxy is a popular open-source, web-based scientific gateway for accessible, reproducible, and transparent analyses of large biomedical datasets. G-OnRamp extends Galaxy with tools and workflows that creates UCSC Assembly Hubs and Apollo/JBrowse genome browsers with evidence tracks for sequence similarity, ab initio gene predictions, RNA-Seq, and repeats. Educators can use this system to design CUREs based on their favorite eukaryotic species (e.g., parasitoid wasps).

G-OnRamp provides a VirtualBox virtual appliance and an AMI image for local and cloud (Amazon EC2) deployments. Future versions of G-OnRamp will (i) enable data storage with CyVerse; (ii) support additional configuration options for UCSC Assembly Hubs; and (iii) adapt GEP annotation tools for other informant species. We will host G-OnRamp training workshops in June and July 2018 at Washington University in St. Louis. If you are interested in attending, please sign up for the G-OnRamp mailing list at http://gonramp.org/signup. Supported by NIH 1R25GM119157.

Jeremy Goecks

January 17, 2018
Tweet

More Decks by Jeremy Goecks

Other Decks in Science

Transcript

  1. Using Galaxy for Genome
    Annotation
    Jeremy Goecks
    Assistant Professor, Oregon Health and Science University
    @galaxyproject /
    #usegalaxy
    http://www.galaxyproject.org

    View Slide

  2. The Lead: G-OnRamp

    http://gonramp.org
    Ready to use Galaxy server and workflows for genome annotation

    for any eukaryotic genome, can adapt as needed

    generates genome browser for doing manual annotation

    integrates into ecosystem of bioinformatics tools such as UCSC Genome
    Browser, JBrowse/Apollo, and CyVerse

    applicable to both research and education
    Travel funds available for attending G-OnRamp workshops this summer: 

    http://gonramp.org/signup
    RNA-Seq
    reads
    Sequence
    similarity
    RNA-Seq
    analysis
    Gene
    predictions
    Repeats
    Hub
    Archive
    Creator
    JBrowse
    Archive
    Creator
    Transcripts /
    proteins from
    informant
    genome
    Reference
    genome
    assembly

    View Slide

  3. Outline
    Motivation: Genomics Education
    Project and Eukaryotic Genome
    Annotation
    G-OnRamp Features and Demo
    Next Steps and Participation Opportunities

    View Slide

  4. • Integration of genomics into the undergraduate biology curriculum
    • Integration of research thinking into the academic year curriculum
    • Creation of dynamic student-scientist partnerships
    • Publication of research in genomics & in science education
    Advantages of bioinformatics:
    – Low laboratory costs (computers, internet connection)
    – Web-based tools
    – No lab safety issues; open access 24/7
    – Lends itself to peer instruction
    – Allows low-cost failure
    – Large pool of publicly accessible raw data
    • Currently 126 faculty from >100 colleges & universities; last year
    74 schools claimed projects, >1500 students involved.
    Goals:
    Courtesy of Sally Elgin
    (and next ~9 slides)

    View Slide

  5. Year joined:
    2006 2007 2008 2009 2010 2011
    2012 2013 2014 2015 2016 & 2017
    Current GEP Membership
    • Mostly PUIs &
    community colleges
    • >100 faculty from
    >100 schools, >1000
    undergraduates
    participate annually

    View Slide

  6. Drosophila comparative genomics:

    to understand the organization, evolution and 

    gene function on the 4th (dot) chromosome
    D. melanogaster
    D. simulans
    D. sechellia
    D. yakuba
    D. ficusphila
    D. eugracilis
    D. biarmipes
    D. takahashii
    D. elegans
    D. rhopaloa
    D. kikkawai
    D. ananassae
    D. bipectinata
    D. pseudoobscura
    D. persimilis
    D. willistoni
    D. mojavensis
    D. virilis
    D. grimshawi
    Reference
    Ongoing annotation
    2015 G3 publication
    New species sequenced by
    modENCODE
    Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project
    2017 G3 publication
    D. erecta
    2010 Genetics publication

    View Slide

  7. Strategy: divide and conquer!
    Students improve the sequence & carefully annotate the genes
    • Use publicly available genome sequences from the web
    • Divide into 100 or 40 kb projects, which different schools claim & then submit
    • Each project done at least twice independently; checked by student; results
    reconciled ( ~75% complete congruence for D. biarmipes).
    Fosmid sequence matches consensus sequence
    Putative polymorphisms
    Finished sequences and improved annotations
    noted in FlyBase, publically available.
    This could not be accomplished without the efforts of hundreds of students;
    what research could you do with that kind of help?

    View Slide

  8. 2015 publication built on student assemblies and gene annotations
    from four species (940 students contributed)


    Students improved 3.8 Mb, closed 72/86 gaps, added 44,468 bases;

    annotated 878 genes (1619 isoforms); only 58% agree with GLEAN-R.
    (D. mojavensis Muller F element)

    View Slide

  9. Annotation challenge: Create gene models from evidence:
    sequence similarity, computational predictions, RNA-seq, etc.
    (GEP UCSC Genome Browser Mirror, D. mojavensis)
    Improved
    Sequence
    Sequence
    Homology
    Gene
    Predictions
    RNA-seq,
    TopHat,
    Cufflinks
    Repeats

    View Slide

  10. SURE survey: More time spent on the GEP project
    results in a better understanding of science.
    .
    GEP school
    SURE 09 Bio
    Skill in interpreting results Ability to analyze data Blue = 10 hr
    Red = ~45 hr

    View Slide

  11. Conclusions:
    • Genomics is an excellent topic for CUREs (course-
    based undergraduate research experiences)
    • Evidence points to knowledge gains, gains in
    understanding of science, improved retention in
    STEM & improved graduation rates with CUREs.
    • Massively parallel undergraduates can accomplish
    research that could not be done in any other way!
    • How to create varied opportunities? G-OnRamp
    should make it easy to create good genome
    browsers for annotating new genomes!

    View Slide

  12. Galaxy features complement GEP
    challenges
    GEP challenges Galaxy features
    Requires expertise (e.g., familiarity with Linux)
    to configure and run bioinformatics tools
    Provides a web-based user interface to
    configure and run tools
    Difficult to share workflows and results Can make Histories, Datasets, and
    Workflows publicly available or share
    with individual Galaxy users
    Difficult to incorporate additional analyses and
    tools
    Can use the Workflow Canvas to
    modify existing workflows and add new
    tools from the Galaxy Tool Shed
    GEP projects are currently limited to the
    analysis of different Drosophila species
    Can extract a Workflow from History
    and run the Workflow on other genome
    assemblies

    View Slide

  13. Outline
    Motivation: Genomics Education Project
    and Eukaryotic Genome Annotation
    G-OnRamp Features and Demo
    Next Steps and Participation Opportunities

    View Slide

  14. The Lead: G-OnRamp

    http://gonramp.org
    Ready to use Galaxy server and workflows for genome annotation

    for any eukaryotic genome, can adapt as needed

    generates genome browser for doing manual annotation

    integrates into ecosystem of bioinformatics tools such as UCSC Genome
    Browser, JBrowse/Apollo, and CyVerse

    applicable to both research and education
    Travel funds available for attending G-OnRamp workshops this summer: 

    http://gonramp.org/signup
    RNA-Seq
    reads
    Sequence
    similarity
    RNA-Seq
    analysis
    Gene
    predictions
    Repeats
    Hub
    Archive
    Creator
    JBrowse
    Archive
    Creator
    Transcripts /
    proteins from
    informant
    genome
    Reference
    genome
    assembly

    View Slide

  15. G-OnRamp can facilitate annotation
    projects
    For faculty

    Investigator identifies new species
    and any genes, sequences, or
    regions of interest

    Students can use GEP and G-
    OnRamp for annotation

    Recommend “parallel” projects with
    two independent determinations

    GEP looking for partners with new
    projects
    For researchers: get a copy of G-
    OnRamp and start working
    Gene Model Checker

    View Slide

  16. The Tools, by Sub-Workflow
    Sequence
    similarity
    NCBI
    BLAST+
    UCSC BLAT
    RNA-Seq
    HISAT
    StringTie
    regtools
    Gene
    predictions
    Augustus
    GlimmerHMM
    SNAP
    Repeats
    Tandem
    Repeats
    Finder
    Create
    Genome
    Browser
    Assembly
    Hubs
    Hub
    Archive
    Creator
    JBrowse
    Archive
    Creator

    View Slide

  17. Input
    Data
    Sequence Similarity
    Ab initio Gene Predictions
    RNA-Seq
    JBrowse
    Archive
    Creator
    JBrowse Archive to Apollo
    Repeats

    View Slide

  18. Use the UCSC genome browser to visualize multiple
    genomic datasets
    • NCBI BLAST+
    • UCSC BLAT
    • Augustus
    • GlimmerHMM
    • SNAP
    • HISAT
    • StringTie
    • regtools
    • TRF
    • WindowMasker
    Repeats
    Sequence
    similarity
    RNA-Seq
    analysis
    Gene
    predictions
    D. miranda

    View Slide

  19. Apollo: The Annotation Platform

    An instantaneous, collaborative, genome annotation editor
    Programmatic interaction between Galaxy and Apollo, from organism population to user
    administration

    View Slide

  20. Use the JBrowse/Apollo browser to visualize multiple
    genomic datasets
    Sequence
    similarity
    Gene
    predictions
    RNA-Seq
    analysis
    Repeats
    • UCSC BLAT
    • Augustus
    • GlimmerHMM
    • SNAP
    • HISAT
    • StringTie
    • regtools
    • TRF
    • WindowMasker
    D. miranda

    View Slide

  21. G-OnRamp Demonstration

    View Slide

  22. Outline
    Motivation: Genomics Education Project
    and Eukaryotic Genome Annotation
    G-OnRamp Features and Demo
    Next Steps and Participation
    Opportunities

    View Slide

  23. Simplify with Subworkflows

    View Slide

  24. Simplify with Subworkflows

    View Slide

  25. G-OnRamp: Future Developments
    Adding tools for simple user management
    Galaxy:

    Adding personal storage with 'Pluggable Media'

    AWS, Azure, GCE & CyVerse integration

    Authorization from external providers (Google, etc)
    via Python Social Auth
    Apollo:

    Instructor role as 'middle manager'

    View Slide

  26. Acquiring G-OnRamp
    Available through (a) virtual Machine image for local
    deployment or (b) AWS AMI for cloud deployment

    includes Galaxy, JBrowse, and Apollo
    Detailed installation instructions can be found at http://gonramp.org:
    Amazon EC2

    View Slide

  27. Summer 2016 G-OnRamp Alpha
    Workshop
    Created UCSC Assembly Hubs for the
    G-OnRamp alpha testers workshop
    10 participants from 9 institutions

    Five genome assemblies:
    Amazona vittata,
    Chlamydomonas reinhardtii,
    Kryptolebias marmoratus,
    Sebastes rubrivinctus, Xenopus
    laevis
    ✦ Assembly sizes: 111Mb - 2.8Gb
    ✦ Number of scaffolds: 54 -
    402,501
    ✦ Four genomes with RNA-Seq
    data
    Photos by Tom MacKenzie (A. vittata),
    Dartmouth Electron Microscope Facility
    (C. reinhardtii), Chad King (S.
    rubrivinctus), Brian Gratwicke (X. laevis),
    and Jean-Paul Cicéron (K. marmoratus)

    View Slide

  28. Summer 2017 G-OnRamp Beta
    Workshop
    Created UCSC Assembly Hubs for
    the G-OnRamp beta testers
    workshop
    23 participants from 21 institutions
    ✦ Ten genome assemblies
    ✦ Assembly sizes: 70Mb - 2.8Gb

    Eight genomes with RNA-Seq
    data
    https://de.cyverse.org/anon-files/
    iplant/home/shared/G-
    OnRamp_hubs/index.html

    View Slide

  29. Summer 2018 G-OnRamp Workshops
    • 15 genome browsers created:
    • Assembly sizes: 70Mb - 2.8Gb
    • Number of scaffolds: 54 - 402,501
    • Data hosted on the CyVerse Data Store
    • http://gonramp.org ➜ “View Assembly”
    • June 12-15 and July 16-19 at Washington University in St. Louis
    • Travel, room and meals supported by NIH BD2K grant
    http://gonramp.org/signup
    Sign up for more information
    Summer 2018 G-OnRamp Workshops

    View Slide

  30. G-OnRamp Summary
    G-OnRamp is a project that joins Galaxy with the
    Genomics Education Project (GEP)
    Provides best-practice genome annotation
    workflows for engaging undergraduates in data
    science and for data analysis
    Provides workshops for learning about workflow
    and how to use it in education—see sign up sheet
    in the back or visit http://gonramp.org/signup

    View Slide

  31. Thank you
    G-OnRamp web site (VMs,
    workshop information)

    http://gonramp.org
    G-OnRamp Demo Server

    http://cloud5.galaxyproject.org

    Galaxy

    http://galaxyproject.org
    Genomics Education Partnership

    http://gep.wustl.edu
    Wilson Leung

    Wash U in St. Louis
    Yating Liu
    Wash U in St. Louis
    Sarah Elgin

    Wash U in St. Louis
    Jeremy Goecks

    OHSU
    Luke Sargent

    OHSU

    View Slide