Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Genome Annotation with Galaxy and G-OnRamp

Jeremy Goecks
January 17, 2017

Genome Annotation with Galaxy and G-OnRamp

Slides for "Genome Annotation with Galaxy and G-OnRamp" workshop at 2017 Plant and Animal Genomes Conference (PAG)

Jeremy Goecks

January 17, 2017
Tweet

More Decks by Jeremy Goecks

Other Decks in Science

Transcript

  1. Genome Annotation with
    Galaxy and G-OnRamp
    Jeremy Goecks
    Assistant Professor, Oregon Health and Science University
    @galaxyproject /
    #usegalaxy
    http://www.galaxyproject.org

    View full-size slide

  2. Agenda
    Introductions
    Galaxy introduction and exercises
    G-OnRamp introduction and demonstration

    View full-size slide

  3. Wilson Leung

    Wash U in St. Louis
    Yating Liu
    Wash U in St. Louis
    Workshop Leaders and Key URLs
    Jeremy Goecks

    OHSU
    Dave Clements
    Johns Hopkins
    Galaxy: http://galaxyproject.org
    G-OnRamp: http://gonramp.org

    View full-size slide

  4. Goals
    Share Galaxy awesomeness with you
    Get you working with Galaxy
    Demonstrate a complex workflow in Galaxy

    View full-size slide

  5. Agenda
    Introductions
    Galaxy introduction and exercises
    G-OnRamp introduction and demonstration

    View full-size slide

  6. Motivation for Galaxy:
    Computational (e.g., Genomic) Analyses are Difficult
    Investigators unfamiliar with
    computation, but complex methods
    and infrastructure required
    Creating and reproducing workflows
    (pipelines) hindered by complexity:
    systems, scripts, tools, parameters
    Collaboration and reuse difficult
    because current approaches do not
    support computational artifacts well

    View full-size slide

  7. Accessibility
    Getting started in bioinformatics is so very hard
    ‣ command line syntax
    ‣ tool and dependency installation
    ‣ creating pipelines (workflows)
    ‣ using computing clouds/clusters
    These difficulties hinder biomedicine in profound ways
    ‣ time spent on computing rather than science
    ‣ little exploration and difficult to test ideas
    ‣ computing is underutilized

    View full-size slide

  8. Reproducibility for Computational
    Science
    Reproducibility is not provenance, reusability/generalizability,
    or correctness
    Reproducibility means that an analysis is described/captured
    in sufficient detail that it can be precisely reproduced (given
    the data)
    Yet most published analyses are not reproducible 

    (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and
    Taylor 2012, 7/50 resequencing experiments reproducible)
    Missing software, versions, parameters, data…

    View full-size slide

  9. Reproducibility Project: Cancer Biology
    Independently replicating 50 “high-impact” cancer
    studies from 2010-2012
    (https://osf.io/e81xl/wiki/home/)

    View full-size slide

  10. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa; Iorns, Elizabeth (2014):
    Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology.
    figshare.
    http://dx.doi.org/10.6084/m9.figshare.987130
    32/127 tools
    6/41 papers

    View full-size slide

  11. Collaboration and Reuse
    There is very little actionable data or methods in PDF
    documents
    ‣ extract a table from a PDF document?
    Need links/embedding of methods plus surrounding
    discussion
    ‣ community understanding and evaluation critical
    ‣ want to build on existing methods rather than start from
    scratch

    View full-size slide

  12. Galaxy: accessible analysis system
    Consistent tool user
    interfaces automatically
    generated
    History system facilitates and
    tracks multistep analyses
    Exact parameters of a step
    can always be inspected, and
    easily rerun
    Workflow system

    View full-size slide

  13. Galaxy is…
    A free (for everyone) web service (http://usegalaxy.org)
    integrating a wealth of tools, compute resources, terabytes of
    reference data and permanent storage
    Open source software that makes integrating your own tools
    and data and customizing for your own site simple
    An open extensible platform for sharing tools, datatypes,
    workflows, ...

    View full-size slide

  14. Galaxy’s Ideological Goals
    How best can data intensive methods be
    accessible to scientists?
    How best to facilitate transparent communication
    of computational analyses?
    How best to ensure that analyses are
    reproducible?

    View full-size slide

  15. Galaxy’s Practical Goals
    How to arm researchers with access to latest tools
    and applications
    How to build a community of tool developers
    How to run Galaxy on any HPC

    View full-size slide

  16. Ways to use Galaxy
    The public web service at http://usegalaxy.org
    Install locally with many compute environments
    Deploy on a cloud using Cloudman
    Atmospher
    e

    View full-size slide

  17. Galaxy Main Usage

    View full-size slide

  18. Galaxy Main Usage

    View full-size slide

  19. bit.ly/gxyServers

    View full-size slide

  20. Proteomics
    Metabolomics
    Drug Discovery
    Cosmology
    Image Analysis
    Climate Change Social Science Natural Language

    View full-size slide

  21. Galaxy Citations

    View full-size slide

  22. Goal: Bring everyone together
    user (scientist)
    HPC admin
    dev

    View full-size slide

  23. Goal: Bring everyone together
    user
    admin
    dev

    View full-size slide

  24. Goal: Bring everyone together
    admin
    user
    dev

    View full-size slide

  25. Goal: Bring everyone together
    user
    dev
    Galaxy
    admin

    View full-size slide

  26. Galaxy 101-1

    View full-size slide

  27. Galaxy 101-1: Find the top 5 exons
    with the highest number of SNPs
    Exons: ~10,000 regions SNPs: ~200,000 regions
    https://github.com/nekrut/galaxy/wiki/Galaxy101-1

    View full-size slide

  28. Galaxy features

    View full-size slide

  29. Describe analysis tool
    behavior abstractly

    View full-size slide

  30. Describe analysis tool
    behavior abstractly
    Scalable* analysis environment
    transparently tracks details
    *several examples of 10,000+ dataset analyses across the world

    View full-size slide

  31. Describe analysis tool
    behavior abstractly
    Scalable* analysis environment
    transparently tracks details
    Scalable* workflow system for
    automated complex analysis
    *several examples of 10,000+ dataset analyses across the world

    View full-size slide

  32. Describe analysis tool
    behavior abstractly
    Scalable* analysis environment
    transparently tracks details
    Scalable* workflow system for
    automated complex analysis
    Pervasive sharing, and publication
    of documents with integrated analysis
    *several examples of 10,000+ dataset analyses across the world

    View full-size slide

  33. Visualization and visual analytics

    View full-size slide

  34. From Galaxy 101-1 to Galaxy 101-2
    What about features other than exons and SNPs?
    which transcription factor binding sites have the most SNPs?
    which exons have the most repeats?
    Exons
    SNPs
    Join exons
    with SNPs
    Group by
    exons
    Sort exons by
    SNP count
    Select top
    five exons
    Recover
    exon info

    View full-size slide

  35. An analysis is really a workflow

    View full-size slide

  36. As analyses needs become
    increasingly complex, typical users
    have moved from running individual
    tools to primarily running workflows

    View full-size slide

  37. For research use, users need to be
    able to construct and modify
    workflows, not just run existing best
    practice pipelines
    The Galaxy workflow editor supports
    this use case well, providing ways for
    users to easily construct and modify
    workflows

    View full-size slide

  38. (Goecks et al. Cancer Medicine, 2015)

    View full-size slide

  39. (Goecks et al. Cancer Medicine, 2015)

    View full-size slide

  40. Galaxy 101-2
    https://github.com/nekrut/galaxy/wiki/Galaxy101-2
    On your own: do analysis across entire genome

    View full-size slide

  41. Agenda
    Introductions
    Galaxy introduction and exercises
    G-OnRamp introduction and demonstration

    View full-size slide

  42. G-OnRamp 

    http://gonramp.org/
    Create Galaxy servers for
    ‣ utilizing large genomics datasets to annotate any eukaryotic
    genome
    ‣ providing educators with a platform to train undergraduate
    students on “big data” biomedical analyses
    Collaboration between Galaxy and Genomics Education Partnership
    (GEP)
    Opportunities to participate in G-OnRamp workshops this summer
    for research or for education: June 20-22 or July 25-27

    View full-size slide

  43. Genomics Education Partnership
    http://gep.wustl.edu
    Goals
    ‣ introduce genomics and bioinformatics into the undergraduate
    curriculum
    ‣ engage students in genomics research
    Approach
    ‣ use genome annotation of Drosophila for “hands-on” exercise
    ‣ students learn to integrate multiple lines of evidence, learn about
    genes/genomes, about genomics, underlying algorithms, and more

    View full-size slide

  44. GEP Gene Annotation
    Evidence tracks
    Reconciled
    gene models
    Sequence
    similarity
    Gene
    predictions
    RNA-seq
    Comparative
    genomics
    Repeats
    Genomic sequence
    D. erecta F element contig1

    View full-size slide

  45. GEP Results
    Results produced by GEP
    students are assembled
    for domain analysis and
    scientific publications
    Students report
    substantial learning
    gains
    ‣ Gains enhanced by
    increased time
    investment; Q4 > Q1
    2 3 4 5
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    Means
    Q1
    Q4
    SURE
    Q1 (1-10 hrs.) Q4 (>36 hrs.) SURE (Summer research)
    Understanding the
    research process
    Ability to
    analyze data
    Independence
    Learning gain items in the SURE survey
    Mean scores
    Scientific results: Leung et al. 2015, G3. 5(5):719-42

    View full-size slide

  46. GEP Participants
    >100 faculty from >100 schools, >1000
    undergraduates participate annually
    Year joined
    2006
    2007
    2008
    2009
    2010
    2011
    2012
    2013
    2014
    2015
    2016

    View full-size slide

  47. GEP Participants
    >100 faculty from >100 schools, >1000
    undergraduates participate annually
    Shaffer CD et al. 2014, CBE
    Life Sci Educ. 13(1):111-30
    Source of funding
    Total enrollment
    Admissions selectivity
    Highest biology degree
    % life sciences majors
    Residential vs. commuter
    Minority/Hispanic serving
    Minority
    Non-traditional students
    First generation (>30%)

    View full-size slide

  48. G-OnRamp: 

    Use Galaxy to address GEP Challenges
    GEP challenges Galaxy features
    Requires expertise (e.g., familiarity with
    Linux) to configure and run bioinformatics
    tools
    Provides a web-based user interface to
    configure and run tools
    Difficult to share workflows and results Can make Histories, Datasets, and Workflows
    publicly available or share with individual
    Galaxy users
    Difficult to incorporate additional analyses
    and tools
    Can use the Workflow Canvas to modify
    existing workflows and add new tools from the
    Galaxy Tool Shed
    GEP projects are currently limited to the
    analysis of different Drosophila species
    Can extract a Workflow from History and run
    the Workflow on other genome assemblies

    View full-size slide

  49. GEP + Galaxy = G-OnRamp

    View full-size slide

  50. Galaxy for Genome Annotation
    Extends Galaxy with tools
    and workflows for genome
    annotation
    Combines multiple tools into
    reproducible sub-workflows
    Uses Hub Archive Creator
    (HAC) to create UCSC
    Assembly Hubs
    Displays genome browsers
    using the servers maintained
    by UCSC

    View full-size slide

  51. G-OnRamp Subworkflows
    Sequence similarity (tblastn
    search against protein sequences
    from informant species)
    Gene predictions (GlimmerHMM,
    Augustus, and SNAP)
    RNA-Seq (HISAT2, read coverage,
    splice junctions, and StringTie)
    Repeats (TRF)

    View full-size slide

  52. G-OnRamp Demonstration

    View full-size slide

  53. Simplify with Subworkflows

    View full-size slide

  54. Simplify with Subworkflows

    View full-size slide

  55. Summer 2016 G-OnRamp Beta
    Workshop
    Created UCSC Assembly Hubs for the G-
    OnRamp beta testers workshop
    10 participants from 9 institutions
    ‣ Five genome assemblies: Amazona
    vittata, Chlamydomonas reinhardtii,
    Kryptolebias marmoratus, Sebastes
    rubrivinctus, Xenopus laevis
    ‣ Assembly sizes: 111Mb - 2.8Gb
    ‣ Number of scaffolds: 54 - 402,501
    ‣ Four genomes with RNA-Seq data
    Photos by Tom MacKenzie (A. vittata), Dartmouth Electron
    Microscope Facility (C. reinhardtii), Chad King (S.
    rubrivinctus), Brian Gratwicke (X. laevis), and Jean-Paul
    Cicéron (K. marmoratus)

    View full-size slide

  56. G-OnRamp: Coming Features (1/2)
    Better documentation for usage and for teaching
    Extend workflow:
    ‣ ChIP-seq
    ‣ DNase-seq/ATAC-seq
    ‣ DNA methylation (bisulfite sequencing)
    Optimize workflow:
    ‣ better labels
    ‣ better repeat detection to speed up workflow

    View full-size slide

  57. G-OnRamp: Coming Features (2/2)
    Connect to broader ecosystem
    ‣ JBrowse for interactive viewing
    ‣ WebApollo for real-time interactive collaborative annotation
    ‣ CyVerse for storing and accessing generated data
    Make easier to install and use
    ‣ on local computer with a virtual machine for running small
    analyses
    ‣ on the cloud for large analyses

    View full-size slide

  58. Want to participate in Summer 2017
    G-OnRamp beta testing workshop?
    G-OnRamp best testing workshops will be held in
    Summer 2017 at Washington University in St. Louis
    ‣ June 20-22 or July 25-27
    ‣ Lodging and food costs covered, you pay travel
    costs
    Express interest and join mailing list at
    http://gonramp.org/signup

    View full-size slide

  59. Galaxy Summary
    Galaxy is an (obsessively) open framework for making data analysis
    accessible and reproducible
    ‣ Nearly everything in Galaxy is “pluggable”, allowing it to customized for
    myriad purposes
    New UI approaches are enabling more complex analysis of much larger
    numbers of datasets without sacrificing usability
    By supporting and leveraging tool developers the Galaxy community can
    collectively keep up with rapid changes in available tools
    http://galaxyproject.org for all things Galaxy

    View full-size slide

  60. G-OnRamp Summary
    G-OnRamp is a project that joins Galaxy with the
    Genomics Education Project (GEP)
    Provides best-practice genome annotation
    workflows for engaging undergraduates in data
    science and for data analysis
    Provides workshops for learning about workflow and
    how to use it in education—see sign up sheet in the
    back or visit http://gonramp.org/signup

    View full-size slide

  61. Dan Blankenberg Nate Coraor
    Dannon Baker
    Jeremy Goecks
    Anton Nekrutenko
    James Taylor
    Dave Clements Jennifer Jackson
    Support and outreach Leadership
    Dave Bouvier
    Sam Guerler
    Martin Čech
    Enis Afgan
    Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103),
    Penn State University, Johns Hopkins University, Oregon Health and Science University, and the
    Pennsylvania Department of Public Health
    Nick Stoler
    The “Core” Galaxy Team
    Mo Heydarian
    John Chilton
    Engineering

    View full-size slide

  62. Björn Grüning
    Uni Freiburg
    Peter Cock
    TJHI
    Kyle Ellrott
    OSHU
    Eric Rasche
    CPT
    Nicola Soranzo
    TGAC
    Brad Chapman
    HSPH
    Nuwan Goonasekera
    VeRSI
    Yousef Kowsar
    VLSCI
    Extended team and other contributors…
    And many others who have contributed to the
    main Galaxy code, tools to the ToolShed,
    participated in discussions, attended the
    Galaxy conferences, …

    View full-size slide

  63. Galaxy is a community!
    Join us on irc, mailing lists, Galaxy Biostar
    Contribute code on bitbucket, github, or the ToolShed
    Join us for a Hackathon or our annual conference

    View full-size slide

  64. Wilson Leung

    Wash U in St. Louis
    Yating Liu
    Wash U in St. Louis
    G-OnRamp
    http://gonramp.org
    Sarah Elgin

    Wash U in St. Louis
    Jeremy Goecks

    OHSU

    View full-size slide