Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy... from genomic data science gateway to global community

James Taylor
September 24, 2019

Galaxy... from genomic data science gateway to global community

Keynote presentation on Galaxy (https://galaxyproject.org) origins and the Galaxy community presented at Gateways 2019, the annual meeting hosted by the Science Gateways Community Institute (https://sciencegateways.org/)

James Taylor

September 24, 2019
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. ...from genomic data science
    gateway to global community
    James Taylor (@jxtx), Johns Hopkins, http://speakerdeck.com/jxtx

    View Slide

  2. 1. Science
    2. Gateways
    3. Community

    View Slide

  3. Mammalian comparative genomics — the beginning
    2001: Initial sequence of the human genome
    2002: Initial sequence of the mouse genome
    2004: Initial sequence of the rat genome

    View Slide

  4. Mammalian comparative genomics — the beginning
    2001: Initial sequence of the human genome
    2002: Initial sequence of the mouse genome
    2004: Initial sequence of the rat genome
    Our story begins somewhere
    around here!

    View Slide

  5. Why care about comparative genomics?

    View Slide

  6. https://twitter.com/lpachter/status/526904556261625857

    View Slide

  7. Coding regions (genes) – deeply
    conserved across evolution,
    ~1.5% of the human genome
    Regulatory regions – much less
    conserved, 5-10% of the human
    genome

    View Slide

  8. View Slide

  9. Preservation of functional sequences
    (Miller et al. Annu. Rev. Genomics Hum. Genet. 2004)
    Time

    View Slide

  10. Whole genome scale alignments can potentially help us to
    understand biological function
    What is aligned to what and does it overlap with anything
    interesting?
    Can we see specific signals in alignments that inform us
    about specific functions?
    Answering these questions requires computational
    approaches

    View Slide

  11. Can we make it easier and more efficient for
    experimental ( ) and computational ( )
    researchers to collaborate?

    View Slide

  12. View Slide

  13. GALA enabled query annotation information from the human genome, alongside
    alignments with the mouse genome, integrated with the UCSC browser, and allowed
    building up set queries using the results of previous queries

    View Slide

  14. To enable collaboration, can we make it
    easy for computational researchers to
    integrate new tools, and for
    experimental researchers to use them?

    View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. 2006 Galaxy
    Tools
    Generated Web UI
    Analysis
    History

    View Slide

  19. And then everything changed… again.
    Illumina NovaSeq 6000
    20 Billion 300bp DNA
    fragments per run
    ~ 6 Terabytes
    Every 2 days…

    View Slide

  20. And then everything changed… again.

    View Slide

  21. Sequencing is widely available… (http://omicsmaps.com)

    View Slide

  22. ...practically free...
    (https://www.genome.gov/27541954/dna-sequencing-costs-data/)
    Cost Per Human Genome ($)

    View Slide

  23. ...and applicable across (nearly) all of Biology!
    - How is the
    production of
    the right protein
    at the right time
    controlled?
    - How are cells
    organized in
    3D?
    - How are cell
    types decided in
    development?
    - How are
    different species
    related?
    - What genome
    variants lead to
    different
    phenotypes or
    disease risk?

    View Slide

  24. Modern biology has rapidly transformed into a data
    intensive discipline
    - Large scale data acquisition has become easy, e.g. high-throughput
    sequencing and imaging
    - Experiments are increasingly complex
    - Making sense of results often requires mining and making connections across
    multiple databases
    - Nearly all high-profile research involves some quantitative methods
    How does this affect traditional research practices
    and outputs?

    View Slide

  25. Idea
    Experiment
    Raw Data
    Tidy Data
    Summarized data
    Results
    Experimental design
    Data collection
    Data cleaning
    Data analysis
    Inference
    Data Pipeline, inspired by Leek and Peng, Nature 2015
    The part we are
    considering here
    The part that ends
    up in the Publication

    View Slide

  26. Three major concerns
    Accessibility: Making use of large-scale data requires complex computational
    resources and methods. Can all researchers access these approaches? How can
    we make these methods available to everyone
    Transparency: Is it possible to communicate analyses and results in ways that are
    both easy to understand and provide all of the essential details
    Reproducibility: Can analyses be precisely reproduced, to facilitate rigorous
    validation and peer review, and ease reuse?

    View Slide

  27. View Slide

  28. Galaxy: accessible analysis system

    View Slide

  29. Describe analysis tool
    behavior abstractly

    View Slide

  30. Describe analysis tool
    behavior abstractly
    Analysis environment automatically and
    transparently tracks details

    View Slide

  31. Describe analysis tool
    behavior abstractly
    Analysis environment automatically and
    transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically

    View Slide

  32. Describe analysis tool
    behavior abstractly
    Pervasive sharing, and publication
    of documents with integrated analysis
    Analysis environment automatically and
    transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically

    View Slide

  33. Visualization and visual analytics

    View Slide

  34. Galaxy IEs: containerized apps, rapidly move between analysis modes

    View Slide

  35. Galaxy is available as...
    A free (for everyone) web service integrating a wealth of tools, compute
    resources, terabytes of reference data and permanent storage
    Open source software that makes integrating your own tools and data and
    customizing for your own site simple
    An open extensible platform for sharing tools, datatypes, workflows, ...

    View Slide

  36. usegalaxy.org
    A free science gateway for the genomics
    research community

    View Slide

  37. usegalaxy.org
    - We provided Galaxy as a free public website from the very beginning
    - Fortunately nobody knew about it at first, and in 2005 the data wasn’t all that big anyway
    - However, the demand for easy-to-use tools in the research community was
    even more than we anticipated… and we didn’t have much funding
    - For eight years Galaxy was run largely on surplus hardware decommissioned
    by other groups, borrowed storage, whatever we could find

    View Slide

  38. The great flood of 2012

    View Slide

  39. The great flood of 2012
    Your data here

    View Slide

  40. ...In which
    Save main
    , , and
    ,

    View Slide

  41. A nationally distributed service:
    The Galaxy / XSEDE Gateway

    View Slide

  42. 125,000
    registered users
    2PB
    user data
    19M
    jobs run
    100
    training events
    (2017 & 2018)
    Stats for Galaxy Main (usegalaxy.org) in May 2018

    View Slide

  43. PSC, Pittsburgh
    Stampede
    ● 462,462 cores
    ● 205 TB memory
    Bridges
    Dedicated resources Shared XSEDE resources
    TACC
    Austin
    Galaxy Cluster (Rodeo)
    ● 256 cores
    ● 2 TB memory
    Corral/Stockyard
    ● 20 PB disk
    PTI IU Bloomington
    (Nate Coraor)

    View Slide

  44. SmartOS
    (PSU)
    Bare metal
    cluster (TACC)
    VMWare
    (TACC)
    Stampede2
    (TACC)
    pulsar Bridges
    (PSC)
    Pulsar/AMQP
    Pulsar/HTTP
    Slurm
    PostgreSQL
    usegalaxy.org Compute Architecture (June 2018)
    NFS
    Jetstream
    (TACC)
    Jetstream (IU)
    Swarm
    db
    CVMFS
    slurm/rabbitmq
    roundup64
    ...
    roundup49
    cvmfs stratum0
    cvmfs stratum0
    jobs
    jobs
    web
    web
    swarm instance
    swarm instance
    swarm instance
    swarm instance
    slurm/pulsar/
    swarm
    cvmfs stratum1
    slurm instance
    slurm instance
    slurm instance
    slurm instance
    Corral (TACC)
    2.3 PB dataset storage
    pulsar
    cvmfs stratum1
    slurm/pulsar
    /swarm
    slurm instance
    instance
    instance
    instance
    cvmfs stratum1/swarm
    (Nate Coraor)

    View Slide

  45. This approach provides both scalability and
    flexibility
    - A set of dedicated compute resources (deployed on TACC’s internal cloud)
    provide basic services and first line job execution
    - The bulk of Galaxy jobs run on Jetstream, an OpenStack cloud which allows
    us to leverage elasticity to efficiently adjust to changing user demands
    - Unique resources like Bridges and Stampede2 allow us to serve jobs that
    have extremely large memory demands (e.g. genome and transcriptome
    assembly), or are highly parallel with long runtimes (e.g. large-scale read
    mapping jobs)

    View Slide

  46. Initial move to XSEDE resources
    (Enis Afgan)

    View Slide

  47. Not just more jobs, different types of jobs
    Can now run larger jobs, and more
    jobs:
    325,000 jobs run
    on behalf of 12,000 users
    Can run new types of jobs:
    Galaxy Interactive
    Environments: Jupyter, RStudio
    (Enis Afgan)

    View Slide

  48. Growing Community

    View Slide

  49. 2010: Galaxy Developer Conference

    View Slide

  50. View Slide

  51. - Galaxy makes it easy to
    integrate new tools
    - The Galaxy Toolshed
    (2011) makes it easy to
    share those tools
    - However, new tools are
    published far faster than
    we can integrate them
    - We needed help if this is
    going to scale at all!

    View Slide

  52. Intergalactic Utilities Commission

    View Slide

  53. ● Maintains a set of high quality
    Galaxy tools in the GitHub
    repository. This repo serves as
    an excellent example and
    inspiration to all Galaxy tool
    developers.
    ● Cultivates and shares the
    Galaxy tool development best
    practices document.
    ● Provides support to tool
    developers on a public Gitter
    channel.

    View Slide

  54. The IUC made the Galaxy tool
    ecosystem vastly more sustainable, can
    we do the same for Galaxy core?

    View Slide

  55. 2015: CONTRIBUTING.md
    - In 2015 we established an official
    open governance policy for core
    Galaxy code
    - We established the committers group,
    consisting of experience Galaxy
    developers with the responsibility of
    managing contributions, as well as
    adding additional committers
    - All committers have equal power – we
    gave up control over the code in order
    to share ownership with the
    community!

    View Slide

  56. View Slide

  57. What about training?

    View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. What about the Gateway itself?

    View Slide

  63. An internationally distributed service:
    usegalaxy.✱
    usegalaxy.org usegalaxy.org.au usegalaxy.eu

    View Slide

  64. View Slide

  65. XSEDE, Indiana University
    XSEDE & CyVerse,
    TACC, Austin
    EU JRC, Ispra
    Penn State
    cvmfs0-tacc0
    ● test.galaxyproject.org
    ● main.galaxyproject.org
    cvmfs1-tacc0
    cvmfs1-iu0
    ● Stratum 0 servers
    ● Stratum 1 servers
    galaxy.jrc.ec.europa.eu
    de.NBI, RZ Freiburg
    cvmfs0-psu0
    ● singularity.galaxyproject.org
    ● data.galaxyproject.org
    cvmfs1-psu0
    cvmfs1-ufr0.usegalaxy.eu
    CVMFS server distribution
    Galaxy Australia, Melbourne
    cvmfs1-mel0.gvl.org.au

    View Slide

  66. Achieving usegalaxy.✱ coherence
    - Common reference and index data
    - These are already distributed by CVMFS, but organized in a ad hoc manner due to the history
    of Galaxy
    - Currently building an automated approach where metadata defining the complete set of
    reference and index data will live in Github, builds will be automated based on Github state,
    and succesfull builds deployed through CVMFS for replication to all site
    - Intergalactic Data Commission: https://github.com/usegalaxy-eu/idc
    - Common tools
    - A common set of tools and a common tool menu organization is currently being defined.
    Tools and tool configuration will also be replicated through CVMFS
    - This will ensure both that users will have the same user experience across different usegalaxy.
    ✱ instances, and that workflows can be moved between instances and still execute correctly
    and reproducibly
    - Local custom tools will still be supported but clearly identified

    View Slide

  67. View Slide

  68. View Slide

  69. Challenges for human genomic (+) data sharing
    The value of data is greatly increased by integration across datasets
    - e.g. in human genomics, power to detect relationships between individual
    variants and disease depends on the number of individuals measured
    Moving/copying data is wasteful: transfer costs, redundant storage costs
    Human genomic data comes with privacy concerns, need to ensure security and
    detect threats

    View Slide

  70. AnVIL
    The NHGRI Genomic Data Science Analysis,
    Visualization, and Informatics Lab-Space

    View Slide

  71. AnVIL: Inverting the model of genomic data sharing
    Traditional: Bring data to the researcher
    - Copying/moving data is costly
    - Harder to enforce security
    - Redundant infrastructure
    - Siloed compute
    Goal: Bring researcher to the data
    - Reduced redundancy and costs
    - Active threat detection and auditing
    - Greater accessibility
    - Elastic, shared, compute

    View Slide

  72. What is the AnVIL?
    - Scalable and interoperable resource for the genomic scientific community
    - Cloud-based infrastructure
    - Shared analysis and computing environment
    - Support genomic data access, sharing and computing across large genomic,
    and genomic related, data sets
    - Genomic datasets, phenotypes and metadata
    - Large datasets generated by NHGRI programs, as well as other initiatives / agencies
    - Data access controls and data security
    - Collaborative environment for datasets and analysis workflows
    - ...for both users with limited computational expertise and sophisticated data scientist users

    View Slide

  73. Goals of the AnVIL
    1. Create open source software
    Storage, scalable analytics, data visualization
    2. Organize and host key NHGRI datasets
    CCDG, CMG, eMERGE, and more
    3. Operate services for the world
    Security, training & outreach, new models of data access

    View Slide

  74. AnVIL / Terra: analysis
    workspaces and batch workflows
    AnVIL / Gen3: Data models,
    indexing, querying
    AnVIL / Dockstore: sharing
    containerized tools and workflows
    AnVIL / Analysis Environments: Jupyter
    Notebooks, RStudio, Galaxy, ...

    View Slide

  75. AnVIL / Terra: analysis workspaces
    and batch workflows
    AnVIL / Gen3: Data models,
    indexing, querying
    AnVIL / Analysis Environments: Jupyter
    Notebooks, RStudio, Galaxy, ...
    FISMA Moderate
    2 ATOs
    Pursuing FedRAMP
    All data use and analysis in a FISMA moderate environment
    Implemented on
    Primary data storage costs covered by AnVIL, user private
    data and compute billed directly through Google

    View Slide

  76. Scale
    Start
    Kubernetes
    + Helm
    Kubernetes
    + Helm
    Proposed system architecture
    Leo
    Kubernetes
    + Helm
    CloudMan Galaxy
    RStudio /
    Bioconductor
    ...
    API Persistence
    Workspace
    Persistence
    Launch
    AnVIL
    portal
    Start
    Galaxy
    Start
    RStudio
    One
    instance
    per user
    CVMFS

    View Slide

  77. Security Boundary
    User 1 Isolated Resources
    User Data
    and DB
    User 1 Galaxy
    Instance
    User Compute
    Containers
    Shared DB
    (No protected Data)
    User 2 Isolated Resources
    User Data
    and DB
    User 2 Galaxy
    Instance
    User Compute
    Containers
    Anonymous User
    Unprivileged
    Galaxy Instance
    User 1
    User 2
    Galaxy Multiplexer
    Isolated Galaxy instances with a single interface

    View Slide

  78. Kubernetes
    Job Pod
    Galaxy
    new job:
    inputs:
    - dataset 1
    - dataset 2
    outputs:
    - dataset 3
    tool: HISAT2
    create job
    Data Storage
    Volume
    execute job
    get datasets 1, 2
    execute job
    3
    job complete
    1
    2
    1
    2
    3
    compute
    Time
    Future k8s Remote
    Execution Data Flow
    NFS
    3
    1
    2
    control message
    data movement
    BioContainer
    Executor
    Container
    @jmchilton
    @natefoo

    View Slide

  79. Challenges for (health) science gateways
    - Human genomic, health, and other protected data will only be available from
    a small set of analysis platforms
    - For the foreseeable future this is motivated by policy, compliance, and political questions
    rather than technical concerns
    - Moving data requires meeting substantial compliance requirements
    - Making gateway software more modular and flexible, along with standards
    for deployment can mitigate this
    - Kubernetes could be a lowest common denominator, but more standardization is needed
    - We need to renew emphasis on interoperability at the platform, tool, and
    workflow level

    View Slide

  80. ACK

    View Slide

  81. Acknowledgements: Galaxy Contributors
    - Core Code: contributors to galaxyproject/galaxy:
    - ~315 (~39 new since last year)
    - Tools: contributors to galaxyproject/tools-iuc:
    - ~195 (~38 new since last year)
    - ...and the ever vigilant Intergalactic Utilities Commission for handling these contributions and
    maintaining the quality of essential Galaxy tools
    - ...and everyone else who has contributed a tool to the ToolShed
    - Training: contributors to galaxyproject/training-material
    - ~140 (~34 new since last year)
    - ...and everyone who has conducted or attended Galaxy Training
    - Everyone who has contributed to Galaxy in other ways:
    - users, supporters, …
    - Funding: NSF and NIH (to our team), and all of the funders of the Global Galaxy Community

    View Slide

  82. Acknowledgements
    Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier, Martin Čech, John Chilton, Dave
    Clements, Nate Coraor, Jeremy Goecks, Sergey Golitsynskiy, Qiang Gu, Juleen Graham, Björn Grüning,
    Sam Guerler, Mo Heydarian, Will Holden, Jennifer Hillman-Jackson, Vahid Jalili, Delphine Lariviere,
    Alexandru Mahmoud, Anton Nekrutenko, Alex Ostrovsky, Helena Rasche, Luke Sargent, Nicola Soranzo,
    Marius van den Beek
    The rest of the Taylor Lab at JHU: Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling,
    Nathan Roach, Michael E. G. Sauria, German Uritskiy
    Funding: NHGRI U41 HG006620 (Galaxy), NHGRI U24 HG010263 (AnVIL), NCI U24 CA231877 (Galaxy
    Federation), NSF DBI 0543285 and DBI 0850103 (Galaxy on US cyberinfrastructure)
    +Collaborators: Dave Hancock and the Jetstream group, Ross Hardison and the VISION group, Victor
    Corces, Karen Reddy, Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology), Battle, Goff, Langmead,
    Leek, Schatz, Timp labs (JHU Genomics)

    View Slide

  83. Mo Heydarian Dave Clements

    View Slide

  84. Broad Institute
    Anthony Philippakis, Daniel MacArthur, Alex Bauman, Adrian
    Sharma, Andrew Rula, Dave Bernick, Jonathan Lawson,
    Kristian Cibulskis, Namrata Gupta, Rob Title, Eric Banks, RIch
    Silva
    University of Chicago
    Robert Grossman, Abby George, Garrett Rupp, Zac Flamig
    University of California Santa Cruz
    Benedict Paten, Denis Yuen, Brian O’Connor, Charles Overbeck,
    Kevin Osborn, Louise Cabansay, Natalie Perez, Stefan Kuhn, Walt
    Shands
    Vanderbilt
    Robert Carroll, Lakhan Swamy, Kristin Wuichet
    Washington University
    Ira Hall, Adam Coffman, Allison Reieir, Haley Abel, Jason Walker
    Johns Hopkins
    James Taylor, Jeff Leek, Kasper Hansen, Enis Afgan, Alexandru
    Mahmoud, Sergey Golitsynskiy, Jenn Vessio, John Muschelli, Mo
    Heydarian
    Penn State University
    Anton Nekrutenko, John Chilton, Nate Coraor, Marten Cech
    Oregon Health & Sciences University
    Jeremy Goecks, Kyle Ellrott, Brian Walsh, Luke Sargent, Vahid Jalili
    Roswell Park Cancer Institute
    Martin Morgan, Nitesh Turaga
    Harvard
    Vincent Carey, BJ Stubbs, Shweta Gopaulakrishnan
    City University of New York
    Levi Waldron, Sehyun Oh, Ludwig Geistlinger
    Acknowledgements: AnVIL Team

    View Slide

  85. (fin)

    View Slide

  86. You’ve gone too far!

    View Slide

  87. (seriously stop)

    View Slide

  88. Colors
    We used (nearly) the “Paired” colormap for the grant figures

    View Slide

  89. Template

    View Slide