Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy... from genomic data science gateway to global community

3ee44f53c39bcd4bc663a2ea0e21d526?s=47 James Taylor
September 24, 2019

Galaxy... from genomic data science gateway to global community

Keynote presentation on Galaxy (https://galaxyproject.org) origins and the Galaxy community presented at Gateways 2019, the annual meeting hosted by the Science Gateways Community Institute (https://sciencegateways.org/)

3ee44f53c39bcd4bc663a2ea0e21d526?s=128

James Taylor

September 24, 2019
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. ...from genomic data science gateway to global community James Taylor

    (@jxtx), Johns Hopkins, http://speakerdeck.com/jxtx
  2. 1. Science 2. Gateways 3. Community

  3. Mammalian comparative genomics — the beginning 2001: Initial sequence of

    the human genome 2002: Initial sequence of the mouse genome 2004: Initial sequence of the rat genome
  4. Mammalian comparative genomics — the beginning 2001: Initial sequence of

    the human genome 2002: Initial sequence of the mouse genome 2004: Initial sequence of the rat genome Our story begins somewhere around here!
  5. Why care about comparative genomics?

  6. https://twitter.com/lpachter/status/526904556261625857

  7. Coding regions (genes) – deeply conserved across evolution, ~1.5% of

    the human genome Regulatory regions – much less conserved, 5-10% of the human genome
  8. None
  9. Preservation of functional sequences (Miller et al. Annu. Rev. Genomics

    Hum. Genet. 2004) Time
  10. Whole genome scale alignments can potentially help us to understand

    biological function What is aligned to what and does it overlap with anything interesting? Can we see specific signals in alignments that inform us about specific functions? Answering these questions requires computational approaches
  11. Can we make it easier and more efficient for experimental

    ( ) and computational ( ) researchers to collaborate?
  12. None
  13. GALA enabled query annotation information from the human genome, alongside

    alignments with the mouse genome, integrated with the UCSC browser, and allowed building up set queries using the results of previous queries
  14. To enable collaboration, can we make it easy for computational

    researchers to integrate new tools, and for experimental researchers to use them?
  15. None
  16. None
  17. None
  18. 2006 Galaxy Tools Generated Web UI Analysis History

  19. And then everything changed… again. Illumina NovaSeq 6000 20 Billion

    300bp DNA fragments per run ~ 6 Terabytes Every 2 days…
  20. And then everything changed… again.

  21. Sequencing is widely available… (http://omicsmaps.com)

  22. ...practically free... (https://www.genome.gov/27541954/dna-sequencing-costs-data/) Cost Per Human Genome ($)

  23. ...and applicable across (nearly) all of Biology! - How is

    the production of the right protein at the right time controlled? - How are cells organized in 3D? - How are cell types decided in development? - How are different species related? - What genome variants lead to different phenotypes or disease risk?
  24. Modern biology has rapidly transformed into a data intensive discipline

    - Large scale data acquisition has become easy, e.g. high-throughput sequencing and imaging - Experiments are increasingly complex - Making sense of results often requires mining and making connections across multiple databases - Nearly all high-profile research involves some quantitative methods How does this affect traditional research practices and outputs?
  25. Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

    design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
  26. Three major concerns Accessibility: Making use of large-scale data requires

    complex computational resources and methods. Can all researchers access these approaches? How can we make these methods available to everyone Transparency: Is it possible to communicate analyses and results in ways that are both easy to understand and provide all of the essential details Reproducibility: Can analyses be precisely reproduced, to facilitate rigorous validation and peer review, and ease reuse?
  27. None
  28. Galaxy: accessible analysis system

  29. Describe analysis tool behavior abstractly

  30. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details
  31. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  32. Describe analysis tool behavior abstractly Pervasive sharing, and publication of

    documents with integrated analysis Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically
  33. Visualization and visual analytics

  34. Galaxy IEs: containerized apps, rapidly move between analysis modes

  35. Galaxy is available as... A free (for everyone) web service

    integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  36. usegalaxy.org A free science gateway for the genomics research community

  37. usegalaxy.org - We provided Galaxy as a free public website

    from the very beginning - Fortunately nobody knew about it at first, and in 2005 the data wasn’t all that big anyway - However, the demand for easy-to-use tools in the research community was even more than we anticipated… and we didn’t have much funding - For eight years Galaxy was run largely on surplus hardware decommissioned by other groups, borrowed storage, whatever we could find
  38. The great flood of 2012

  39. The great flood of 2012 Your data here

  40. ...In which Save main , , and ,

  41. A nationally distributed service: The Galaxy / XSEDE Gateway

  42. 125,000 registered users 2PB user data 19M jobs run 100

    training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018
  43. PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory

    Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk PTI IU Bloomington (Nate Coraor)
  44. SmartOS (PSU) Bare metal cluster (TACC) VMWare (TACC) Stampede2 (TACC)

    pulsar Bridges (PSC) Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL usegalaxy.org Compute Architecture (June 2018) NFS Jetstream (TACC) Jetstream (IU) Swarm db CVMFS slurm/rabbitmq roundup64 ... roundup49 cvmfs stratum0 cvmfs stratum0 jobs jobs web web swarm instance swarm instance swarm instance swarm instance slurm/pulsar/ swarm cvmfs stratum1 slurm instance slurm instance slurm instance slurm instance Corral (TACC) 2.3 PB dataset storage pulsar cvmfs stratum1 slurm/pulsar /swarm slurm instance instance instance instance cvmfs stratum1/swarm (Nate Coraor)
  45. This approach provides both scalability and flexibility - A set

    of dedicated compute resources (deployed on TACC’s internal cloud) provide basic services and first line job execution - The bulk of Galaxy jobs run on Jetstream, an OpenStack cloud which allows us to leverage elasticity to efficiently adjust to changing user demands - Unique resources like Bridges and Stampede2 allow us to serve jobs that have extremely large memory demands (e.g. genome and transcriptome assembly), or are highly parallel with long runtimes (e.g. large-scale read mapping jobs)
  46. Initial move to XSEDE resources (Enis Afgan)

  47. Not just more jobs, different types of jobs Can now

    run larger jobs, and more jobs: 325,000 jobs run on behalf of 12,000 users Can run new types of jobs: Galaxy Interactive Environments: Jupyter, RStudio (Enis Afgan)
  48. Growing Community

  49. 2010: Galaxy Developer Conference

  50. None
  51. - Galaxy makes it easy to integrate new tools -

    The Galaxy Toolshed (2011) makes it easy to share those tools - However, new tools are published far faster than we can integrate them - We needed help if this is going to scale at all!
  52. Intergalactic Utilities Commission

  53. • Maintains a set of high quality Galaxy tools in

    the GitHub repository. This repo serves as an excellent example and inspiration to all Galaxy tool developers. • Cultivates and shares the Galaxy tool development best practices document. • Provides support to tool developers on a public Gitter channel.
  54. The IUC made the Galaxy tool ecosystem vastly more sustainable,

    can we do the same for Galaxy core?
  55. 2015: CONTRIBUTING.md - In 2015 we established an official open

    governance policy for core Galaxy code - We established the committers group, consisting of experience Galaxy developers with the responsibility of managing contributions, as well as adding additional committers - All committers have equal power – we gave up control over the code in order to share ownership with the community!
  56. None
  57. What about training?

  58. None
  59. None
  60. None
  61. None
  62. What about the Gateway itself?

  63. An internationally distributed service: usegalaxy.✱ usegalaxy.org usegalaxy.org.au usegalaxy.eu

  64. None
  65. XSEDE, Indiana University XSEDE & CyVerse, TACC, Austin EU JRC,

    Ispra Penn State cvmfs0-tacc0 • test.galaxyproject.org • main.galaxyproject.org cvmfs1-tacc0 cvmfs1-iu0 • Stratum 0 servers • Stratum 1 servers galaxy.jrc.ec.europa.eu de.NBI, RZ Freiburg cvmfs0-psu0 • singularity.galaxyproject.org • data.galaxyproject.org cvmfs1-psu0 cvmfs1-ufr0.usegalaxy.eu CVMFS server distribution Galaxy Australia, Melbourne cvmfs1-mel0.gvl.org.au
  66. Achieving usegalaxy.✱ coherence - Common reference and index data -

    These are already distributed by CVMFS, but organized in a ad hoc manner due to the history of Galaxy - Currently building an automated approach where metadata defining the complete set of reference and index data will live in Github, builds will be automated based on Github state, and succesfull builds deployed through CVMFS for replication to all site - Intergalactic Data Commission: https://github.com/usegalaxy-eu/idc - Common tools - A common set of tools and a common tool menu organization is currently being defined. Tools and tool configuration will also be replicated through CVMFS - This will ensure both that users will have the same user experience across different usegalaxy. ✱ instances, and that workflows can be moved between instances and still execute correctly and reproducibly - Local custom tools will still be supported but clearly identified
  67. None
  68. None
  69. Challenges for human genomic (+) data sharing The value of

    data is greatly increased by integration across datasets - e.g. in human genomics, power to detect relationships between individual variants and disease depends on the number of individuals measured Moving/copying data is wasteful: transfer costs, redundant storage costs Human genomic data comes with privacy concerns, need to ensure security and detect threats
  70. AnVIL The NHGRI Genomic Data Science Analysis, Visualization, and Informatics

    Lab-Space
  71. AnVIL: Inverting the model of genomic data sharing Traditional: Bring

    data to the researcher - Copying/moving data is costly - Harder to enforce security - Redundant infrastructure - Siloed compute Goal: Bring researcher to the data - Reduced redundancy and costs - Active threat detection and auditing - Greater accessibility - Elastic, shared, compute
  72. What is the AnVIL? - Scalable and interoperable resource for

    the genomic scientific community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workflows - ...for both users with limited computational expertise and sophisticated data scientist users
  73. Goals of the AnVIL 1. Create open source software Storage,

    scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access
  74. AnVIL / Terra: analysis workspaces and batch workflows AnVIL /

    Gen3: Data models, indexing, querying AnVIL / Dockstore: sharing containerized tools and workflows AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ...
  75. AnVIL / Terra: analysis workspaces and batch workflows AnVIL /

    Gen3: Data models, indexing, querying AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ... FISMA Moderate 2 ATOs Pursuing FedRAMP All data use and analysis in a FISMA moderate environment Implemented on Primary data storage costs covered by AnVIL, user private data and compute billed directly through Google
  76. Scale Start Kubernetes + Helm Kubernetes + Helm Proposed system

    architecture Leo Kubernetes + Helm CloudMan Galaxy RStudio / Bioconductor ... API Persistence Workspace Persistence Launch AnVIL portal Start Galaxy Start RStudio One instance per user CVMFS
  77. Security Boundary User 1 Isolated Resources User Data and DB

    User 1 Galaxy Instance User Compute Containers Shared DB (No protected Data) User 2 Isolated Resources User Data and DB User 2 Galaxy Instance User Compute Containers Anonymous User Unprivileged Galaxy Instance User 1 User 2 Galaxy Multiplexer Isolated Galaxy instances with a single interface
  78. Kubernetes Job Pod Galaxy new job: inputs: - dataset 1

    - dataset 2 outputs: - dataset 3 tool: HISAT2 create job Data Storage Volume execute job get datasets 1, 2 execute job 3 job complete 1 2 1 2 3 compute Time Future k8s Remote Execution Data Flow NFS 3 1 2 control message data movement BioContainer Executor Container @jmchilton @natefoo
  79. Challenges for (health) science gateways - Human genomic, health, and

    other protected data will only be available from a small set of analysis platforms - For the foreseeable future this is motivated by policy, compliance, and political questions rather than technical concerns - Moving data requires meeting substantial compliance requirements - Making gateway software more modular and flexible, along with standards for deployment can mitigate this - Kubernetes could be a lowest common denominator, but more standardization is needed - We need to renew emphasis on interoperability at the platform, tool, and workflow level
  80. ACK

  81. Acknowledgements: Galaxy Contributors - Core Code: contributors to galaxyproject/galaxy: -

    ~315 (~39 new since last year) - Tools: contributors to galaxyproject/tools-iuc: - ~195 (~38 new since last year) - ...and the ever vigilant Intergalactic Utilities Commission for handling these contributions and maintaining the quality of essential Galaxy tools - ...and everyone else who has contributed a tool to the ToolShed - Training: contributors to galaxyproject/training-material - ~140 (~34 new since last year) - ...and everyone who has conducted or attended Galaxy Training - Everyone who has contributed to Galaxy in other ways: - users, supporters, … - Funding: NSF and NIH (to our team), and all of the funders of the Global Galaxy Community
  82. Acknowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Sergey Golitsynskiy, Qiang Gu, Juleen Graham, Björn Grüning, Sam Guerler, Mo Heydarian, Will Holden, Jennifer Hillman-Jackson, Vahid Jalili, Delphine Lariviere, Alexandru Mahmoud, Anton Nekrutenko, Alex Ostrovsky, Helena Rasche, Luke Sargent, Nicola Soranzo, Marius van den Beek The rest of the Taylor Lab at JHU: Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Funding: NHGRI U41 HG006620 (Galaxy), NHGRI U24 HG010263 (AnVIL), NCI U24 CA231877 (Galaxy Federation), NSF DBI 0543285 and DBI 0850103 (Galaxy on US cyberinfrastructure) +Collaborators: Dave Hancock and the Jetstream group, Ross Hardison and the VISION group, Victor Corces, Karen Reddy, Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology), Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)
  83. Mo Heydarian Dave Clements

  84. Broad Institute Anthony Philippakis, Daniel MacArthur, Alex Bauman, Adrian Sharma,

    Andrew Rula, Dave Bernick, Jonathan Lawson, Kristian Cibulskis, Namrata Gupta, Rob Title, Eric Banks, RIch Silva University of Chicago Robert Grossman, Abby George, Garrett Rupp, Zac Flamig University of California Santa Cruz Benedict Paten, Denis Yuen, Brian O’Connor, Charles Overbeck, Kevin Osborn, Louise Cabansay, Natalie Perez, Stefan Kuhn, Walt Shands Vanderbilt Robert Carroll, Lakhan Swamy, Kristin Wuichet Washington University Ira Hall, Adam Coffman, Allison Reieir, Haley Abel, Jason Walker Johns Hopkins James Taylor, Jeff Leek, Kasper Hansen, Enis Afgan, Alexandru Mahmoud, Sergey Golitsynskiy, Jenn Vessio, John Muschelli, Mo Heydarian Penn State University Anton Nekrutenko, John Chilton, Nate Coraor, Marten Cech Oregon Health & Sciences University Jeremy Goecks, Kyle Ellrott, Brian Walsh, Luke Sargent, Vahid Jalili Roswell Park Cancer Institute Martin Morgan, Nitesh Turaga Harvard Vincent Carey, BJ Stubbs, Shweta Gopaulakrishnan City University of New York Levi Waldron, Sehyun Oh, Ludwig Geistlinger Acknowledgements: AnVIL Team
  85. (fin)

  86. You’ve gone too far!

  87. (seriously stop)

  88. Colors We used (nearly) the “Paired” colormap for the grant

    figures
  89. Template