Galaxy... from genomic data science gateway to global community

...from genomic data science gateway to global community James Taylor
(@jxtx), Johns Hopkins, http://speakerdeck.com/jxtx

1. Science 2. Gateways 3. Community

Mammalian comparative genomics — the beginning 2001: Initial sequence of
the human genome 2002: Initial sequence of the mouse genome 2004: Initial sequence of the rat genome

Mammalian comparative genomics — the beginning 2001: Initial sequence of
the human genome 2002: Initial sequence of the mouse genome 2004: Initial sequence of the rat genome Our story begins somewhere around here!

Why care about comparative genomics?

https://twitter.com/lpachter/status/526904556261625857

Coding regions (genes) – deeply conserved across evolution, ~1.5% of
the human genome Regulatory regions – much less conserved, 5-10% of the human genome

Preservation of functional sequences (Miller et al. Annu. Rev. Genomics
Hum. Genet. 2004) Time

Whole genome scale alignments can potentially help us to understand
biological function What is aligned to what and does it overlap with anything interesting? Can we see speciﬁc signals in alignments that inform us about speciﬁc functions? Answering these questions requires computational approaches

Can we make it easier and more efﬁcient for experimental
( ) and computational ( ) researchers to collaborate?

GALA enabled query annotation information from the human genome, alongside
alignments with the mouse genome, integrated with the UCSC browser, and allowed building up set queries using the results of previous queries

To enable collaboration, can we make it easy for computational
researchers to integrate new tools, and for experimental researchers to use them?

2006 Galaxy Tools Generated Web UI Analysis History

And then everything changed… again. Illumina NovaSeq 6000 20 Billion
300bp DNA fragments per run ~ 6 Terabytes Every 2 days…

And then everything changed… again.

Sequencing is widely available… (http://omicsmaps.com)

...practically free... (https://www.genome.gov/27541954/dna-sequencing-costs-data/) Cost Per Human Genome ($)

...and applicable across (nearly) all of Biology! - How is
the production of the right protein at the right time controlled? - How are cells organized in 3D? - How are cell types decided in development? - How are different species related? - What genome variants lead to different phenotypes or disease risk?

Modern biology has rapidly transformed into a data intensive discipline
- Large scale data acquisition has become easy, e.g. high-throughput sequencing and imaging - Experiments are increasingly complex - Making sense of results often requires mining and making connections across multiple databases - Nearly all high-proﬁle research involves some quantitative methods How does this affect traditional research practices and outputs?

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental
design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

Three major concerns Accessibility: Making use of large-scale data requires
complex computational resources and methods. Can all researchers access these approaches? How can we make these methods available to everyone Transparency: Is it possible to communicate analyses and results in ways that are both easy to understand and provide all of the essential details Reproducibility: Can analyses be precisely reproduced, to facilitate rigorous validation and peer review, and ease reuse?

Galaxy: accessible analysis system

Describe analysis tool behavior abstractly

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details Workﬂow system for complex analysis, constructed explicitly or automatically

Describe analysis tool behavior abstractly Pervasive sharing, and publication of
documents with integrated analysis Analysis environment automatically and transparently tracks details Workﬂow system for complex analysis, constructed explicitly or automatically

Visualization and visual analytics

Galaxy IEs: containerized apps, rapidly move between analysis modes

Galaxy is available as... A free (for everyone) web service
integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workﬂows, ...

usegalaxy.org A free science gateway for the genomics research community

usegalaxy.org - We provided Galaxy as a free public website
from the very beginning - Fortunately nobody knew about it at ﬁrst, and in 2005 the data wasn’t all that big anyway - However, the demand for easy-to-use tools in the research community was even more than we anticipated… and we didn’t have much funding - For eight years Galaxy was run largely on surplus hardware decommissioned by other groups, borrowed storage, whatever we could ﬁnd

The great ﬂood of 2012

The great ﬂood of 2012 Your data here

...In which Save main , , and ,

A nationally distributed service: The Galaxy / XSEDE Gateway

125,000 registered users 2PB user data 19M jobs run 100
training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018

PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory
Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk PTI IU Bloomington (Nate Coraor)

SmartOS (PSU) Bare metal cluster (TACC) VMWare (TACC) Stampede2 (TACC)
pulsar Bridges (PSC) Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL usegalaxy.org Compute Architecture (June 2018) NFS Jetstream (TACC) Jetstream (IU) Swarm db CVMFS slurm/rabbitmq roundup64 ... roundup49 cvmfs stratum0 cvmfs stratum0 jobs jobs web web swarm instance swarm instance swarm instance swarm instance slurm/pulsar/ swarm cvmfs stratum1 slurm instance slurm instance slurm instance slurm instance Corral (TACC) 2.3 PB dataset storage pulsar cvmfs stratum1 slurm/pulsar /swarm slurm instance instance instance instance cvmfs stratum1/swarm (Nate Coraor)

This approach provides both scalability and flexibility - A set
of dedicated compute resources (deployed on TACC’s internal cloud) provide basic services and first line job execution - The bulk of Galaxy jobs run on Jetstream, an OpenStack cloud which allows us to leverage elasticity to efficiently adjust to changing user demands - Unique resources like Bridges and Stampede2 allow us to serve jobs that have extremely large memory demands (e.g. genome and transcriptome assembly), or are highly parallel with long runtimes (e.g. large-scale read mapping jobs)

Initial move to XSEDE resources (Enis Afgan)

Not just more jobs, different types of jobs Can now
run larger jobs, and more jobs: 325,000 jobs run on behalf of 12,000 users Can run new types of jobs: Galaxy Interactive Environments: Jupyter, RStudio (Enis Afgan)

Growing Community

2010: Galaxy Developer Conference

- Galaxy makes it easy to integrate new tools -
The Galaxy Toolshed (2011) makes it easy to share those tools - However, new tools are published far faster than we can integrate them - We needed help if this is going to scale at all!

Intergalactic Utilities Commission

• Maintains a set of high quality Galaxy tools in
the GitHub repository. This repo serves as an excellent example and inspiration to all Galaxy tool developers. • Cultivates and shares the Galaxy tool development best practices document. • Provides support to tool developers on a public Gitter channel.

The IUC made the Galaxy tool ecosystem vastly more sustainable,
can we do the same for Galaxy core?

2015: CONTRIBUTING.md - In 2015 we established an ofﬁcial open
governance policy for core Galaxy code - We established the committers group, consisting of experience Galaxy developers with the responsibility of managing contributions, as well as adding additional committers - All committers have equal power – we gave up control over the code in order to share ownership with the community!

What about training?

What about the Gateway itself?

An internationally distributed service: usegalaxy.✱ usegalaxy.org usegalaxy.org.au usegalaxy.eu

XSEDE, Indiana University XSEDE & CyVerse, TACC, Austin EU JRC,
Ispra Penn State cvmfs0-tacc0 • test.galaxyproject.org • main.galaxyproject.org cvmfs1-tacc0 cvmfs1-iu0 • Stratum 0 servers • Stratum 1 servers galaxy.jrc.ec.europa.eu de.NBI, RZ Freiburg cvmfs0-psu0 • singularity.galaxyproject.org • data.galaxyproject.org cvmfs1-psu0 cvmfs1-ufr0.usegalaxy.eu CVMFS server distribution Galaxy Australia, Melbourne cvmfs1-mel0.gvl.org.au

Achieving usegalaxy.✱ coherence - Common reference and index data -
These are already distributed by CVMFS, but organized in a ad hoc manner due to the history of Galaxy - Currently building an automated approach where metadata defining the complete set of reference and index data will live in Github, builds will be automated based on Github state, and succesfull builds deployed through CVMFS for replication to all site - Intergalactic Data Commission: https://github.com/usegalaxy-eu/idc - Common tools - A common set of tools and a common tool menu organization is currently being defined. Tools and tool configuration will also be replicated through CVMFS - This will ensure both that users will have the same user experience across different usegalaxy. ✱ instances, and that workflows can be moved between instances and still execute correctly and reproducibly - Local custom tools will still be supported but clearly identified

Challenges for human genomic (+) data sharing The value of
data is greatly increased by integration across datasets - e.g. in human genomics, power to detect relationships between individual variants and disease depends on the number of individuals measured Moving/copying data is wasteful: transfer costs, redundant storage costs Human genomic data comes with privacy concerns, need to ensure security and detect threats

AnVIL The NHGRI Genomic Data Science Analysis, Visualization, and Informatics
Lab-Space

AnVIL: Inverting the model of genomic data sharing Traditional: Bring
data to the researcher - Copying/moving data is costly - Harder to enforce security - Redundant infrastructure - Siloed compute Goal: Bring researcher to the data - Reduced redundancy and costs - Active threat detection and auditing - Greater accessibility - Elastic, shared, compute

What is the AnVIL? - Scalable and interoperable resource for
the genomic scientiﬁc community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workﬂows - ...for both users with limited computational expertise and sophisticated data scientist users

Goals of the AnVIL 1. Create open source software Storage,
scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access

AnVIL / Terra: analysis workspaces and batch workﬂows AnVIL /
Gen3: Data models, indexing, querying AnVIL / Dockstore: sharing containerized tools and workﬂows AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ...

AnVIL / Terra: analysis workspaces and batch workﬂows AnVIL /
Gen3: Data models, indexing, querying AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ... FISMA Moderate 2 ATOs Pursuing FedRAMP All data use and analysis in a FISMA moderate environment Implemented on Primary data storage costs covered by AnVIL, user private data and compute billed directly through Google

Scale Start Kubernetes + Helm Kubernetes + Helm Proposed system
architecture Leo Kubernetes + Helm CloudMan Galaxy RStudio / Bioconductor ... API Persistence Workspace Persistence Launch AnVIL portal Start Galaxy Start RStudio One instance per user CVMFS

Security Boundary User 1 Isolated Resources User Data and DB
User 1 Galaxy Instance User Compute Containers Shared DB (No protected Data) User 2 Isolated Resources User Data and DB User 2 Galaxy Instance User Compute Containers Anonymous User Unprivileged Galaxy Instance User 1 User 2 Galaxy Multiplexer Isolated Galaxy instances with a single interface

Kubernetes Job Pod Galaxy new job: inputs: - dataset 1
- dataset 2 outputs: - dataset 3 tool: HISAT2 create job Data Storage Volume execute job get datasets 1, 2 execute job 3 job complete 1 2 1 2 3 compute Time Future k8s Remote Execution Data Flow NFS 3 1 2 control message data movement BioContainer Executor Container @jmchilton @natefoo

Challenges for (health) science gateways - Human genomic, health, and
other protected data will only be available from a small set of analysis platforms - For the foreseeable future this is motivated by policy, compliance, and political questions rather than technical concerns - Moving data requires meeting substantial compliance requirements - Making gateway software more modular and ﬂexible, along with standards for deployment can mitigate this - Kubernetes could be a lowest common denominator, but more standardization is needed - We need to renew emphasis on interoperability at the platform, tool, and workﬂow level

Acknowledgements: Galaxy Contributors - Core Code: contributors to galaxyproject/galaxy: -
~315 (~39 new since last year) - Tools: contributors to galaxyproject/tools-iuc: - ~195 (~38 new since last year) - ...and the ever vigilant Intergalactic Utilities Commission for handling these contributions and maintaining the quality of essential Galaxy tools - ...and everyone else who has contributed a tool to the ToolShed - Training: contributors to galaxyproject/training-material - ~140 (~34 new since last year) - ...and everyone who has conducted or attended Galaxy Training - Everyone who has contributed to Galaxy in other ways: - users, supporters, … - Funding: NSF and NIH (to our team), and all of the funders of the Global Galaxy Community

Acknowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,
Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Sergey Golitsynskiy, Qiang Gu, Juleen Graham, Björn Grüning, Sam Guerler, Mo Heydarian, Will Holden, Jennifer Hillman-Jackson, Vahid Jalili, Delphine Lariviere, Alexandru Mahmoud, Anton Nekrutenko, Alex Ostrovsky, Helena Rasche, Luke Sargent, Nicola Soranzo, Marius van den Beek The rest of the Taylor Lab at JHU: Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Funding: NHGRI U41 HG006620 (Galaxy), NHGRI U24 HG010263 (AnVIL), NCI U24 CA231877 (Galaxy Federation), NSF DBI 0543285 and DBI 0850103 (Galaxy on US cyberinfrastructure) +Collaborators: Dave Hancock and the Jetstream group, Ross Hardison and the VISION group, Victor Corces, Karen Reddy, Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology), Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)

Mo Heydarian Dave Clements

Broad Institute Anthony Philippakis, Daniel MacArthur, Alex Bauman, Adrian Sharma,
Andrew Rula, Dave Bernick, Jonathan Lawson, Kristian Cibulskis, Namrata Gupta, Rob Title, Eric Banks, RIch Silva University of Chicago Robert Grossman, Abby George, Garrett Rupp, Zac Flamig University of California Santa Cruz Benedict Paten, Denis Yuen, Brian O’Connor, Charles Overbeck, Kevin Osborn, Louise Cabansay, Natalie Perez, Stefan Kuhn, Walt Shands Vanderbilt Robert Carroll, Lakhan Swamy, Kristin Wuichet Washington University Ira Hall, Adam Coffman, Allison Reieir, Haley Abel, Jason Walker Johns Hopkins James Taylor, Jeff Leek, Kasper Hansen, Enis Afgan, Alexandru Mahmoud, Sergey Golitsynskiy, Jenn Vessio, John Muschelli, Mo Heydarian Penn State University Anton Nekrutenko, John Chilton, Nate Coraor, Marten Cech Oregon Health & Sciences University Jeremy Goecks, Kyle Ellrott, Brian Walsh, Luke Sargent, Vahid Jalili Roswell Park Cancer Institute Martin Morgan, Nitesh Turaga Harvard Vincent Carey, BJ Stubbs, Shweta Gopaulakrishnan City University of New York Levi Waldron, Sehyun Oh, Ludwig Geistlinger Acknowledgements: AnVIL Team

(ﬁn)

You’ve gone too far!

(seriously stop)

Colors We used (nearly) the “Paired” colormap for the grant
ﬁgures

Template

Galaxy... from genomic data science gateway to ...

Galaxy... from genomic data science gateway to global community

More Decks by James Taylor

Other Decks in Science

Featured

Transcript