Slide 1

Slide 1 text

...from genomic data science gateway to global community James Taylor (@jxtx), Johns Hopkins, http://speakerdeck.com/jxtx

Slide 2

Slide 2 text

1. Science 2. Gateways 3. Community

Slide 3

Slide 3 text

Mammalian comparative genomics — the beginning 2001: Initial sequence of the human genome 2002: Initial sequence of the mouse genome 2004: Initial sequence of the rat genome

Slide 4

Slide 4 text

Mammalian comparative genomics — the beginning 2001: Initial sequence of the human genome 2002: Initial sequence of the mouse genome 2004: Initial sequence of the rat genome Our story begins somewhere around here!

Slide 5

Slide 5 text

Why care about comparative genomics?

Slide 6

Slide 6 text

https://twitter.com/lpachter/status/526904556261625857

Slide 7

Slide 7 text

Coding regions (genes) – deeply conserved across evolution, ~1.5% of the human genome Regulatory regions – much less conserved, 5-10% of the human genome

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Preservation of functional sequences (Miller et al. Annu. Rev. Genomics Hum. Genet. 2004) Time

Slide 10

Slide 10 text

Whole genome scale alignments can potentially help us to understand biological function What is aligned to what and does it overlap with anything interesting? Can we see specific signals in alignments that inform us about specific functions? Answering these questions requires computational approaches

Slide 11

Slide 11 text

Can we make it easier and more efficient for experimental ( ) and computational ( ) researchers to collaborate?

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

GALA enabled query annotation information from the human genome, alongside alignments with the mouse genome, integrated with the UCSC browser, and allowed building up set queries using the results of previous queries

Slide 14

Slide 14 text

To enable collaboration, can we make it easy for computational researchers to integrate new tools, and for experimental researchers to use them?

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

2006 Galaxy Tools Generated Web UI Analysis History

Slide 19

Slide 19 text

And then everything changed… again. Illumina NovaSeq 6000 20 Billion 300bp DNA fragments per run ~ 6 Terabytes Every 2 days…

Slide 20

Slide 20 text

And then everything changed… again.

Slide 21

Slide 21 text

Sequencing is widely available… (http://omicsmaps.com)

Slide 22

Slide 22 text

...practically free... (https://www.genome.gov/27541954/dna-sequencing-costs-data/) Cost Per Human Genome ($)

Slide 23

Slide 23 text

...and applicable across (nearly) all of Biology! - How is the production of the right protein at the right time controlled? - How are cells organized in 3D? - How are cell types decided in development? - How are different species related? - What genome variants lead to different phenotypes or disease risk?

Slide 24

Slide 24 text

Modern biology has rapidly transformed into a data intensive discipline - Large scale data acquisition has become easy, e.g. high-throughput sequencing and imaging - Experiments are increasingly complex - Making sense of results often requires mining and making connections across multiple databases - Nearly all high-profile research involves some quantitative methods How does this affect traditional research practices and outputs?

Slide 25

Slide 25 text

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

Slide 26

Slide 26 text

Three major concerns Accessibility: Making use of large-scale data requires complex computational resources and methods. Can all researchers access these approaches? How can we make these methods available to everyone Transparency: Is it possible to communicate analyses and results in ways that are both easy to understand and provide all of the essential details Reproducibility: Can analyses be precisely reproduced, to facilitate rigorous validation and peer review, and ease reuse?

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Galaxy: accessible analysis system

Slide 29

Slide 29 text

Describe analysis tool behavior abstractly

Slide 30

Slide 30 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details

Slide 31

Slide 31 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically

Slide 32

Slide 32 text

Describe analysis tool behavior abstractly Pervasive sharing, and publication of documents with integrated analysis Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically

Slide 33

Slide 33 text

Visualization and visual analytics

Slide 34

Slide 34 text

Galaxy IEs: containerized apps, rapidly move between analysis modes

Slide 35

Slide 35 text

Galaxy is available as... A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Slide 36

Slide 36 text

usegalaxy.org A free science gateway for the genomics research community

Slide 37

Slide 37 text

usegalaxy.org - We provided Galaxy as a free public website from the very beginning - Fortunately nobody knew about it at first, and in 2005 the data wasn’t all that big anyway - However, the demand for easy-to-use tools in the research community was even more than we anticipated… and we didn’t have much funding - For eight years Galaxy was run largely on surplus hardware decommissioned by other groups, borrowed storage, whatever we could find

Slide 38

Slide 38 text

The great flood of 2012

Slide 39

Slide 39 text

The great flood of 2012 Your data here

Slide 40

Slide 40 text

...In which Save main , , and ,

Slide 41

Slide 41 text

A nationally distributed service: The Galaxy / XSEDE Gateway

Slide 42

Slide 42 text

125,000 registered users 2PB user data 19M jobs run 100 training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018

Slide 43

Slide 43 text

PSC, Pittsburgh Stampede ● 462,462 cores ● 205 TB memory Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster (Rodeo) ● 256 cores ● 2 TB memory Corral/Stockyard ● 20 PB disk PTI IU Bloomington (Nate Coraor)

Slide 44

Slide 44 text

SmartOS (PSU) Bare metal cluster (TACC) VMWare (TACC) Stampede2 (TACC) pulsar Bridges (PSC) Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL usegalaxy.org Compute Architecture (June 2018) NFS Jetstream (TACC) Jetstream (IU) Swarm db CVMFS slurm/rabbitmq roundup64 ... roundup49 cvmfs stratum0 cvmfs stratum0 jobs jobs web web swarm instance swarm instance swarm instance swarm instance slurm/pulsar/ swarm cvmfs stratum1 slurm instance slurm instance slurm instance slurm instance Corral (TACC) 2.3 PB dataset storage pulsar cvmfs stratum1 slurm/pulsar /swarm slurm instance instance instance instance cvmfs stratum1/swarm (Nate Coraor)

Slide 45

Slide 45 text

This approach provides both scalability and flexibility - A set of dedicated compute resources (deployed on TACC’s internal cloud) provide basic services and first line job execution - The bulk of Galaxy jobs run on Jetstream, an OpenStack cloud which allows us to leverage elasticity to efficiently adjust to changing user demands - Unique resources like Bridges and Stampede2 allow us to serve jobs that have extremely large memory demands (e.g. genome and transcriptome assembly), or are highly parallel with long runtimes (e.g. large-scale read mapping jobs)

Slide 46

Slide 46 text

Initial move to XSEDE resources (Enis Afgan)

Slide 47

Slide 47 text

Not just more jobs, different types of jobs Can now run larger jobs, and more jobs: 325,000 jobs run on behalf of 12,000 users Can run new types of jobs: Galaxy Interactive Environments: Jupyter, RStudio (Enis Afgan)

Slide 48

Slide 48 text

Growing Community

Slide 49

Slide 49 text

2010: Galaxy Developer Conference

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

- Galaxy makes it easy to integrate new tools - The Galaxy Toolshed (2011) makes it easy to share those tools - However, new tools are published far faster than we can integrate them - We needed help if this is going to scale at all!

Slide 52

Slide 52 text

Intergalactic Utilities Commission

Slide 53

Slide 53 text

● Maintains a set of high quality Galaxy tools in the GitHub repository. This repo serves as an excellent example and inspiration to all Galaxy tool developers. ● Cultivates and shares the Galaxy tool development best practices document. ● Provides support to tool developers on a public Gitter channel.

Slide 54

Slide 54 text

The IUC made the Galaxy tool ecosystem vastly more sustainable, can we do the same for Galaxy core?

Slide 55

Slide 55 text

2015: CONTRIBUTING.md - In 2015 we established an official open governance policy for core Galaxy code - We established the committers group, consisting of experience Galaxy developers with the responsibility of managing contributions, as well as adding additional committers - All committers have equal power – we gave up control over the code in order to share ownership with the community!

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

What about training?

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

What about the Gateway itself?

Slide 63

Slide 63 text

An internationally distributed service: usegalaxy.✱ usegalaxy.org usegalaxy.org.au usegalaxy.eu

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

XSEDE, Indiana University XSEDE & CyVerse, TACC, Austin EU JRC, Ispra Penn State cvmfs0-tacc0 ● test.galaxyproject.org ● main.galaxyproject.org cvmfs1-tacc0 cvmfs1-iu0 ● Stratum 0 servers ● Stratum 1 servers galaxy.jrc.ec.europa.eu de.NBI, RZ Freiburg cvmfs0-psu0 ● singularity.galaxyproject.org ● data.galaxyproject.org cvmfs1-psu0 cvmfs1-ufr0.usegalaxy.eu CVMFS server distribution Galaxy Australia, Melbourne cvmfs1-mel0.gvl.org.au

Slide 66

Slide 66 text

Achieving usegalaxy.✱ coherence - Common reference and index data - These are already distributed by CVMFS, but organized in a ad hoc manner due to the history of Galaxy - Currently building an automated approach where metadata defining the complete set of reference and index data will live in Github, builds will be automated based on Github state, and succesfull builds deployed through CVMFS for replication to all site - Intergalactic Data Commission: https://github.com/usegalaxy-eu/idc - Common tools - A common set of tools and a common tool menu organization is currently being defined. Tools and tool configuration will also be replicated through CVMFS - This will ensure both that users will have the same user experience across different usegalaxy. ✱ instances, and that workflows can be moved between instances and still execute correctly and reproducibly - Local custom tools will still be supported but clearly identified

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

Challenges for human genomic (+) data sharing The value of data is greatly increased by integration across datasets - e.g. in human genomics, power to detect relationships between individual variants and disease depends on the number of individuals measured Moving/copying data is wasteful: transfer costs, redundant storage costs Human genomic data comes with privacy concerns, need to ensure security and detect threats

Slide 70

Slide 70 text

AnVIL The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-Space

Slide 71

Slide 71 text

AnVIL: Inverting the model of genomic data sharing Traditional: Bring data to the researcher - Copying/moving data is costly - Harder to enforce security - Redundant infrastructure - Siloed compute Goal: Bring researcher to the data - Reduced redundancy and costs - Active threat detection and auditing - Greater accessibility - Elastic, shared, compute

Slide 72

Slide 72 text

What is the AnVIL? - Scalable and interoperable resource for the genomic scientific community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workflows - ...for both users with limited computational expertise and sophisticated data scientist users

Slide 73

Slide 73 text

Goals of the AnVIL 1. Create open source software Storage, scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access

Slide 74

Slide 74 text

AnVIL / Terra: analysis workspaces and batch workflows AnVIL / Gen3: Data models, indexing, querying AnVIL / Dockstore: sharing containerized tools and workflows AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ...

Slide 75

Slide 75 text

AnVIL / Terra: analysis workspaces and batch workflows AnVIL / Gen3: Data models, indexing, querying AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ... FISMA Moderate 2 ATOs Pursuing FedRAMP All data use and analysis in a FISMA moderate environment Implemented on Primary data storage costs covered by AnVIL, user private data and compute billed directly through Google

Slide 76

Slide 76 text

Scale Start Kubernetes + Helm Kubernetes + Helm Proposed system architecture Leo Kubernetes + Helm CloudMan Galaxy RStudio / Bioconductor ... API Persistence Workspace Persistence Launch AnVIL portal Start Galaxy Start RStudio One instance per user CVMFS

Slide 77

Slide 77 text

Security Boundary User 1 Isolated Resources User Data and DB User 1 Galaxy Instance User Compute Containers Shared DB (No protected Data) User 2 Isolated Resources User Data and DB User 2 Galaxy Instance User Compute Containers Anonymous User Unprivileged Galaxy Instance User 1 User 2 Galaxy Multiplexer Isolated Galaxy instances with a single interface

Slide 78

Slide 78 text

Kubernetes Job Pod Galaxy new job: inputs: - dataset 1 - dataset 2 outputs: - dataset 3 tool: HISAT2 create job Data Storage Volume execute job get datasets 1, 2 execute job 3 job complete 1 2 1 2 3 compute Time Future k8s Remote Execution Data Flow NFS 3 1 2 control message data movement BioContainer Executor Container @jmchilton @natefoo

Slide 79

Slide 79 text

Challenges for (health) science gateways - Human genomic, health, and other protected data will only be available from a small set of analysis platforms - For the foreseeable future this is motivated by policy, compliance, and political questions rather than technical concerns - Moving data requires meeting substantial compliance requirements - Making gateway software more modular and flexible, along with standards for deployment can mitigate this - Kubernetes could be a lowest common denominator, but more standardization is needed - We need to renew emphasis on interoperability at the platform, tool, and workflow level

Slide 80

Slide 80 text

ACK

Slide 81

Slide 81 text

Acknowledgements: Galaxy Contributors - Core Code: contributors to galaxyproject/galaxy: - ~315 (~39 new since last year) - Tools: contributors to galaxyproject/tools-iuc: - ~195 (~38 new since last year) - ...and the ever vigilant Intergalactic Utilities Commission for handling these contributions and maintaining the quality of essential Galaxy tools - ...and everyone else who has contributed a tool to the ToolShed - Training: contributors to galaxyproject/training-material - ~140 (~34 new since last year) - ...and everyone who has conducted or attended Galaxy Training - Everyone who has contributed to Galaxy in other ways: - users, supporters, … - Funding: NSF and NIH (to our team), and all of the funders of the Global Galaxy Community

Slide 82

Slide 82 text

Acknowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier, Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Sergey Golitsynskiy, Qiang Gu, Juleen Graham, Björn Grüning, Sam Guerler, Mo Heydarian, Will Holden, Jennifer Hillman-Jackson, Vahid Jalili, Delphine Lariviere, Alexandru Mahmoud, Anton Nekrutenko, Alex Ostrovsky, Helena Rasche, Luke Sargent, Nicola Soranzo, Marius van den Beek The rest of the Taylor Lab at JHU: Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Funding: NHGRI U41 HG006620 (Galaxy), NHGRI U24 HG010263 (AnVIL), NCI U24 CA231877 (Galaxy Federation), NSF DBI 0543285 and DBI 0850103 (Galaxy on US cyberinfrastructure) +Collaborators: Dave Hancock and the Jetstream group, Ross Hardison and the VISION group, Victor Corces, Karen Reddy, Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology), Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)

Slide 83

Slide 83 text

Mo Heydarian Dave Clements

Slide 84

Slide 84 text

Broad Institute Anthony Philippakis, Daniel MacArthur, Alex Bauman, Adrian Sharma, Andrew Rula, Dave Bernick, Jonathan Lawson, Kristian Cibulskis, Namrata Gupta, Rob Title, Eric Banks, RIch Silva University of Chicago Robert Grossman, Abby George, Garrett Rupp, Zac Flamig University of California Santa Cruz Benedict Paten, Denis Yuen, Brian O’Connor, Charles Overbeck, Kevin Osborn, Louise Cabansay, Natalie Perez, Stefan Kuhn, Walt Shands Vanderbilt Robert Carroll, Lakhan Swamy, Kristin Wuichet Washington University Ira Hall, Adam Coffman, Allison Reieir, Haley Abel, Jason Walker Johns Hopkins James Taylor, Jeff Leek, Kasper Hansen, Enis Afgan, Alexandru Mahmoud, Sergey Golitsynskiy, Jenn Vessio, John Muschelli, Mo Heydarian Penn State University Anton Nekrutenko, John Chilton, Nate Coraor, Marten Cech Oregon Health & Sciences University Jeremy Goecks, Kyle Ellrott, Brian Walsh, Luke Sargent, Vahid Jalili Roswell Park Cancer Institute Martin Morgan, Nitesh Turaga Harvard Vincent Carey, BJ Stubbs, Shweta Gopaulakrishnan City University of New York Levi Waldron, Sehyun Oh, Ludwig Geistlinger Acknowledgements: AnVIL Team

Slide 85

Slide 85 text

(fin)

Slide 86

Slide 86 text

You’ve gone too far!

Slide 87

Slide 87 text

(seriously stop)

Slide 88

Slide 88 text

Colors We used (nearly) the “Paired” colormap for the grant figures

Slide 89

Slide 89 text

Template