Making large-scale genomic analysis accessible, transparent, and reproducible

Making large-scale genomic analysis accessible, transparent, and reproducible James Taylor
(@jxtx), Johns Hopkins, http://speakerdeck.com/jxtx http://galaxyproject.org http://anvilproject.org

SEQUENCING

It’s widely available… (http://omicsmaps.com)

...practically free... (https://www.genome.gov/27541954/dna-sequencing-costs-data/) Cost Per Human Genome ($)

...and applicable across (nearly) all of Biology! - How is
the production of the right protein at the right time controlled? - How are cells organized in 3D? - How are cell types decided in development? - How are different species related? - What genome variants lead to different phenotypes or disease risk?

However, it produces massive amounts of data Illumina NovaSeq 6000
20 Billion 300bp DNA fragments per run ~ 6 Terabytes Every 2 days…

And sequencing is only the beginning Lattice light-sheet microscope: 3d
live cell imaging, terabytes per experiment — Image from Karen Reddy LMNB1 m6a-tracer

Modern biology has rapidly transformed into a data intensive discipline
- Large scale data acquisition has become easy, e.g. high-throughput sequencing and imaging - Experiments are increasingly complex - Making sense of results often requires mining and making connections across multiple databases - Nearly all high-profile research involves some quantitative methods How does this affect traditional research practices and outputs?

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental
design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

Three major concerns Accessibility: Making use of large-scale data requires
complex computational resources and methods. Can all researchers access these approaches? How can we make these methods available to everyone Transparency: Is it possible to communicate analyses and results in ways that are both easy to understand and provide all of the essential details Reproducibility: Can analyses be precisely reproduced, to facilitate rigorous validation and peer review, and ease reuse?

Galaxy: accessible analysis system

Describe analysis tool behavior abstractly

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details Workflow system for complex analysis, constructed explicitly or automatically

Describe analysis tool behavior abstractly Pervasive sharing, and publication of
documents with integrated analysis Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically

Visualization and visual analytics

Galaxy IEs: containerized apps, rapidly move between analysis modes

Practical computational reproducibility

Persistent challenge: managing underlying software Bioinformatics workflows use a lot
of different tools, which each use different software packages, which depend on other software packages… Running a workflow requires we make it possible, and hopefully easy, for all of the underlying dependencies to be installed Reproducing a workflow requires assembling all of the right dependencies with all of the right versions and ideally in a controlled environment - Sometimes different steps require different, and incompatible versions of dependencies… The Galaxy project has wasted a lot of time trying to solve this problem

Builds on Conda packaging system, designed “for installing multiple versions
of software packages and their dependencies and switching easily between them” More than 4000 recipes for software packages All packages are automatically built in a minimal environment to ensure isolation and portability https://bioconda.github.io

Submit recipe to GitHub CircleCI pulls recipes and builds in
minimal docker container Successful binary builds from main repo uploaded to Anaconda to be installed anywhere

Biocontainers Given a set of packages and versions in Conda/Bioconda,
we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled) https://biocontainers.pro/

CircleCI pulls recipes and builds in minimal docker container Successful
binary builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider Singularity

Tool and dependency binaries, built in minimal environment with controlled
libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control

Galaxy is available as... A free (for everyone) web service
integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

A nationally distributed service: The Galaxy / XSEDE Gateway

125,000 registered users 2PB user data 19M jobs run 100
training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018

PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory
Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk PTI IU Bloomington (Nate Coraor)

SmartOS (PSU) Bare metal cluster (TACC) VMWare (TACC) Stampede2 (TACC)
pulsar Bridges (PSC) Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL usegalaxy.org Compute Architecture (June 2018) NFS Jetstream (TACC) Jetstream (IU) Swarm db CVMFS slurm/rabbitmq roundup64 ... roundup49 cvmfs stratum0 cvmfs stratum0 jobs jobs web web swarm instance swarm instance swarm instance swarm instance slurm/pulsar/ swarm cvmfs stratum1 slurm instance slurm instance slurm instance slurm instance Corral (TACC) 2.3 PB dataset storage pulsar cvmfs stratum1 slurm/pulsar /swarm slurm instance instance instance instance cvmfs stratum1/swarm (Nate Coraor)

This approach provides both scalability and flexibility - A set
of dedicated compute resources (deployed on TACC’s internal cloud) provide basic services and first line job execution - The bulk of Galaxy jobs run on Jetstream, an OpenStack cloud which allows us to leverage elasticity to efficiently adjust to changing user demands - Unique resources like Bridges and Stampede2 allow us to serve jobs that have extremely large memory demands (e.g. genome and transcriptome assembly), or are highly parallel with long runtimes (e.g. large-scale read mapping jobs)

Initial move to XSEDE resources (Enis Afgan)

Not just more jobs, different types of jobs Can now
run larger jobs, and more jobs: 325,000 jobs run on behalf of 12,000 users Can run new types of jobs: Galaxy Interactive Environments: Jupyter, RStudio (Enis Afgan)

An internationally distributed service: usegalaxy.✱ usegalaxy.org usegalaxy.org.au usegalaxy.eu

XSEDE, Indiana University XSEDE & CyVerse, TACC, Austin EU JRC,
Ispra Penn State cvmfs0-tacc0 • test.galaxyproject.org • main.galaxyproject.org cvmfs1-tacc0 cvmfs1-iu0 • Stratum 0 servers • Stratum 1 servers galaxy.jrc.ec.europa.eu de.NBI, RZ Freiburg cvmfs0-psu0 • singularity.galaxyproject.org • data.galaxyproject.org cvmfs1-psu0 cvmfs1-ufr0.usegalaxy.eu CVMFS server distribution Galaxy Australia, Melbourne cvmfs1-mel0.gvl.org.au

Achieving usegalaxy.✱ coherence - Common reference and index data -
These are already distributed by CVMFS, but organized in a ad hoc manner due to the history of Galaxy - Currently building an automated approach where metadata defining the complete set of reference and index data will live in Github, builds will be automated based on Github state, and succesfull builds deployed through CVMFS for replication to all site - Intergalactic Data Commission: https://github.com/usegalaxy-eu/idc - Common tools - A common set of tools and a common tool menu organization is currently being defined. Tools and tool configuration will also be replicated through CVMFS - This will ensure both that users will have the same user experience across different usegalaxy. ✱ instances, and that workflows can be moved between instances and still execute correctly and reproducibly - Local custom tools will still be supported but clearly identified

Toward federated cloud Galaxy

A long-coming convergence (Enis Afgan)

A tool suite for cloud virtual environments (Enis Afgan)

Orchestrating a Galaxy instance Since 2008 we’ve had two very
different models for managing Galaxy - Traditional HPC instances running on metal, e.g. Galaxy Main - Cloud instances managed by our cloud stack (CloudBridge, CloudLaunch, CloudMan), e.g. the Genomics Virtual Lab We’re actively working to unify these approaches and create a single best practice for deploying and managing Galaxy instances. By orchestrating all components of Galaxy through Kubernetes, we can deploy robust Galaxy instances on local or cloud resources (e.g. using Rancher).

Bootstrap via CloudLaunch >_ run VM IP CloudBridge AWS Azure
GCE OpenStack CloudLaunch-plugin galaxy/cloudman-boot cloudman-boot → Rancher K8S Helm CloudMan chart CloudBridge CloudLaunch CloudMan HelmsMan Multi-cloud Infrastructure Coordination Applications VM ... ... ... ... Galaxy Chart Remote object store(s) Local cache Authn / authz Authnz Authnz Containerized jobs (Enis Afgan)

Kubernetes Galaxy new job: inputs: - dataset 1 - dataset
2 outputs: - dataset 3 tool: HISAT2 create job Google Bucket Volume execute job get datasets 1, 2 execute job 3 job complete 1 2 1 2 3 compute Time control message data movement Future Remote Execution Data Flow 1 2 3 Job Pod BioContainer Executor Container (Enis Afgan)

Challenges for human genomic (+) data sharing The value of
data is greatly increased by integration across datasets - e.g. in human genomics, power to detect relationships between individual variants and disease depends on the number of individuals measured Moving/copying data is wasteful: transfer costs, redundant storage costs Human genomic data comes with privacy concerns, need to ensure security and detect threats

AnVIL The NHGRI Genomic Data Science Analysis, Visualization, and Informatics
Lab-Space

AnVIL: Inverting the model of genomic data sharing Traditional: Bring
data to the researcher - Copying/moving data is costly - Harder to enforce security - Redundant infrastructure - Siloed compute Goal: Bring researcher to the data - Reduced redundancy and costs - Active threat detection and auditing - Greater accessibility - Elastic, shared, compute

What is the AnVIL? - Scalable and interoperable resource for
the genomic scientific community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workflows - ...for both users with limited computational expertise and sophisticated data scientist users

The AnVIL Team Hopkins Team James Taylor, Jeff Leek, Mike
Schatz, Kasper Hansen Johns Hopkins Anton Nekrutenko Penn State University Jeremy Goecks, Kyle Ellrott Oregon Health & Sciences University Martin Morgan Roswell Park Cancer Institute Vincent Carey Harvard Levi Waldron City University of New York Broad Team Anthony Philippakis, Daniel MacArthur Broad Institute Robert Grossman University of Chicago Benedict Paten University of California Santa Cruz Josh Denny Vanderbilt Ira Hall Washington University Jennifer Hall American Heart Association

Cloud infrastructure and services (Broad) Principles: - Modular - Open
- Community-driven - Standards-based - A modular suite of cloud services to support sharing and analyzing genomic and clinical data at scale. - Deployed in production as part of several flagship scientific activities, including - All of Us - NIH Data Commons and NHLBI STAGE - NCI Cloud Resources - We will now leverage these services to support AnVIL

Analysis tools, environments, training (Hopkins) - Bring together groups that
have built open-source platforms, tools, and workflows that are widely used in the genomics community - Delivered a cloud-based analysis platform to hundreds of thousands of users for over ten years - Developers of 3 of the world’s most popular MOOC sequences and have trained thousands of genomic researchers - Principles: - Focus on enabling users - Meet the needs of multiple research communities - Leverage existing investment in tools to be useful quickly

Goals of the AnVIL 1. Create open source software Storage,
scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access

Components: Data commons / ecosystem (Gen3) - Data commons framework
services (authn, authz, data management, …) - Applications for importing, exploring, and exporting data - Interoperable based on GA4GH and Gen3 standards

Components: Analysis Platform (Firecloud/Terra) - Collaborative cloud-based analysis platform built
on top of Google Cloud Platform - Free to access / compute & storage charged by Google - All software components are Fully Open-Source - Access published data and methods or add your own - Execute analyses in an auditable manner - Securely share data, methods and results AUTH API Workspaces Data Library Tool Content Repository Analysis Tools FireCloud Portal www.firecloud.org Workbench

Firecloud/Terra: Security Development and Deployment - Authenticate, Authorize, Encrypt, Audit
- All activity audited, retained for 5 years Verification - Internal AppSec team (red team) - Quarterly 3rd party pen tests Compliance Certification - 2 FISMA ATOs (FireCloud/NCI, AoU/NIH) - Pursuing FedRAMP

Components: Portals and Applications

Hosting tools and analysis environments Data access / authorization constraints
are pushed down into and enforced by the underlying cloud platform Virtual Machines are provisioned by the platform on behalf of users – all workflows, tools, analysis environments are run within a user’s security context Tools can be as simple as single container images, or multiple orchestrated containers e.g. in the case of Galaxy, the analysis environment will run in one or more containers provisioned for the user, with additional containers provisioned on demand to handle job execution elastically

Different analysis environments, common view of data

Combine multiple tools and environments A n V I L
A P I s

Organize and host key NHGRI datasets Data curation is a
key unmet need across NIH - Processing with consistent pipelines to facilitate sharing - Common metadata model to support indexing and search - Rigorous quality control and white/black-listing of data - Structured data use restrictions to expedite DAC review AnVIL will leverage experiences from the following efforts - Phenotypic data models (Vanderbilt in All of Us/eMERGE) - Read reprocessing and QC (Broad/WashU from CCDG; U. Chicago from GDC effort) - Metadata models (UCSC from genome browser)

Components: Training and Outreach Training materials (Jupyter/Markdown) Videos mp4 Projects/questions
(Jupyter/Markdown) Github Youtube MOOCs Leanpub Coursera EdX Non-ANVIL Training Data Carpentry University Courses Anvil Training Network Galaxy Training Network Bioconductor courses Data Carpentry

DUOS – Broad Data Use Oversight system - The model
for requesting and reviewing data access scales poorly - Each Data Access request needs to be manually reviewed against each data use agreement: O(N2) 826 Number of studies in dbGaP 5,344 Number of PIs requesting data 46 Number of PI countries 1500+ Number of publications resulting from secondary use of dbGaP data 13 days Average Data Access Request time As of July 1, 2017 50,167 Submitted 34,162 Approved

DUOS – Broad Data Use Oversight system 1. Interfaces to
transform data use restrictions and data access requests to machine-readable code 3. A matching algorithm that checks if data access requests are compatible with data use restrictions 2. Interfaces for the Data Access Committee to adjudicate whether structuring and matching has been done appropriately

What is the AnVIL? - Scalable and interoperable resource for
the genomic scientific community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workflows - ...for both users with limited computational expertise and sophisticated data scientist users

Acknowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,
Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Sergey Golitsynskiy, Qiang Gu, Björn Grüning, Sam Guerler, Mo Heydarian, Jennifer Hillman-Jackson, Vahid Jalili, Delphine Lariviere, Alexandru Mahmoud, Anton Nekrutenko, Helena Rasche, Luke Sargent, Nicola Soranzo, Marius van den Beek Taylor Lab at JHU: Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy AnVIL: Anthony Philippakis, Vincent Carey, Josh Denny, Kyle Ellrott, Jeremy Goecks, Robert Grossman, Ira Hall, Jennifer Hall, Kasper Hansen, Jeff Leek, Daniel MacArthur, Martin Morgan, Anton Nekrutenko, Benedict Paten, Mike Schatz, Levi Waldron, and many others! Funding: NHGRI U41 HG006620 (Galaxy), NHGRI U24 HG010263 (AnVIL), NCI U24 CA231877 (Galaxy Federation), NSF DBI 0543285, DBI 0850103 (Galaxy on US cyberinfrastructure) +Collaborators: Dave Hancock and the Jetstream group, Ross Hardison and the VISION group, Victor Corces, Karen Reddy, Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology), Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)

Making large-scale genomic analysis accessible...

Making large-scale genomic analysis accessible, transparent, and reproducible

More Decks by James Taylor

Other Decks in Science

Featured

Transcript