Making large-scale genomic analysis accessible, transparent, and reproducible

3ee44f53c39bcd4bc663a2ea0e21d526?s=47 James Taylor
February 08, 2019

Making large-scale genomic analysis accessible, transparent, and reproducible

CMU CompBio Seminar Series, some newish Galaxy stuff (galaxyproject.org) and some even newer stuff on the AnVIL (anvilproject.org)

http://www.cbd.cmu.edu/event/james-taylor-johns-hopkins-university-cmu/

3ee44f53c39bcd4bc663a2ea0e21d526?s=128

James Taylor

February 08, 2019
Tweet

Transcript

  1. Making large-scale genomic analysis accessible, transparent, and reproducible James Taylor

    (@jxtx), Johns Hopkins, http://speakerdeck.com/jxtx http://galaxyproject.org http://anvilproject.org
  2. SEQUENCING

  3. It’s widely available… (http://omicsmaps.com)

  4. ...practically free... (https://www.genome.gov/27541954/dna-sequencing-costs-data/) Cost Per Human Genome ($)

  5. ...and applicable across (nearly) all of Biology! - How is

    the production of the right protein at the right time controlled? - How are cells organized in 3D? - How are cell types decided in development? - How are different species related? - What genome variants lead to different phenotypes or disease risk?
  6. However, it produces massive amounts of data Illumina NovaSeq 6000

    20 Billion 300bp DNA fragments per run ~ 6 Terabytes Every 2 days…
  7. And sequencing is only the beginning Lattice light-sheet microscope: 3d

    live cell imaging, terabytes per experiment — Image from Karen Reddy LMNB1 m6a-tracer
  8. Modern biology has rapidly transformed into a data intensive discipline

    - Large scale data acquisition has become easy, e.g. high-throughput sequencing and imaging - Experiments are increasingly complex - Making sense of results often requires mining and making connections across multiple databases - Nearly all high-profile research involves some quantitative methods How does this affect traditional research practices and outputs?
  9. Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

    design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
  10. Three major concerns Accessibility: Making use of large-scale data requires

    complex computational resources and methods. Can all researchers access these approaches? How can we make these methods available to everyone Transparency: Is it possible to communicate analyses and results in ways that are both easy to understand and provide all of the essential details Reproducibility: Can analyses be precisely reproduced, to facilitate rigorous validation and peer review, and ease reuse?
  11. None
  12. Galaxy: accessible analysis system

  13. Describe analysis tool behavior abstractly

  14. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details
  15. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  16. Describe analysis tool behavior abstractly Pervasive sharing, and publication of

    documents with integrated analysis Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically
  17. Visualization and visual analytics

  18. Galaxy IEs: containerized apps, rapidly move between analysis modes

  19. None
  20. None
  21. Practical computational reproducibility

  22. Persistent challenge: managing underlying software Bioinformatics workflows use a lot

    of different tools, which each use different software packages, which depend on other software packages… Running a workflow requires we make it possible, and hopefully easy, for all of the underlying dependencies to be installed Reproducing a workflow requires assembling all of the right dependencies with all of the right versions and ideally in a controlled environment - Sometimes different steps require different, and incompatible versions of dependencies… The Galaxy project has wasted a lot of time trying to solve this problem
  23. Builds on Conda packaging system, designed “for installing multiple versions

    of software packages and their dependencies and switching easily between them” More than 4000 recipes for software packages All packages are automatically built in a minimal environment to ensure isolation and portability https://bioconda.github.io
  24. Submit recipe to GitHub CircleCI pulls recipes and builds in

    minimal docker container Successful binary builds from main repo uploaded to Anaconda to be installed anywhere
  25. Biocontainers Given a set of packages and versions in Conda/Bioconda,

    we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) With automation, these containers can be built automatically for every package with no manual modification or intervention (e.g. mulled) https://biocontainers.pro/
  26. CircleCI pulls recipes and builds in minimal docker container Successful

    binary builds from main repo uploaded to Anaconda to be installed anywhere Same binary from bioconda installed into minimal container for each provider Singularity
  27. Tool and dependency binaries, built in minimal environment with controlled

    libs Container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control
  28. Galaxy is available as... A free (for everyone) web service

    integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  29. A nationally distributed service: The Galaxy / XSEDE Gateway

  30. 125,000 registered users 2PB user data 19M jobs run 100

    training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018
  31. PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory

    Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk PTI IU Bloomington (Nate Coraor)
  32. SmartOS (PSU) Bare metal cluster (TACC) VMWare (TACC) Stampede2 (TACC)

    pulsar Bridges (PSC) Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL usegalaxy.org Compute Architecture (June 2018) NFS Jetstream (TACC) Jetstream (IU) Swarm db CVMFS slurm/rabbitmq roundup64 ... roundup49 cvmfs stratum0 cvmfs stratum0 jobs jobs web web swarm instance swarm instance swarm instance swarm instance slurm/pulsar/ swarm cvmfs stratum1 slurm instance slurm instance slurm instance slurm instance Corral (TACC) 2.3 PB dataset storage pulsar cvmfs stratum1 slurm/pulsar /swarm slurm instance instance instance instance cvmfs stratum1/swarm (Nate Coraor)
  33. This approach provides both scalability and flexibility - A set

    of dedicated compute resources (deployed on TACC’s internal cloud) provide basic services and first line job execution - The bulk of Galaxy jobs run on Jetstream, an OpenStack cloud which allows us to leverage elasticity to efficiently adjust to changing user demands - Unique resources like Bridges and Stampede2 allow us to serve jobs that have extremely large memory demands (e.g. genome and transcriptome assembly), or are highly parallel with long runtimes (e.g. large-scale read mapping jobs)
  34. Initial move to XSEDE resources (Enis Afgan)

  35. Not just more jobs, different types of jobs Can now

    run larger jobs, and more jobs: 325,000 jobs run on behalf of 12,000 users Can run new types of jobs: Galaxy Interactive Environments: Jupyter, RStudio (Enis Afgan)
  36. An internationally distributed service: usegalaxy.✱ usegalaxy.org usegalaxy.org.au usegalaxy.eu

  37. None
  38. XSEDE, Indiana University XSEDE & CyVerse, TACC, Austin EU JRC,

    Ispra Penn State cvmfs0-tacc0 • test.galaxyproject.org • main.galaxyproject.org cvmfs1-tacc0 cvmfs1-iu0 • Stratum 0 servers • Stratum 1 servers galaxy.jrc.ec.europa.eu de.NBI, RZ Freiburg cvmfs0-psu0 • singularity.galaxyproject.org • data.galaxyproject.org cvmfs1-psu0 cvmfs1-ufr0.usegalaxy.eu CVMFS server distribution Galaxy Australia, Melbourne cvmfs1-mel0.gvl.org.au
  39. Achieving usegalaxy.✱ coherence - Common reference and index data -

    These are already distributed by CVMFS, but organized in a ad hoc manner due to the history of Galaxy - Currently building an automated approach where metadata defining the complete set of reference and index data will live in Github, builds will be automated based on Github state, and succesfull builds deployed through CVMFS for replication to all site - Intergalactic Data Commission: https://github.com/usegalaxy-eu/idc - Common tools - A common set of tools and a common tool menu organization is currently being defined. Tools and tool configuration will also be replicated through CVMFS - This will ensure both that users will have the same user experience across different usegalaxy. ✱ instances, and that workflows can be moved between instances and still execute correctly and reproducibly - Local custom tools will still be supported but clearly identified
  40. Toward federated cloud Galaxy

  41. A long-coming convergence (Enis Afgan)

  42. A tool suite for cloud virtual environments (Enis Afgan)

  43. Orchestrating a Galaxy instance Since 2008 we’ve had two very

    different models for managing Galaxy - Traditional HPC instances running on metal, e.g. Galaxy Main - Cloud instances managed by our cloud stack (CloudBridge, CloudLaunch, CloudMan), e.g. the Genomics Virtual Lab We’re actively working to unify these approaches and create a single best practice for deploying and managing Galaxy instances. By orchestrating all components of Galaxy through Kubernetes, we can deploy robust Galaxy instances on local or cloud resources (e.g. using Rancher).
  44. Bootstrap via CloudLaunch >_ run VM IP CloudBridge AWS Azure

    GCE OpenStack CloudLaunch-plugin galaxy/cloudman-boot cloudman-boot → Rancher K8S Helm CloudMan chart CloudBridge CloudLaunch CloudMan HelmsMan Multi-cloud Infrastructure Coordination Applications VM ... ... ... ... Galaxy Chart Remote object store(s) Local cache Authn / authz Authnz Authnz Containerized jobs (Enis Afgan)
  45. Kubernetes Galaxy new job: inputs: - dataset 1 - dataset

    2 outputs: - dataset 3 tool: HISAT2 create job Google Bucket Volume execute job get datasets 1, 2 execute job 3 job complete 1 2 1 2 3 compute Time control message data movement Future Remote Execution Data Flow 1 2 3 Job Pod BioContainer Executor Container (Enis Afgan)
  46. Challenges for human genomic (+) data sharing The value of

    data is greatly increased by integration across datasets - e.g. in human genomics, power to detect relationships between individual variants and disease depends on the number of individuals measured Moving/copying data is wasteful: transfer costs, redundant storage costs Human genomic data comes with privacy concerns, need to ensure security and detect threats
  47. AnVIL The NHGRI Genomic Data Science Analysis, Visualization, and Informatics

    Lab-Space
  48. AnVIL: Inverting the model of genomic data sharing Traditional: Bring

    data to the researcher - Copying/moving data is costly - Harder to enforce security - Redundant infrastructure - Siloed compute Goal: Bring researcher to the data - Reduced redundancy and costs - Active threat detection and auditing - Greater accessibility - Elastic, shared, compute
  49. What is the AnVIL? - Scalable and interoperable resource for

    the genomic scientific community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workflows - ...for both users with limited computational expertise and sophisticated data scientist users
  50. The AnVIL Team Hopkins Team James Taylor, Jeff Leek, Mike

    Schatz, Kasper Hansen Johns Hopkins Anton Nekrutenko Penn State University Jeremy Goecks, Kyle Ellrott Oregon Health & Sciences University Martin Morgan Roswell Park Cancer Institute Vincent Carey Harvard Levi Waldron City University of New York Broad Team Anthony Philippakis, Daniel MacArthur Broad Institute Robert Grossman University of Chicago Benedict Paten University of California Santa Cruz Josh Denny Vanderbilt Ira Hall Washington University Jennifer Hall American Heart Association
  51. Cloud infrastructure and services (Broad) Principles: - Modular - Open

    - Community-driven - Standards-based - A modular suite of cloud services to support sharing and analyzing genomic and clinical data at scale. - Deployed in production as part of several flagship scientific activities, including - All of Us - NIH Data Commons and NHLBI STAGE - NCI Cloud Resources - We will now leverage these services to support AnVIL
  52. Analysis tools, environments, training (Hopkins) - Bring together groups that

    have built open-source platforms, tools, and workflows that are widely used in the genomics community - Delivered a cloud-based analysis platform to hundreds of thousands of users for over ten years - Developers of 3 of the world’s most popular MOOC sequences and have trained thousands of genomic researchers - Principles: - Focus on enabling users - Meet the needs of multiple research communities - Leverage existing investment in tools to be useful quickly
  53. Goals of the AnVIL 1. Create open source software Storage,

    scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access
  54. Components: Data commons / ecosystem (Gen3) - Data commons framework

    services (authn, authz, data management, …) - Applications for importing, exploring, and exporting data - Interoperable based on GA4GH and Gen3 standards
  55. Components: Analysis Platform (Firecloud/Terra) - Collaborative cloud-based analysis platform built

    on top of Google Cloud Platform - Free to access / compute & storage charged by Google - All software components are Fully Open-Source - Access published data and methods or add your own - Execute analyses in an auditable manner - Securely share data, methods and results AUTH API Workspaces Data Library Tool Content Repository Analysis Tools FireCloud Portal www.firecloud.org Workbench
  56. Firecloud/Terra: Security Development and Deployment - Authenticate, Authorize, Encrypt, Audit

    - All activity audited, retained for 5 years Verification - Internal AppSec team (red team) - Quarterly 3rd party pen tests Compliance Certification - 2 FISMA ATOs (FireCloud/NCI, AoU/NIH) - Pursuing FedRAMP
  57. Components: Portals and Applications

  58. Hosting tools and analysis environments Data access / authorization constraints

    are pushed down into and enforced by the underlying cloud platform Virtual Machines are provisioned by the platform on behalf of users – all workflows, tools, analysis environments are run within a user’s security context Tools can be as simple as single container images, or multiple orchestrated containers e.g. in the case of Galaxy, the analysis environment will run in one or more containers provisioned for the user, with additional containers provisioned on demand to handle job execution elastically
  59. Different analysis environments, common view of data

  60. Combine multiple tools and environments A n V I L

    A P I s
  61. Goals of the AnVIL 1. Create open source software Storage,

    scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access
  62. Organize and host key NHGRI datasets Data curation is a

    key unmet need across NIH - Processing with consistent pipelines to facilitate sharing - Common metadata model to support indexing and search - Rigorous quality control and white/black-listing of data - Structured data use restrictions to expedite DAC review AnVIL will leverage experiences from the following efforts - Phenotypic data models (Vanderbilt in All of Us/eMERGE) - Read reprocessing and QC (Broad/WashU from CCDG; U. Chicago from GDC effort) - Metadata models (UCSC from genome browser)
  63. Goals of the AnVIL 1. Create open source software Storage,

    scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access
  64. Components: Training and Outreach Training materials (Jupyter/Markdown) Videos mp4 Projects/questions

    (Jupyter/Markdown) Github Youtube MOOCs Leanpub Coursera EdX Non-ANVIL Training Data Carpentry University Courses Anvil Training Network Galaxy Training Network Bioconductor courses Data Carpentry
  65. None
  66. None
  67. None
  68. None
  69. None
  70. None
  71. None
  72. DUOS – Broad Data Use Oversight system - The model

    for requesting and reviewing data access scales poorly - Each Data Access request needs to be manually reviewed against each data use agreement: O(N2) 826 Number of studies in dbGaP 5,344 Number of PIs requesting data 46 Number of PI countries 1500+ Number of publications resulting from secondary use of dbGaP data 13 days Average Data Access Request time As of July 1, 2017 50,167 Submitted 34,162 Approved
  73. DUOS – Broad Data Use Oversight system 1. Interfaces to

    transform data use restrictions and data access requests to machine-readable code 3. A matching algorithm that checks if data access requests are compatible with data use restrictions 2. Interfaces for the Data Access Committee to adjudicate whether structuring and matching has been done appropriately
  74. What is the AnVIL? - Scalable and interoperable resource for

    the genomic scientific community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workflows - ...for both users with limited computational expertise and sophisticated data scientist users
  75. Acknowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Sergey Golitsynskiy, Qiang Gu, Björn Grüning, Sam Guerler, Mo Heydarian, Jennifer Hillman-Jackson, Vahid Jalili, Delphine Lariviere, Alexandru Mahmoud, Anton Nekrutenko, Helena Rasche, Luke Sargent, Nicola Soranzo, Marius van den Beek Taylor Lab at JHU: Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy AnVIL: Anthony Philippakis, Vincent Carey, Josh Denny, Kyle Ellrott, Jeremy Goecks, Robert Grossman, Ira Hall, Jennifer Hall, Kasper Hansen, Jeff Leek, Daniel MacArthur, Martin Morgan, Anton Nekrutenko, Benedict Paten, Mike Schatz, Levi Waldron, and many others! Funding: NHGRI U41 HG006620 (Galaxy), NHGRI U24 HG010263 (AnVIL), NCI U24 CA231877 (Galaxy Federation), NSF DBI 0543285, DBI 0850103 (Galaxy on US cyberinfrastructure) +Collaborators: Dave Hancock and the Jetstream group, Ross Hardison and the VISION group, Victor Corces, Karen Reddy, Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology), Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)