Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making large-scale genomic analysis accessible, transparent, and reproducible

James Taylor
February 08, 2019

Making large-scale genomic analysis accessible, transparent, and reproducible

CMU CompBio Seminar Series, some newish Galaxy stuff (galaxyproject.org) and some even newer stuff on the AnVIL (anvilproject.org)

http://www.cbd.cmu.edu/event/james-taylor-johns-hopkins-university-cmu/

James Taylor

February 08, 2019
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Making large-scale genomic analysis
    accessible, transparent, and reproducible
    James Taylor (@jxtx), Johns Hopkins, http://speakerdeck.com/jxtx
    http://galaxyproject.org http://anvilproject.org

    View Slide

  2. SEQUENCING

    View Slide

  3. It’s widely available… (http://omicsmaps.com)

    View Slide

  4. ...practically free...
    (https://www.genome.gov/27541954/dna-sequencing-costs-data/)
    Cost Per Human Genome ($)

    View Slide

  5. ...and applicable across (nearly) all of Biology!
    - How is the
    production of
    the right protein
    at the right time
    controlled?
    - How are cells
    organized in
    3D?
    - How are cell
    types decided in
    development?
    - How are
    different species
    related?
    - What genome
    variants lead to
    different
    phenotypes or
    disease risk?

    View Slide

  6. However, it produces massive amounts of data
    Illumina NovaSeq 6000
    20 Billion 300bp DNA
    fragments per run
    ~ 6 Terabytes
    Every 2 days…

    View Slide

  7. And sequencing is only the beginning
    Lattice light-sheet microscope: 3d live cell imaging, terabytes per experiment — Image from Karen Reddy
    LMNB1 m6a-tracer

    View Slide

  8. Modern biology has rapidly transformed into a data
    intensive discipline
    - Large scale data acquisition has become easy, e.g. high-throughput
    sequencing and imaging
    - Experiments are increasingly complex
    - Making sense of results often requires mining and making connections across
    multiple databases
    - Nearly all high-profile research involves some quantitative methods
    How does this affect traditional research practices
    and outputs?

    View Slide

  9. Idea
    Experiment
    Raw Data
    Tidy Data
    Summarized data
    Results
    Experimental design
    Data collection
    Data cleaning
    Data analysis
    Inference
    Data Pipeline, inspired by Leek and Peng, Nature 2015
    The part we are
    considering here
    The part that ends
    up in the Publication

    View Slide

  10. Three major concerns
    Accessibility: Making use of large-scale data requires complex computational
    resources and methods. Can all researchers access these approaches? How can
    we make these methods available to everyone
    Transparency: Is it possible to communicate analyses and results in ways that are
    both easy to understand and provide all of the essential details
    Reproducibility: Can analyses be precisely reproduced, to facilitate rigorous
    validation and peer review, and ease reuse?

    View Slide

  11. View Slide

  12. Galaxy: accessible analysis system

    View Slide

  13. Describe analysis tool
    behavior abstractly

    View Slide

  14. Describe analysis tool
    behavior abstractly
    Analysis environment automatically and
    transparently tracks details

    View Slide

  15. Describe analysis tool
    behavior abstractly
    Analysis environment automatically and
    transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically

    View Slide

  16. Describe analysis tool
    behavior abstractly
    Pervasive sharing, and publication
    of documents with integrated analysis
    Analysis environment automatically and
    transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically

    View Slide

  17. Visualization and visual analytics

    View Slide

  18. Galaxy IEs: containerized apps, rapidly move between analysis modes

    View Slide

  19. View Slide

  20. View Slide

  21. Practical computational reproducibility

    View Slide

  22. Persistent challenge: managing underlying software
    Bioinformatics workflows use a lot of different tools, which each use different
    software packages, which depend on other software packages…
    Running a workflow requires we make it possible, and hopefully easy, for all of
    the underlying dependencies to be installed
    Reproducing a workflow requires assembling all of the right dependencies with all
    of the right versions and ideally in a controlled environment
    - Sometimes different steps require different, and incompatible versions of
    dependencies…
    The Galaxy project has wasted a lot of time trying to solve this problem

    View Slide

  23. Builds on Conda packaging system, designed “for installing multiple versions of
    software packages and their dependencies and switching easily between them”
    More than 4000 recipes for software packages
    All packages are automatically built in a minimal environment to ensure isolation
    and portability
    https://bioconda.github.io

    View Slide

  24. Submit recipe to GitHub
    CircleCI pulls recipes and builds in
    minimal docker container
    Successful binary builds from
    main repo uploaded to Anaconda
    to be installed anywhere

    View Slide

  25. Biocontainers
    Given a set of packages and versions in Conda/Bioconda, we can build a
    container with just that software on a minimal base image
    If we use the same base image, we can reconstruct exactly the same container
    (since we archive all binary builds of all versions)
    With automation, these containers can be built automatically for every package
    with no manual modification or intervention (e.g. mulled)
    https://biocontainers.pro/

    View Slide

  26. CircleCI pulls recipes and builds in
    minimal docker container
    Successful binary builds from
    main repo uploaded to Anaconda
    to be installed anywhere
    Same binary
    from
    bioconda
    installed into
    minimal
    container for
    each provider
    Singularity

    View Slide

  27. Tool and dependency binaries, built in minimal
    environment with controlled libs
    Container defines minimum environment
    Virtual machine controls kernel and apparent
    hardware environment
    KVM, Xen, ….
    Increasingly precise environment control

    View Slide

  28. Galaxy is available as...
    A free (for everyone) web service integrating a wealth of tools, compute
    resources, terabytes of reference data and permanent storage
    Open source software that makes integrating your own tools and data and
    customizing for your own site simple
    An open extensible platform for sharing tools, datatypes, workflows, ...

    View Slide

  29. A nationally distributed service:
    The Galaxy / XSEDE Gateway

    View Slide

  30. 125,000
    registered users
    2PB
    user data
    19M
    jobs run
    100
    training events
    (2017 & 2018)
    Stats for Galaxy Main (usegalaxy.org) in May 2018

    View Slide

  31. PSC, Pittsburgh
    Stampede
    ● 462,462 cores
    ● 205 TB memory
    Bridges
    Dedicated resources Shared XSEDE resources
    TACC
    Austin
    Galaxy Cluster (Rodeo)
    ● 256 cores
    ● 2 TB memory
    Corral/Stockyard
    ● 20 PB disk
    PTI IU Bloomington
    (Nate Coraor)

    View Slide

  32. SmartOS
    (PSU)
    Bare metal
    cluster (TACC)
    VMWare
    (TACC)
    Stampede2
    (TACC)
    pulsar Bridges
    (PSC)
    Pulsar/AMQP
    Pulsar/HTTP
    Slurm
    PostgreSQL
    usegalaxy.org Compute Architecture (June 2018)
    NFS
    Jetstream
    (TACC)
    Jetstream (IU)
    Swarm
    db
    CVMFS
    slurm/rabbitmq
    roundup64
    ...
    roundup49
    cvmfs stratum0
    cvmfs stratum0
    jobs
    jobs
    web
    web
    swarm instance
    swarm instance
    swarm instance
    swarm instance
    slurm/pulsar/
    swarm
    cvmfs stratum1
    slurm instance
    slurm instance
    slurm instance
    slurm instance
    Corral (TACC)
    2.3 PB dataset storage
    pulsar
    cvmfs stratum1
    slurm/pulsar
    /swarm
    slurm instance
    instance
    instance
    instance
    cvmfs stratum1/swarm
    (Nate Coraor)

    View Slide

  33. This approach provides both scalability and
    flexibility
    - A set of dedicated compute resources (deployed on TACC’s internal cloud)
    provide basic services and first line job execution
    - The bulk of Galaxy jobs run on Jetstream, an OpenStack cloud which allows
    us to leverage elasticity to efficiently adjust to changing user demands
    - Unique resources like Bridges and Stampede2 allow us to serve jobs that
    have extremely large memory demands (e.g. genome and transcriptome
    assembly), or are highly parallel with long runtimes (e.g. large-scale read
    mapping jobs)

    View Slide

  34. Initial move to XSEDE resources
    (Enis Afgan)

    View Slide

  35. Not just more jobs, different types of jobs
    Can now run larger jobs, and more
    jobs:
    325,000 jobs run
    on behalf of 12,000 users
    Can run new types of jobs:
    Galaxy Interactive
    Environments: Jupyter, RStudio
    (Enis Afgan)

    View Slide

  36. An internationally distributed service:
    usegalaxy.✱
    usegalaxy.org usegalaxy.org.au usegalaxy.eu

    View Slide

  37. View Slide

  38. XSEDE, Indiana University
    XSEDE & CyVerse,
    TACC, Austin
    EU JRC, Ispra
    Penn State
    cvmfs0-tacc0
    ● test.galaxyproject.org
    ● main.galaxyproject.org
    cvmfs1-tacc0
    cvmfs1-iu0
    ● Stratum 0 servers
    ● Stratum 1 servers
    galaxy.jrc.ec.europa.eu
    de.NBI, RZ Freiburg
    cvmfs0-psu0
    ● singularity.galaxyproject.org
    ● data.galaxyproject.org
    cvmfs1-psu0
    cvmfs1-ufr0.usegalaxy.eu
    CVMFS server distribution
    Galaxy Australia, Melbourne
    cvmfs1-mel0.gvl.org.au

    View Slide

  39. Achieving usegalaxy.✱ coherence
    - Common reference and index data
    - These are already distributed by CVMFS, but organized in a ad hoc manner due to the history
    of Galaxy
    - Currently building an automated approach where metadata defining the complete set of
    reference and index data will live in Github, builds will be automated based on Github state,
    and succesfull builds deployed through CVMFS for replication to all site
    - Intergalactic Data Commission: https://github.com/usegalaxy-eu/idc
    - Common tools
    - A common set of tools and a common tool menu organization is currently being defined.
    Tools and tool configuration will also be replicated through CVMFS
    - This will ensure both that users will have the same user experience across different usegalaxy.
    ✱ instances, and that workflows can be moved between instances and still execute correctly
    and reproducibly
    - Local custom tools will still be supported but clearly identified

    View Slide

  40. Toward federated cloud Galaxy

    View Slide

  41. A long-coming convergence
    (Enis Afgan)

    View Slide

  42. A tool suite for cloud virtual environments
    (Enis Afgan)

    View Slide

  43. Orchestrating a Galaxy instance
    Since 2008 we’ve had two very different models for managing Galaxy
    - Traditional HPC instances running on metal, e.g. Galaxy Main
    - Cloud instances managed by our cloud stack (CloudBridge, CloudLaunch,
    CloudMan), e.g. the Genomics Virtual Lab
    We’re actively working to unify these approaches and create a single best
    practice for deploying and managing Galaxy instances.
    By orchestrating all components of Galaxy through Kubernetes, we can deploy
    robust Galaxy instances on local or cloud resources (e.g. using Rancher).

    View Slide

  44. Bootstrap via
    CloudLaunch
    >_ run
    VM IP
    CloudBridge
    AWS Azure GCE OpenStack
    CloudLaunch-plugin
    galaxy/cloudman-boot
    cloudman-boot →
    Rancher K8S Helm
    CloudMan
    chart CloudBridge CloudLaunch CloudMan HelmsMan
    Multi-cloud Infrastructure Coordination Applications
    VM
    ...
    ...
    ...
    ...
    Galaxy
    Chart
    Remote
    object store(s)
    Local
    cache
    Authn / authz
    Authnz
    Authnz
    Containerized jobs
    (Enis Afgan)

    View Slide

  45. Kubernetes
    Galaxy
    new job:
    inputs:
    - dataset 1
    - dataset 2
    outputs:
    - dataset 3
    tool: HISAT2
    create job
    Google Bucket
    Volume
    execute job
    get datasets 1, 2
    execute job
    3
    job complete
    1
    2
    1
    2
    3
    compute
    Time
    control message data movement
    Future Remote
    Execution Data Flow
    1
    2
    3
    Job Pod
    BioContainer
    Executor
    Container
    (Enis Afgan)

    View Slide

  46. Challenges for human genomic (+) data sharing
    The value of data is greatly increased by integration across datasets
    - e.g. in human genomics, power to detect relationships between individual
    variants and disease depends on the number of individuals measured
    Moving/copying data is wasteful: transfer costs, redundant storage costs
    Human genomic data comes with privacy concerns, need to ensure security and
    detect threats

    View Slide

  47. AnVIL
    The NHGRI Genomic Data Science Analysis,
    Visualization, and Informatics Lab-Space

    View Slide

  48. AnVIL: Inverting the model of genomic data sharing
    Traditional: Bring data to the researcher
    - Copying/moving data is costly
    - Harder to enforce security
    - Redundant infrastructure
    - Siloed compute
    Goal: Bring researcher to the data
    - Reduced redundancy and costs
    - Active threat detection and auditing
    - Greater accessibility
    - Elastic, shared, compute

    View Slide

  49. What is the AnVIL?
    - Scalable and interoperable resource for the genomic scientific community
    - Cloud-based infrastructure
    - Shared analysis and computing environment
    - Support genomic data access, sharing and computing across large genomic,
    and genomic related, data sets
    - Genomic datasets, phenotypes and metadata
    - Large datasets generated by NHGRI programs, as well as other initiatives / agencies
    - Data access controls and data security
    - Collaborative environment for datasets and analysis workflows
    - ...for both users with limited computational expertise and sophisticated data scientist users

    View Slide

  50. The AnVIL Team
    Hopkins Team
    James Taylor, Jeff Leek, Mike Schatz, Kasper Hansen
    Johns Hopkins
    Anton Nekrutenko
    Penn State University
    Jeremy Goecks, Kyle Ellrott
    Oregon Health & Sciences University
    Martin Morgan
    Roswell Park Cancer Institute
    Vincent Carey
    Harvard
    Levi Waldron
    City University of New York
    Broad Team
    Anthony Philippakis, Daniel MacArthur
    Broad Institute
    Robert Grossman
    University of Chicago
    Benedict Paten
    University of California Santa Cruz
    Josh Denny
    Vanderbilt
    Ira Hall
    Washington University
    Jennifer Hall
    American Heart Association

    View Slide

  51. Cloud infrastructure and services (Broad)
    Principles:
    - Modular
    - Open
    - Community-driven
    - Standards-based
    - A modular suite of cloud services to support sharing and analyzing genomic and clinical
    data at scale.
    - Deployed in production as part of several flagship scientific activities, including
    - All of Us
    - NIH Data Commons and NHLBI STAGE
    - NCI Cloud Resources
    - We will now leverage these services to support AnVIL

    View Slide

  52. Analysis tools, environments, training (Hopkins)
    - Bring together groups that have built open-source platforms, tools, and
    workflows that are widely used in the genomics community
    - Delivered a cloud-based analysis platform to hundreds of thousands of users
    for over ten years
    - Developers of 3 of the world’s most popular MOOC sequences and have
    trained thousands of genomic researchers
    - Principles:
    - Focus on enabling users
    - Meet the needs of multiple research communities
    - Leverage existing investment in tools to be useful quickly

    View Slide

  53. Goals of the AnVIL
    1. Create open source software
    Storage, scalable analytics, data visualization
    2. Organize and host key NHGRI datasets
    CCDG, CMG, eMERGE, and more
    3. Operate services for the world
    Security, training & outreach, new models of data access

    View Slide

  54. Components: Data commons / ecosystem (Gen3)
    - Data commons framework services (authn, authz, data management, …)
    - Applications for importing, exploring, and exporting data
    - Interoperable based on GA4GH and Gen3 standards

    View Slide

  55. Components: Analysis Platform (Firecloud/Terra)
    - Collaborative cloud-based analysis platform built on top of Google Cloud Platform
    - Free to access / compute & storage charged by Google
    - All software components are Fully Open-Source
    - Access published data and methods or add your own
    - Execute analyses in an auditable manner
    - Securely share data, methods and results
    AUTH
    API
    Workspaces
    Data Library
    Tool Content
    Repository
    Analysis
    Tools
    FireCloud Portal
    www.firecloud.org
    Workbench

    View Slide

  56. Firecloud/Terra: Security
    Development and Deployment
    - Authenticate, Authorize, Encrypt, Audit
    - All activity audited, retained for 5 years
    Verification
    - Internal AppSec team (red team)
    - Quarterly 3rd party pen tests
    Compliance Certification
    - 2 FISMA ATOs (FireCloud/NCI, AoU/NIH)
    - Pursuing FedRAMP

    View Slide

  57. Components: Portals and Applications

    View Slide

  58. Hosting tools and analysis environments
    Data access / authorization constraints are pushed down into and enforced by the
    underlying cloud platform
    Virtual Machines are provisioned by the platform on behalf of users – all
    workflows, tools, analysis environments are run within a user’s security context
    Tools can be as simple as single container images, or multiple orchestrated
    containers
    e.g. in the case of Galaxy, the analysis environment will run in one or more
    containers provisioned for the user, with additional containers provisioned on
    demand to handle job execution elastically

    View Slide

  59. Different analysis environments, common view of data

    View Slide

  60. Combine multiple tools and environments
    A
    n
    V
    I
    L
    A
    P
    I
    s

    View Slide

  61. Goals of the AnVIL
    1. Create open source software
    Storage, scalable analytics, data visualization
    2. Organize and host key NHGRI datasets
    CCDG, CMG, eMERGE, and more
    3. Operate services for the world
    Security, training & outreach, new models of data access

    View Slide

  62. Organize and host key NHGRI datasets
    Data curation is a key unmet need across NIH
    - Processing with consistent pipelines to facilitate sharing
    - Common metadata model to support indexing and search
    - Rigorous quality control and white/black-listing of data
    - Structured data use restrictions to expedite DAC review
    AnVIL will leverage experiences from the following efforts
    - Phenotypic data models (Vanderbilt in All of Us/eMERGE)
    - Read reprocessing and QC (Broad/WashU from CCDG; U. Chicago from
    GDC effort)
    - Metadata models (UCSC from genome browser)

    View Slide

  63. Goals of the AnVIL
    1. Create open source software
    Storage, scalable analytics, data visualization
    2. Organize and host key NHGRI datasets
    CCDG, CMG, eMERGE, and more
    3. Operate services for the world
    Security, training & outreach, new models of data access

    View Slide

  64. Components: Training and Outreach
    Training materials
    (Jupyter/Markdown)
    Videos
    mp4
    Projects/questions
    (Jupyter/Markdown)
    Github Youtube
    MOOCs
    Leanpub
    Coursera
    EdX
    Non-ANVIL Training
    Data Carpentry
    University Courses
    Anvil Training Network
    Galaxy Training Network
    Bioconductor courses
    Data Carpentry

    View Slide

  65. View Slide

  66. View Slide

  67. View Slide

  68. View Slide

  69. View Slide

  70. View Slide

  71. View Slide

  72. DUOS – Broad Data Use Oversight system
    - The model for requesting and reviewing data access scales poorly
    - Each Data Access request needs to be manually reviewed against each data
    use agreement: O(N2)
    826 Number of studies in dbGaP
    5,344 Number of PIs requesting data
    46 Number of PI countries
    1500+ Number of publications resulting from
    secondary use of dbGaP data
    13 days Average Data Access Request time
    As of July 1, 2017
    50,167 Submitted
    34,162 Approved

    View Slide

  73. DUOS – Broad Data Use Oversight system
    1. Interfaces to transform data use restrictions and
    data access requests to machine-readable code
    3. A matching algorithm that
    checks if data access requests are
    compatible with data use
    restrictions
    2. Interfaces for the Data
    Access Committee to
    adjudicate whether
    structuring and matching has
    been done appropriately

    View Slide

  74. What is the AnVIL?
    - Scalable and interoperable resource for the genomic scientific community
    - Cloud-based infrastructure
    - Shared analysis and computing environment
    - Support genomic data access, sharing and computing across large genomic,
    and genomic related, data sets
    - Genomic datasets, phenotypes and metadata
    - Large datasets generated by NHGRI programs, as well as other initiatives / agencies
    - Data access controls and data security
    - Collaborative environment for datasets and analysis workflows
    - ...for both users with limited computational expertise and sophisticated data scientist users

    View Slide

  75. Acknowledgements
    Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier, Martin Čech, John Chilton, Dave
    Clements, Nate Coraor, Jeremy Goecks, Sergey Golitsynskiy, Qiang Gu, Björn Grüning, Sam Guerler, Mo
    Heydarian, Jennifer Hillman-Jackson, Vahid Jalili, Delphine Lariviere, Alexandru Mahmoud, Anton
    Nekrutenko, Helena Rasche, Luke Sargent, Nicola Soranzo, Marius van den Beek
    Taylor Lab at JHU: Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach,
    Michael E. G. Sauria, German Uritskiy
    AnVIL: Anthony Philippakis, Vincent Carey, Josh Denny, Kyle Ellrott, Jeremy Goecks, Robert Grossman,
    Ira Hall, Jennifer Hall, Kasper Hansen, Jeff Leek, Daniel MacArthur, Martin Morgan, Anton Nekrutenko,
    Benedict Paten, Mike Schatz, Levi Waldron, and many others!
    Funding: NHGRI U41 HG006620 (Galaxy), NHGRI U24 HG010263 (AnVIL), NCI U24 CA231877 (Galaxy
    Federation), NSF DBI 0543285, DBI 0850103 (Galaxy on US cyberinfrastructure)
    +Collaborators: Dave Hancock and the Jetstream group, Ross Hardison and the VISION group, Victor
    Corces, Karen Reddy, Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology), Battle, Goff, Langmead,
    Leek, Schatz, Timp labs (JHU Genomics)

    View Slide