Introduction to the AnVIL

Slide 1

Slide 1 text

Introduction to The AnVIL AnVIL

Slide 2

Slide 2 text

Goals of The AnVIL

Slide 3

Slide 3 text

AnVIL: Inverting the model of genomic data sharing Traditional: Bring data to the researcher - Copying/moving data is costly - Harder to enforce security - Redundant infrastructure - Siloed compute Goal: Bring researcher to the data - Reduced redundancy and costs - Active threat detection and auditing - Greater accessibility - Elastic, shared, compute

Slide 4

Slide 4 text

What is the AnVIL? - Scalable and interoperable resource for the genomic scientiﬁc community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workﬂows - ...for both users with limited computational expertise and sophisticated data scientist users

Slide 5

Slide 5 text

Goals of the AnVIL 1. Create open source software Storage, scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access

Slide 6

Slide 6 text

Software Components

Slide 7

Slide 7 text

AnVIL / Terra: analysis workspaces and batch workﬂows AnVIL / Gen3: Data models, indexing, querying AnVIL / Dockstore: sharing containerized tools and workﬂows AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ...

Slide 8

Slide 8 text

AnVIL / Terra: analysis workspaces and batch workﬂows AnVIL / Gen3: Data models, indexing, querying AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ... FISMA Moderate 2 ATOs Pursuing FedRAMP All data use and analysis in a FISMA moderate environment Implemented on Primary data storage costs covered by AnVIL, user private data and compute billed directly through Google

Slide 9

Slide 9 text

Approach to enabling analysis - Goal is to integrate a wide variety of analysis tools and analysis environments to support different types of users and communities - Built for extensibility, users will be able to bring new tools and analysis applications to the platform - Initial launch has been focused on more expert users, supporting batch workﬂows and Python programming - Additional environments and visualization tools will continue to be integrated throughout 2019/2020

Slide 10

Slide 10 text

Terra: Batch Workﬂows

Slide 11

Slide 11 text

Terra: Batch Workﬂows

Slide 12

Slide 12 text

Terra: Jupyter Notebooks

Slide 13

Slide 13 text

Dockstore: registry of tools and workﬂows - Tools – a container with metadata that documents the tools interface - 18 WDL and 192 CWL tools currently - Workﬂows – a combination of multiple tools - 104 WDL and 139 CWL currently

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Bioconductor + RStudio - Bioconductor: tools and modules for the analysis and comprehension of high-throughput genomic data, implemented in R - RStudio: analysis environment speciﬁcally designed for, and largely preferred by the R community. - 1,741 software packages available in Bioconductor release 3.9 - AnVIL will provide a robust well tested RStudio environment with the latest Bioconductor release integrated

Slide 16

Slide 16 text

Galaxy - Web-based analysis environment for running analysis tools and building workﬂows for users with no programming expertise - Galaxy ToolShed, a repository for community contributed tools and workﬂows, has 6,894 tools - Additionally, Galaxy integrates dozens of visualization tools which will also be available in AnVIL.

Slide 17

Slide 17 text

Extending AnVIL - Bring your own tools and workflows - Either by registering them in Dockstore, or by uploading your own custom WDL to Terra - Build on top of the AnVIL APIs - All of the components of the AnVIL provide APIs - We will be providing a unified, stable API endpoint for the AnVIL with OpenAPI documentation - We are building API wrapper libraries in Python and R, largely generate from the OpenAPI specification but curated - Adding new web applications - We are defining standards to allow a containerized web application to be hosted inside AnVIL - Leveraging standards container orchestration (Kubernetes) for complex applications

Slide 18

Slide 18 text

Data Management

Slide 19

Slide 19 text

Three areas of focus 1. Ingestion and Processing of Genomic Data Align, variant-call, and QC read-level data with best-practices pipelines 2. Ingestion and Processing of Phenotypic Data QC phenotypes and map to standardized data models 3. Data Access and Data Use Oversight Create more streamlined mechanisms to access the data For all three, governance is crucial!

Slide 20

Slide 20 text

Working Groups 1. Data Processing Working Group Select best-practices pipelines, and deﬁne QC metrics for read-level data 2. Phenotype Working Group Select data models to map phenotypic data to, and oversee process 3. Data Access Working Group Map data use restrictions to ontologies, and create a data “passport” We welcome your involvement!

Slide 21

Slide 21 text

Data Roadmap 1. By end of 2019 - GSP: Center for Complex Disease Genomics (CCDG) and Center for Mendelian Genomics (CMG). Both will require updates beyond 2019 - GTEx v8 - High coverage 1kG - eMERGE 2. By end of 2020 - eMERGE - Clinical Sequencing Evidence-Generating Research - Undiagnosed Diseases Network

Slide 22

Slide 22 text

Operations

Slide 23

Slide 23 text

AnVIL is Live as of July 2019! anvil.terra.bio

Slide 24

Slide 24 text

Areas of Focus 1. Security & Compliance Ensure that all applications in the AnVIL ecosystem are secure & compliant 2. Data Access Can we use new models of data governance that accelerate access to data? 3. Training and Outreach Engage researchers around the world and train them to use the AnVIL

Slide 25

Slide 25 text

Security We assume that all software that touches controlled-access data is NIST-800-53 compliant. 1) Extensive Disaster Recovery 2) Rigorous security program (personnel, process) 3) Continuous monitoring and auditing 4) Threat Assessment 5) Security-centric software development (code reviews, SDLC, etc)

Slide 26

Slide 26 text

Slide 27

Slide 27 text

DUOS Our current protocol for data access Data Depositors Data Use Limitations This data is available for cancer research in a non-profit setting. Data Access Committee No! Data Access Request Data Requestors I am studying breast cancer at a company.

Slide 28

Slide 28 text

DUOS Scales Poorly!! O(N2) dbGaP at PRIM&R 2017 826 = Number of studies in dbGaP 5,344 = Number of PIs requesting data 46 = Number of PI countries 1500+ = Number of publications resulting from secondary use of dbGaP data 13 days = Average Data Access Request processing time As of July 1, 2017 50,167 Submitted 34,16 Approved Model for this is dbGaP (database of Genotypes and Phenotypes), which was started by NIH over a decade ago. In general, all NHGRI-funded genotyping studies are required to be deposited in this database.

Slide 29

Slide 29 text

What is DUOS? • Interfaces to transform data use restrictions and data access requests to machine-readable code (ADA-M & Consent Codes) • A matching algorithm that checks if data access requests are compatible with data use restrictions • Interfaces for the Data Access Committee to adjudicate whether structuring and matching has been done appropriately DUOS

Slide 30

Slide 30 text

DUOS Trial We aimed to answer the following questions: Can we... Translate a data use letter to structured data use limitations? Translate a research use statement to a structured research purpose? Evaluate these structured terms with an algorithm to make the same decision a human DAC would make when reviewing a DAR ?

Slide 31

Slide 31 text

DUOS Trial We aimed to answer the following questions: Can we... ● Results: 118/123 Data Use Letters able to be translated to structured data use limitations (96%) ● Examples of Data Use Letter text unable to be structured: ● Aggregate level data for general research use is prohibited ● Available for research on smoking ● Data must be held behind a firewall so it is only available to qualified scientists and healthcare professionals ● No data may be used from participants who consented before 8/7/2001

Slide 32

Slide 32 text

DUOS Trial We aimed to answer the following questions: Can we... ● Results: 38/38 of researchers’ submitted research use statements were correctly translated (100%) ● The human DAC & DUOS algorithm agreed on 37/37 of data access request evaluations (100%) ● >95% of DAR decisions were able to be adjudicated by the algorithm

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Training and Outreach

Slide 35

Slide 35 text

Portal The primary communication channel of the AnVIL Project to the community. ● Background and team information ● News, events, announcements ● Data, tools, components ● Training materials

Slide 36

Slide 36 text

Portal (2) The primary communication channel of the AnVIL Project to the community. ● Background and team information ● News, events, announcements ● Data, tools, components ● Training materials

Slide 37

Slide 37 text

Components: Training and Outreach Training materials (Jupyter/Markdown) Videos mp4 Projects/questions (Jupyter/Markdown) Github Youtube MOOCs Leanpub Coursera EdX Non-ANVIL Training Data Carpentry University Course materials Anvil Training Network Galaxy Training Network Bioconductor courses Data Carpentry

Slide 38

Slide 38 text

Principles 1. Training material must be maintainable 2. Training material must be updatable 3. Training material creation must be distributed 4. Training material must be accessible 5. Training material must be repurposable 6. Training material must be free or low cost

Slide 39

Slide 39 text

Example input and output AnVIL Supported Technologies in Blue Slides: https://tinyurl.com/y52nny87 Youtube: https://youtu.be/nZE6mHUa-a4 ariExtra + ari

Slide 40

Slide 40 text

Example translation AnVIL Supported Technologies in Blue translate Slides: https://tinyurl.com/yxtcja4r

Slide 41

Slide 41 text

{quiz, id: quiz_003_data_science_process, random-question-order: true} ### The Data Science Process quiz {choose-answers: 4} ?1 Which of these is NOT an effective way to communicate the ﬁndings of your analysis? C) save code locally on your computer C) print code out and store in a desk drawer o) write a blog post o) publish a paper o) publish a news article o) write a report and share it with your team o) write a report for your boss o) give a talk at a conference and make materials available online Leanpub

Slide 42

Slide 42 text

Existing training materials to adapt

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Popular MOOCs to be adapted for AnVIL based training utilizing AnVIL components and software.

Slide 47

Slide 47 text

Existing platform agnostic content port Portable content - Introduction to Genomics - Statistics - Bioconductor - Galaxy - Python programming - Alignment - Command line tools Differences in AnVIL to be accomodated - File management - Permissions - Workﬂow

Slide 48

Slide 48 text

Platform speciﬁc content Workﬂows - How to navigate the system - File permission and management - Computing permission and management - Compute credit access and management Platform - Galaxy - R/Bioconductor - Python

Slide 49

Slide 49 text

Broad Institute Anthony Philippakis, Daniel MacArthur, Alex Bauman, Adrian Sharma, Andrew Rula, Dave Bernick, Jonathan Lawson, Kristian Cibulskis, Namrata Gupta, Rob Title, Eric Banks, RIch Silva University of Chicago Robert Grossman, Abby George, Garrett Rupp, Zac Flamig University of California Santa Cruz Benedict Paten, Denis Yuen, Brian O’Connor, Charles Overbeck, Kevin Osborn, Louise Cabansay, Natalie Perez, Stefan Kuhn, Walt Shands Vanderbilt Robert Carroll, Lakhan Swamy, Kristin Wuichet Washington University Ira Hall, Adam Coffman, Allison Reieir, Haley Abel, Jason Walker Johns Hopkins James Taylor, Jeff Leek, Kasper Hansen, Enis Afgan, Alexandru Mahmoud, Sergey Golitsynskiy, Jenn Vessio, John Muschelli, Mo Heydarian Penn State University Anton Nekrutenko, John Chilton, Nate Coraor, Marten Cech Oregon Health & Sciences University Jeremy Goecks, Kyle Ellrott, Brian Walsh, Luke Sargent, Vahid Jalili Roswell Park Cancer Institute Martin Morgan, Nitesh Turaga, Lori Shepherd Harvard Vincent Carey, BJ Stubbs, Shweta Gopaulakrishnan City University of New York Levi Waldron, Sehyun Oh, Ludwig Geistlinger AnVIL Team

Slide 50

Slide 50 text

(ﬁn)