A Survey of Technologies for Reproducing and Communicating Biomedical Analyses

A Survey of Technologies for Reproducing and Communicating Biomedical Analyses

Short talk at High-throughput Sequencing Computational Standards for Regulatory Sciences workshop.

4f34bca33e4f7b830f5f1cb3ce26958b?s=128

Jeremy Goecks

March 16, 2017
Tweet

Transcript

  1. A Survey of Technologies for Reproducing and Communicating Biomedical Analyses

    Jeremy Goecks Assistant Professor, Computational Biology and Biomedical Engineering Oregon Health and Science University @jgoecks
  2. Challenges Reproducibility: can you and others rerun your analysis now

    and in the future? Communication ‣ Can others understand what you’ve done at many different levels? ‣ Can others extend your analysis to their own work?
  3. Technical Complexity Impedes Scientific and Regulatory Progress Operating System Analysis

    Tools Parameter Settings Input Data Pipelines/Workflows “High level” “Low level” BCOs } BCOs within this ecosystem • BCOs will be consumers of pipeline, tool, data, and parameters technologies • Some technologies will produce BCOs for use/resuse
  4. Research Objects “More than just PDFs” http://www.researchobject.org/ Analysis Tools Parameter

    Settings Input Data Pipelines/Workflows
  5. Galaxy: Web-based analysis system https://galaxyproject.org Use a Web browser for

    large biomedical analyses on high- performance computing or the cloud ‣ datasets ‣ tools ‣ workflows ‣ visualizations Operating System Analysis Tools Parameter Settings Pipelines/Workflows
  6. Communication and Reuse

  7. None
  8. (Goecks et al. Cancer Medicine, 2015)

  9. Common Workflow Language “Specification for describing analysis workflows and tools

    in a way that makes them portable and scalable across a variety of software and hardware environments" https://github.com/common-workflow-language/common-workflow-language Analysis Tools Parameter Settings Pipelines/Workflows
  10. GA4GH “Global standards and tools for the secure, privacy respecting

    and interoperable sharing of Genomic data” More than just data: workflows, containers, etc. http://ga4gh.org/ Operating System Analysis Tools Parameter Settings Input Data Pipelines/Workflows
  11. Software containers http://www.zdnet.com/article/what-is-docker-and-why-is-it-so-darn-popular/ Operating System https://www.docker.com/ Analysis Tools

  12. Finding and Creating Containers Many repositories where general-purpose and bioinformatics-specific

    containers are available ‣ General: Dockerhub (https://hub.docker.com/) ‣ Bioinformatics: Dockstore (https://dockstore.org/), Biocontainers (https:// github.com/BioContainers) Many tools for creating containers by simplifying software installation ‣ Conda/Bioconda (https://conda.io/docs/): “Package, dependency and environment management for any language: Python, R, Ruby, Lua, Scala, Java, Javascript, C/ C++, FORTRAN” ‣ Install software plus dependencies on many different systems Operating System Analysis Tools
  13. Questions How much detail should biocompute objects (BCOs) capture? What

    is the best way to ensure that BCOs can be easily used by non-technical users? Should execution platforms enforce intended BCO usage? What about clutter in the BCOs repository? How to search for and provide feedback on BCOs that yield good performance? What incentives can encourage sharing of BCOs?
  14. Thank you! • goecksj@ohsu.edu • @jgoecks