Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software and infrastructure to support Reproducible Computational Research

James Taylor
April 01, 2016
93

Software and infrastructure to support Reproducible Computational Research

2016 Chicago Chapter ASA Conference: Learning Tools to Promote Reproducible Research and Open Science

James Taylor

April 01, 2016
Tweet

Transcript

  1. @jxtx / #usegalaxy
    Software and infrastructure to support
    Reproducible Computational Research
    (with maybe some bias towards )
    https://speakerdeck.com/jxtx

    View Slide

  2. What happens to traditional research outputs
    when an area of science rapidly become data
    intensive?

    View Slide

  3. Idea
    Experiment
    Raw Data
    Tidy Data
    Summarized data
    Results
    Experimental design
    Data collection
    Data cleaning
    Data analysis
    Inference
    Data Pipeline, inspired by Leek and Peng, Nature 2015
    The part we are
    considering here
    The part that
    ends up in the
    Publication

    View Slide

  4. Questions one might ask about a
    published analysis
    Is the analysis as described correct?
    Was the analysis performed as described?
    Can the analysis be re-created exactly?

    View Slide

  5. What is reproducibility?
    (for computational analyses)
    Reproducibility means that an analysis is
    described/captured in sufficient detail that it can
    be precisely reproduced
    Reproducibility is not provenance, reusability/
    generalizability, or correctness
    A minimum standard for evaluating analyses

    View Slide

  6. A minimum standard for evaluating analyses
    Yet most published analyses are not reproducible

    Ioannadis et al. 2009 – 6/18 microarray experiments reproducible
    Nekrutenko and Taylor 2012 – 7/50 re-sequencing experiments reproducible

    Missing software, versions, parameters, data…

    View Slide

  7. A core challenge of reproducibility is
    identifiability
    Given a methods section, can we actually
    identify the resources, data, software …
    that was actually used?

    View Slide

  8. Submitted 2 June 2013
    On the reproducibility of science: unique
    identification of research resources in the
    biomedical literature
    Nicole A. Vasilevsky1, Matthew H. Brush1, Holly Paddock2,
    Laura Ponting3, Shreejoy J. Tripathy4, Gregory M. LaRocca4 and
    Melissa A. Haendel1
    1 Ontology Development Group, Library, Oregon Health & Science University, Portland,
    OR, USA
    2 Zebrafish Information Framework, University of Oregon, Eugene, OR, USA
    3 FlyBase, Department of Genetics, University of Cambridge, Cambridge, UK
    4 Department of Biological Sciences and Center for the Neural Basis of Cognition,
    Carnegie Mellon University, Pittsburgh, PA, USA
    ABSTRACT
    Scientific reproducibility has been at the forefront of many news stories and there
    exist numerous initiatives to help address this problem. We posit that a contributor
    is simply a lack of specificity that is required to enable adequate research repro-
    ducibility. In particular, the inability to uniquely identify research resources, such
    as antibodies and model organisms, makes it diYcult or impossible to reproduce
    experiments even where the science is otherwise sound. In order to better understand
    the magnitude of this problem, we designed an experiment to ascertain the “iden-
    tifiability” of research resources in the biomedical literature. We evaluated recent
    journal articles in the fields of Neuroscience, Developmental Biology, Immunology,
    Cell and Molecular Biology and General Biology, selected randomly based on a
    diversity of impact factors for the journals, publishers, and experimental method
    reporting guidelines. We attempted to uniquely identify model organisms (mouse,
    rat, zebrafish, worm, fly and yeast), antibodies, knockdown reagents (morpholinos
    or RNAi), constructs, and cell lines. Specific criteria were developed to determine if
    a resource was uniquely identifiable, and included examining relevant repositories

    View Slide

  9. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa; Iorns, Elizabeth (2014):
    Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare.
    http://dx.doi.org/10.6084/m9.figshare.987130
    32/127 tools
    6/41 papers

    View Slide

  10. #METHODSMATTER
    Figure 1
    0.480
    0.483
    0.486
    0.489
    0.492
    0.495
    0.498
    0.501
    0.504
    0.507
    0.510
    5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1
    Frequency Fluctuation for site 8992
    Default -n 3 -q 15 -n 3 -q 15
    (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)

    View Slide

  11. Core reproducibility tasks
    1. Capture the precise description of the experiment
    (either as it is being carried out, or after the fact)
    2. Assemble all of the necessary data and software
    dependencies needed by the described experiment
    3. Combine the above to verify the analysis

    View Slide

  12. Recommendations for performing
    reproducible computational research
    Nekrutenko and Taylor, Nature Reviews Genetics, 2012
    Sandve, Nekrutenko, Taylor and Hovig, PLoS Computational Biology 2013

    View Slide

  13. 1. Accept that computation is an integral
    component of biomedical research. Familiarize
    yourself with best practices of scientific computing,
    and implement good computational practices in
    your group

    View Slide

  14. 2. Always provide access to raw primary data

    View Slide

  15. 3. Record versions of all auxiliary datasets used in
    analysis. Many analyses require data such as
    genome annotations from external databases that
    change regularly, either record versions or store a
    copy of the specific data used.

    View Slide

  16. 4. Store the exact versions of all software used.
    Ideally archive the software to ensure it can be
    recovered later.

    View Slide

  17. 5. Record all parameters, even if default values are
    used. Default settings can change over time and
    determining what those settings were later can
    sometimes be difficult.

    View Slide

  18. 6. Record and provide exact versions of any custom
    scripts used.

    View Slide

  19. 7. Do not reinvent the wheel, use existing software
    and pipelines when appropriate to contribute to the
    development of best practices.

    View Slide

  20. Is reproducibility achievable?

    View Slide

  21. A spectrum of solutions
    Analysis environments (Galaxy, GenePattern, Mobyle, …)
    Workflow systems (Taverna, Pegasus, VisTrails, …)
    Notebook style (iPython notebook, …)
    Literate programming style (Sweave/knitR, …)
    System level provenance capture (ReproZip, …)
    Complete environment capture (VMs, containers, …)

    View Slide

  22. View Slide

  23. Galaxy: accessible analysis system

    View Slide

  24. A free (for everyone) web service integrating a
    wealth of tools, compute resources, terabytes of
    reference data and permanent storage
    Open source software that makes integrating your
    own tools and data and customizing for your own
    site simple
    An open extensible platform for sharing tools,
    datatypes, workflows, ...

    View Slide

  25. Galaxy’s goals:
    Accessibility: Eliminate barriers for researchers
    wanting to use complex methods, make these
    methods available to everyone
    Transparency: Facilitate communication of analyses
    and results in ways that are easy to understand while
    providing all details
    and of course…
    Reproducibility: Ensure that analysis performed in the
    system can be reproduced precisely and practically

    View Slide

  26. Integrating existing tools into a uniform framework
    • Defined in terms of an abstract
    interface (inputs and outputs)
    • In practice, mostly
    command line tools, a
    declarative XML description
    of the interface, how to
    generate a command line
    • Designed to be as easy as
    possible for tool authors, while
    still allowing rigorous
    reasoning

    View Slide

  27. Galaxy analysis interface
    • Consistent tool user
    interfaces
    automatically
    generated
    • History system
    facilitates and tracks
    multistep analyses

    View Slide

  28. Automatically and transparently tracks
    every step of every analysis

    View Slide

  29. As well as user-generated
    metadata and annotation...

    View Slide

  30. Galaxy workflow system
    • Workflows can be
    constructed from scratch
    or extracted from existing
    analysis histories
    • Facilitate reuse, as well as
    providing precise
    reproducibility of a
    complex analysis

    View Slide

  31. View Slide

  32. Example: Workflow for differential expression analysis of RNA-seq using Tophat/
    Cufflinks tools

    View Slide

  33. View Slide

  34. Galaxy Pages for publishing analysis

    View Slide

  35. Actual histories and datasets directly accessible from the text

    View Slide

  36. Histories can be imported and the exact parameters inspected

    View Slide

  37. Describe analysis tool
    behavior abstractly
    Analysis environment automatically
    and transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically
    Pervasive sharing, and publication
    of documents with integrated analysis

    View Slide

  38. Visualization and visual analytics

    View Slide

  39. How do we make this available to
    as many people as possible?

    View Slide

  40. PSC, Pittsburgh
    Stampede
    ● 462,462 cores
    ● 205 TB memory
    Blacklight
    Bridges
    Dedicated resources Shared XSEDE resources
    TACC
    Austin
    Galaxy Cluster 

    (Rodeo)
    ● 256 cores
    ● 2 TB memory
    Corral/Stockyard
    ● 20 PB disk
    funded by the National Science Foundation
    Award #ACI-1445604
    PTI IU Bloomington
    Leveraging National Cyberinfrastructure: Galaxy/XSEDE Gateway

    View Slide

  41. Reproducibility possibilities with Jetstream
    Jetstream will enable archival of volumes and virtual
    machines in the IU Scholarworks Archive with DOIs
    attached
    Exact analysis environment, data, and provenance
    becomes an archived, citable entity

    View Slide

  42. Limitations of current infrastructure
    Many Galaxy jobs are short duration, we want to enable
    interactive analysis where possible
    Galaxy dedicated allocation on Rodeo is fully saturated
    Long wait times for scheduling to some XSEDE resources,
    bioinformatics tools are often difficult to install and test in
    different environments

    View Slide

  43. CloudMan: General purpose deployment
    manager for any cloud. Cluster and service
    management, auto-scaling
    Cloudbridge: New abstraction library for
    working with multiple cloud APIs
    Genomics Virtual Lab: CloudMan + Galaxy +
    many other common bioinformatics tools
    and frameworks
    An alternative approach: Galaxy Cloud

    View Slide

  44. Dedicated Galaxy instances on various Cloud environments

    View Slide

  45. Share a snapshot of this instance
    Complete instances can be archived and shared
    with others
    Reproducibility assistance from CloudMan

    View Slide

  46. Galaxy gives us…
    Abstract definition of tool interfaces and precise capture
    of parameters for every tool invocation
    Complete provenence for data relationships (user defined
    and system wide)
    Usefulness of such a system relies on having large
    numbers of tools integrated, how do we facilitate this?

    View Slide

  47. 1 2 3 ∞
    http://usegalaxy.org
    http://usegalaxy.org/community
    ...
    Galaxies on
    private clouds
    Galaxies on
    public clouds
    ...
    private Galaxy installations
    Private Tool Sheds
    Galaxy Tool
    Shed

    View Slide

  48. Vision for the Galaxy ToolShed
    Grow tool development by supporting and nurturing
    community
    Provide infrastructure to host all tools, make it easy to
    build tools, install tools into Galaxy, …
    Quality oversight by a group of volunteers from the
    community
    Version and store every dependency of every tool to
    ensure that we can reconstruct environments exactly

    View Slide

  49. Repositories are owned by the
    contributor, can contain tools,
    workflows, etc.
    Backed by version control, a complete
    version history is retained for everything
    that passes through the toolshed
    Galaxy instance admins can install tools
    directly from the toolshed using only a
    web UI
    Support for recipes for installing the
    underlying software that tools depend
    on (also versioned)

    View Slide

  50. State of the Galaxy ToolShed
    ToolShed now contains thousands of tools
    Community response has been phenomenal
    However, packaging is challenging — it never ends!
    Need to move to a model that pulls in and integrates with
    a broader community

    View Slide

  51. Packaging software for
    reproducible research

    View Slide

  52. Portability and Isolation
    are crucial for practical reproducibility

    View Slide

  53. https://bioconda.github.io

    View Slide

  54. It is now reasonable to support one major
    server platform — Linux
    (this is great for portability and reproducibility, but scary
    for other reasons — monoculture leads to fragility)

    View Slide

  55. Builds on Conda packaging system, designed
    “for installing multiple versions of software
    packages and their dependencies and
    switching easily between them”
    823 recipes for software packages
    (as of yesterday)
    All packages are built in a minimal
    environment to ensure isolation and
    portability

    View Slide

  56. Submit recipe to GitHub
    Travis CI pulls recipes and builds
    in minimal docker container
    Successful builds from main
    repo uploaded to Anaconda
    to be installed anywhere

    View Slide

  57. See also…

    View Slide

  58. Containers for composing an recreating
    complete environments

    View Slide

  59. View Slide

  60. Docker
    Builds on Linux kernel features enabling complete
    isolation from the kernel level up
    Containers — lightweight environments with
    isolation enforced at the OS level, complete control
    over all software
    Adds a complete ecosystem for sharing, versioning,
    managing containers — Docker hub

    View Slide

  61. Galaxy + Docker
    Run every analysis in a clean container — analysis
    are isolated and environment is the same every time
    Archive that container — containers are lightweight
    thanks to layers — and the analysis can always be
    recreated

    View Slide

  62. Bioconda + Docker
    Given a set of packages and versions in Conda/
    Bioconda, we can build a container with just that
    software on a minimal base image
    If we use the same base image, we can reconstruct
    exactly the same container (since we archive all
    binary builds of all versions)
    And we can even host on a specific VM image…

    View Slide

  63. Tool and dependency binaries, built in minimal
    environment with controlled libs
    Docker container defines minimum environment
    Virtual machine controls kernel and apparent
    hardware environment
    KVM, Xen, ….
    Increasingly precise environment control

    View Slide

  64. Sharing tools and workflows
    beyond Galaxy

    View Slide

  65. https://www.commonwl.org

    View Slide

  66. Common Workflow Language
    “Multi-vendor working group … enable data
    scientists to describe analysis tools and workflows
    that are powerful, easy to use, portable, and support
    reproducibility.”
    Schemas for describing tools and workflows in a
    workflow system agnostic manner
    Containers as default dependency resolution
    mechanism
    Work in progress, currently Draft 3

    View Slide

  67. Common Workflow Language
    ~10 reference implementations
    cwltool (reference), Rabix, Avadros, Galaxy, Toil
    (BD2K Translational Genomics), Taverna, …
    Get involved!
    More communities will help to ensure a general and
    expressive approach

    View Slide

  68. No need to be stuck with just one system

    View Slide

  69. View Slide

  70. +

    View Slide

  71. View Slide

  72. View Slide

  73. View Slide

  74. View Slide

  75. View Slide

  76. View Slide

  77. View Slide

  78. Galaxy Interactive Environments
    General framework support environments
    other that Jupyer (e.g. RStudio)
    Problems with the notebook model: history can
    be edited! Only reproducible when all cells are
    rerun
    Goal: keep complete history (provenence
    graph) for every dataset generated from a
    notebook — preserve Galaxy’s provenance
    guarantees

    View Slide

  79. Reproducibility is possible

    View Slide

  80. Even partial reproducibility is better than nothing
    Striving for reproducibility makes methods more
    transparent, understandable, leading to better science

    View Slide

  81. Reproducibility is possible,
    why is it not the norm?
    Slightly more difficult than not doing it right
    Analysts don’t know how to do it right
    Fear of being critiqued –
    “my code is too ugly”
    “why hold myself to a higher standard”

    View Slide

  82. Tools can only fix so much of the problem
    Need to create an expectation of
    reproducibility
    Require authors to make their work
    reproducible as part of the peer review
    process
    Need to educate analysts

    View Slide

  83. Reproducibility is only one part of research
    integrity
    Need widespread education on how to conduct
    computational analyses that are correct and
    transparent

    View Slide

  84. Science culture needs to adapt
    Mistakes will be made!
    Need to create an environment where
    researchers are willing to be open and
    transparent enough that these mistakes are
    found
    Research should be subject to continuous,
    constructive, and open peer review

    View Slide

  85. (and, bring back methods sections!)

    View Slide

  86. ACKnowledgements
    Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier, Martin
    Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks,
    Björn Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric
    Rasche, Nicola Soranzo, Nitesh Turaga, Marius van den Beek
    UiO: Geir Kjetil Sandve and Eivind Hovig
    JHU Data Science: Jeff Leek, Roger Peng, …
    BioConda: Johannes Köster, Björn Grüning, Ryan Dale, Andreas Sjödin,
    Adam Caprez, Chris Tomkins-Tinch, Brad Chapman, Alexey Strokach, …
    CWL: Peter Amstutz, Robin Andeer, Brad Chapman, John Chilton, Michael R.
    Crusoe, Roman Valls Guimerà, Guillermo Carrasco Hernandez, Sinisa Ivkovic,
    Andrey Kartashov, John Kern, Dan Leehr, Hervé Ménager, Maxim Mikheev, Tim
    Pierce, Josh Randall, Stian Soiland-Reyes, Luka Stojanovic, Nebojša Tijanić
    Everyone I forgot…

    View Slide

  87. (fin)

    View Slide