Software and infrastructure to support Reproducible Computational Research

Slide 1

Slide 1 text

@jxtx / #usegalaxy Software and infrastructure to support Reproducible Computational Research (with maybe some bias towards ) https://speakerdeck.com/jxtx

Slide 2

Slide 2 text

What happens to traditional research outputs when an area of science rapidly become data intensive?

Slide 3

Slide 3 text

Idea Experiment Raw Data Tidy Data Summarized data Results Experimental design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication

Slide 4

Slide 4 text

Questions one might ask about a published analysis Is the analysis as described correct? Was the analysis performed as described? Can the analysis be re-created exactly?

Slide 5

Slide 5 text

What is reproducibility? (for computational analyses) Reproducibility means that an analysis is described/captured in suﬃcient detail that it can be precisely reproduced Reproducibility is not provenance, reusability/ generalizability, or correctness A minimum standard for evaluating analyses

Slide 6

Slide 6 text

A minimum standard for evaluating analyses Yet most published analyses are not reproducible   Ioannadis et al. 2009 – 6/18 microarray experiments reproducible Nekrutenko and Taylor 2012 – 7/50 re-sequencing experiments reproducible … Missing software, versions, parameters, data…

Slide 7

Slide 7 text

A core challenge of reproducibility is identifiability Given a methods section, can we actually identify the resources, data, software … that was actually used?

Slide 8

Slide 8 text

Submitted 2 June 2013 On the reproducibility of science: unique identification of research resources in the biomedical literature Nicole A. Vasilevsky1, Matthew H. Brush1, Holly Paddock2, Laura Ponting3, Shreejoy J. Tripathy4, Gregory M. LaRocca4 and Melissa A. Haendel1 1 Ontology Development Group, Library, Oregon Health & Science University, Portland, OR, USA 2 Zebrafish Information Framework, University of Oregon, Eugene, OR, USA 3 FlyBase, Department of Genetics, University of Cambridge, Cambridge, UK 4 Department of Biological Sciences and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA, USA ABSTRACT Scientific reproducibility has been at the forefront of many news stories and there exist numerous initiatives to help address this problem. We posit that a contributor is simply a lack of specificity that is required to enable adequate research reproducibility. In particular, the inability to uniquely identify research resources, such as antibodies and model organisms, makes it diYcult or impossible to reproduce experiments even where the science is otherwise sound. In order to better understand the magnitude of this problem, we designed an experiment to ascertain the “identifiability” of research resources in the biomedical literature. We evaluated recent journal articles in the fields of Neuroscience, Developmental Biology, Immunology, Cell and Molecular Biology and General Biology, selected randomly based on a diversity of impact factors for the journals, publishers, and experimental method reporting guidelines. We attempted to uniquely identify model organisms (mouse, rat, zebrafish, worm, fly and yeast), antibodies, knockdown reagents (morpholinos or RNAi), constructs, and cell lines. Specific criteria were developed to determine if a resource was uniquely identifiable, and included examining relevant repositories

Slide 9

Slide 9 text

Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa; Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers

Slide 10

Slide 10 text

#METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498 0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)

Slide 11

Slide 11 text

Core reproducibility tasks 1. Capture the precise description of the experiment (either as it is being carried out, or after the fact) 2. Assemble all of the necessary data and software dependencies needed by the described experiment 3. Combine the above to verify the analysis

Slide 12

Slide 12 text

Recommendations for performing reproducible computational research Nekrutenko and Taylor, Nature Reviews Genetics, 2012 Sandve, Nekrutenko, Taylor and Hovig, PLoS Computational Biology 2013

Slide 13

Slide 13 text

1. Accept that computation is an integral component of biomedical research. Familiarize yourself with best practices of scientific computing, and implement good computational practices in your group

Slide 14

Slide 14 text

2. Always provide access to raw primary data

Slide 15

Slide 15 text

3. Record versions of all auxiliary datasets used in analysis. Many analyses require data such as genome annotations from external databases that change regularly, either record versions or store a copy of the specific data used.

Slide 16

Slide 16 text

4. Store the exact versions of all software used. Ideally archive the software to ensure it can be recovered later.

Slide 17

Slide 17 text

5. Record all parameters, even if default values are used. Default settings can change over time and determining what those settings were later can sometimes be diﬃcult.

Slide 18

Slide 18 text

6. Record and provide exact versions of any custom scripts used.

Slide 19

Slide 19 text

7. Do not reinvent the wheel, use existing software and pipelines when appropriate to contribute to the development of best practices.

Slide 20

Slide 20 text

Is reproducibility achievable?

Slide 21

Slide 21 text

A spectrum of solutions Analysis environments (Galaxy, GenePattern, Mobyle, …) Workflow systems (Taverna, Pegasus, VisTrails, …) Notebook style (iPython notebook, …) Literate programming style (Sweave/knitR, …) System level provenance capture (ReproZip, …) Complete environment capture (VMs, containers, …)

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Galaxy: accessible analysis system

Slide 24

Slide 24 text

A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...

Slide 25

Slide 25 text

Galaxy’s goals: Accessibility: Eliminate barriers for researchers wanting to use complex methods, make these methods available to everyone Transparency: Facilitate communication of analyses and results in ways that are easy to understand while providing all details and of course… Reproducibility: Ensure that analysis performed in the system can be reproduced precisely and practically

Slide 26

Slide 26 text

Integrating existing tools into a uniform framework • Defined in terms of an abstract interface (inputs and outputs) • In practice, mostly command line tools, a declarative XML description of the interface, how to generate a command line • Designed to be as easy as possible for tool authors, while still allowing rigorous reasoning

Slide 27

Slide 27 text

Galaxy analysis interface • Consistent tool user interfaces automatically generated • History system facilitates and tracks multistep analyses

Slide 28

Slide 28 text

Automatically and transparently tracks every step of every analysis

Slide 29

Slide 29 text

As well as user-generated metadata and annotation...

Slide 30

Slide 30 text

Galaxy workflow system • Workflows can be constructed from scratch or extracted from existing analysis histories • Facilitate reuse, as well as providing precise reproducibility of a complex analysis

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Example: Workflow for diﬀerential expression analysis of RNA-seq using Tophat/ Cuﬄinks tools

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Galaxy Pages for publishing analysis

Slide 35

Slide 35 text

Actual histories and datasets directly accessible from the text

Slide 36

Slide 36 text

Histories can be imported and the exact parameters inspected

Slide 37

Slide 37 text

Describe analysis tool behavior abstractly Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Slide 38

Slide 38 text

Visualization and visual analytics

Slide 39

Slide 39 text

How do we make this available to as many people as possible?

Slide 40

Slide 40 text

PSC, Pittsburgh Stampede ● 462,462 cores ● 205 TB memory Blacklight Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster   (Rodeo) ● 256 cores ● 2 TB memory Corral/Stockyard ● 20 PB disk funded by the National Science Foundation Award #ACI-1445604 PTI IU Bloomington Leveraging National Cyberinfrastructure: Galaxy/XSEDE Gateway

Slide 41

Slide 41 text

Reproducibility possibilities with Jetstream Jetstream will enable archival of volumes and virtual machines in the IU Scholarworks Archive with DOIs attached Exact analysis environment, data, and provenance becomes an archived, citable entity

Slide 42

Slide 42 text

Limitations of current infrastructure Many Galaxy jobs are short duration, we want to enable interactive analysis where possible Galaxy dedicated allocation on Rodeo is fully saturated Long wait times for scheduling to some XSEDE resources, bioinformatics tools are often diﬃcult to install and test in diﬀerent environments

Slide 43

Slide 43 text

CloudMan: General purpose deployment manager for any cloud. Cluster and service management, auto-scaling Cloudbridge: New abstraction library for working with multiple cloud APIs Genomics Virtual Lab: CloudMan + Galaxy + many other common bioinformatics tools and frameworks An alternative approach: Galaxy Cloud

Slide 44

Slide 44 text

Dedicated Galaxy instances on various Cloud environments

Slide 45

Slide 45 text

Share a snapshot of this instance Complete instances can be archived and shared with others Reproducibility assistance from CloudMan

Slide 46

Slide 46 text

Galaxy gives us… Abstract definition of tool interfaces and precise capture of parameters for every tool invocation Complete provenence for data relationships (user defined and system wide) Usefulness of such a system relies on having large numbers of tools integrated, how do we facilitate this?

Slide 47

Slide 47 text

1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed

Slide 48

Slide 48 text

Vision for the Galaxy ToolShed Grow tool development by supporting and nurturing community Provide infrastructure to host all tools, make it easy to build tools, install tools into Galaxy, … Quality oversight by a group of volunteers from the community Version and store every dependency of every tool to ensure that we can reconstruct environments exactly

Slide 49

Slide 49 text

Repositories are owned by the contributor, can contain tools, workflows, etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)

Slide 50

Slide 50 text

State of the Galaxy ToolShed ToolShed now contains thousands of tools Community response has been phenomenal However, packaging is challenging — it never ends! Need to move to a model that pulls in and integrates with a broader community

Slide 51

Slide 51 text

Packaging software for reproducible research

Slide 52

Slide 52 text

Portability and Isolation are crucial for practical reproducibility

Slide 53

Slide 53 text

https://bioconda.github.io

Slide 54

Slide 54 text

It is now reasonable to support one major server platform — Linux (this is great for portability and reproducibility, but scary for other reasons — monoculture leads to fragility)

Slide 55

Slide 55 text

Builds on Conda packaging system, designed “for installing multiple versions of software packages and their dependencies and switching easily between them” 823 recipes for software packages (as of yesterday) All packages are built in a minimal environment to ensure isolation and portability

Slide 56

Slide 56 text

Submit recipe to GitHub Travis CI pulls recipes and builds in minimal docker container Successful builds from main repo uploaded to Anaconda to be installed anywhere

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Containers for composing an recreating complete environments

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

Docker Builds on Linux kernel features enabling complete isolation from the kernel level up Containers — lightweight environments with isolation enforced at the OS level, complete control over all software Adds a complete ecosystem for sharing, versioning, managing containers — Docker hub

Slide 61

Slide 61 text

Galaxy + Docker Run every analysis in a clean container — analysis are isolated and environment is the same every time Archive that container — containers are lightweight thanks to layers — and the analysis can always be recreated

Slide 62

Slide 62 text

Bioconda + Docker Given a set of packages and versions in Conda/ Bioconda, we can build a container with just that software on a minimal base image If we use the same base image, we can reconstruct exactly the same container (since we archive all binary builds of all versions) And we can even host on a specific VM image…

Slide 63

Slide 63 text

Tool and dependency binaries, built in minimal environment with controlled libs Docker container defines minimum environment Virtual machine controls kernel and apparent hardware environment KVM, Xen, …. Increasingly precise environment control

Slide 64

Slide 64 text

Sharing tools and workflows beyond Galaxy

Slide 65

Slide 65 text

https://www.commonwl.org

Slide 66

Slide 66 text

Common Workflow Language “Multi-vendor working group … enable data scientists to describe analysis tools and workflows that are powerful, easy to use, portable, and support reproducibility.” Schemas for describing tools and workflows in a workflow system agnostic manner Containers as default dependency resolution mechanism Work in progress, currently Draft 3

Slide 67

Slide 67 text

Common Workflow Language ~10 reference implementations cwltool (reference), Rabix, Avadros, Galaxy, Toil (BD2K Translational Genomics), Taverna, … Get involved! More communities will help to ensure a general and expressive approach

Slide 68

Slide 68 text

No need to be stuck with just one system

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

Galaxy Interactive Environments General framework support environments other that Jupyer (e.g. RStudio) Problems with the notebook model: history can be edited! Only reproducible when all cells are rerun Goal: keep complete history (provenence graph) for every dataset generated from a notebook — preserve Galaxy’s provenance guarantees

Slide 79

Slide 79 text

Reproducibility is possible

Slide 80

Slide 80 text

Even partial reproducibility is better than nothing Striving for reproducibility makes methods more transparent, understandable, leading to better science

Slide 81

Slide 81 text

Reproducibility is possible, why is it not the norm? Slightly more diﬃcult than not doing it right Analysts don’t know how to do it right Fear of being critiqued – “my code is too ugly” “why hold myself to a higher standard”

Slide 82

Slide 82 text

Tools can only fix so much of the problem Need to create an expectation of reproducibility Require authors to make their work reproducible as part of the peer review process Need to educate analysts

Slide 83

Slide 83 text

Reproducibility is only one part of research integrity Need widespread education on how to conduct computational analyses that are correct and transparent

Slide 84

Slide 84 text

Science culture needs to adapt Mistakes will be made! Need to create an environment where researchers are willing to be open and transparent enough that these mistakes are found Research should be subject to continuous, constructive, and open peer review

Slide 85

Slide 85 text

(and, bring back methods sections!)

Slide 86

Slide 86 text

ACKnowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier, Martin Cěch, John Chilton, Dave Clements, Nate Coraor, Carl Eberhard, Jeremy Goecks, Björn Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Anton Nekrutenko, Eric Rasche, Nicola Soranzo, Nitesh Turaga, Marius van den Beek UiO: Geir Kjetil Sandve and Eivind Hovig JHU Data Science: Jeﬀ Leek, Roger Peng, … BioConda: Johannes Köster, Björn Grüning, Ryan Dale, Andreas Sjödin, Adam Caprez, Chris Tomkins-Tinch, Brad Chapman, Alexey Strokach, … CWL: Peter Amstutz, Robin Andeer, Brad Chapman, John Chilton, Michael R. Crusoe, Roman Valls Guimerà, Guillermo Carrasco Hernandez, Sinisa Ivkovic, Andrey Kartashov, John Kern, Dan Leehr, Hervé Ménager, Maxim Mikheev, Tim Pierce, Josh Randall, Stian Soiland-Reyes, Luka Stojanovic, Nebojša Tijanić Everyone I forgot…

Slide 87

Slide 87 text

(fin)