Slide 1

Slide 1 text

Reproducibility of computational workflows is automated using continuous analysis Brett K Beaulieu-Jones, Casey S Greene Nature Biotechnology, vol.35, No.4, pp.342-346, 2017. April 20th, 2017 Ph.D. Student Kento Aoyama Akiyama Laboratory Department of Computer Science, School of Computing Tokyo Institute of Technology

Slide 2

Slide 2 text

Nature Biotechnology • Top Scientific Journal in biological, biomedical, agricultural and environmental sciences • 2-year IF: 43.113 (2016) • e.g.) Nature, IF = 38.138 (2016) Source : http://www.nature.com/npg_/company_info/jour nal_metrics.html Journal Information 2 nature biotechnology, April 2017, vol.35 no.4

Slide 3

Slide 3 text

Brett K Beaulieu-Jones1, Casey S Greene2 1. Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania (Twitter: @beaulieujones) 2. Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania (Twitter: @GreeneScientist) Authors Information 3

Slide 4

Slide 4 text

Target Problem Reproducibility of computational research Proposed Method Continuous Integration + Computational Research = Continuous Analysis Continuous Analysis can automatically verify the research reproducibility • Easy to reproduce, review, and cooperate What is the value of this research ? 4 [GitHub] https://greenelab.github.io/continuous_analysis/

Slide 5

Slide 5 text

1. Background 2. Result (Survey) 3. Proposed Method (Architecture) 4. Experiments 5. Discussion, Conclusion Outline 5

Slide 6

Slide 6 text

Research reproducibility is crucial for science But 90% of researchers acknowledged reproducibility crisis[1] Background | Reproducibility Crisis 6 [1] Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).

Slide 7

Slide 7 text

Reproducibility Problems • lack of details of experiment • data, parameters, code, etc. • lack of machine environment information • software versions, libraries, operating systems, etc. Computational research should be reproducible Background | Reproducibility Spectrum 7 Peng, R.D. Reproducible research in computational science. Science 334, 1226–1227 (2011).

Slide 8

Slide 8 text

Background | Reproducibility in Biology 8 18 articles, published in Nature Genetics (2005, 2006) • can not reproduce (10 articles, 56%) • can reproduce with discrepancies (6 articles, 33%) Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression analyses”, Nat. Genet. 41, 149–155 (2009)

Slide 9

Slide 9 text

Result (Survey) 9

Slide 10

Slide 10 text

Survey of Differential Gene Expression Research • Probe information is necessary for reproduction • probe, is the oligonucleotides of certain sequences, is used to measure transcript expression levels BrainArray Custom CDF [1] • A popular source of probe set description files • [Dai, M. et al.] published and maintains • Version of Custom CDF can verify detailed information of probe set Authors analyzed the 200 articles, which cited [Dai, M. et al.][1]. Reproducibility on RNA-Analysis 10 [1] Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).

Slide 11

Slide 11 text

Reporting of Custom CDF in articles 11 a) Most Recent 100 articles 51% of articles do NOT showed version of Custom CDF b) Highest cited 100 articles 64% of articles do NOT showed version of Custom CDF cannot download (14 Nov. 2016)

Slide 12

Slide 12 text

How different versions affect the analysis result To measure the effects, • download the different version of Custom CDFs • use the same data set • normal HeLa cells and HeLa cells in which TIA1 and TIAR (TIAL1) were knocked down Comparing the results • same source code • same data set • different versions of BrainArray Custom CDF (18, 19, 20) • different versions of software packages Effects on Analysis Result 12

Slide 13

Slide 13 text

Figure 2a. differential gene expression analysis of HeLa cells 13 Each version identified different number of significantly altered genes. • e.g.) 15 genes were identified as significant in v19, but not in version 18. … Analysis results are NOT reproducible without accurate version of software, dataset

Slide 14

Slide 14 text

Figure 2b. container-based approaches 14 Using Docker[1] containers improves reproducibility • Docker can create “image” which contains software env. • Docker allows users to run the exact same apps in any env. Using Docker container enabled versions to be matched and produced same result. [1] https://www.docker.com

Slide 15

Slide 15 text

• Docker is useful for reproducible workflow • same versions of software • same version of dataset • isolation from host OS software environment • Image tags is useful for management of software release and paper revisions. Supplementary Information • Docker (Container Virtualization) is attached at the end of this slide. Docker for reproducible workflow 15

Slide 16

Slide 16 text

Proposed Method 16

Slide 17

Slide 17 text

Resolving Reproducible Problem To avoid the problem of version of data & software • Docker can share the executable container which contains data & software But sometimes, we need to upgrade the software. Then, it is necessary to check the result. Automatic verification is needed. An automatic & verifiable software development approach Continuous Integration (CI) Continuous Analysis 17

Slide 18

Slide 18 text

Continuous Integration (CI)[1] • is a software engineering practice for fast development • automatically build, run tests, and make analytics which triggered by version control system (e.g. git) About Continuous Integration 18 [1] Grady (1991). Object Oriented Design: With Applications. Benjamin Cummings. p. 209. ISBN 9780805300918. Retrieved 2014-08-18. [2] Travis CI, https://travis-ci.org/ e.g.) Travis CI[2] badge

Slide 19

Slide 19 text

1. Developer pushed commits to repository 2. Test script is executed automatically on CI service 3. Test result is generated automatically e.g.) Travis CI 19 e.g.) https://github.com/galaxyproject/galaxy

Slide 20

Slide 20 text

e.g.) CI on Product Development 20 figure: https://developer.xamarin.com/guides/cross-platform/ci/intro_to_ci/ e.g.) Xamarin Test Cloud

Slide 21

Slide 21 text

Docker provides environment reproducibility • same version of dataset • same version of software • easy to build the environment (Dockerfile) • easy to share the environment (Docker Hub) • Continuous Analysis can verify reproducibility of computational research • automatically tests the reproducibility • automatically updates results Continuous Analysis 21

Slide 22

Slide 22 text

Fig.3 Continuous Analysis Workflow 22

Slide 23

Slide 23 text

Workflow 23 1. Push source code changes 2. (Generate the base Docker image from Dockerfile) 3. Read parameters and commands from YAML files • Users can descript and execute any commands using YAML e.g.) pre-processing, data-analysis, etc. 4. Generate the outputs to another branch • result data, figures, logs (managed in VCS) 5. Update the latest Docker Image

Slide 24

Slide 24 text

Drone • Continuous Integration Open Source Software • https://github.com/drone/drone • Easy to setup using Docker container • (almost same as other CI services) GitHub • Online Git Repository • BitBucket and GitLab are also available System Components 24

Slide 25

Slide 25 text

.drone.yml Example Configuration https://greenelab.github.io/continuous_analysis/ https://github.com/greenelab/continuous_analysis/blob/master/.drone.yml Example of YAML file 25 # choose the base docker image image: brettbj/continuous_analysis_base script: # run pre-process # run tests # perform analysis # publish results publish: docker: # docker details

Slide 26

Slide 26 text

Introducing this system to their work • “Denoising Autoencoders for Phenotype Stratification (DAPS): Preprint Release” • http://doi.org/10.5281/zenodo.46165 They runs 2 example analyses: • a phylogenetic tree–building analysis • an RNA-seq differential expression analysis (detailed information is in Online Method) Experiments 26

Slide 27

Slide 27 text

Experiments Result (Fig.4) 27 easy to compare the changed output figure

Slide 28

Slide 28 text

• Continuous analysis provides a verifiable scientific software in fully specified environment • easy to get reproducible environment using Docker • environment have been automatically kept up-to-date • It allows reviewers, editors and readers to assess reproducibility without a large time commitment Discussion | Conclusion 28

Slide 29

Slide 29 text

• It may be impractical to use it on large-computational analysis at every commit • Cloud computing environment can resolve it, but it requires auto-provisioning skills • It is possible to skip CI steps using registered phrase • It does not address reproducibility in the broader sense: • robustness of results to parameter settings • starting conditions • partitions in the data (these are not target of this research) Discussion | Limitations 29

Slide 30

Slide 30 text

Linux Container • virtualizes the host resource as containers • Filesystem, hostname, IPC, PID, Network, User, etc. • can be used like Virtual Machines Linux Kernel Features • Containers are sharing same host kernel • namespace[1], chroot, cgroup, SELinux, etc. Container-based Virtualization 30 [1] E. W. Biederman. “Multiple instances of the global Linux namespaces.”, In Proceedings of the 2006 Ottawa Linux Symposium, 2006. Machine Linux Kernel Space Container Process Process Container Process Process

Slide 31

Slide 31 text

Docker [1] • Most popular Linux Container management platform • Many useful components and services Linux Container Management Tools 31 [1] Solomon Hykes and others. “What is Docker?” - https://www.docker.com/what-docker [2] W. Bhimji, S. Canon, D. Jacobsen, L. Gerhardt, M. Mustafa, and J. Porter, “Shifter : Containers for HPC,” Cray User Group, pp. 1–12, 2016. [3] “Singularity” - http://singularity.lbl.gov/ [1] [2] [3]

Slide 32

Slide 32 text

Easy container sharing – Docker Hub 32 Portability & Reproducibility • Easy to share the application environment via Docker Hub • Containers can be executed on other host machine Ubuntu Docker Engine Container App Bins/Libs Image App Bins/Libs Docker Hub Image App Bins/Libs Push Pull Dockerfile apt-get install … wget … … make CentOS Docker Engine Container App Bins/Libs Image App Bins/Libs Generate Share

Slide 33

Slide 33 text

AUFS (Advanced multi layered unification filesystem) [1] • Docker default filesystem as AUFS • Layers can be reused in other container image • AUFS helps software Reproducibility Docker - Filesystem 33 [1] Advanced multi layered unification filesystem. http://aufs.sourceforge.net, 2014. Docker Container (image) f49eec89601e 129.5 MB ubuntu:16.04 (base image) 366a03547595 39.85 MB ef122501292c 3.6 MB e50c89716342 15.4 KB tag: beta tag: version-1.0 tag: version-1.0.2 tag: version-1.1 5aec9aa5462c 1.17 MB tag: latest 0d3cccd04bdb 1.07 MB

Slide 34

Slide 34 text

Linux Container – Performance [1] 34 [1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual machines and Linux containers,” IEEE International Symposium on Performance Analysis of Systems and Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.) 0.96 1.00 0.98 0.78 0.83 0.99 0.82 0.98 0.00 0.20 0.40 0.60 0.80 1.00 PXZ [MB/s] Linpack [GFLOPS] Random Access [GUPS] Performance Ratio [based Native] Native Docker KVM KVM-tuned