Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Journal Seminar: Reproducibility of computational workflows is automated using continuous analysis

Journal Seminar: Reproducibility of computational workflows is automated using continuous analysis

journal seminar in Akiyama-Lab@Tokyo Tech (http://www.bi.cs.titech.ac.jp/)
(2017-04-20)

> B. K. Beaulieu-Jones and C. S. Greene, “Reproducibility of computational workflows is automated using continuous analysis,” Nature. Biotechnology., vol. 35, no. 4, pp. 342–346, 2017.
> http://www.nature.com/nbt/journal/v35/n4/full/nbt.3780.html

metaVariable

April 21, 2017
Tweet

More Decks by metaVariable

Other Decks in Science

Transcript

  1. Reproducibility of computational workflows is automated using continuous analysis Brett

    K Beaulieu-Jones, Casey S Greene Nature Biotechnology, vol.35, No.4, pp.342-346, 2017. April 20th, 2017 Ph.D. Student Kento Aoyama Akiyama Laboratory Department of Computer Science, School of Computing Tokyo Institute of Technology
  2. Nature Biotechnology • Top Scientific Journal in biological, biomedical, agricultural

    and environmental sciences • 2-year IF: 43.113 (2016) • e.g.) Nature, IF = 38.138 (2016) Source : http://www.nature.com/npg_/company_info/jour nal_metrics.html Journal Information 2 nature biotechnology, April 2017, vol.35 no.4
  3. Brett K Beaulieu-Jones1, Casey S Greene2 1. Genomics and Computational

    Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania (Twitter: @beaulieujones) 2. Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania (Twitter: @GreeneScientist) Authors Information 3
  4. Target Problem Reproducibility of computational research Proposed Method Continuous Integration

    + Computational Research = Continuous Analysis Continuous Analysis can automatically verify the research reproducibility • Easy to reproduce, review, and cooperate What is the value of this research ? 4 [GitHub] https://greenelab.github.io/continuous_analysis/
  5. 1. Background 2. Result (Survey) 3. Proposed Method (Architecture) 4.

    Experiments 5. Discussion, Conclusion Outline 5
  6. Research reproducibility is crucial for science But 90% of researchers

    acknowledged reproducibility crisis[1] Background | Reproducibility Crisis 6 [1] Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
  7. Reproducibility Problems • lack of details of experiment • data,

    parameters, code, etc. • lack of machine environment information • software versions, libraries, operating systems, etc. Computational research should be reproducible Background | Reproducibility Spectrum 7 Peng, R.D. Reproducible research in computational science. Science 334, 1226–1227 (2011).
  8. Background | Reproducibility in Biology 8 18 articles, published in

    Nature Genetics (2005, 2006) • can not reproduce (10 articles, 56%) • can reproduce with discrepancies (6 articles, 33%) Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression analyses”, Nat. Genet. 41, 149–155 (2009)
  9. Survey of Differential Gene Expression Research • Probe information is

    necessary for reproduction • probe, is the oligonucleotides of certain sequences, is used to measure transcript expression levels BrainArray Custom CDF [1] • A popular source of probe set description files • [Dai, M. et al.] published and maintains • Version of Custom CDF can verify detailed information of probe set Authors analyzed the 200 articles, which cited [Dai, M. et al.][1]. Reproducibility on RNA-Analysis 10 [1] Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).
  10. Reporting of Custom CDF in articles 11 a) Most Recent

    100 articles 51% of articles do NOT showed version of Custom CDF b) Highest cited 100 articles 64% of articles do NOT showed version of Custom CDF cannot download (14 Nov. 2016)
  11. How different versions affect the analysis result To measure the

    effects, • download the different version of Custom CDFs • use the same data set • normal HeLa cells and HeLa cells in which TIA1 and TIAR (TIAL1) were knocked down Comparing the results • same source code • same data set • different versions of BrainArray Custom CDF (18, 19, 20) • different versions of software packages Effects on Analysis Result 12
  12. Figure 2a. differential gene expression analysis of HeLa cells 13

    Each version identified different number of significantly altered genes. • e.g.) 15 genes were identified as significant in v19, but not in version 18. … Analysis results are NOT reproducible without accurate version of software, dataset
  13. Figure 2b. container-based approaches 14 Using Docker[1] containers improves reproducibility

    • Docker can create “image” which contains software env. • Docker allows users to run the exact same apps in any env. Using Docker container enabled versions to be matched and produced same result. [1] https://www.docker.com
  14. • Docker is useful for reproducible workflow • same versions

    of software • same version of dataset • isolation from host OS software environment • Image tags is useful for management of software release and paper revisions. Supplementary Information • Docker (Container Virtualization) is attached at the end of this slide. Docker for reproducible workflow 15
  15. Resolving Reproducible Problem To avoid the problem of version of

    data & software • Docker can share the executable container which contains data & software But sometimes, we need to upgrade the software. Then, it is necessary to check the result. Automatic verification is needed. An automatic & verifiable software development approach Continuous Integration (CI) Continuous Analysis 17
  16. Continuous Integration (CI)[1] • is a software engineering practice for

    fast development • automatically build, run tests, and make analytics which triggered by version control system (e.g. git) About Continuous Integration 18 [1] Grady (1991). Object Oriented Design: With Applications. Benjamin Cummings. p. 209. ISBN 9780805300918. Retrieved 2014-08-18. [2] Travis CI, https://travis-ci.org/ e.g.) Travis CI[2] badge
  17. 1. Developer pushed commits to repository 2. Test script is

    executed automatically on CI service 3. Test result is generated automatically e.g.) Travis CI 19 e.g.) https://github.com/galaxyproject/galaxy
  18. Docker provides environment reproducibility • same version of dataset •

    same version of software • easy to build the environment (Dockerfile) • easy to share the environment (Docker Hub) • Continuous Analysis can verify reproducibility of computational research • automatically tests the reproducibility • automatically updates results Continuous Analysis 21
  19. Workflow 23 1. Push source code changes 2. (Generate the

    base Docker image from Dockerfile) 3. Read parameters and commands from YAML files • Users can descript and execute any commands using YAML e.g.) pre-processing, data-analysis, etc. 4. Generate the outputs to another branch • result data, figures, logs (managed in VCS) 5. Update the latest Docker Image
  20. Drone • Continuous Integration Open Source Software • https://github.com/drone/drone •

    Easy to setup using Docker container • (almost same as other CI services) GitHub • Online Git Repository • BitBucket and GitLab are also available System Components 24
  21. .drone.yml Example Configuration https://greenelab.github.io/continuous_analysis/ https://github.com/greenelab/continuous_analysis/blob/master/.drone.yml Example of YAML file 25

    # choose the base docker image image: brettbj/continuous_analysis_base script: # run pre-process # run tests # perform analysis # publish results publish: docker: # docker details
  22. Introducing this system to their work • “Denoising Autoencoders for

    Phenotype Stratification (DAPS): Preprint Release” • http://doi.org/10.5281/zenodo.46165 They runs 2 example analyses: • a phylogenetic tree–building analysis • an RNA-seq differential expression analysis (detailed information is in Online Method) Experiments 26
  23. • Continuous analysis provides a verifiable scientific software in fully

    specified environment • easy to get reproducible environment using Docker • environment have been automatically kept up-to-date • It allows reviewers, editors and readers to assess reproducibility without a large time commitment Discussion | Conclusion 28
  24. • It may be impractical to use it on large-computational

    analysis at every commit • Cloud computing environment can resolve it, but it requires auto-provisioning skills • It is possible to skip CI steps using registered phrase • It does not address reproducibility in the broader sense: • robustness of results to parameter settings • starting conditions • partitions in the data (these are not target of this research) Discussion | Limitations 29
  25. Linux Container • virtualizes the host resource as containers •

    Filesystem, hostname, IPC, PID, Network, User, etc. • can be used like Virtual Machines Linux Kernel Features • Containers are sharing same host kernel • namespace[1], chroot, cgroup, SELinux, etc. Container-based Virtualization 30 [1] E. W. Biederman. “Multiple instances of the global Linux namespaces.”, In Proceedings of the 2006 Ottawa Linux Symposium, 2006. Machine Linux Kernel Space Container Process Process Container Process Process
  26. Docker [1] • Most popular Linux Container management platform •

    Many useful components and services Linux Container Management Tools 31 [1] Solomon Hykes and others. “What is Docker?” - https://www.docker.com/what-docker [2] W. Bhimji, S. Canon, D. Jacobsen, L. Gerhardt, M. Mustafa, and J. Porter, “Shifter : Containers for HPC,” Cray User Group, pp. 1–12, 2016. [3] “Singularity” - http://singularity.lbl.gov/ [1] [2] [3]
  27. Easy container sharing – Docker Hub 32 Portability & Reproducibility

    • Easy to share the application environment via Docker Hub • Containers can be executed on other host machine Ubuntu Docker Engine Container App Bins/Libs Image App Bins/Libs Docker Hub Image App Bins/Libs Push Pull Dockerfile apt-get install … wget … … make CentOS Docker Engine Container App Bins/Libs Image App Bins/Libs Generate Share
  28. AUFS (Advanced multi layered unification filesystem) [1] • Docker default

    filesystem as AUFS • Layers can be reused in other container image • AUFS helps software Reproducibility Docker - Filesystem 33 [1] Advanced multi layered unification filesystem. http://aufs.sourceforge.net, 2014. Docker Container (image) f49eec89601e 129.5 MB ubuntu:16.04 (base image) 366a03547595 39.85 MB ef122501292c 3.6 MB e50c89716342 15.4 KB tag: beta tag: version-1.0 tag: version-1.0.2 tag: version-1.1 5aec9aa5462c 1.17 MB tag: latest 0d3cccd04bdb 1.07 MB
  29. Linux Container – Performance [1] 34 [1] W. Felter, A.

    Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual machines and Linux containers,” IEEE International Symposium on Performance Analysis of Systems and Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.) 0.96 1.00 0.98 0.78 0.83 0.99 0.82 0.98 0.00 0.20 0.40 0.60 0.80 1.00 PXZ [MB/s] Linpack [GFLOPS] Random Access [GUPS] Performance Ratio [based Native] Native Docker KVM KVM-tuned