Meeting 2019 Kento Aoyama, Ph.D. Student Research Assistant, RWBC-OIL (group: 2-3) Akiyama Laboratory, Dept. of Computer Science, Tokyo Institute of Technology March 8th, 2019
been expected as that can help our research works related to software environments. But, why we need to use it? what does it contribute to? We show the contributions of containers in computational science through the survey of container-use in bioinformatics on HPC environments. We present followings: • A brief introduction of container virtualization • Container-use and advantages in bioinformatics • Case-Studies of containers in bioinformatics on HPC environments Introduction | Purpose and Motivation 3
File I/O, Memory I/O, Network I/O, etc. • details about native MPI/GPU support • Prof. DK. Panda talked that in RWBC-OIL Workshop, a year ago [Zhang, et al. 2017] • “Singularity-based container technology can achieve near-native performance” • “Singularity has very little overhead for running MPI-based HPC applications on both Omni-Path and InfiniBand networks” Introduction | Note 4
7 • “container packages an application with of its dependencies into a standardized unit for software development, shipment” • without kernel, virtual devices • better performance than normal VMs • “… executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings” • good portability, enough compatibility https://www.docker.com/resources/what-container
8 Core features of the linux container cgroups • restricting system resources of process groups (cpu, memory, drive, etc.) namespaces [Biederman, 2006] • isolating system resources between host and guest (filesystem, hostname, IPC, PID, network, user, etc.)
acknowledged reproducibility problem Reproducibility Crisis 12 Baker, M., “1,500 scientists lift the lid on reproducibility”, Nature 533, 452–454 (2016).
parameters, code, etc. • lack of machine environment information • software versions, libraries, operating systems, etc. Computational research should be reproducible Reproducibility Spectrum 13 Peng, R.D. ,”Reproducible research in computational science”, Science 334, 1226–1227 (2011).
is a threat of research reproducibility, especially in computational science • e.g.) Peng, R.D., “Reproducible research in computational science”, Science 334, 1226-1227, 2011. • In recent, scientists interested in container virtualizations as a lightweight/useful virtualization technology to improve research reproducibility. Application Reproducibility 14 Env. A Application Library version 1.2 Result Reproducibility Crisis Env. B Application Library version 1.3 Result’ different result
distribution [D.V. Leprevost F., et al. 2017] • providing executable bioinformatics containers and build recipe • currently, project integrates BioConda project and automates the container building for packages
distribution [D.V. Leprevost F., et al. 2017] • providing executable bioinformatics containers and build recipe • currently, project integrates BioConda project and automates the container building for packages
science workflow enables researchers to reproduce results without contacting the study authors. Continuous Analysis 17 B. K. Beaulieu-Jones and C. S. Greene, “Reproducibility of computational workflows is automated using continuous analysis”, Nature Biotechnology, vol.35, No.4, pp.342-346, 2017.
& TSUBAME3.0 AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL) Case 2 Singularity and BioContainer on National Institute of Genetics SuperComputer Facilities of National Institute of Genetics Case 3 Containers for Reproducible Pipeline Analysis (e.g. Nextflow) Centre for Genomic Regulation (Barcelona, Spain) Case 4 Protein-Protein Interaction prediction on Public Clouds and HPC Akiyama Laboratory, Tokyo Institute of Technology
“TSUBAME3.0 and ABCI: Supercomputer Architecture for HPC and AI/BD Convergence“, GTC2017, 2017. https://www.t3.gsic.titech.ac.jp/applications Now “Singularity” container is available on TSUBAME3.0 Thank you very much for Prof. Endo, Dr. Nomura!
(or build) a container image from registry service (e.g. DockerHub, SingularityHub) then saved as a local file (.simg) 2. Exec singularity commands (exec/run/shell) to run a container image 2. Registry Service Local file (.simg) Local Machine Container Application 1. Container User script Application Compute node Login node User script Job Scheduler Singularity Local file (.simg) • Computational resource management is the same as before • Singularity image is compatible with docker image format • we can use any docker images on HPC environment! *
Genetics 27 … following container images are available on this system: • user-custom singularity images built on another system • container images downloaded from Docker Hub, Singularity Hub, NGC, etc. • community container images provided by bioinformatics project (e.g. BioContainers) etc. https://sc2.ddbj.nig.ac.jp/index.php/singularity
Genetics 28 https://sc2.ddbj.nig.ac.jp/index.php/available-biotools • Pre-downloaded container images are available, shared on the system • container images (appx. 9000) provided from BioContainers project • Container images are updated on a regular basis (at least once every 6 months)
language) based pipeline software for omics analysis • provide various functions to support the result reproducibility • recommend using containers (docker, singularity) in pipeline executions Case 3 | Containers for Reproducible Pipeline Analysis 30
us to virtualize the analytics processes • avoid conflicts of library dependencies between tools • keep reproducibility through the saving container images and logs • users can input the tool versions (= image tag), data, etc. Case 3 | Containers for Reproducible Pipeline Analysis 31 Container A TASK 1 Tool A Container B TASK 2 Tool B Container C TASK 3 Tool C samtools:v1.3 bwa:0.7.15 picard:2.3.0 Host System
HPC 33 MEGADOCK[M, Ohue, et al., 2014] • Protein-Protein-Interaction predicting software using FFT-grid based docking • Hybrid parallelization using OpenMP/GPU/MPI • research achievements on TSUBAME, K computer, etc. • provide container images, deploy on public clouds [K. Aoyama, et al. 2017] [Y. Yamamoto, et al. 2018]
not only for company • it provides executable application standard unit for distribution • it enables us to reproduce the research results • it contributes to the credibility of computational sciences • Containers are now available even in HPC environments • ABCI (AIST), TSUBAME3.0 (Tokyo Tech), PizDaint (CSCS), etc. • Singularity, Docker, Shifter, etc. • We believe that using containers in our research works is a reasonable approach to keep the credibility of computational science Conclusion 36
available at the following URL • it will be officially supported in future, but singularity is enough for current HPC use case (my opinion) Rootless Docker (Experimental) 37 https://engineering.docker.com/2019/02/experimenting-with-rootless-docker/