Slide 1

Slide 1 text

Container virtualizations for computational science on HPC environments RWBC-OIL Annual Meeting 2019 Kento Aoyama, Ph.D. Student Research Assistant, RWBC-OIL (group: 2-3) Akiyama Laboratory, Dept. of Computer Science, Tokyo Institute of Technology March 8th, 2019

Slide 2

Slide 2 text

1. Introduction 2. Container Virtualization 3. Container Virtualization in Bioinformatics 4. Case Study: Containers in Bioinformatics on HPC environment 5. Conclusion Outline 2

Slide 3

Slide 3 text

Linux container (e.g. Docker, Singularity), as lightweight virtualization technology, has been expected as that can help our research works related to software environments. But, why we need to use it? what does it contribute to? We show the contributions of containers in computational science through the survey of container-use in bioinformatics on HPC environments. We present followings: • A brief introduction of container virtualization • Container-use and advantages in bioinformatics • Case-Studies of containers in bioinformatics on HPC environments Introduction | Purpose and Motivation 3

Slide 4

Slide 4 text

We will NOT mention about performance: • performance overhead in File I/O, Memory I/O, Network I/O, etc. • details about native MPI/GPU support • Prof. DK. Panda talked that in RWBC-OIL Workshop, a year ago [Zhang, et al. 2017] • “Singularity-based container technology can achieve near-native performance” • “Singularity has very little overhead for running MPI-based HPC applications on both Omni-Path and InfiniBand networks” Introduction | Note 4

Slide 5

Slide 5 text

2. Container Virtualization 5

Slide 6

Slide 6 text

Application-Centric Virtualization “lightweight, standalone, executable package” Background | Container Virtualization 6

Slide 7

Slide 7 text

Application-Centric Virtualization “lightweight, standalone, executable package” Background | Container Virtualization 7 • “container packages an application with of its dependencies into a standardized unit for software development, shipment” • without kernel, virtual devices • better performance than normal VMs • “… executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings” • good portability, enough compatibility https://www.docker.com/resources/what-container

Slide 8

Slide 8 text

Application-Centric Virtualization “lightweight, standalone, executable package” Background | Container Virtualization 8 Core features of the linux container cgroups • restricting system resources of process groups (cpu, memory, drive, etc.) namespaces [Biederman, 2006] • isolating system resources between host and guest (filesystem, hostname, IPC, PID, network, user, etc.)

Slide 9

Slide 9 text

Background | Portability (e.g. Docker) 9 Keeping Portability & Reproducibility of application • Easy to share application through registry service (e.g. Docker Hub) • Easy to reproduce the environments from shell-like recipe (e.g. Dockerfile) Docker Hub Image App Bins/Libs Push Pull Ubuntu Docker Engine Linux Kernel Container App Bins/Libs Image App Bins/Libs CentOS Docker Engine Linux Kernel Run Dockerfile apt-get install … wget … … make Generate Image App Bins/Libs

Slide 10

Slide 10 text

Background | Linux Containers for HPC 10 [G.M. Kurtzer, et al., 2017] [R.S. Canon, et al., 2015] [L. Benedicic, et al., 2017] [S. Hykes, 2013]

Slide 11

Slide 11 text

3. Container Virtualization in Bioinformatics 11

Slide 12

Slide 12 text

Research reproducibility is crucial for science But 90% of researchers acknowledged reproducibility problem Reproducibility Crisis 12 Baker, M., “1,500 scientists lift the lid on reproducibility”, Nature 533, 452–454 (2016).

Slide 13

Slide 13 text

Reproducibility Problems: • lack of details of experiment • data, parameters, code, etc. • lack of machine environment information • software versions, libraries, operating systems, etc. Computational research should be reproducible Reproducibility Spectrum 13 Peng, R.D. ,”Reproducible research in computational science”, Science 334, 1226–1227 (2011).

Slide 14

Slide 14 text

Different version of library may produce different result • It is a threat of research reproducibility, especially in computational science • e.g.) Peng, R.D., “Reproducible research in computational science”, Science 334, 1226-1227, 2011. • In recent, scientists interested in container virtualizations as a lightweight/useful virtualization technology to improve research reproducibility. Application Reproducibility 14 Env. A Application Library version 1.2 Result Reproducibility Crisis Env. B Application Library version 1.3 Result’ different result

Slide 15

Slide 15 text

BioContainers Project 15 • community-driven project for open-source bioinformatics software distribution [D.V. Leprevost F., et al. 2017] • providing executable bioinformatics containers and build recipe • currently, project integrates BioConda project and automates the container building for packages

Slide 16

Slide 16 text

BioContainers Project 16 • community-driven project for open-source bioinformatics software distribution [D.V. Leprevost F., et al. 2017] • providing executable bioinformatics containers and build recipe • currently, project integrates BioConda project and automates the container building for packages

Slide 17

Slide 17 text

Introducing software engineering practices like a Continuous Integration into computational science workflow enables researchers to reproduce results without contacting the study authors. Continuous Analysis 17 B. K. Beaulieu-Jones and C. S. Greene, “Reproducibility of computational workflows is automated using continuous analysis”, Nature Biotechnology, vol.35, No.4, pp.342-346, 2017.

Slide 18

Slide 18 text

4. Case Study: Containers in Bioinformatics on HPC environment 18

Slide 19

Slide 19 text

CaseStudy | Ovewview 19 Case 1 Container Virtualizations on ABCI & TSUBAME3.0 AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL) Case 2 Singularity and BioContainer on National Institute of Genetics SuperComputer Facilities of National Institute of Genetics Case 3 Containers for Reproducible Pipeline Analysis (e.g. Nextflow) Centre for Genomic Regulation (Barcelona, Spain) Case 4 Protein-Protein Interaction prediction on Public Clouds and HPC Akiyama Laboratory, Tokyo Institute of Technology

Slide 20

Slide 20 text

Case 1: Container Virtualizations on ABCI & TSUBAME3.0 20

Slide 21

Slide 21 text

Case 1 | Container Virtualizations on ABCI 21 Hitoshi Sato, “Building Software Ecosystems for AI Cloud using Singularity HPC Container“ 5th Accelerated Data Analytics and Computing (ADAC5), 2018.

Slide 22

Slide 22 text

Case 1 | Container Virtualizations on TSUBAME3.0 22 Satoshi Matuoka, “TSUBAME3.0 and ABCI: Supercomputer Architecture for HPC and AI/BD Convergence“, GTC2017, 2017.

Slide 23

Slide 23 text

Case 1 | Container Virtualizations on TSUBAME3.0 23 Satoshi Matuoka, “TSUBAME3.0 and ABCI: Supercomputer Architecture for HPC and AI/BD Convergence“, GTC2017, 2017. https://www.t3.gsic.titech.ac.jp/applications Now “Singularity” container is available on TSUBAME3.0 Thank you very much for Prof. Endo, Dr. Nomura!

Slide 24

Slide 24 text

Case 1 | Singularity on HPC environment 24 1. download (or build) a container image from registry service (e.g. DockerHub, SingularityHub) then saved as a local file (.simg) 2. Exec singularity commands (exec/run/shell) to run a container image 2. Registry Service Local file (.simg) Local Machine Container Application 1. Container User script Application Compute node Login node User script Job Scheduler Singularity Local file (.simg) • Computational resource management is the same as before • Singularity image is compatible with docker image format • we can use any docker images on HPC environment! *

Slide 25

Slide 25 text

Case 2: Singularity and BioContainers on National Institute of Genetics 25

Slide 26

Slide 26 text

Case 2 | Singularity and BioContainers on National Institute of Genetics 26 https://sc2.ddbj.nig.ac.jp/index.php/systemconfig Opened on March 5th, 2019

Slide 27

Slide 27 text

Case 2 | Singularity and BioContainers on National Institute of Genetics 27 … following container images are available on this system: • user-custom singularity images built on another system • container images downloaded from Docker Hub, Singularity Hub, NGC, etc. • community container images provided by bioinformatics project (e.g. BioContainers) etc. https://sc2.ddbj.nig.ac.jp/index.php/singularity

Slide 28

Slide 28 text

Case 2 | Singularity and BioContainers on National Institute of Genetics 28 https://sc2.ddbj.nig.ac.jp/index.php/available-biotools • Pre-downloaded container images are available, shared on the system • container images (appx. 9000) provided from BioContainers project • Container images are updated on a regular basis (at least once every 6 months)

Slide 29

Slide 29 text

Case 3: Containers for Reproducible Pipeline Analysis 29

Slide 30

Slide 30 text

Nextflow[P. D. Tommaso, et al., 2017] • DSL (domein specific language) based pipeline software for omics analysis • provide various functions to support the result reproducibility • recommend using containers (docker, singularity) in pipeline executions Case 3 | Containers for Reproducible Pipeline Analysis 30

Slide 31

Slide 31 text

• Using each container images for each execution task enables us to virtualize the analytics processes • avoid conflicts of library dependencies between tools • keep reproducibility through the saving container images and logs • users can input the tool versions (= image tag), data, etc. Case 3 | Containers for Reproducible Pipeline Analysis 31 Container A TASK 1 Tool A Container B TASK 2 Tool B Container C TASK 3 Tool C samtools:v1.3 bwa:0.7.15 picard:2.3.0 Host System

Slide 32

Slide 32 text

Case 4: Protein-Protein Interaction prediction on Public Clouds and HPC 32

Slide 33

Slide 33 text

Case 4 | Protein-Protein Interaction prediction on Public Clouds and HPC 33 MEGADOCK[M, Ohue, et al., 2014] • Protein-Protein-Interaction predicting software using FFT-grid based docking • Hybrid parallelization using OpenMP/GPU/MPI • research achievements on TSUBAME, K computer, etc. • provide container images, deploy on public clouds [K. Aoyama, et al. 2017] [Y. Yamamoto, et al. 2018]

Slide 34

Slide 34 text

Case 4 | Protein-Protein Interaction prediction on Public Clouds and HPC 34 MEGADOCK containers on Microsoft Azure [K. Aoyama 2017]

Slide 35

Slide 35 text

5. Conclusion 35

Slide 36

Slide 36 text

• Container, as lightweight virtualization, is getting accepted for researchers, not only for company • it provides executable application standard unit for distribution • it enables us to reproduce the research results • it contributes to the credibility of computational sciences • Containers are now available even in HPC environments • ABCI (AIST), TSUBAME3.0 (Tokyo Tech), PizDaint (CSCS), etc. • Singularity, Docker, Shifter, etc. • We believe that using containers in our research works is a reasonable approach to keep the credibility of computational science Conclusion 36

Slide 37

Slide 37 text

• Rootless Docker is under development • experimental script is available at the following URL • it will be officially supported in future, but singularity is enough for current HPC use case (my opinion) Rootless Docker (Experimental) 37 https://engineering.docker.com/2019/02/experimenting-with-rootless-docker/