Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Container Virtualization for Computational Science on HPC Environments

Container Virtualization for Computational Science on HPC Environments

RWBC-OIL Annual Meeting 2019

metaVariable

March 08, 2019
Tweet

More Decks by metaVariable

Other Decks in Science

Transcript

  1. Container virtualizations for computational science on HPC environments RWBC-OIL Annual

    Meeting 2019 Kento Aoyama, Ph.D. Student Research Assistant, RWBC-OIL (group: 2-3) Akiyama Laboratory, Dept. of Computer Science, Tokyo Institute of Technology March 8th, 2019
  2. 1. Introduction 2. Container Virtualization 3. Container Virtualization in Bioinformatics

    4. Case Study: Containers in Bioinformatics on HPC environment 5. Conclusion Outline 2
  3. Linux container (e.g. Docker, Singularity), as lightweight virtualization technology, has

    been expected as that can help our research works related to software environments. But, why we need to use it? what does it contribute to? We show the contributions of containers in computational science through the survey of container-use in bioinformatics on HPC environments. We present followings: • A brief introduction of container virtualization • Container-use and advantages in bioinformatics • Case-Studies of containers in bioinformatics on HPC environments Introduction | Purpose and Motivation 3
  4. We will NOT mention about performance: • performance overhead in

    File I/O, Memory I/O, Network I/O, etc. • details about native MPI/GPU support • Prof. DK. Panda talked that in RWBC-OIL Workshop, a year ago [Zhang, et al. 2017] • “Singularity-based container technology can achieve near-native performance” • “Singularity has very little overhead for running MPI-based HPC applications on both Omni-Path and InfiniBand networks” Introduction | Note 4
  5. 2. Container Virtualization 5

  6. Application-Centric Virtualization “lightweight, standalone, executable package” Background | Container Virtualization

    6
  7. Application-Centric Virtualization “lightweight, standalone, executable package” Background | Container Virtualization

    7 • “container packages an application with of its dependencies into a standardized unit for software development, shipment” • without kernel, virtual devices • better performance than normal VMs • “… executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings” • good portability, enough compatibility https://www.docker.com/resources/what-container
  8. Application-Centric Virtualization “lightweight, standalone, executable package” Background | Container Virtualization

    8 Core features of the linux container cgroups • restricting system resources of process groups (cpu, memory, drive, etc.) namespaces [Biederman, 2006] • isolating system resources between host and guest (filesystem, hostname, IPC, PID, network, user, etc.)
  9. Background | Portability (e.g. Docker) 9 Keeping Portability & Reproducibility

    of application • Easy to share application through registry service (e.g. Docker Hub) • Easy to reproduce the environments from shell-like recipe (e.g. Dockerfile) Docker Hub Image App Bins/Libs Push Pull Ubuntu Docker Engine Linux Kernel Container App Bins/Libs Image App Bins/Libs CentOS Docker Engine Linux Kernel Run Dockerfile apt-get install … wget … … make Generate Image App Bins/Libs
  10. Background | Linux Containers for HPC 10 [G.M. Kurtzer, et

    al., 2017] [R.S. Canon, et al., 2015] [L. Benedicic, et al., 2017] [S. Hykes, 2013]
  11. 3. Container Virtualization in Bioinformatics 11

  12. Research reproducibility is crucial for science But 90% of researchers

    acknowledged reproducibility problem Reproducibility Crisis 12 Baker, M., “1,500 scientists lift the lid on reproducibility”, Nature 533, 452–454 (2016).
  13. Reproducibility Problems: • lack of details of experiment • data,

    parameters, code, etc. • lack of machine environment information • software versions, libraries, operating systems, etc. Computational research should be reproducible Reproducibility Spectrum 13 Peng, R.D. ,”Reproducible research in computational science”, Science 334, 1226–1227 (2011).
  14. Different version of library may produce different result • It

    is a threat of research reproducibility, especially in computational science • e.g.) Peng, R.D., “Reproducible research in computational science”, Science 334, 1226-1227, 2011. • In recent, scientists interested in container virtualizations as a lightweight/useful virtualization technology to improve research reproducibility. Application Reproducibility 14 Env. A Application Library version 1.2 Result Reproducibility Crisis Env. B Application Library version 1.3 Result’ different result
  15. BioContainers Project 15 • community-driven project for open-source bioinformatics software

    distribution [D.V. Leprevost F., et al. 2017] • providing executable bioinformatics containers and build recipe • currently, project integrates BioConda project and automates the container building for packages
  16. BioContainers Project 16 • community-driven project for open-source bioinformatics software

    distribution [D.V. Leprevost F., et al. 2017] • providing executable bioinformatics containers and build recipe • currently, project integrates BioConda project and automates the container building for packages
  17. Introducing software engineering practices like a Continuous Integration into computational

    science workflow enables researchers to reproduce results without contacting the study authors. Continuous Analysis 17 B. K. Beaulieu-Jones and C. S. Greene, “Reproducibility of computational workflows is automated using continuous analysis”, Nature Biotechnology, vol.35, No.4, pp.342-346, 2017.
  18. 4. Case Study: Containers in Bioinformatics on HPC environment 18

  19. CaseStudy | Ovewview 19 Case 1 Container Virtualizations on ABCI

    & TSUBAME3.0 AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL) Case 2 Singularity and BioContainer on National Institute of Genetics SuperComputer Facilities of National Institute of Genetics Case 3 Containers for Reproducible Pipeline Analysis (e.g. Nextflow) Centre for Genomic Regulation (Barcelona, Spain) Case 4 Protein-Protein Interaction prediction on Public Clouds and HPC Akiyama Laboratory, Tokyo Institute of Technology
  20. Case 1: Container Virtualizations on ABCI & TSUBAME3.0 20

  21. Case 1 | Container Virtualizations on ABCI 21 Hitoshi Sato,

    “Building Software Ecosystems for AI Cloud using Singularity HPC Container“ 5th Accelerated Data Analytics and Computing (ADAC5), 2018.
  22. Case 1 | Container Virtualizations on TSUBAME3.0 22 Satoshi Matuoka,

    “TSUBAME3.0 and ABCI: Supercomputer Architecture for HPC and AI/BD Convergence“, GTC2017, 2017.
  23. Case 1 | Container Virtualizations on TSUBAME3.0 23 Satoshi Matuoka,

    “TSUBAME3.0 and ABCI: Supercomputer Architecture for HPC and AI/BD Convergence“, GTC2017, 2017. https://www.t3.gsic.titech.ac.jp/applications Now “Singularity” container is available on TSUBAME3.0 Thank you very much for Prof. Endo, Dr. Nomura!
  24. Case 1 | Singularity on HPC environment 24 1. download

    (or build) a container image from registry service (e.g. DockerHub, SingularityHub) then saved as a local file (.simg) 2. Exec singularity commands (exec/run/shell) to run a container image 2. Registry Service Local file (.simg) Local Machine Container Application 1. Container User script Application Compute node Login node User script Job Scheduler Singularity Local file (.simg) • Computational resource management is the same as before • Singularity image is compatible with docker image format • we can use any docker images on HPC environment! *
  25. Case 2: Singularity and BioContainers on National Institute of Genetics

    25
  26. Case 2 | Singularity and BioContainers on National Institute of

    Genetics 26 https://sc2.ddbj.nig.ac.jp/index.php/systemconfig Opened on March 5th, 2019
  27. Case 2 | Singularity and BioContainers on National Institute of

    Genetics 27 … following container images are available on this system: • user-custom singularity images built on another system • container images downloaded from Docker Hub, Singularity Hub, NGC, etc. • community container images provided by bioinformatics project (e.g. BioContainers) etc. https://sc2.ddbj.nig.ac.jp/index.php/singularity
  28. Case 2 | Singularity and BioContainers on National Institute of

    Genetics 28 https://sc2.ddbj.nig.ac.jp/index.php/available-biotools • Pre-downloaded container images are available, shared on the system • container images (appx. 9000) provided from BioContainers project • Container images are updated on a regular basis (at least once every 6 months)
  29. Case 3: Containers for Reproducible Pipeline Analysis 29

  30. Nextflow[P. D. Tommaso, et al., 2017] • DSL (domein specific

    language) based pipeline software for omics analysis • provide various functions to support the result reproducibility • recommend using containers (docker, singularity) in pipeline executions Case 3 | Containers for Reproducible Pipeline Analysis 30
  31. • Using each container images for each execution task enables

    us to virtualize the analytics processes • avoid conflicts of library dependencies between tools • keep reproducibility through the saving container images and logs • users can input the tool versions (= image tag), data, etc. Case 3 | Containers for Reproducible Pipeline Analysis 31 Container A TASK 1 Tool A Container B TASK 2 Tool B Container C TASK 3 Tool C samtools:v1.3 bwa:0.7.15 picard:2.3.0 Host System
  32. Case 4: Protein-Protein Interaction prediction on Public Clouds and HPC

    32
  33. Case 4 | Protein-Protein Interaction prediction on Public Clouds and

    HPC 33 MEGADOCK[M, Ohue, et al., 2014] • Protein-Protein-Interaction predicting software using FFT-grid based docking • Hybrid parallelization using OpenMP/GPU/MPI • research achievements on TSUBAME, K computer, etc. • provide container images, deploy on public clouds [K. Aoyama, et al. 2017] [Y. Yamamoto, et al. 2018]
  34. Case 4 | Protein-Protein Interaction prediction on Public Clouds and

    HPC 34 MEGADOCK containers on Microsoft Azure [K. Aoyama 2017]
  35. 5. Conclusion 35

  36. • Container, as lightweight virtualization, is getting accepted for researchers,

    not only for company • it provides executable application standard unit for distribution • it enables us to reproduce the research results • it contributes to the credibility of computational sciences • Containers are now available even in HPC environments • ABCI (AIST), TSUBAME3.0 (Tokyo Tech), PizDaint (CSCS), etc. • Singularity, Docker, Shifter, etc. • We believe that using containers in our research works is a reasonable approach to keep the credibility of computational science Conclusion 36
  37. • Rootless Docker is under development • experimental script is

    available at the following URL • it will be officially supported in future, but singularity is enough for current HPC use case (my opinion) Rootless Docker (Experimental) 37 https://engineering.docker.com/2019/02/experimenting-with-rootless-docker/