Slide 1

Slide 1 text

Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? by Jie Zhang, Xiaoyi Lu, Dhabaleswar K. Panda* * The Ohio State University In Proceedings of the10th International Conference on Utility and Cloud Computing (UCC '17) pp.151-160, 2017. Kento Aoyama, Ph.D. Student Akiyama Laboratory, Dept. of Computer Science, Tokyo Institute of Technology Journal Seminar (Akiyama and Ishida Laboratory) on April 19th, 2018

Slide 2

Slide 2 text

Self-Introduction 2 Name Aoyama Kento 青山 健人 Research Interests High Performance Computing, Container Virtualization, Parallel/Distributed Computing, Bioinformatics Educations WebCV https://metavariable.github.io @meta1127 富山高等専門学校 情報工学科 (A.Sc. in Eng.) 電気通信大学 電気通信学部 情報工学科 (B.Sc. in Eng.) 東京工業大学 大学院情報理工学研究科 修士課程 (M.Sc. in Eng.) (富士通株式会社 / Software Developer, Fujitsu Ltd.) 東京工業大学 情報理工学院 博士課程 (Ph.D. Student)

Slide 3

Slide 3 text

SlideShare: https://www.slideshare.net/KentoAoyama/reproducibility-of-computational- workflows-is-automated-using-continuous-analysis Side-Story | Bioinformatics + Container 3

Slide 4

Slide 4 text

4 Why do we have to do the “INSTALLATION BATTLE” on Supercomputers ? Isn’t it a waste of time? image from Jiro Ueda, “Why don’t you do your best?”, 2004.

Slide 5

Slide 5 text

1. Meta-Information • Conference / Authors / Abstract 2. Background • Container-Virtualization 3. HPC Features • Intel Knight Landing 4. Experiments 5. Conclusion 6. (Additional Discussion) Outline 5

Slide 6

Slide 6 text

• In Proceedings of the10th International Conference on Utility and Cloud Computing (UCC '17) • Place : Austin, Texas, USA • Date : December 5-8, 2017 • H5-Index : 14.00 ( https://aminer.org/ranks/conf ) Conference Information 6 http://www.depts.ttu.edu/cac/conferences/ucc2017/

Slide 7

Slide 7 text

Jie Zhang • Ph.D Student in NOWLAB (Network Based Computing Lab) • Best Student Paper Award (UCC’17, this paper) Prof. Dhabaleswar K. (DK) Panda • Professor of the Ohio State University • Faculty of NOWLAB MVAPICH • famous MPI Implementation in HPC e.g.) Sunway TaihuLight, TSUBAME2.5, etc. • http://mvapich.cse.ohio-state.edu/publications/ OSU Benchmark • MPI Communication Benchmark • point-to-point, collective, non-blocking, GPU memory access, … Authors 7 “The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC” D. Panda, K. Tomko, K. Schulz, A. Majumdar Int'l Workshop on Sustainable Software for Science: Practice and Experiences, Nov 2013. http://nowlab.cse.ohio-state.edu/people

Slide 8

Slide 8 text

Question: “Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?” Answer: Yes. • Singularity shows near-native performance even when running MPI (HPC) applications • Container-Technology is ready for HPC field! What’s the message? 8

Slide 9

Slide 9 text

• Presents 4-Dimension based Evaluation Methodology for Characterizing Singularity Performance Contributions (1/2) 9 Singularity Omni-Path Intel KNL Intel Xeon Haswell Intel KNL - Cluster Modes - Cache/Flat Modes InfiniBand

Slide 10

Slide 10 text

• Conducts Extensive Performance Evaluation on cutting-edge HPC technologies • Intel Xeon • Intel KNL • Omni-Path • InfiniBand • Provides Performance Reports and analysis of running MPI Benchmarks with Singularity on different platforms • Chameleon Cloud ( https://www.chameleoncloud.org/ ) • Local Clusters Contributions (2/2) 10

Slide 11

Slide 11 text

Background Container Virtualization 11

Slide 12

Slide 12 text

12 Why do we have to do the “INSTALLATION BATTLE” on Supercomputers ? Because of … - Complex Library Dependencies ... - Version Mismatch of Libraries (Too old…) etc.

Slide 13

Slide 13 text

13 All you needs is Container-Technology … To end the “INSTALLATION BATTLE” image from Jiro Ueda, “Why don’t you do your best?”, 2004.

Slide 14

Slide 14 text

Application-Centric Virtualization or “Process-Level Virtualization” Background | Container Virtualization 14 Hardware Virtual Machine App Guest OS Bins/Libs Virtual Machine App Guest OS Bins/Libs Hypervisor Virtual Machines (Hypervisor-based Virtualization) Hardware Linux Kernel Container App Bins/Libs Container App Bins/Libs Containers (Container-based Virtualization) Fast & Lightweight

Slide 15

Slide 15 text

Linux Container • Concept of Linux container based on Linux namespace. • No visibility to objects outside the container • Containers have another level of access controls namespace • namespace can isolates system resources • creates separate instances of global namespaces • process id (PID), host & domain name (UTS), inter-process communication (IPC), users (UID), … • Processes running inside the container … • shares the host Linux kernel • has its own root directory and mount table • performs to be running on a normal Linux system Background | Linux Container (namespace) 15 E. W. Biederman. “Multiple instances of the global Linux namespaces.”, In Proceedings of the 2006 Ottawa Linux Symposium, 2006. Hardware Linux Kernel Container App Bins/Libs own namespace, pid, uid, gid, hosntname, filesystem, …

Slide 16

Slide 16 text

Background | Portability (e.g. Docker) 16 Keeping Portability & Reproducibility for application • Easy to port the application using Docker Hub • Easy to reproduce the Environments using Dockerfile Docker Hub Image App Bins/Libs Push Pull Dockerfile apt-get install … wget … … make Generate Share Ubuntu Docker Engine Container App Bins/Libs Image App Bins/Libs Linux Kernel Container App Bins/Libs Image App Bins/Libs CentOS Docker Engine Linux Kernel

Slide 17

Slide 17 text

How about on Performance? • virtualization overhead • Compute, Network I/O, File I/O, … • Latency, Bandwidth, Throughput, … How about on Security? • requires root-daemon process (Docker) • requires SUID for binary (Shifter, Singularity) How about on Usability? • where to store container image (repository, local file, …) • affinity with user’s workflow Background | Concerns on HPC Field 17

Slide 18

Slide 18 text

SlideShare: https://www.slideshare.net/KentoAoyama/an-updated-performance- comparison-of-virtual-machines-and-linux-containers-73758906 Side-Story | Docker Performance 18

Slide 19

Slide 19 text

Side-Story | Docker Performance Overview (1/2) 19 Case Perf. Category Docker KVM A, B CPU Good Bad* C Memory Bandwidth (sequential) Good Good D Memory Bandwidth (Random) Good Good E Network Bandwidth Acceptable* Acceptable* F Network Latency Bad* Bad G Block I/O (Sequential) Good Good G Block I/O (RandomAccess) Good (with Volume Option) Bad Comparing to native performance … equal = Good a little worse = Acceptable worse = Bad *= depends case or tuning

Slide 20

Slide 20 text

Side-Story | Docker Performance Overview (2/2) 20 [1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual machines and Linux containers,” IEEE International Symposium on Performance Analysis of Systems and Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.) 0.96 1.00 0.98 0.78 0.83 0.99 0.82 0.98 0.00 0.20 0.40 0.60 0.80 1.00 PXZ [MB/s] Linpack [GFLOPS] Random Access [GUPS] Performance Ratio [based Native] Native Docker KVM KVM-tuned

Slide 21

Slide 21 text

Docker Solomon Hykes and others. “What is Docker?” - https://www.docker.com/what-docker Shifter W. Bhimji, S. Canon, D. Jacobsen, L. Gerhardt, M. Mustafa, and J. Porter, “Shifter : Containers for HPC,” Cray User Group, pp. 1–12, 2016. Singularity Gregory M. K., Vanessa S., Michael W. B., “Singularity: Scientific containers for mobility of compute”, PLOS ONE 12(5): e0177459. Background | Containers for HPC 21

Slide 22

Slide 22 text

Background | Singularity (OSS) 22 Gregory M. K., Vanessa S., Michael W. B., “Singularity: Scientific containers for mobility of compute”, PLOS ONE 12(5): e0177459. Linux Container OSS for HPC Workload • Developed by LBNL (Lawrence Berkeley National Laboratory, USA) Key Features • Near-native performance • Not-require the root daemon • Compatible with Docker Container Format • Support HPC Features • NVIDIA GPU, MPI, InfiniBand, etc. http://singularity.lbl.gov/

Slide 23

Slide 23 text

Background | Singularity Workflow 23 Command privilege Functions singularity create required create a empty container image file singularity import required import container image from registry (e.g. Docker Hub) singularity bootstrap required build a container image from definition file singularity shell (partially) required attach interactive-shell into the container (‘--writable’ option requires privilege) singularity run run a container process from container image file singularity exec execute user-command inside the container process https://singularity.lbl.gov/

Slide 24

Slide 24 text

• Container Virtualization is a application-centric virtualization technology • can packages library dependencies • can provide Application Portability & Reproducibility under the reasonable Performance • “Singularity” is a Linux Container OSS for HPC Workloads • provide near-native performance • support HPC features (GPU, MPI, InfiniBand, …) Background | Summary 24

Slide 25

Slide 25 text

HPC Features Intel KNL: Memory Modes Intel KNL: Cluster Modes Intel Omni-Path 25

Slide 26

Slide 26 text

Intel KNL (Knight Landing) 26 2nd Generation Intel® Xeon Phi Processor • MIC (Many Integrated Core) designed by Intel® for High-Performance Computing • covers similar HPC areas with GPU • allow use of standard programming language API such as OpenMP, MPI, … • Examples of use on Supercomputers • Oakforest-PACS by JCAHPC (Tokyo Univ., Tsukuba Univ.) • Tianhe-2A by NSCC-GZ (China) A. Sodani, “Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor,” 2015 IEEE Hot Chips 27 Symp. HCS 2015, 2016.

Slide 27

Slide 27 text

Intel KNL Architecture 27 A. Sodani, “Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor,” 2015 IEEE Hot Chips 27 Symp. HCS 2015, 2016.

Slide 28

Slide 28 text

Intel KNL Memory Modes 28 A. Sodani, “Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor,” 2015 IEEE Hot Chips 27 Symp. HCS 2015, 2016.

Slide 29

Slide 29 text

Intel KNL Cluster Modes (1/4) 29 A. Sodani, “Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor,” 2015 IEEE Hot Chips 27 Symp. HCS 2015, 2016.

Slide 30

Slide 30 text

Intel KNL Cluster Modes (2/4) 30 A. Sodani, “Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor,” 2015 IEEE Hot Chips 27 Symp. HCS 2015, 2016.

Slide 31

Slide 31 text

Intel KNL Cluster Modes (3/4) 31 A. Sodani, “Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor,” 2015 IEEE Hot Chips 27 Symp. HCS 2015, 2016.

Slide 32

Slide 32 text

Intel KNL Cluster Modes (4/4) 32 A. Sodani, “Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor,” 2015 IEEE Hot Chips 27 Symp. HCS 2015, 2016.

Slide 33

Slide 33 text

Intel KNL with Omni-Path 33 A. Sodani, “Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor,” 2015 IEEE Hot Chips 27 Symp. HCS 2015, 2016.

Slide 34

Slide 34 text

MIC (Many Integrated Core) designed by Intel® for High-Performance Computing Memory Mode • Cache Mode : Automatically use MCDRAM as L3-Cache • Flat Mode : Manually allocate data onto MCDRAM Cluster Mode • All-to-All : Address uniformly hashed across all distributed directories • Quadrant : Address divided into same quadrant • SNC : Each quadrant exposed as a NUMA node (can be seen as 4 sockets) Intel KNL Summary 34

Slide 35

Slide 35 text

Experiments MPI Point-to-Point Communication Performance MPI Collective Communication Performance HPC Application Performance 35

Slide 36

Slide 36 text

case1: MPI Point-to-Point Communication Performance • MPI_Send / MPI Recv • measure Latency and Bandwidth • (both of MPI Intra-Node and MPI Inter-Node) case2: MPI Collective Communication Performance • MPI_Bcast / MPI_Allgather / MPI_Allreduce, MPI_Alltoall • measure Latency and Bandwidth case3: HPC Application Performance • Graph500 (https://graph500.org/) , NAS [NASA, 1991] • measure Execution Time Experiments Overview 36

Slide 37

Slide 37 text

Chameleon Cloud (for Intel Xeon nodes) • 32 baremetal InfiniBand nodes • CPU: Intel Xeon E5-2670v3 (Haswell) 24 cores, 2 sockets • Memory: 128 GB • Network Card: Mellanox ConnectX-3 FDR (56Gbps) Local Cluster (for Intel KNL nodes) • Intel Xeon Phi CPU7250 (1.40 GHz) • Memory: 96 GB (host, DDR4) 16 GB (MCDRAM) • Network Card: Omni-Path HFI Silicon 100 Series fabric controller Clusters Information 37

Slide 38

Slide 38 text

Common Software Settings • Singularity: 2.3 • gcc: 4.8.3 (used for compiling all application & libraries on experiments) • MPI library: MVAPICH2-2.3a • OSU micro-benchmarks v5.3 Others • Results are averaged across 5 runs • Cluster mode was set to “All-to-All” or “Quadrant” (?) • >“Since there is only one NUMA node on KNL architecture, we do not consider intra/inter-socket anymore here.” Other Settings 38

Slide 39

Slide 39 text

case1: MPI Point-to-Point Communication Performance • Singularity’s overhead is less than 7% (on Haswell) • Singularity’s overhead is less than 8% (on KNL) case2: MPI Collective Communication Performance • Singularity’s overhead is less than 8% at all operations • Singularity reflects native performance characteristics case3: HPC Application Performance • Singularity’s overhead is less than 7% at all case “It reveals a promising way for efficiently running MPI applications on HPC clouds.” Results Summary (about Singularity) 39

Slide 40

Slide 40 text

MPI Point-to-Point Communication Performance on Haswell 40 intra-socket (intra-node) case is better InfiniBand FDR 6.4 GB/s

Slide 41

Slide 41 text

MPI Point-to-Point Communication Performance on KNL with Cache Mode 41 Latency Performance: Haswell architecture is better than KNL with cache mode Omni-Path fabric controller 9.2 GB/s - complex memory access - maintaining cache coherency

Slide 42

Slide 42 text

MPI Point-to-Point Communication Performance on KNL with Flat Mode 42 Omni-Path fabric controller 9.2 GB/s inter-node > intra-node intra-node Bandwidth Performance: KNL with Flat mode is better than KNL with Cache mode because of cache miss penalty on MCDRAM

Slide 43

Slide 43 text

MPI Collective Communication Performance with 512-Process (32 nodes) on Haswell 43

Slide 44

Slide 44 text

MPI Collective Communication Performance with 128-Process (2 nodes) on KNL with Cache Mode 44

Slide 45

Slide 45 text

MPI Collective Communication Performance with 128-Process (2 nodes) on KNL with Flat Mode 45 over L2-cache capacity

Slide 46

Slide 46 text

NAS Parallel Benchmarks Graph500 • Graph data-analytics workload • Heavily utilizes point-to-point communication (MPI_Isend, MPI_Irecv) with 4 KB messages for BFS search of random vertices • scale (x, y) = the graph has 2𝑥 vertices and 2𝑦 edges NAS, Graph500 46 CG Conjugate Gradient, irregular memory access and communication EP Embarrassingly Parallel FT discrete 3D fast Fourier Transform, all-to-all communication IS Integer Sort, random memory access LU Lower-Upper Gauss-Seidel solver MG Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive

Slide 47

Slide 47 text

Application Performance with 512-Process (32 nodes) on Haswell 47 Singularity-based container technology only introduces <7% overhead

Slide 48

Slide 48 text

Application Performance with 128-Process (2 nodes) on KNL with Cache/Flat Mode 48 Singularity-based container technology only introduces <7% overhead

Slide 49

Slide 49 text

Discussion (Personal Discussion) 49

Slide 50

Slide 50 text

• What’s the cause of Singularity’s overhead (0-8%)? • Network? File I/O? Memory I/O? Compute? • What’s the cause of the performance differences on Fig.12? (e.g. Traits of Benchmarks) Where Singularity’s overhead come from? 50

Slide 51

Slide 51 text

1. Singularity application is invoked 2. Global options are parsed and activated 3. The Singularity command (subcommand) process is activated 4. Subcommand options are parsed 5. The appropriate sanity checks are made 6. Environment variables are set 7. The Singularity Execution binary is called (sexec) 8. Sexec determines if it is running privileged and calls the SUID code if necessary 9. Namespaces are created depending on configuration and process requirements 10. The Singularity image is checked, parsed, and mounted in the CLONE_NEWNS namespace 11. Bind mount points are setup so that files on the host are visible in the container 12. The namespace CLONE_FS is used to virtualize a new root file system 13. Singularity calls execvp() and Singularity process itself is replaced by the process inside the container 14. When the process inside the container exits, all namespaces collapse with that process, leaving a clean system (material) Singularity Process Flow 51

Slide 52

Slide 52 text

Conclusion 52

Slide 53

Slide 53 text

• proposed a 4D evaluation methodology for HPC clouds • conducted the comprehensive studies to evaluate the Singularity’s performance • Singularity-based container technology can achieve near-native performance on Intel Xeon/KNL • Singularity provides a promising way to build the next-generation HPC clouds Conclusion 53