Evaluation of Container Virtualized
MEGADOCK System
in Distributed Computing Environment
March 23th, 2017
SIG BIO 49@Japan Advanced Institute of Science and Technology
Kento Aoyama1,2, Yuki Yamamoto1,2, Masahito Ohue1,3, Yutaka Akiyama1,2,3
1) Department of Computer Science, School of Computing
Tokyo Institute of Technology
2) Education Academy of Computational Life Sciences (ACLS)
Tokyo Institute of Technology
3) Advanced Computational Drug Discovery Unit, Institute of Innovative Research
Tokyo Institute of Technology
Slide 2
Slide 2 text
“Docker” 2
https://www.docker.com/what-container
No. of pulled containers from DockerHub
Slide 3
Slide 3 text
Docker and Bioinformatics 3
A. Paolo, D. Tommaso, A. B. Ramirez, E. Palumbo, C. Notredame, and D.
Gruber, “Benchmark Report : Univa Grid Engine , Nextflow , and Docker
for running Genomic Analysis Workflows.”
Docker Integration Benchmark Report
@Centre for Genomic Regulation
(Barcelona, Spain)
• Univa Grid Engine (Job Scheduler)
• Nextflow (Workflow manager)
• Docker (Linux Container)
• Reproducibility
• Portability
Slide 4
Slide 4 text
To develop the
Container-Native HPC Bioinformatics Application
Using Linux Container
which has …
• Low Dependency on Environment
• High-Performance
• Parallel execution performance
• Overhead of virtualization
• Dynamically Scaling
Research Purpose 4
Slide 5
Slide 5 text
• To evaluate the
Performance of Docker Container-Virtualization
in Bioinformatics Application
Target Application
• MEGADOCK[1]
• FFT-grid-based Protein-Protein Docking software
• Multi-threading, Multi-node, Multi-GPU (OpenMP, MPI, GPU)
• Extremely compute intensive workloads
Today’s Report 5
[1] Masahito Ohue, et al. “MEGADOCK 4.0: an ultra-high-performance protein-protein docking
software for heterogeneous supercomputers”, Bioinformatics, 30(22): 3281-3283, 2014.
Slide 6
Slide 6 text
Background
Linux Container
Docker
Container & Bioinformatics
6
Slide 7
Slide 7 text
Kernel-Shared Virtualization
• Lightweight : small size, fast deploy, easy sharing
• Performance : few virtualization overhead, faster than VM
Linux Container 7
Hardware
Linux Kernel
Container
App
Bins/Libs
Container
App
Bins/Libs
Hardware
Virtual
Machine
App
Guest
OS
Bins/Libs
Virtual
Machine
App
Guest
OS
Bins/Libs
Hypervisor
Virtual Machines Containers
Slide 8
Slide 8 text
Linux Container
• virtualizes the host resource as containers
• Filesystem, hostname, IPC, PID, Network, User, etc.
• can be used like Virtual Machines
Linux Kernel Features
• Containers are sharing same host kernel
• namespace[1], chroot, cgroup, SELinux, etc.
Container-based Virtualization 8
[1] E. W. Biederman. “Multiple instances of the global Linux namespaces.”,
In Proceedings of the 2006 Ottawa Linux Symposium, 2006.
Machine
Linux Kernel Space
Container
Process
Process
Container
Process
Process
Slide 9
Slide 9 text
Linux Container – Performance [1] 9
[1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual
machines and Linux containers,” IEEE International Symposium on Performance Analysis of Systems and
Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.)
0.96 1.00 0.98
0.78
0.83
0.99
0.82
0.98
0.00
0.20
0.40
0.60
0.80
1.00
PXZ [MB/s] Linpack [GFLOPS] Random Access [GUPS]
Performance Ratio
[based Native]
Native Docker KVM KVM-tuned
Slide 10
Slide 10 text
Docker [1]
• Most popular Linux Container management platform
• Many useful components and services
Linux Container Management Tools 10
[1] Solomon Hykes and others. “What is Docker?” - https://www.docker.com/what-docker
[2] W. Bhimji, S. Canon, D. Jacobsen, L. Gerhardt, M. Mustafa, and J. Porter, “Shifter : Containers for
HPC,” Cray User Group, pp. 1–12, 2016.
[3] “Singularity” - http://singularity.lbl.gov/
[1]
[2] [3]
Slide 11
Slide 11 text
Easy container sharing – Docker Hub 11
Portability & Reproducibility
• Easy to share the application environment via Docker Hub
• Containers can be executed on other host machine
Ubuntu
Docker Engine
Container
App
Bins/Libs
Image
App
Bins/Libs
Docker Hub
Image
App
Bins/Libs
Push Pull
Dockerfile
apt-get install …
wget …
…
make
CentOS
Docker Engine
Container
App
Bins/Libs
Image
App
Bins/Libs
Generate
Share
Why in the field of Bioinformatics?
• Types of Applications
• Data Analysis, Machine Learning
• MD Simulation, Docking calc. , etc.
• Data-centric workload
• Compute : Large
• Data I/O : Case by case
• Communication : Small
• Container performs well on compute-Intensive workload[1]
For Bioinformatics Apps : 1 13
[1] W. Felter, et al. “An updated performance comparison of virtual
machines and Linux containers,” IEEE International Symposium on
Performance Analysis of Systems and Software, pp.171-172, 2015.
Slide 14
Slide 14 text
Reproducibility
• Different version of library can make different result
• e.g.) Genomic analysis pipeline [Paolo, 2016]
Container A’
Container A
Container B
Container A
For Bioinformatics Apps : 2 14
Library A
Application A Application B
version >= 1.2 version < 1.1
Application A
Library version 1.3
Result A’
Application A
Library version 1.2
Result A
conflict
different
result
Dependency
Isolation
Application
Reproducibility
Dependency conflict
• Different application can requires different version of same library
Slide 15
Slide 15 text
Performance
• Few performance overhead
Reproducibility
• Dependency Isolation from other applications/libraries
Portability, Generality
• Sharing/Porting to other environment
Features for Bioinformatics Apps 15
Features Native VM Container
Performance
Scalability
Great Bad Good
Reproducibility Bad Good Great
Portability
Generality
Bad Great Great
Slide 16
Slide 16 text
Proposed Method
16
Slide 17
Slide 17 text
MEGADOCK 17
Masahito Ohue, et al. “MEGADOCK 4.0: an ultra-high-
performance protein-protein docking software for
heterogeneous supercomputers”, Bioinformatics,
30(22): 3281-3283, 2014.
High-performance protein-protein interaction predictions
• FFT-grid based docking software
• Extremely compute-intensive
• OpenMP/MPI/GPU support
• Great HPC Performance
Slide 18
Slide 18 text
Container-based Application Distribution 18
Resource
Resource
MEGA
DOCK
Resource
MEGA
DOCK
Add/Remove
Container
Resource
MEGA
DOCK
Add/Remove
Application
Layer
Compute
Resource
Layer
• All application dependencies exist in the Container
• Easy-to-test application
• Easy-to-scale size of resources
Test Environment Production Environment
(a) MEGADOCK-Azure[2]
Measurement
• megadock-dp exec. time
• time command (3 times, median)
Dataset
• ZDOCK benchmark 1.0 [1]
(59 * 59 = 3481 pairs)
Options (OpenMP, OpenMPI)
• MPI : 12 threads / 4 MPI process / 1 node
All file input/output in Local SSD
Overview of Experiment II-(a) 25
Virtual
Machine
MPI
MPI
MPI
MPI
VM
MPI
MPI
MPI
MPI
VM
MPI
MPI
MPI
MPI
VM
MPI
MPI
MPI
MPI
VM
MPI
MPI
MPI
MPI
VM
MPI
MPI
MPI
MPI
VM
MPI
MPI
MPI
MPI
Master Process
Worker Process
(Other)
[1] R. Chen, et al. “A protein-protein docking benchmark,” Proteins: Structure,
Function and Genetics, vol. 52, no. 1, pp. 88-91, 2003.
[2] Masahito Ohue, et al. ”MEGADOCK-Azure: High-performance protein-protein
interaction prediction system on Microsoft Azure HPC”, IIBMP2016.
Slide 26
Slide 26 text
(b) MEGADOCK + Docker on Microsoft Azure
Measurement
• megadock-dp exec. time
• time command (3 times, median)
Dataset
• ZDOCK benchmark 1.0
(59 * 59 = 3481 pairs)
Options (OpenMP, OpenMPI)
• MPI : 12 threads / 4 MPI process / 1 node
All file input/output in Local SSD
Docker Swarm
• All Containers in 1 overlay network
Overview of Experiment II-(b) 26
Virtual Machine
Docker
MPI
MPI
MPI
MPI
Docker
MPI
MPI
MPI
MPI
Docker
MPI
MPI
MPI
MPI
Docker
MPI
MPI
MPI
MPI
Docker
MPI
MPI
MPI
MPI
Docker
MPI
MPI
MPI
MPI
Docker
MPI
MPI
MPI
MPI
Docker Swarm
(Docker Network)
Master Process
Worker Process
(Other)
[1] R. Chen, J. Mintseris, J. Janin, and Z. Weng, “A protein-protein docking benchmark,”
Proteins: Structure, Function and Genetics, vol. 52, no. 1, pp. 88-91, 2003.
Slide 27
Slide 27 text
VM Instance/Software Specification 27
Software Env. Virtual Machine Docker
OS (image) SUSE Linux Enterprise Server 12 ubuntu:14.04
Linux Kernel 3.12.43 3.12.43
GCC 4.8.3 4.8.4
FFTW 3.3.4 3.3.5
OpenMPI 1.10.2 1.6.5
Docker Engine 1.12.6 N/A
VM Instance Standard_D14_v2
CPU Intel Xeon E5-2673, 2.40 [GHz] × 16 [core]
Memory 112 [GB]
Local SSD 800 [GB]
Slide 28
Slide 28 text
Execution time 28
145,534
25,515
13,132
6,006
4,098
117,219
25,145
12,331
6,344
3,971
0
25,000
50,000
75,000
100,000
125,000
150,000
1 5 10 20 30
Time [sec]
# of VMs
VM Docker on VM
May be a measurement mistake
Slide 29
Slide 29 text
Scalability (Strong Scaling, based VM=1) 29
0
5
10
15
20
25
30
35
40
45
0 100 200 300 400 500
Speed-up
# of worker cores
Ideal VM Docker on VM
VM=5
VM=1
VM=10
VM=20
VM=30
comparable scalability
Slide 30
Slide 30 text
Experiment I
• MEGADOCK + Docker on Physical Machine
showed 6.32% lower performance.
• Docker can cause 0-4% compute-performance down[1]
• Communications via Docker NAT (Network Address Translation)
• MEGADOCK (GPU) + NVIDIA-Docker on Physical Machine
showed comparable performance to native.
• GPU calc. is independent from container virtualization
• Container virtualization has few overhead on memory bandwidth
Experiment II
• MEGADOCK + Docker on Microsoft Azure
performed comparable scalability.
• Container virtualization overhead is smaller than other cloud environment factor
Result & Discussion 30
[1] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual
machines and Linux containers”, IEEE International Symposium on Performance Analysis of Systems
and Software, pp.171-172, 2015. (IBM Research Report, RC25482 (AUS1407-001), 2014.)
Slide 31
Slide 31 text
• Performance overhead of
Docker container-virtualization is small.
• suitable for GPU-accelerated-App and Cloud Environment
• Container-Virtualization can isolate
application environment from host environment.
• same container image can be used on various machines
• Physical machine on local environment
• Virtual machine on cloud environment
• Docker is useful for computational research work
Conclusion 31
Slide 32
Slide 32 text
Multi-Node & Multi-GPU Evaluation on Cloud
• NVIDIA-Docker is not available on Docker Swarm mode
• Kubernetes[1] officially support 1GPU/1node
• (experimental-feature: multi-GPU support)
Container-based Task Distribution
• Web-Service-Application like container-based distribution
• easy to scale computing resource
• easy to extends multiple task (e.g. GHOST-MP, MEGADOCK)
Future Work 32
[1] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, “Borg, Omega, and
Kubernetes,” acmqueue, vol. 14, no. 1, p. 24, 2016.